100% found this document useful (1 vote)
63 views12 pages

HPC Unit 1

Parallel computing is the simultaneous use of multiple compute resources to solve a computational problem. It is an evolution of serial computing that attempts to emulate what has always been the state of affairs in the natural world. Parallel computing has been motivated by numerical simulations of complex systems and "Grand Challenge Problems" such as: o planetary and galactic orbits Weather and ocean patterns Tectonic plate drift Rush hour traffic in LA Automobile assembly line Daily operations within a business Building a shopping mall.

Uploaded by

Pragash Kaliyan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
63 views12 pages

HPC Unit 1

Parallel computing is the simultaneous use of multiple compute resources to solve a computational problem. It is an evolution of serial computing that attempts to emulate what has always been the state of affairs in the natural world. Parallel computing has been motivated by numerical simulations of complex systems and "Grand Challenge Problems" such as: o planetary and galactic orbits Weather and ocean patterns Tectonic plate drift Rush hour traffic in LA Automobile assembly line Daily operations within a business Building a shopping mall.

Uploaded by

Pragash Kaliyan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 12

Chapter 1: Introduction

1.1 Introduction
For over 40 years, virtually all computers have followed a common machine
model known as the von Neumann computer. Named after the Hungarian
mathematician John von Neumann.
A von Neumann computer uses the stored-program concept. The CPU executes
a stored program that specifies a sequence of read and write operations on the
memory.

Fig 1.1 Von Neumann concept

Basic design:
Memory is used to store both program and data instructions.
Program instructions are coded data which tell the computer to do
something.
Data is simply information to be used by the program.
A central processing unit (CPU) gets instructions and/or data from memory,
decodes the instructions and then sequentially performs them.
What is Parallel Computing?
Additionally, software has been written for serial computation:
- To be executed by a single computer having a single Central Processing
Unit (CPU);
- Problems are solved by a series of instructions, executed one after the
other by the CPU. Only one instruction may be executed at any moment
in time.
In the simplest sense, parallel computing is the simultaneous use of multiple
compute resources to solve a computational problem.
The compute resources can include:
- A single computer with multiple processors;
- An arbitrary number of computers connected by a network;
- A combination of both.
The computational problem usually demonstrates characteristics such as the
ability to be:

Broken apart into


simultaneously;

discrete

pieces

of work

that can

be

solved

Execute multiple program instructions at any moment in time;

Solved in less time with multiple compute resources than with a single
compute resource.

Parallel computing is an evolution of serial computing that attempts to emulate


what has always been the state of affairs in the natural world: many complex,
interrelated events happening at the same time, yet within a sequence. Some
examples:
o

Planetary and galactic orbits

Weather and ocean patterns

Tectonic plate drift

Rush hour traffic in LA

Automobile assembly line

Daily operations within a business

Building a shopping mall

Ordering a hamburger at the drive through.

Traditionally, parallel computing has been considered to be "the high end of


computing" and has been motivated by numerical simulations of complex
systems and "Grand Challenge Problems" such as:
o

weather and climate

chemical and nuclear reactions

biological, human genome

geological, seismic activity

mechanical devices - from prosthetics to spacecraft

electronic circuits

manufacturing processes

Today, commercial applications are providing an equal or greater driving force in


the development of faster computers. These applications require the processing
of large amounts of data in sophisticated ways. Example applications include:
2

parallel databases, data mining

oil exploration

web search engines, web based business services

computer-aided diagnosis in medicine

management of national and multi-national corporations

advanced graphics and virtual reality, particularly in the entertainment


industry

networked video and multi-media technologies

collaborative work environments

Ultimately, parallel computing is an attempt to maximize the infinite but


seemingly scarce commodity called time.

Why Use Parallel Computing?


There are two primary reasons for using parallel computing:
o Save time - wall clock time
o Solve larger problems

Other reasons might include:


o

Taking advantage of non-local resources - using available compute


resources on a wide area network, or even the Internet when local
compute resources are scarce.

Cost savings - using multiple "cheap" computing resources instead of


paying for time on a supercomputer.

Overcoming memory constraints - single computers have very finite


memory resources. For large problems, using the memories of multiple
computers may overcome this obstacle.

Limits to serial computing - both physical and practical reasons pose significant
constraints to simply building ever faster serial computers:
o

Transmission speeds - the speed of a serial computer is directly


dependent upon how fast data can move through hardware. Absolute
limits are the speed of light (30 cm/nanosecond) and the transmission
limit of copper wire (9 cm/nanosecond). Increasing speeds necessitate
increasing proximity of processing elements.

Limits to miniaturization - processor technology is allowing an increasing


number of transistors to be placed on a chip. However, even with
3

molecular or atomic-level components, a limit will be reached on how


small components can be.
o

Economic limitations - it is increasingly expensive to make a single


processor faster. Using a larger number of moderately fast commodity
processors to achieve the same (or better) performance is less expensive.

The future: during the past 10 years, the trends indicated by ever faster
networks, distributed systems, and multi-processor computer architectures
(even at the desktop level) suggest that parallelism is the future of computing.

1.2 Need Of High Speed Computing:


The traditional scientific paradigm is first to do theory (say on paper), and then
lab experiments to confirm or deny the theory.
The traditional engineering paradigm is first to do a design (say on paper), and
then build a laboratory prototype.
Both paradigms are being replacing by numerical experiments and numerical
prototyping.
There are several reasons for this.
1) Real phenomena are too complicated to model on paper (eg. climate
prediction).
2) Real experiments are too hard, too expensive, too slow, or too dangerous for a
laboratory (eg oil reservoir simulation, large wind tunnels, overall aircraft
design, galactic evolution, whole factory or product life cycle design and
optimization, etc.).
Scientific and engineering problems requiring the most computing power to
simulate are commonly called "Grand Challenges like predicting the climate 50
years hence, are estimated to require computers computing at the rate of 1 Tflop
= 1 Teraflop = 10^12 floating point operations per second, and with a memory
size of 1 TB = 1 Terabyte = 10^12 bytes. Here is some commonly used notation
we will use to describe problem sizes:
1 Mflop = 1 Megaflop = 10^6 floating point operations per second
1 Gflop = 1 Gigaflop = 10^9 floating point operations per second
1 Tflop = 1 Teraflop = 10^12 floating point operations per second
1 MB = 1 Megabyte = 10^6 bytes
1 GB = 1 Gigabyte = 10^9 bytes
1 TB = 1 Terabyte = 10^12 bytes
1 PB = 1 Petabyte = 10^15 bytes
Suppose we take 100 flops (floating point operations) to update each cell by one
minute. In other words, dt = 1 minute, and computing Climate (i, j,k,n+1) for all i,j,k
from Climate(i,j,k,n) takes about 100*5e9 = 5e11 floating point operations.

It clearly makes no sense to take longer than one minute to predict the weather one
minute from now; otherwise it is cheaper to look out the window.
4

Thus, we must compute at least at the rate of


5e11 flops / 60 secs ~ 8 Gflops.
Weather prediction (computing 24 hours to compute the weather 7 days hence)
requires computing 7 times faster, or 56 Gflops.
Climate prediction (computing 30 days, a long run, to compute the climate 50 years
hence), requires computing 50*12=600 times faster, or 4.8 Tflops.
The actual grid resolution used in climate codes today is 4 degrees of latitude by 5
degrees of longitude, or about 450 km by 560 km, a rather coarse resolution.
A near term goal is to improve this resolution to 2 degrees by 2.5 degrees, which is
four times as much data.
The size of the overall database is enormous. NASA is putting up weather satellites
expected to collect 1TB/day for a period of years, totaling as much as 6 PB of data,
which no existing system is large enough to store.

1.3 Increase the speed of computers


We can increase the speed of computers in several ways.
By increasing the speed of the processing element using faster semiconductor
technology (by using advanced technology)

By architecture methods. It in turn we can increase the speed of computer by


applying parallelism.
1) use parallelism in Single processor
- overlap the execution of number of instructions by pipelining
or by using multiple functional units
- overlap the operation of different units.
2) use parallelism in the problem
- use number of interconnected processors to work
cooperatively to solve the problem.

1.4 History of parallel computers:


A brief history of parallel computers are given below
Vector supercomputers
Glory days: 76-90
Famous examples: Cray machines
Characterized by:
The fastest clock rates, because vector pipelines can be very simple.
Vector processing.
Quite good vectorizing compilers.
High price tag; small market share.
Not always scalable because of shared-memory bottleneck (vector
processors need more data per cycles than conventional processors).
Vector processing is back in various forms: SIMD extensions of
5

commodity microprocessors (e.g. Intel's SSE), vector processors for game


consoles (Cell), multithreaded vector processors (Cray), etc.
Vector processors went down temporarily because of:
Market issues, price/performance, microprocessor revolution, commodity
microprocessors.
Not enough parallelism for biggest problems. Hard to vectorize/parallelize
automatically.
Didn't scale down.

MPPs

Today

Glory days: 90-96


Famous examples: Intel hypercubes and Paragon, TMC Connection Machine,
IBM SP, Cray/SGI T3E.
Characterized by:
Scalable interconnection network, up to 1000's of processors. We'll
discuss these networks shortly
Commodity (or at least, modest) microprocessors.
Message passing programming paradigm.
Killed by:
Small market niche, especially as a modest number of processors
can do more and more.
Programming paradigm too hard.
Relatively slow communication (especially latency) compared to everfaster processors (this is actually no more and no less than another
example of the memory wall).
A state of flux in hardware.
But more stability in software, e.g., MPI and OpenMP.
Machines are being sold, and important problems are being solved, on all of the
following:
Vector SMPs, e.g., Cray X1, Hitachi, Fujitsu, NEC.
SMPs and ccNUMA, e.g., Sun, IBM, HP, SGI, Dell, hundreds of
custom boxes.
Distributed memory multiprocessors, e.g., Cray XT3, IBM Blue
Gene.
Clusters: Beowulf (Linux) and many manufacturers and assemblers.
A complete top-down view: At the highest level you have either a distributed
memory architecture with a scalable interconnection network, or an SMP
architecture with a bus. A distributed memory architecture may or may not
provide support for a global memory consistency model (such as cache
coherence, software distributed shared memory, coherent RDMA, etc.). On an
SMP architecture you expect hardware support for cache coherence. A
distributed memory architecture can be built from SMP or even (rarely) ccNUMA
boxes. Each box is treated as a tightly coupled node (with local processors and
uniformly accessed shared memory). Boxes communicate via message passing,
6

or (less frequently) with hardware or software memory coherence schemes. Both


on distributed and on shared memory architectures, the processors themselves
may support an internal form of task or data parallelism. Processors may be
vector processors, commodity microprocessors with multiple cores, or multiple
threads multiplexed over a single core, heterogeneous multicore processors, etc.
Programming: Typically MPI is supported over both distributed and sharedmemory substrates for portability (large existing base of code written and
optimized in MPI). OpenMP and POSIX threads are almost always available on
SMPs and ccNUMA machines. OpenMP implementations over distributed
memory machines with software support for cache coherence also exist, but
scaling this implementation is hard and is a subject of ongoing research.

Future
The end of Moore's Law?
Nan scale electronics
Exotic architectures? Quantum, DNA/molecular.

1.5 Solving problems in parallel:

Solving a simple job can be solved in parallel in many ways.

1.5.1 Method 1
Utilizing temporal parallelism:
Consider there are 1000 that appeared for the exam. There are 4 questions in each
answer book. If a teacher is to correct these answer books, the following instructions
are given to them.
1. Take an answer book from the pile of answer books.
2. Correct the answer to Q1, namely A1.
3. Repeat step 2 for answers to Q2, Q3,Q4 namely A2,A3,A4.
4. Add marks.
5. Put answer book in pile of corrected answer books.
6. Repeat steps 1 to 5 until no answer books are left.
Ask 4 teachers to correct each answer book by sitting in one line.
The first teacher corrects answer Q1, namely A1 of first paper and passes the
paper to the second teacher.
When the first three papers are corrected, some are idle.
Time taken to correct A1=Time to correct A2= Time to correct A3= Time to
correct A4=5 minutes. Then first answer book takes 20 min.
Total time taken to correct 1000 papers will be 20+ (999*5) =5015 min. This is
about 1/4th of the time taken.
Temporal means pertaining to time.
The method is correct if:
o Jobs are identical.
o Independent tasks are possible.
o Time is same.
o No. of tasks is small compared to total no of jobs.
7

Let no of jobs=n
Time to do a job=p
Each job is divided into k tasks
Time for each task=p/k
Time to complete n jobs with no pipeline processing =np
Time complete n jobs with pipeline processing of k teachers=p+(n-1)p/k=p*[(k+n-1)/k]
Speedup due to pipeline processing=[np/p(k+n-1)/k]=[k/1+(k-1)/n]
Problems encountered:

Synchronization:
o Identical time
Bubbles in pipeline
o Bubbles are formed
Fault tolerance
o Does not tolerate.
Inter task communication
o small
Scalability
o Cant be increased.

1.5.2 Method 2
Utilizing Data Parallelism:
Divide the answer books into four piles and give one pile to each teacher.
Each teacher takes 20 min to correct an answer book, the time taken for 1000 papers
is 5000 min.
Each teacher corrects 250 papers but simultaneously.
Let no of jobs=n
Time to do a job=p
Let there be k teachers
Time to distribute=kq
Time to complete n jobs by single teacher=np
Time to complete n jobs by k teachers=kq+np/k
Speed up due to parallel processing=np/kq+np/k=knp/k8k*q+np=k/1+ (kq/np)
Advantages:
No Synchronization:
No Bubbles in pipeline
More Fault tolerance
No communication
Disadvantages:

Static assignment
Partitionable
8

Time to divide jobs is small.

1.5.3 Method 3
Combined Temporal And Data Parallelism:
Combining method 1 and 2 gives this method.
Two pipelines of teachers are formed and each pipeline is given half of total no of
jobs.
Halves the time taken by single pipeline.
Reduces time to complete set of jobs.
Very efficient for numerical computing in which a no of long vectors and large
matrices are used as data and could be processed.

1.5.4 Method 4
Data Parallelism with Dynamic Assignment:
A head examiner gives one answer book to each teacher.
All teachers simultaneously correct the paper.
A teacher who completes goes to head examiner for another paper.
If second completes at the same time, then he queues up in front of head examiner.
Advantages:
Balancing of the work assigned to each teacher.
Teacher is not forced to be idle.
No bubbles
Overall time is minimized
Disadvantages:
Teachers Have To Wait In The Queue.
Head examiner can become bottle neck
Head examiner is idle after handing the papers.
Difficult to increase the number of teachers.
If speedup of a method is directly proportional to the number, then the method is said
to scale well.
Let total no of papers=n
Let there be k teachers
Time waited to get paper=q
Time for each teacher to get, grade and return a paper= (q+p)
Total time to correct papers by k teachers=[n(q+p)/k]
Speed up due to parallel processing=np/ [n (q+p)/k] =k/[1+(q/p)]

1.5.5 Method 5
Data Parallelism with Quasi-Dynamic Scheduling:
Method 4 can be made better by giving each teacher unequal sets of papers to
correct. Teacher 1,2,3,4 may be given with 7, 9, 11, 13 papers. When finish that
further papers will be given. This randomizes the job completion and reduces the
9

probability of queue. Each job is much smaller compared to the time to actually do the
job. This method is in between purely static and purely dynamic schedule. The jobs
are coarser grain in the sense that a bunch of jobs are assigned and the completion
time will be more than if one job is assigned.
Table 1.1 Difference between temporal and data parallelism:

TEMPORAL PARALLELISM
Independent task
Tasks take equal time
Bubbles leads to idling
Task assignment is static
Not tolerant to processor
Efficient with fine grained

DATA PARALLELISM
Full jobs are assigned
Tasks take different time
No bubbles
Task
assignment
is
static,
dynamic or quasi dynamic
Tolerates to processor
Efficient with coarse grained

Data Parallel Processing With Specialized Processor:


Data Parallel Processing is more tolerant but requires each teacher to be
capable of correcting answers to all questions with equal case.

1.5.6 Method 6
Specialist Data Parallelism:
There is a head examiner whop dispatches answer papers to teachers. We
assume that teacher 1(T1) grades A1, teacher 2(T2) grades A2 and teacher i(Ti) grades
Ai to question Qi.
Procedure:
Give one answer book to T1,T2,T3,T4
When a corrected answer paper is returned check if all questions are graded. If
yes add marks and put the paper in the output pile.
If no check which questions are not graded
For each I,if Ai is ungraded and teacher Ti is idle send it to teacher Ti or if any
other teacher Tp is idle.
Repeat steps 2,3 and 4 until no answer paper remains in input pile

1.5.7 Method 7
Coarse Grained Specialist Temporal Parallelism:
All teachers are independently and simultaneously at their pace. That teacher
will end up spending a lot of time inefficiently waiting for other teachers to complete
their work.
Procedure:
Answer papers are divided into 4 equal piles and put in the in-trays of each teacher.
Each teacher repeats 4 times simultaneously steps 1 to 5.
For teachers Ti (i=1 to 4) do in parallel

10

Take an answer paper from in-tray


Grade answer Ai to question Qi and put in out-tray
Repeat steps 1 and 2 till no papers are left
Check if teacher (i+1) mod4s in-tray is empty.
As soon as it is empty, empty own out-tray into in-tray of that teacher.

1.5.8 Method 8
Agenda Parallelism:
Answer book is thought as an agenda of questions to be graded. All teachers are
asked to work on the first item on agenda, namely grade the answer to first question in
all papers. Head examiner gives one paper to each teacher and asks him to grade the
answer A1 to Q1.When a teacher finishes this, he is given with another paper. This is
data parallel method with dynamic schedule and fine grain tasks.

1.6 Inter Task Dependency

The following assumptions are made in evolving tasks to teacher:


The answer to a question is independent of answers to other questions
Teachers do not have to interact
The same instructions are used to grad all answer books
Tasks are inter related. Some tasks are done independently and simultaneously while
others have to wait for completion of previous tasks. The inter relations of various
tasks of a job may be represented graphically as a task graph.
Circles represent tasks
The direction of an arrow shows precedence
Procedure: Recipe for Chinese vegetable fried rice:
T1: Clean and wash rice
T2: Boil water in a vessel with 1 teaspoon salt
T3: Put rice in boiling water with some oil and cook till soft
T4: Drain rice and cool
T5: Wash and scrape carrots
T6: Wash and string French beans
T7: Boil water with teaspoon salt in 2 vessels
T8: Drop carrots and French beans in boiling water
T9: Drain and cool carrots and French beans
T10: Dice carrots
T11: Dice French beans
T12: Peel onions and dice in into small pieces
T13: Clean cauliflower .Cut into small pieces.
T14: Heat oil in iron pan and fry diced onion cauliflower for 1 min in heated oil
T15: Add diced carrots and French beans to above and fry for 2 min.
T16: Add cooled cooked rice, chopped onions and soya sauce to the above and
stir and fry for 5 min.

11

There are 16 tasks in this, in that they have to be carried out in sequence. A graph
showing the relationship among the tasks is given.

12

You might also like