Unit01-Parallel Computing Introduction
Unit01-Parallel Computing Introduction
th
BSCS – 7
I
Qurtuba University of Science and Information Technology Peshawar
(Computer Science Department)
Parallel and Distributed Computing (CS404) Fall 2023-24
th
BSCS – 7
II
Qurtuba University of Science and Information Technology Peshawar
(Computer Science Department)
Parallel and Distributed Computing (CS404) Fall 2023-24
th
BSCS – 7
Parallel Algorithms.......................................................................................................................... 45
Algorithmic Notations for Parallel Algorithms ................................................................................. 45
Parallel Models for Parallel Algorithms ........................................................................................... 46
Shared-Memory Model ................................................................................................................... 47
PRAM Model ........................................................................................................................... 47
Network Models ............................................................................................................................. 48
Directed Acyclic Graph Models ........................................................................................................ 49
Parallel Algorithm Techniques ......................................................................................................... 49
III
Qurtuba University of Science and Information Technology Peshawar
(Computer Science Department)
Parallel and Distributed Computing (CS404) Fall 2023-24
th
BSCS – 7
Parallel Computing
Parallel computing is a type of computing architecture in which several processors execute or process an
application or computation simultaneously. It refers to the process of breaking down larger problems into
smaller, independent parts, which are often similar. These parts can be executed simultaneously by multiple
processors communicating via shared memory. After processing the results are combined as part of an
overall algorithm. Parallel computing is also known as parallel processing.
1. Bit-level parallelism: increases processor word size, which reduces the quantity of instructions the
processor must execute in order to perform an operation on variables greater than the length of the
word.
2. Instruction-level parallelism: the two forms are hardware approach and software approach.
a. The hardware approach implements dynamic parallelism where the processor decides at
run-time which instructions to execute in parallel.
b. The software approach implements static parallelism where the compiler decides which
instructions to execute in parallel.
3. Task parallelism: the parallelization of computer code across multiple processors that runs several
different tasks at the same time on the same data.
4. Superword-level parallelism: a vectorization technique that can exploit parallelism of inline code. It
involves identifying scalar instructions in a large basic block that perform the same operation, and
combining them into a superword operation on a multi-word object, if dependences do not prevent
it.
i. Fine-grained parallelism: where subtasks will communicate several times per second
ii. Coarse-grained parallelism: where subtasks do not communicate several times per second, or
iii. Embarrassing parallelism: where subtasks rarely or never communicate
Parallel Computer
A parallel computer is a set of processors that are able to work cooperatively to solve a computational
problem. Parallel computers offer the potential to concentrate computational resources on important
computational problems. Computational resources include processors, memory, or I/O bandwidth, etc.
Parallel computing also includes parallel supercomputers that have hundreds or thousands of processors,
networks of workstations, multiple-processor workstations, and embedded systems.
A parallel computer is simply a collection of processors, typically of the same type, interconnected in a certain
fashion to allow the coordination of their activities and the exchange of data. The processors are assumed to
be located within a small distance of one another, and are primarily used to solve a given problem jointly.
Lecturer: Mairaj 1
Note: These handouts/notes are not equivalent and/or replacement of the text/reference books.
Qurtuba University of Science and Information Technology Peshawar
(Computer Science Department)
Parallel and Distributed Computing (CS404) Fall 2023-24
th
BSCS – 7
While in distributed systems, a set of possibly many different types of processors are distributed over a large
geographic area. In distributed systems, the primary goals are:
to use the available distributed resources, and
to collect information and transmit it over a network connecting the various processors
Multicomputer Model
The multicomputer is an idealized parallel computer model. Each node consists of a von Neumann machine
(a CPU and memory). A node can communicate with other nodes by sending and receiving messages over an
interconnection network. Each computer executes its own program. This program may access local memory
and may send and receive messages over the network.
MIMD means that each processor can execute a separate stream of instructions on its own local
data.
Distributed memory means that memory is distributed among the processors, rather than placed in
a central location.
The principal difference between a multicomputer and the distributed-memory MIMD computer is the cost
of sending and receiving of messages among the nodes. In this architecture, the cost of messaging between
two nodes is dependent on the location of nodes and other network traffic. Examples of this class of
machine include the IBM SP, Intel Paragon, Thinking Machines CM5, Cray T3D, Meiko CS-2, and nCUBE.
Lecturer: Mairaj 2
Note: These handouts/notes are not equivalent and/or replacement of the text/reference books.
Qurtuba University of Science and Information Technology Peshawar
(Computer Science Department)
Parallel and Distributed Computing (CS404) Fall 2023-24
th
BSCS – 7
Programs developed for multicomputers can also execute efficiently on multiprocessors, because shared
memory permits an efficient implementation of message passing. Examples of this class of machine include
the Silicon Graphics Challenge, Sequent Symmetry, and the many multiprocessor workstations.
In the idealized multiprocessor model, any processor can access any memory element in the same amount
of time. This architecture usually introduces some form of memory hierarchy. Such as copies of frequently
used data items are stored in a cache associated with each processor. Access to this cache is much faster
than access to the shared memory.
Lecturer: Mairaj 3
Note: These handouts/notes are not equivalent and/or replacement of the text/reference books.
Qurtuba University of Science and Information Technology Peshawar
(Computer Science Department)
Parallel and Distributed Computing (CS404) Fall 2023-24
th
BSCS – 7
Von Neumann architecture is based on the stored-program computer concept, where instruction data and
program data are stored in the same memory. This design is still used in most computers produced today.
Von Neumann architecture is the design upon which many general purpose computers are based. The key
elements of von Neumann architecture are:
data and instructions are both stored as binary digits
data and instructions are both stored in primary storage
instructions are fetched from memory one at a time and in order (serially)
the processor decodes and executes an instruction, before cycling around to fetch the next
instruction
the cycle continues until no more instructions are available
Registers
Registers are high speed storage areas in the CPU. All data must be stored in a register before it can be
processed.
Lecturer: Mairaj 4
Note: These handouts/notes are not equivalent and/or replacement of the text/reference books.
Qurtuba University of Science and Information Technology Peshawar
(Computer Science Department)
Parallel and Distributed Computing (CS404) Fall 2023-24
th
BSCS – 7
MAR Memory Address Register Holds the memory location of data that needs to be accessed
MDR Memory Data Register Holds data that is being transferred to or from memory
AC Accumulator Where intermediate arithmetic and logic results are stored
PC Program Counter Contains the address of the next instruction to be executed
CIR Current Instruction Register Contains the current instruction during processing
Table 1: Registers in CPU – Von Neumann architecture
Buses
Buses are the means by which data is transmitted from one part of a computer to another, connecting all
major internal components to the CPU and memory. A standard CPU system bus is comprised of a control
bus, data bus and address bus.
Address Bus Carries the addresses of data between the processor and memory
Data Bus Carries data between the processor, the memory unit and the input/output
devices
Control Bus Carries control signals or commands from the CPU, and status signals from
other devices, in order to control and coordinate all the activities within the
computer
Table 2: Types of buses
Memory Unit
The memory unit consists of RAM, sometimes referred to as primary or main memory. Unlike secondary
memory, primary memory is faster and directly accessible by the CPU. RAM is split into partitions. Each
partition consists of an address and its contents (both in binary form). The addresses uniquely identify every
location in the memory. Loading data from permanent memory (hard drive), into the faster and directly
accessible temporary memory (RAM), allows the CPU to operate much faster.
Flynn's taxonomy distinguishes multi-processor computer architectures according to how they can be
classified along the two independent dimensions of Instruction Stream and Data Stream. Each of these
dimensions can have only one of two possible states – Single or Multiple.
Lecturer: Mairaj 5
Note: These handouts/notes are not equivalent and/or replacement of the text/reference books.
Qurtuba University of Science and Information Technology Peshawar
(Computer Science Department)
Parallel and Distributed Computing (CS404) Fall 2023-24
th
BSCS – 7
1. Array Processor
These receive the one (same) instruction but each parallel processing unit has its own separate and
distinct memory and register file. The modern term for an array processor is "single instruction,
multiple threads" (SIMT).
Lecturer: Mairaj 6
Note: These handouts/notes are not equivalent and/or replacement of the text/reference books.
Qurtuba University of Science and Information Technology Peshawar
(Computer Science Department)
Parallel and Distributed Computing (CS404) Fall 2023-24
th
BSCS – 7
2. Pipelined Processor
These receive the one (same) instruction but then read data from a central resource, each processes
fragments of that data, then writes back the results to the same central resource. In Flynn's 1977
paper the resource is main memory. For modern CPUs the resource is now more typically the
register file. Alternative name for this type of register-based SIMD is "packed SIMD".
3. Associative Processor
These receive the one (same) instruction but in each parallel processing unit an independent
decision is made, based on data local to the unit, as to whether to perform the execution or whether
to skip it. The modern term for associative processor is "Predicated" (or masked) SIMD.
Some modern designs (GPUs in particular) take features of more than one of these subcategories. GPUs of
today are SIMT (single instruction multiple threads) but also are Associative i.e. each processing element in
the SIMT array is also predicated.
Lecturer: Mairaj 7
Note: These handouts/notes are not equivalent and/or replacement of the text/reference books.
Qurtuba University of Science and Information Technology Peshawar
(Computer Science Department)
Parallel and Distributed Computing (CS404) Fall 2023-24
th
BSCS – 7
Lecturer: Mairaj 8
Note: These handouts/notes are not equivalent and/or replacement of the text/reference books.
Qurtuba University of Science and Information Technology Peshawar
(Computer Science Department)
Parallel and Distributed Computing (CS404) Fall 2023-24
th
BSCS – 7
CPU
Modern day CPUs consist of one or more cores. A core is a distinct execution unit with its own instruction
stream. Cores with a CPU may be organized into one or more sockets - each socket with its own distinct
memory. When a CPU consists of two or more sockets, usually hardware infrastructure supports memory
sharing across sockets.
Node
A node is a standalone "computer in a box". Node usually comprised of multiple CPUs/processors/cores,
memory, network interfaces, etc. Nodes are networked together to comprise a supercomputer.
Task
A task is a logically discrete section of computational work. A task is typically a program or program-like set
of instructions that is executed by a processor. A parallel program consists of multiple tasks running on
multiple processors.
Pipelining
Pipelining is the breaking of a task into steps performed by different processor units, with inputs streaming
through, much like an assembly line; a type of parallel computing.
Shared Memory
It describes a computer architecture where all processors have direct access to common physical memory. In
a programming, it specifies a model where parallel tasks all have the same "picture" of memory and can
directly address and access the same logical memory locations regardless of where the physical memory
actually exists.
Distributed Memory
In hardware, distributed memory refers to the network based “memory access” for physical memory (which
is not common). As a programming model, tasks can only logically "see" local machine memory and must use
communications to access memory on other machines where other tasks are executing.
Communications
Parallel tasks typically need to exchange data. There are several ways this can be accomplished, such as
through a shared memory bus or over a network.
Lecturer: Mairaj 9
Note: These handouts/notes are not equivalent and/or replacement of the text/reference books.
Qurtuba University of Science and Information Technology Peshawar
(Computer Science Department)
Parallel and Distributed Computing (CS404) Fall 2023-24
th
BSCS – 7
Synchronization
The coordination of parallel tasks in real time, very often associated with communications. Synchronization
usually involves waiting by at least one task. Therefore, it may increase the wall clock execution time of a
parallel application.
Computational Granularity
In parallel computing, granularity is a quantitative or qualitative measure of the ratio of computation to
communication.
Coarse: relatively large amounts of computational work are done between communication events
Fine: relatively small amounts of computational work are done between communication events
Observed Speedup
It is one of the simplest and most widely used indicators for measuring the performance of a parallel
program. Speedup is defined as the ratio between the “wall-clock time of serial execution” and “wall-clock
time of parallel execution”
Parallel Overhead
It is the required execution time that is unique to parallel tasks, as opposed to that for doing useful work.
Parallel overhead can include factors such as Task start-up time, Synchronizations, Data communications,
Software overhead imposed by parallel languages, libraries, operating system, etc. and Task termination
time.
Massively Parallel
It refers to the hardware that comprises a given parallel system, having many processing elements. The
meaning of "many" keeps increasing, but currently, the largest parallel computers are comprised of
processing elements numbering in the hundreds of thousands to millions.
Scalability
Scalability refers to a parallel system's (hardware and/or software) ability to demonstrate a proportionate
increase in parallel speedup with the addition of more resources. Factors that contribute to scalability
include Hardware (particularly memory-CPU bandwidths) and network communication properties,
Application algorithm, Parallel overhead related, Characteristics of the specific application.
Lecturer: Mairaj 10
Note: These handouts/notes are not equivalent and/or replacement of the text/reference books.
Qurtuba University of Science and Information Technology Peshawar
(Computer Science Department)
Parallel and Distributed Computing (CS404) Fall 2023-24
th
BSCS – 7
Parallel computing infrastructure is typically housed within a single datacenter where several processors are
installed in a server rack. Computation requests are distributed in small chunks by the application servers
that are then executed simultaneously on each server.
The importance of parallel computing continues to grow with the increasing usage of multicore processors
and GPUs. GPUs work together with CPUs to increase the throughput of data and the number of concurrent
calculations within an application. Using the power of parallelism, a GPU can complete more work than a
CPU in a given amount of time.
Engineering fields
o Statistical power grid protection tests
o aircraft design and simulation
o motor drive controller design methods and space robot integration, etc
Computer gaming
Industrial market for operator training and off-line controller tuning
Lecturer: Mairaj 11
Note: These handouts/notes are not equivalent and/or replacement of the text/reference books.
Qurtuba University of Science and Information Technology Peshawar
(Computer Science Department)
Parallel and Distributed Computing (CS404) Fall 2023-24
th
BSCS – 7
Global Applications
Parallel computing is now being used extensively around the world, in a wide variety of applications. Some
of them are:
Research
Finance
Logistic services
Information processing services
Aerospace
Telecommunication
Defense
Health and medicines , and so on …
Lecturer: Mairaj 12
Note: These handouts/notes are not equivalent and/or replacement of the text/reference books.
Qurtuba University of Science and Information Technology Peshawar
(Computer Science Department)
Parallel and Distributed Computing (CS404) Fall 2023-24
th
BSCS – 7
1000:1 or greater, depending on the relative performance of the local computer, the network, and the
mechanisms used to move data to and from the network.
Lecturer: Mairaj 13
Note: These handouts/notes are not equivalent and/or replacement of the text/reference books.
Qurtuba University of Science and Information Technology Peshawar
(Computer Science Department)