Week1 - Parallel and Distributed Computing
Week1 - Parallel and Distributed Computing
Distributed Computing
FALL 2021
NATIONAL UNIVERSITY OF COMPUTER AND EMERGING
SCIENCES
Instructor
Muhammad Danish Khan
Lecturer, Department of Computer Science
FAST NUCES Karachi
[email protected]
Week 14: Distributed System Models and Enabling Technologies, Assignment Task(s)
Week 15: Distributed System Models and Enabling Technologies, Quiz-3, Project
Evaluations
Week 16: Distributed System Models and Enabling technologies, Project Evaluations
LMS: Google Class Room
Section 5C Class Code: eg2u47t
https://classroom.google.com/c/Mzg4NTA1MTMzNDI3?cjc=eg2u47t
Operating System Concepts Revision
Program
◦ Set of instructions and associated data
◦ resides on the disk and is loaded by the operating system to perform some task.
◦ E.g. An executable file or a python script file.
Process
◦ A program in execution.
◦ In order to run a program, the operating system's kernel is first asked to create a new
process, which is an environment in which a program executes.
◦ consists of instructions, user-data, and system-data segments, CPU, memory, address-
space, disk acquired at runtime
Thread
◦ the smallest unit of execution in a process.
◦ A thread simply executes instructions serially.
◦ A process can have multiple threads running as part of it.
◦ Processes don't share any resources amongst themselves whereas threads
of a process can share the resources allocated to that particular process,
including memory address space.
“Multiprocessing systems”
◦ where multiple processes get scheduled on more than one CPU.
◦ Usually, this requires hardware support where a single system comes with
multiple cores
◦ or the execution takes place in a cluster of machines.
Parallel Execution
Parallelism
The term parallelism means that an application splits its tasks up
into smaller subtasks which can be processed in parallel, for instance
on multiple CPUs at the exact same time.
Serial Execution vs. Parallel Execution
Transmission speeds - the speed of a serial computer is directly dependent upon how
fast data can move through hardware.
◦ Absolute limits are the speed of light (30 cm/nanosecond) and the transmission limit of copper wire
(9 cm/nanosecond). Increasing speeds necessitate increasing proximity of processing elements.
◦ Solved in less time with multiple compute resources than with a single
compute resource.
LD $12, (100)
ADD $11, $12
SUB $10, $11
INC $10
SW $13, ($10)
int sample1
{
X = sample2()
Return x;
}
float sample3
{
Pi=3.14
Return pi
}
Int sample2()
{
Cin>>I
Return I;
}
Parallel Computing: what for?
Example applications include:
… ..
Flynn Taxonomy
Based on the number of concurrent instruction (single or multiple)
and data streams (single or multiple) available in the architecture
Single Instruction, Single Data (SISD)
It represents the organization of a single computer containing a control unit,
processor unit and a memory unit.
Single instruction: only one instruction stream is being acted on by the CPU
during any one clock cycle
Single data: only one data stream is being used as input during any one
clock cycle
This is the oldest and until recently, the most prevalent form of computer
◦ Examples: most PCs, single CPU workstations and mainframes
Single Instruction, Multiple Data
(SIMD)
Single instruction: All processing units execute the same instruction at any given clock cycle
Multiple data: Each processing unit can operate on a different data element
The processing units are made to operate under the control of a common control unit, thus
providing a single instruction stream and multiple data streams.
◦ Best suited for specialized problems characterized by a high degree of regularity, such as image processing.
Parallel Task
◦ A task that can be executed by multiple processors safely (yields correct results)
Serial Execution
◦ Execution of a program sequentially, one statement at a time. In the simplest sense, this is what
happens on a one processor machine. However, virtually all parallel tasks will have sections of a
parallel program that must be executed serially.
Parallel Execution
◦ Execution of a program by more than one task, with each task being able to execute the same or
different statement at the same moment in time.
Shared Memory
◦ From a strictly hardware point of view, describes a computer architecture where all processors have
direct (usually bus based) access to common physical memory.
◦ In a programming sense, it describes a model where parallel tasks all have the same "picture" of memory and can directly address and access the same
logical memory locations regardless of where the physical memory actually exists.
Distributed Memory
◦ In hardware, refers to network based memory access for physical memory that is not common. As a
programming model, tasks can only logically "see" local machine memory and must use
communications to access memory on other machines where other tasks are executing.
Communications
◦ Parallel tasks typically need to exchange data. There are several ways this can be accomplished, such as
through a shared memory bus or over a network, however the actual event of data exchange is
commonly referred to as communications regardless of the method employed.
Synchronization
◦ The coordination of parallel tasks in real time, very often associated with communications. Often
implemented by establishing a synchronization point within an application where a task may not
proceed further until another task(s) reaches the same or logically equivalent point.
◦ Synchronization usually involves waiting by at least one task, and can therefore cause a parallel
application's wall clock execution time to increase.
Granularity
◦ In parallel computing, granularity is a qualitative measure of the ratio of computation to communication.
◦ Coarse: relatively large amounts of computational work are done between communication events
◦ Fine: relatively small amounts of computational work are done between communication events
Observed Speedup
◦ Observed speedup of a code which has been parallelized, defined as:
◦ One of the simplest and most widely used indicators for a parallel program's performance.
Parallel Overhead
◦ The amount of time required to coordinate parallel tasks, as opposed to doing useful work. Parallel
overhead can include factors such as:
◦ Task start-up time
◦ Synchronizations
◦ Data communications
◦ Software overhead imposed by parallel compilers, libraries, tools, operating system, etc.
◦ Task termination time
Massively Parallel
◦ Refers to the hardware that comprises a given parallel system - having many processors. The meaning of
many keeps increasing, but currently BG/L* pushes this number to 6 digits.
*Blue Gene is an IBM project aimed at designing supercomputers that can reach operating
speeds in the petaFLOPS (PFLOPS) range, with low power consumption.
Scalability
◦ Refers to a parallel system's (hardware and/or software) ability to
demonstrate a proportionate increase in parallel speedup with the addition of
more processors.
Multiple processors can operate independently but share the same memory resources.
Changes in a memory location effected by one processor are visible to all other processors.
Shared memory machines can be divided into two main classes based upon memory access times: UMA and NUMA.
Shared Memory : UMA vs. NUMA
Uniform Memory Access (UMA):
◦ Most commonly represented today by Symmetric Multiprocessor (SMP) machines
◦ Identical processors with equal access and access times to memory
◦ Sometimes called CC-UMA - Cache Coherent UMA.
Disadvantages:
◦ Primary disadvantage is the lack of scalability between memory and CPUs. Adding more CPUs can
geometrically increases traffic on the shared memory-CPU path, and for cache coherent systems,
geometrically increase traffic associated with cache/memory management.
◦ Programmer responsibility for synchronization constructs that insure "correct" access of global memory.
◦ Expense: it becomes increasingly difficult and expensive to design and produce shared memory
machines with ever increasing numbers of processors.
Distributed Memory
Like shared memory systems, distributed memory systems vary widely but share a common characteristic. Distributed
memory systems require a communication network to connect inter-processor memory.
Processors have their own local memory. Memory addresses in one processor do not map to another processor, so
there is no concept of global address space across all processors.
Because each processor has its own local memory, it operates independently. Changes it makes to its local memory have
no effect on the memory of other processors. Hence, the concept of cache coherency does not apply.
When a processor needs access to data in another processor, it is usually the task of the programmer to explicitly define
how and when data is communicated. Synchronization between tasks is likewise the programmer's responsibility.
The network "fabric" used for data transfer varies widely, though it can can be as simple as Ethernet.
Distributed Memory: Pro and Con
Advantages
◦ Memory is scalable with number of processors. Increase the number of processors and the size of
memory increases proportionately.
◦ Each processor can rapidly access its own memory without interference and without the overhead
incurred with trying to maintain cache coherency.
◦ Cost effectiveness: can use commodity, off-the-shelf processors and networking.
Disadvantages
◦ The programmer is responsible for many of the details associated with data communication between
processors.
◦ It may be difficult to map existing data structures, based on global memory, to this memory
organization.
◦ Non-uniform memory access (NUMA) times
Hybrid Distributed-Shared Memory
Comparison of Shared and Distributed Memory Architectures