CS326 Parallel and Distributed Computing: SPRING 2021 National University of Computer and Emerging Sciences
CS326 Parallel and Distributed Computing: SPRING 2021 National University of Computer and Emerging Sciences
Distributed Computing
SPRING 2021
NATIONAL UNIVERSITY OF COMPUTER AND EMERGING SCIENCES
Parallel computing is the simultaneous use of multiple computing resources to
solve a computational problem.
◦ To be run using multiple CPUs/Cores
Parallel Task
◦ A task that can be executed by multiple processors safely (yields correct results)
Serial Execution
◦ Execution of a program sequentially, one statement at a time. In the simplest sense,
this is what happens on a one processor machine. However, virtually all parallel
tasks will have sections of a parallel program that must be executed serially.
Parallel Execution
◦ Execution of a program by more than one task, with each task being able to execute the same or
different statement at the same moment in time.
Shared Memory
◦ From a strictly hardware point of view, describes a computer architecture where all processors have
direct (usually bus based) access to common physical memory.
◦ In a programming sense, it describes a model where parallel tasks all have the same "picture" of memory and can directly address and access the same
logical memory locations regardless of where the physical memory actually exists.
Distributed Memory
◦ In hardware, refers to network based memory access for physical memory that is not common. As a
programming model, tasks can only logically "see" local machine memory and must use
communications to access memory on other machines where other tasks are executing.
Communications
◦ Parallel tasks typically need to exchange data. There are several ways this can be accomplished, such as
through a shared memory bus or over a network, however the actual event of data exchange is
commonly referred to as communications regardless of the method employed.
Synchronization
◦ The coordination of parallel tasks in real time, very often associated with communications. Often
implemented by establishing a synchronization point within an application where a task may not
proceed further until another task(s) reaches the same or logically equivalent point.
◦ Synchronization usually involves waiting by at least one task, and can therefore cause a parallel
application's wall clock execution time to increase.
Parallel Overhead
◦ The amount of time required to coordinate parallel tasks, as opposed to doing useful work. Parallel
overhead can include factors such as:
◦ Task start-up time
◦ Synchronizations
◦ Data communications
◦ Software overhead imposed by parallel compilers, libraries, tools, operating system, etc.
◦ Task termination time
Massively Parallel
◦ Refers to the hardware that comprises a given parallel system - having many processors. The meaning
of many keeps increasing, but currently BG/L* pushes this number to 6 digits.
*Blue Gene is an IBM project aimed at designing supercomputers that can reach operating
speeds in the petaFLOPS (PFLOPS) range, with low power consumption.
Scalability
◦ Refers to a parallel system's (hardware and/or software) ability to
demonstrate a proportionate increase in parallel speedup with the addition
of more processors.
Multiple processors can operate independently but share the same memory resources.
Changes in a memory location effected by one processor are visible to all other processors.
Shared memory machines can be divided into two main classes based upon memory access times: UMA and NUMA.
Shared Memory : UMA vs. NUMA
Uniform Memory Access (UMA):
◦ Identical processors with equal access and access times to memory
◦ Sometimes called CC-UMA - Cache Coherent UMA.
◦ Thus, in principle, if a piece of data is repeatedly used, the effective latency of this memory
system can be reduced by the cache.
◦ The fraction of data references satisfied by the cache is called the cache hit ratio of the
computation on the system.
◦ Data reuse is critical for cache performance because if each data item is used only once, it
would still have to be fetched once per use from the DRAM
◦ The cache consists of m blocks, called
lines.
◦ In referring to the basic unit of the cache, the term
line is used, rather than the term block.
27
Impact of Memory Bandwidth
One commonly used technique to improve memory bandwidth is to
increase the size of the memory blocks.
◦ Consider again a memory system with a single cycle cache and 100 cycle latency
DRAM with the processor operating at 1 GHz
◦ If the block size is one word, the processor takes 100 cycles to fetch each word.
◦ If the block size is increased to four words, i.e., the processor can fetch a four-word
cache line every 100 cycles.
◦ increasing the block size from one to four words did not change the latency of the
memory system. However, it increased the bandwidth four-fold.
◦ Another way of quickly estimating performance bounds is to estimate the
cache hit ratio.
◦ Spatial locality (also termed data locality) refers to the use of data elements
within relatively close storage locations. Sequential locality.
◦ a special case of spatial locality, occurs when data elements are arranged and accessed linearly, such
as, traversing the elements in a one-dimensional array.
Alternate Approaches for Hiding Memory
Latency
Imagine sitting at your computer browsing the web during peak
network traffic hours.
The lack of response from your browser can be alleviated using one
of three simple approaches:
◦ Anticipate which pages we are going to browse ahead of time and issue
requests for them in advance: Prefetching
◦ we open multiple browsers and access different pages in each browser, thus
while we are waiting for one page to load, we could be reading others:
multi-threading
◦ we access a whole bunch of pages in one go – remunerating the latency
across various accesses: Spatial Locality
Multithreading for Latency Hiding