0% found this document useful (0 votes)
50 views

CS326 Parallel and Distributed Computing: SPRING 2021 National University of Computer and Emerging Sciences

This document provides an overview of the CS326 Parallel and Distributed Computing course at National University of Computer and Emerging Sciences. It discusses key concepts in parallel computing including using multiple CPUs/cores to solve computational problems by breaking them into discrete parts that execute simultaneously on different CPUs. It also covers parallel versus concurrent versus distributed computing, parallel terminologies, and parallel computing platforms including shared memory, distributed memory, and implicit parallelism trends in microprocessor architectures.

Uploaded by

Neha
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
50 views

CS326 Parallel and Distributed Computing: SPRING 2021 National University of Computer and Emerging Sciences

This document provides an overview of the CS326 Parallel and Distributed Computing course at National University of Computer and Emerging Sciences. It discusses key concepts in parallel computing including using multiple CPUs/cores to solve computational problems by breaking them into discrete parts that execute simultaneously on different CPUs. It also covers parallel versus concurrent versus distributed computing, parallel terminologies, and parallel computing platforms including shared memory, distributed memory, and implicit parallelism trends in microprocessor architectures.

Uploaded by

Neha
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

CS326 Parallel and

Distributed Computing
SPRING 2021
NATIONAL UNIVERSITY OF COMPUTER AND EMERGING SCIENCES
Parallel computing is the simultaneous use of multiple computing resources to
solve a computational problem.
◦ To be run using multiple CPUs/Cores

A problem is broken into discrete parts that can be solved


concurrently
◦ Each part is further broken down to a series of instructions

Instructions from each part execute simultaneously on different


CPUs
Some Questions
Why we’re building parallel systems?

Why we need ever-increasing performance?

Why we need to write parallel programs?

How do we write parallel programs?


◦ Task Parallelism
◦ Data Parallelism
Parallel VS Concurrent VS Distributed
In concurrent computing, a program is one in which multiple tasks
can be in progress at any instant.

In parallel computing, a program is one in which multiple tasks


cooperate closely to solve a problem.

In distributed computing, a program may need to cooperate with


other programs to solve a problem
Some General Parallel Terminologies
Task/Process
◦ A logically discrete section of computational work. A task is typically a program or
program-like set of instructions that is executed by a processor.

Parallel Task
◦ A task that can be executed by multiple processors safely (yields correct results)

Serial Execution
◦ Execution of a program sequentially, one statement at a time. In the simplest sense,
this is what happens on a one processor machine. However, virtually all parallel
tasks will have sections of a parallel program that must be executed serially.
Parallel Execution
◦ Execution of a program by more than one task, with each task being able to execute the same or
different statement at the same moment in time.
Shared Memory
◦ From a strictly hardware point of view, describes a computer architecture where all processors have
direct (usually bus based) access to common physical memory.
◦ In a programming sense, it describes a model where parallel tasks all have the same "picture" of memory and can directly address and access the same
logical memory locations regardless of where the physical memory actually exists.

Distributed Memory
◦ In hardware, refers to network based memory access for physical memory that is not common. As a
programming model, tasks can only logically "see" local machine memory and must use
communications to access memory on other machines where other tasks are executing.
Communications
◦ Parallel tasks typically need to exchange data. There are several ways this can be accomplished, such as
through a shared memory bus or over a network, however the actual event of data exchange is
commonly referred to as communications regardless of the method employed.

Synchronization
◦ The coordination of parallel tasks in real time, very often associated with communications. Often
implemented by establishing a synchronization point within an application where a task may not
proceed further until another task(s) reaches the same or logically equivalent point.

◦ Synchronization usually involves waiting by at least one task, and can therefore cause a parallel
application's wall clock execution time to increase.
Parallel Overhead
◦ The amount of time required to coordinate parallel tasks, as opposed to doing useful work. Parallel
overhead can include factors such as:
◦ Task start-up time
◦ Synchronizations
◦ Data communications
◦ Software overhead imposed by parallel compilers, libraries, tools, operating system, etc.
◦ Task termination time

Massively Parallel
◦ Refers to the hardware that comprises a given parallel system - having many processors. The meaning
of many keeps increasing, but currently BG/L* pushes this number to 6 digits.

*Blue Gene is an IBM project aimed at designing supercomputers that can reach operating
speeds in the petaFLOPS (PFLOPS) range, with low power consumption.
Scalability
◦ Refers to a parallel system's (hardware and/or software) ability to
demonstrate a proportionate increase in parallel speedup with the addition
of more processors.

◦ Factors that contribute to scalabilty include:


◦ Hardware - particularly memory-cpu bandwidths and network communications
◦ Application Algorithm
◦ Parallel overhead related
◦ Characteristics of your specific application and coding
Parallel Computers Memory
Architecture
Shared Memory
Distributed Memory
Hybrid Distributed-Shared Memory
Shared Memory
Shared memory parallel computers vary widely, but generally have in common the ability for all processors to access all
memory as global address space.

Multiple processors can operate independently but share the same memory resources.
Changes in a memory location effected by one processor are visible to all other processors.
Shared memory machines can be divided into two main classes based upon memory access times: UMA and NUMA.
Shared Memory : UMA vs. NUMA
Uniform Memory Access (UMA):
◦ Identical processors with equal access and access times to memory
◦ Sometimes called CC-UMA - Cache Coherent UMA.

Non-Uniform Memory Access (NUMA):


◦ Not all processors have equal access time to all memories
◦ One SMP can directly access memory of another SMP
Distributed Memory
Processors have their own local memory. Memory addresses in one processor do not map to
another processor, so there is no concept of global address space across all processors.
Distributed memory systems require a communication network to connect inter-processor
memory.
When a processor needs access to data in another processor, it is usually the task of the
programmer to explicitly define how and when data is communicated.
The network "fabric" used for data transfer varies widely, though it can can be as simple as
Ethernet.
Parallel Computing Platforms
Implicit Parallelism: Trends in
Microprocessor Architectures
Traditionally a sequential computer consists of:
◦ processor, memory, and datapath
◦ all present bottleneck to the performance.

Increments in clock speed are severely diluted by the limitations of


memory technology.
Pipelining and Superscalar Execution
Pipelining: overlapping various stages in instruction execution
◦ fetch, schedule, decode, operand fetch, execute, store, among others
◦ Pentium 4, which operates at 2.0 GHz, has a 20 stage pipeline.
◦ Speed of a single pipeline is ultimately limited by the largest atomic task in the
pipeline.
◦ In typical instruction traces, every fifth to sixth instruction is a branch instruction.
◦ This requires need effective techniques for predicting branch destinations so that
pipelines can be speculatively filled, a misprediction will cost a lot.

Use multiple pipelines. The ability of a processor to issue multiple


instructions in the same cycle is referred to as superscalar execution
◦ During each clock cycle, multiple instructions are piped into the processor in
parallel.
◦ These instructions are executed on multiple functional units.
A number of issues needs to be resolved with superscalar execution.
◦ Data Dependency
◦ Resource Dependency
◦ Branch/Procedural Dependency

The processor needs the ability to issue instructions out-of-order to


accomplish desired reordering.
◦ The parallelism available in in-order issue of instructions can be
highly limited.

Most current microprocessors are capable of out-of-order issue and completion.


◦ This model, also referred to as dynamic instruction issue, exploits maximum instruction level
parallelism.
Very Long Instruction Word (VLIW)
Processors
The parallelism extracted by superscalar processors is often limited by the
instruction look-ahead.

Instructions that can be executed concurrently are packed into groups


and parceled off to the processor as a single long instruction word to be
executed on multiple functional units at the same time.
◦ The compiler has a larger context from which to select instructions.

The performance of VLIW processors is very sensitive to the compilers'


ability to detect data and resource dependencies and read and write
hazards, and to schedule instructions for maximum parallelism
Limitation of Memory System Performance
The effective performance of a program on a computer relies not
just on the speed of the processor but also on the ability of the
memory system to feed data to the processor.

Latency: request to receiving time for a memory word.


◦ Rate at which data can be pumped from the memory to the processor
determines the bandwidth of the memory system.
Example
a processor operating at 1 GHz (1 ns clock) connected to a DRAM
with a latency of 100 ns (no caches).

The processor is capable of executing four instructions in each cycle


of 1 ns (4 GFLOPS).

Since the memory latency is equal to 100 cycles and


block size is one word, every time a memory request is made, the
processor must wait 100 cycles before it can process the data
Improving Effective Memory Latency Using Caches
Cache: a low-latency high-bandwidth storage between the processor and the
DRAM.
◦ The data needed by the processor is first fetched into the cache. All subsequent accesses to
data items residing in the cache are serviced by the cache.

◦ Thus, in principle, if a piece of data is repeatedly used, the effective latency of this memory
system can be reduced by the cache.

◦ The fraction of data references satisfied by the cache is called the cache hit ratio of the
computation on the system.

◦ Data reuse is critical for cache performance because if each data item is used only once, it
would still have to be fetched once per use from the DRAM
◦ The cache consists of m blocks, called
lines.
◦ In referring to the basic unit of the cache, the term
line is used, rather than the term block.

◦ Line Size: The length of a line.


◦ Line size may be as less as 32 bits (a word).

◦ The number of lines is considerably less


than the number of main memory
blocks
Cache Read operation

27
Impact of Memory Bandwidth
One commonly used technique to improve memory bandwidth is to
increase the size of the memory blocks.

◦ Consider again a memory system with a single cycle cache and 100 cycle latency
DRAM with the processor operating at 1 GHz

◦ If the block size is one word, the processor takes 100 cycles to fetch each word.

◦ If the block size is increased to four words, i.e., the processor can fetch a four-word
cache line every 100 cycles.

◦ increasing the block size from one to four words did not change the latency of the
memory system. However, it increased the bandwidth four-fold.
◦ Another way of quickly estimating performance bounds is to estimate the
cache hit ratio.

Temporal and Spatial Locality of Data:


◦ Temporal locality refers to the reuse of specific data, and/or resources,
within a relatively small time duration.

◦ Spatial locality (also termed data locality) refers to the use of data elements
within relatively close storage locations. Sequential locality.
◦ a special case of spatial locality, occurs when data elements are arranged and accessed linearly, such
as, traversing the elements in a one-dimensional array.
Alternate Approaches for Hiding Memory
Latency
Imagine sitting at your computer browsing the web during peak
network traffic hours.
The lack of response from your browser can be alleviated using one
of three simple approaches:
◦ Anticipate which pages we are going to browse ahead of time and issue
requests for them in advance: Prefetching
◦ we open multiple browsers and access different pages in each browser, thus
while we are waiting for one page to load, we could be reading others:
multi-threading
◦ we access a whole bunch of pages in one go – remunerating the latency
across various accesses: Spatial Locality
Multithreading for Latency Hiding

◦ Because each dot-product is independent of the other, and


therefore represents a concurrent unit of execution, we may rewrite the
above code as:
◦ Multithreaded processors are capable of maintaining the context of a number
of threads of computation with outstanding requests
(memory accesses, I/O, or communication requests) and execute them as the
requests are satisfied.

◦ Machines such relying on multithreaded processors that can switch the


context of execution in every cycle .
◦ they are able to hide latency effectively, provided there is enough concurrency (threads) to keep the
processor from idling.
Prefetching for Latency Hiding
In a typical program, a data item is loaded and used by a processor
in a small time window. If the load results in a cache miss, then the
use stalls.
A simple solution to this problem is to advance the load operation
so that even if there is a cache miss, the data is likely to have
arrived by the time it is used.
However, if the data item has been overwritten between load and
use, a fresh load is issued.

You might also like