The document discusses different types of parallel computer memory architectures and parallel programming models. It describes shared memory architectures including uniform memory access (UMA), non-uniform memory access (NUMA), and the advantages and disadvantages of shared memory. It also describes distributed memory and hybrid distributed-shared memory architectures. Finally, it summarizes common parallel programming models including shared memory, threads, message passing, data parallel, and others.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0 ratings0% found this document useful (0 votes)
43 views
2 Parallel Computer Memory Architectures
The document discusses different types of parallel computer memory architectures and parallel programming models. It describes shared memory architectures including uniform memory access (UMA), non-uniform memory access (NUMA), and the advantages and disadvantages of shared memory. It also describes distributed memory and hybrid distributed-shared memory architectures. Finally, it summarizes common parallel programming models including shared memory, threads, message passing, data parallel, and others.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 26
Introduction to Parallel Computing
Parallel Computer Memory Architectures
Shared Memory • All processors access all memory as global address space • Multiple processors can operate independently but share the same memory resources • Changes in a memory location effected by one processor are visible to all other processors. Shared Memory • Shared memory machines are classified as UMA and NUMA, based upon memory access times • Uniform Memory Access (UMA) • Most commonly represented today by Symmetric Multiprocessor (SMP) machines • Identical processors • Equal access and access times to memory • Sometimes called CC-UMA - Cache Coherent UMA • Cache coherent means if one processor updates a location in shared memory, all the other processors know about the update. Cache coherency is accomplished at the hardware level Shared Memory Shared Memory • Non-Uniform Memory Access (NUMA) • Often made by physically linking two or more SMPs • One SMP can directly access memory of another SMP • Not all processors have equal access time to all memories • Memory access across link is slower • If cache coherency is maintained, then may also be called CC- NUMA - Cache Coherent NUMA Shared Memory Shared Memory • Advantages • Global address space provides a user-friendly programming perspective to memory • Data sharing between tasks is both fast and uniform due to the proximity of memory to CPUs • Disadvantages • Primary disadvantage is the lack of scalability between memory and CPUs • Adding more CPUs can geometrically increases traffic on the shared memory-CPU path, and for cache coherent systems, geometrically increase traffic associated with cache/memory management • Programmer responsibility for synchronization constructs that ensure "correct" access of global memory Shared Memory Distributed Memory • Processors have their own local memory • Changes to processor’s local memory have no effect on the memory of other processors • When a processor needs access to data in another processor, it is usually the task of the programmer to explicitly define how and when data is communicated • Synchronization between tasks is likewise the programmer's responsibility • The network "fabric" used for data transfer varies widely, though it can be as simple as Ethernet Distributed Memory • Advantages • Memory is scalable with the number of processors. • Increase the number of processors and the size of memory increases proportionately • Each processor can rapidly access its own memory without interference and without the overhead incurred with trying to maintain global cache coherency. • Cost effectiveness: can use commodity, off-the-shelf processors and networking • Disadvantages • The programmer is responsible for many of the details associated with data communication between processors. • Non-uniform memory access times - data residing on a remote node takes longer to access than node local data. Hybrid Distributed-Shared Memory • The largest and fastest computers in the world today employ both shared and distributed memory architectures • The shared memory component can be a shared memory machine or graphics processing units • The distributed memory component is the networking of multiple shared memory or GPU machines Hybrid Distributed-Shared Memory • Advantages and Disadvantages • Whatever is common to both shared and distributed memory architectures • Increased scalability is an important advantage • Increased programmer complexity is an important disadvantage Parallel Programming Models Parallel Programming Model • Programming model provides an abstract view of computing system
• Abstraction above hardware and memory architectures
• Value of a programming model is usually judged on its generality
• how well a range of different problems can be expressed and • how well they execute on a range of different architectures
• The implementation of a programming model can take several forms such as
• libraries invoked from traditional sequential languages, • language extensions, or • complete new execution models Parallel Programming Model • Parallel programming models in common use: • Shared Memory (without threads) • Threads • Distributed Memory / Message Passing • Data Parallel • Hybrid • Single Program Multiple Data (SPMD) • Multiple Program Multiple Data (MPMD) • These models are NOT specific to a particular type of machine or memory architecture • Any of these models can be implemented on any underlying hardware Parallel Programming Model • Parallel programming models in common use: • Shared Memory (without threads) • Threads • Distributed Memory / Message Passing • Data Parallel • Hybrid • Single Program Multiple Data (SPMD) • Multiple Program Multiple Data (MPMD) • These models are NOT specific to a particular type of machine or memory architecture • Any of these models can be implemented on any underlying hardware Parallel Programming Model • SHARED memory model on a DISTRIBUTED memory machine • Machine memory was physically distributed across networked machines, but appeared to the user as a single shared memory (global address space). • This approach is referred to as virtual shared memory • DISTRIBUTED memory model on a SHARED memory machine • The SGI Origin 2000 employed the CC-NUMA type of shared memory architecture, where every task has direct access to global address space spread across all machines. • However, the ability to send and receive messages using MPI, as is commonly done over a network of distributed memory machines, was implemented and commonly used Shared Memory Model - Without Threads • Tasks share a common address space • Efficient means of passing data between programs • Various mechanisms such as locks / semaphores may be used to control access to the shared memory • Programmer's point of view • The notion of data "ownership" is lacking, so there is no need to specify explicitly the communication of data between tasks • Program development can often be simplified • Disadvantage in terms of performance • It becomes more difficult to understand and manage data locality: • Keeping data local to the processor that works on it conserves memory accesses, cache refreshes and bus traffic that occurs when multiple processors use the same data Shared Memory Model - Without Threads • Implementations • Native compilers or hardware translate user program variables into actual memory addresses, which are global
• On stand-alone shared memory machines, this is straightforward.
• On distributed shared memory machines, memory is physically distributed across a network of machines, but made global through specialized hardware and software. Threads Model • Type of shared memory programming model • A single "heavy weight" process can have multiple "light weight", concurrent execution paths • Main program a.out is scheduled by native OS • a.out loads and acquires all of the necessary system and user resources to run. This is the "heavy weight" process • a.out performs some serial work, and then creates a number of tasks (threads) that can be scheduled and run by the operating system concurrently Threads Model • Each thread has local data, but also, shares the entire resources of a.out • This saves the overhead associated with replicating a program's resources for each thread ("light weight"). Each thread also benefits from a global memory view because it shares the memory space of a.out • Threads communicate with each other through global memory (updating address locations). This requires synchronization constructs to ensure that more than one thread is not updating the same global address at any time • Threads can come and go, but a.out remains present to provide the necessary shared resources until the application has completed Threads Model • Implementations • POSIX Threads • Library based; requires parallel coding • C Language only • Commonly referred to as Pthreads. • Most hardware vendors now offer Pthreads in addition to their proprietary threads implementations. • Very explicit parallelism; requires significant programmer attention to detail. • OpenMP • Compiler directive based; can use serial code • Portable / multi-platform, including Unix and Windows platforms • Available in C/C++ and Fortran implementations • Can be very easy and simple to use Distributed Memory / Message Passing Model • A set of tasks that use their own local memory during computation • Multiple tasks can reside on the same physical machine and/or across an arbitrary number of machines. • Tasks exchange data through communications by sending and receiving messages. • Data transfer usually requires cooperative operations to be performed by each process. For example, a send operation must have a matching receive operation. Distributed Memory / Message Passing Model • Implementations • From a programming perspective, message passing implementations usually comprise a library of subroutines • Calls to these subroutines are imbedded in source code • MPI specifications are available on the web at http://www.mpi- forum.org/docs/ • MPI implementations exist for virtually all popular parallel computing platforms Data Parallel Model • Address space is treated globally • A set of tasks work collectively on the same data structure, however, each task works on a different partition of the same data structure • On shared memory architectures, all tasks may have access to the data structure through global memory • On distributed memory architectures the data structure is split up and resides as "chunks" in the local memory of each task Data Parallel Model • Implementations • Unified Parallel C (UPC): An extension to the C programming language for SPMD parallel programming. Compiler dependent. More information: http://upc.lbl.gov/ • Global Arrays: Provides a shared memory style programming environment in the context of distributed array data structures. Public domain library with C and Fortran77 bindings. More information: http://www.emsl.pnl.gov/docs/global/ • X10: A PGAS based parallel programming language being developed by IBM at the Thomas J. Watson Research Center. More information: http://x10-lang.org/ Hybrid Model • A hybrid model combines more than one of the previously described programming models Single Program Multiple Data • High level programming model that can be built upon any combination of the previously mentioned parallel programming models. • SINGLE PROGRAM: All tasks execute their copy of the same program simultaneously. • This program can be threads, message passing, data parallel or hybrid. • MULTIPLE DATA: All tasks may use different data