0% found this document useful (0 votes)
96 views7 pages

Term Paper: Computer Organization and Architecure (Cse211)

The document summarizes shared memory MIMD (Multiple Instruction stream, Multiple Data stream) architectures. It discusses that in shared memory MIMD architectures, multiple processors are connected to a globally available shared memory. This allows for relatively easy programming since all processors share a single view of data. However, scaling beyond around thirty-two processors is difficult due to bottlenecks from CPU-to-memory connections and maintaining cache coherence across processors. The document outlines different types of shared memory architectures including UMA, NUMA, and COMA.

Uploaded by

sunnyajay
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
96 views7 pages

Term Paper: Computer Organization and Architecure (Cse211)

The document summarizes shared memory MIMD (Multiple Instruction stream, Multiple Data stream) architectures. It discusses that in shared memory MIMD architectures, multiple processors are connected to a globally available shared memory. This allows for relatively easy programming since all processors share a single view of data. However, scaling beyond around thirty-two processors is difficult due to bottlenecks from CPU-to-memory connections and maintaining cache coherence across processors. The document outlines different types of shared memory architectures including UMA, NUMA, and COMA.

Uploaded by

sunnyajay
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 7

Term paper

COMPUTER ORGANIZATION AND


ARCHITECURE (CSE211)

Topic: Shared Memory MIMD Architecture

Submitted To: Submitted By:


Miss Ramanpreet Kaur Lamba Shivani
Lect. ARCH Reg. No.: 10900957
RollNo.:RK1902A10
Section: K1902
Shared Memory MIMD Architectures
Shivani, RK1902A10

Abstract A shared memory system is relatively easy to program since


The processors are all connected to a "globally available" all processors share a single view of data and the
memory, via either a software or hardware means. The communication between processors can be as fast as
operating system usually maintains its memory coherence.
memory accesses to a same location.
From a programmer's point-of-view, this memory model is
better understood than the distributed memory model. Another
The issue with shared memory systems is that many CPUs
advantage is that memory coherence is managed by the need fast access to memory and will likely cache memory,
operating system and not the written program. Two known which has two complications:
disadvantages are: scalability beyond thirty-two processors is
difficult, and the shared memory model is less flexible than the • CPU-to-memory connection becomes a bottleneck.
distributed memory model.
Shared memory computers cannot scale very well. Most of
them have ten or fewer processors.
There are further types of shared memory (multiprocesors): • Cache coherence: Whenever one cache is updated
UMA (Uniform Memory Access), COMA (Cached Only
Memory Access) and NUMA (Non-Uniform Memory Access).
with information that may be used by other processors, the
change needs to be reflected to the other processors,
otherwise the different processors will be working with
I. INTRODUCTION incoherent data (see cache coherence and memory
coherence). Such coherence protocols can, when they work
MIMD (Multiple Instruction stream, Multiple Data stream) well, provide extremely high-performance access to shared
computer system has a number of independent processors information between multiple processors. On the other hand
operate upon separate data concurrently. Hence each they can sometimes become overloaded and become a
processor has its own program memory or has access to bottleneck to performance
program memory. Similarly, each processor has its own data
memory or access to data memory. Clearly there needs to be The alternatives to shared memory are distributed memory
a mechanism to load the program and data memories and a and distributed shared memory, each having a similar set of
mechanism for passing information between processors as issues. See also Non-Uniform Memory Access.
they work on some problem. MIMD has clearly emerges the
architecture of choice for general-purpose mutiprocessors.
MIMD machines offer flexibility. With the correct hardware In software
and software support, MIMDs can function as single user
machines focusing on high performance for one application, In computer software, shared memory is either
as multiprogrammed machines running many tasks
simultaneously, or as some combination of these functions. • a method of inter-process communication (IPC), i.e.
There are two types of MIMD architectures: distributed a way of exchanging data between programs running at the
memory MIMD architecture, and shared memory MIMD same time. One process will create an area in RAM which
architecture. other processes can access, or
• a method of conserving memory space by directing
I. SHARED MEMORY accesses to what would ordinarily be copies of a piece of
data to a single instance instead, by using virtual memory
In computing, shared memory is memory that may be mappings or with explicit support of the program in
simultaneously accessed by multiple programs with an intent question. This is most often used for shared libraries and for
to provide communication among them or avoid redundant XIP.
copies. Depending on context, programs may run on a single
processor or on multiple separate processors. Using memory Since both processes can access the shared memory area like
for communication inside a single program, for example regular working memory, this is a very fast way of
among its multiple threads, is generally not referred to as communication (as opposed to other mechanisms of IPC
shared memory. such as named pipes, Unix domain sockets or CORBA). On
the other hand, it is less powerful, as for example the
A. In hardware communicating processes must be running on the same
machine (whereas other IPC methods can use a computer
In computer hardware, shared memory refers to a network), and care must be taken to avoid issues if processes
(typically) large block of random access memory that can be sharing memory are running on separate CPUs and the
accessed by several different central processing units (CPUs) underlying architecture is not cache coherent.
in a multiple-processor computer system.
IPC by shared memory is used for example to transfer • Non-uniform memory access (NUMA) machines
images between the application and the X server on Unix
systems, or inside the IStream object returned by
CoMarshalInterThreadInterfaceInStream in the COM • Cache-coherent non-uniform memory access
libraries under Windows. (CC_NUMA) machines

• Cache-only memory access (COMA) machines


Dynamic libraries are generally held in memory once and
mapped to multiple processes, and only pages that had to be
customized for the individual process (because a symbol UMA machines belong to the physically shared memory
resolved differently there) are duplicated, usually with a architecture class, while NUMA, CC_NUMA, and COMA
mechanism that transparently copies the page when a write machines form the class of distributed shared memory
is attempted, and then lets the write succeed on the private architectures. The four classes cover the three generations of
copy. shared memory systems. Their first generation contains the
UMA machines where the interconnection network was
based either on the concept of shared bus in order to
POSIX provides a standardized API for using shared
construct low-price multiprocessors or on multistage
memory, POSIX Shared Memory. This uses the function
networks in order to build massively parallel shared memory
shm_open from sys/mman.h.
systems.
Unix System 5 provides an API for shared memory as well.
Contention is an inherent consequence of sharing and, by
This uses shmget from sys/shm.h.
introducing an additional shared hardware resource - the
shared bus - it became a critical architectural bottleneck. The
BSD systems provide "anonymous mapped memory" which whole history of shared memory systems is about struggling
can be used by several processes. against contention. Even in the first generation, local cache
memories were introduced to reduce contention. However,
Recent 2.6 Linux kernel builds have started to offer despite of the use of sophisticated cache systems, the
/dev/shm as shared memory in the form of a RAM disk, scalability of first-generation shared memory systems was
more specifically as a world-writable directory that is stored strongly limited. The number of effectively exploitable
in memory. /dev/shm support is completely optional within processors was in the range of 20-30 in shared bus machines
the kernel configuration file. It is included by default in both and 100-200 in multistage network-based machines. The
Fedora and Ubuntu distributions. second-generation shared memory systems tried to
physically distribute the shared memory among the
II. SHARED MEMORY MIMD ARCHITECTURE processors in order to reduce the traffic and consequently,
the contention on the interconnection network. A further
The distinguishing feature of shared memory systems is that improvement was the replacement of the single shared bus
no matter how many memory blocks are used in them and by a more complex multibus or multistage network. The
how these memory blocks are connected to the processors third-generation shared memory systems combine the
and address spaces of these memory blocks are unified into a advantages of the first two generations. CC_NUMA and
global address space which is completely visible to all COMA machines are highly scalable, massively parallel
processors of the shared memory system. Issuing a certain systems where contention is dramatically reduced by
memory address by any processor will access the same introducing large local cache memories. Because of the
memory block location. However, according to the physical underlying cache coherence protocols, programming these
organization of the logically shared memory, two main types machines is no more difficult than programming in the first
of shared memory system could be distinguished: generation.

A. Physically shared memory systems Like multicomputers, the quality of the interconnection
network has a decisive impact on the speed, size and cost of
the whole machine. Since in multiprocessors any processor
In physically shared memory systems all memory blocks can
must be able to access any memory location, even if it
be accessed uniformly by all processors. In distributed
physically belongs to another processor, dynamic
shared memory systems the memory blocks are physically
interconnection schemes are usually employed. Dynamic
distributed among the processors as local memory units.
networks can be divided into two main classes according to
their mode of operations. Those that provide continuous
B. Virtual (or distributed) shared memory systems connection among the processors and memory blocks are
called share path networks. The other type of dynamic
Shared memory systems are basically classified according to network does not provide a continuous connection among
their memory organization since this is the most fundamental the processors and memory blocks, rather a switching
design issue. Accordingly, shared memory systems can be mechanism enables processors to be temporarily connected
divided into four main classes: memory blocks.

• Uniform memory access (UMA) machines


Dynamic networks have some drawbacks compared with the the UMA architecture is not suitable for building scalable
static networks applied in multicomputers. Dynamic parallel computers, it is excellent for constructing small-size
networks are either too expensive (switching networks) or single bus multiprocessors. Two such machines are the
they can support only a limited number of processors (bus Encore Multimax of Encore Computer Corporation
connection). representing the technology of the late 1980s and the Power
Challenge of Silicon Graphics Computing Systems
Uniprocessors have successfully demonstrated the benefits representing the technology of the 1990s.
of cache memory in order to increase memory bandwidth.
Accordingly, most of the shared memory systems employ
cache memories, too. However, the application of caches in
a multiprocessor environment gives rise to the so-called
cache consistency problem. In order to solve the problem of
maintaining data consistency in the caches, the cache
coherence protocol must be added to the traffic on the
network. The extra traffic deriving from the protocol reduces
the benefits of the caches and hence, careful design is
necessary to introduce a protocol of minimal complexity.
Cache coherence protocols are divided into two classes:
hardware-based protocols and software-based protocols.
Hardware-based protocols are strongly related to the type of
interconnection network employed.

Figure II UMA Machines

D. Non-Uniform Memory Access (NUMA) Machines:

Non-uniform memory access (NUMA) machines were


designed to avoid the memory access bottleneck of UMA
machines. The logically shared memory is physically
distributed among the processing nodes of NUMA machines,
leading to distributed shared memory architectures. On one
hand these parallel computers became highly scalable, but
on the other hand they are very sensitive to data allocation in
local memories. Accessing a local memory segment of a
node is much faster than accessing a remote memory
segment. Not by chance, the structure and design of these
machines resemble in many ways that of distributed memory
multicomputers. The main difference is in the organization
of the address space. In multiprocessors, a global address
space is applied that is uniformly visible from each
Figure I The Design Space and Classification of Shared Memory
processor; that is, all processors can transparently access all
Architectures memory locations. In multicomputers, the address space is
replicated in the local memories of the processing elements.
This difference in the address space of the memory is also
C. Uniform Memory Access (UMA) Machines: reflected at the software level: distributed memory
multicomputers are programmed on the basis of the
message-passing paradigm, while NUMA machines are
Contemporary uniform memory access machines are small- programmed on the basis of the global address space (shared
size single bus multiprocessors. Large UMA machines with memory) principle.
hundreds of processors and a switching network were typical
in the early design of scalable shared memory systems.
Famous representatives of that class of multiprocessors are The problem of cache coherency does not appear in
the Denelcor HEP and the NYU Ultracomputer. They distributed memory multicomputers since the message-
introduced many innovative features in their design, some of passing paradigm explicitly handles different copies of the
which even today represent a significant milestone in same data structure in the form of independent messages. In
parallel computer architectures. However, these early the shard memory paradigm, multiple accesses to the same
systems do not contain either cache memory or local main global data structure are possible and can be accelerated if
memory which turned out to be necessary to achieve high local copies of the global data structure are maintained in
performance in scalable shared memory systems. Although local caches. However, the hardware-supported cache
consistency schemes are not introduced into the NUMA
machines. These systems can cache read-only code and data, connects processing nodes to the station bus by forwarding
as well as local data, but not shared modifiable data. This is station bus requests to the station controller. The station
the distinguishing feature between NUMA and CC-NUMA controller has a twofold role. First, it controls the allocation
multiprocessors. Accordingly, NUMA machines are closer of the station bus between on-station requests, and second, it
to multicomputers than to other shared memory realises the local ring interface for the station. The station
multiprocessors, while CC-NUMA machines look like real controller is responsible for connecting the station to a local
shared memory systems. ring, as well. The inter-ring interface is realized as a two-
deep FIFO buffer that gives priority to packets moving in the
In NUMA machines, like in multicomputers, the main global ring. It means that whenever a packet travels on the
design issues are the organization of processor nodes, the global ring it will reach its destination without delay.
interconnection network, and the possible techniques to
reduce remote memory accesses. Two examples of NUMA The main advantages can be stated in assessing the Hector
machines are the Hector and the Cray T3D multiprocessor. machine:

---The hierarchical structure enables short transmission lines


and good scalability.

---The cost and the overall bandwidth of the structure grow


linearly with the number of nodes.

---The cost of a memory access grows incrementally with


the distance between the processor and memory location.
The main drawbacks of Hector are typical for all the NUMA
machines: lack of global cache consistency and non-uniform
memory access time which require careful software design.

2. Cray T3D: Cray T3D is the most recent NUMA


machine that was designed with the intention of providing a
highly scalable parallel supercomputer that can incorporate
both the shared memory and the message-passing
programming paradigms. As in other NUMA machines, the
shared memory is distributed among the processing elements
Figure III NUMA Machines in order to avoid the memory access bottleneck and there is
no hardware support for cache coherency. However, a
special software package and programming model, called the
1. Hector: Hector is a hierarchical NUMA machine CRAFT, manages coherence and guarantees the integrity of
consisting of stations connected by a hierarchy of ring the data. The Cray T3D hardware structure is divided into
networks. Stations are symmetric multiprocessors where the two parts: microarchitecture, macroarchitecture. The
processing nodes are connected by a single bus. Nodes microarchitecture is based on Digital’s 21064 Alpha AXP
comprise three main units: a processor/cache unit, a memory microprocessor which, like other contemporary
unit and the station bus interface which connects the microprocessors, have two main weaknesses: limited address
otherwise separated processor and memory buses. The space, little or no latency-hiding capability.
separation of the two buses enables other processors to
access this memory while the processor performs memory
access operations in off-node memory. The processing nodes Cary research has designed a shell of circuitry around the
of the machine are grouped into shared bus symmetric core microprocessor to extend its capabilities in these areas.
multiprocessors, called stations. These are connected by bit- The Cray T3D system has up to 128Gbytes of distributed
parallel local rings which are, in turn, interconnected by a shared memory that requires at least 37 bits of physical
single global ring. Hector provides a flat, global address address. In order to extend the number of address bits
space, where each processing nodes is assigned a range of beyond the 34 provided by the Alpha chip, the Cray T3D
addresses. The addressing scheme uses r+s+p bits where r employs a 32-entry register set. It is the task of the shell
identifies the ring, s points to the station, and p addresses the circuitry to check the virtual processing element number.
slot inside the station. Although global cache consistency
cannot be maintained in Hector, a snoopy protocol provides To improve the latency hiding mechanism of the Alpha chip,
cache consistency among the nodes inside a station. Cray introduces a 16-word FIFO, called the prefetch queue,
which permits 16 prefetch instructions to be performed
Memory accesses take place in a synchronized packet- without executing any load from the queue. The effect of a
transfer scheme controlled by a hierarchy of interface prefetch for a remote node is that the next free location of
circuits: station bus interface, station controller (local ring the prefetch queue is reserved for the data and a remote load
interface), inter-ring interface. The station bus interface operation is started for the requested data. When the
processor needs the prefetched data a load operation on the
prefetch queue delivers the requested slot of the queue. If the realize a write-back, write-invalidate cache coherence
data has not yet returned from the remote node, the protocol. Possible states of blocks in memories are
processor is stalled. unmodified which the value in the main memory is correct
and it can have several correct cached copies, and modified
The macroarchitecture defines how to connect and integrate which the value in the main memory is stale and there exists
the nodes of the parallel computer, while the exactly one correct cached copy. Possible states of blocks in
microarchitecture specifies the node organization. The two caches are shared (its copy in the main memory is in a global
parts of the macroarchitecture are the memory system and unmodified state), modified, and invalid. Each cache
the interconnection network. The memory system realises a controller contains a special data structure called the
distributed shared memory where any processing element modified line table. This table stores addresses of all
can directly address any other processing element’s memory. modified data blocks residing in caches in that column.
Notice that all the modified line tables in a given column
should be identical. A cache controller can issue four types
E. Cache-Coherent Non-Uniform Memory Access of consistency commands:
(CC-NUMA) Machines
READ: the associated processor wishes to read a data block
All the CC-NUMA machines share the common goal of that is not present in its cache.
building a scalable shared memory multiprocessor. The main
difference among them is in the way the memory and cache
coherence mechanisms are distributed among the processing READ-MOD: the associated processor wishes to write a data
nodes. Another distinguishing design issue is the selection of block that is not in a modified state in its cache.
the interconnection network among the nodes. They
demonstrate a progress from bus-based networks towards a ALLOCATE: an entire block is to be written regardless of
more general interconnection network and from the snoopy its current contents.
cache coherency protocol towards a directory scheme. The
Wisconsin multicube architecture is the closest WRITE-BACK: the data block in the main memory should
generalization of a single bus-based multiprocessor. It be updated and set into unmodified state.
completely relies on the snoopy cache protocol but in a
hierarchical way. The main goal of the Stanford FLASH 2. Stanford FLASH: The main design issue in the
design was the efficient integration of cache-coherent shared Stanford FLASH project is the efficient combination of
memory with high-performance message passing. The directory-based cache coherent shared memory architectures
FLASH applies a directory scheme for maintaining cache and state-of-the-art message-passing architectures in order to
coherence. The figure below shows the design space of the reduce the high hardware overhead of distributed shared
CC-NUMA machines. memory machines and the high software overhead of
multicomputers. The FLASH node comprises a high-
performance commodity microprocessor (MIPS T5) with its
caches, a portion of the main memory and the MAGIC chip.

The heart of the FLASH design is the MAGIC chip which


integrates the memory controller, network interface,
programmable protocol processor and I/O controller. The
MAGIC chip contains an embedded processor to provide
flexibility for various cache coherence and message-passing
protocols. The applied directory-based cache coherence
protocol has two components: directory data structure,
handlers to realize the cache coherent protocol. The
directory data structure is built on a semi-dynamic pointer
allocation scheme for maintaining the list of sharing
Figure IV CC-NUMA Machines processors. Concerning the message-passing protocols of
FLASH, it can support different message types. To
implement a protocol two components must be defined for
1. Wisconsin Multicube:The Wisconsin multicube the embedded processor of MAGIC: the message type and
architecture employs row and column buses forming a two- the executing handler that realises the necessary protocol.
dimensional grid architecture. The three-dimensional Messages are optimized to support cache operations and
generalization will result in a cube architecture. The main hence they are cache line sized. User messages are realized
memory is distributed along the column buses, and each data as long messages transferred in three stages: initiation,
block of memory has a home column. All rows of processors transfer, and reception.
work similarly to single bus-based multiprocessors. Each
processing element contains a processor, a conventional Owing to the central role of the MAGIC controller, FLASH
cache memory to reduce memory latency and a snoopy can be conceived as a message-passing computer, too,
cache that monitors a row bus and a column bus in order to extended with coherent caches. Its organization
demonstrates well that the two MIMD architecture types that Equations
were strictly distinguished in the past, the shared memory
and distributed memory architectures, will probably be The algorithmic calculations have not been used in the Body
merged into a single class in the near future. Text.

F. Cache-Only Memory Access (COMA) Machines ACKNOWLEDGMENT

COMA machines try to avoid the problems of static memory I would like to express my gratitude for the many helpful
allocation of NUMA and CC-NUMA machines by excluding comment and suggestions .I have received over the last few
main memory blocks from the local memory of nodes and days regarding the expository and critical expects of my
employing only large caches as node memories. In these term work and especially for those comments which bear
architectures only cache memories are present; no main directly or may various argument for the center thesis of
memory is employed wither in the form of a central shared term work. I am undebted to several people in this regard.
memory as in UMA machines or in the form of a distributed Most importantly I would like to thank my HOD (head of
main memory as in NUMA and CC-NUMA computers. department) and my ARCH lect. Miss Ramanpreet Kaur
Similarly to the way virtual memory has eliminated the need Lamba for his days of supervision. His critical commentary
to handle memory addresses explicitly, COMA machines on my work has played a major role in both the content and
render static data allocation to local memories superfluous. presentation of our discussion and arguments.
In COMA machines data allocation is demand driven;
according to the cache coherence scheme, data is always I have extend my appreciation to the several sources which
attracted to the local (cache) memory where it is needed. provided various kinds of knowledge base support for me
during this period.
In COMA machines similar cache coherence schemes can be
applied as in other shared memory systems. The only Yours obediently,
difference is that these techniques must be extended with the
capability of finding the data on a cache read miss and of SHIVANI
handling replacement. Since COMA machines are scalable
parallel architectures, only cache coherence protocols that
REFERENCE
support large-scale parallel systems can be applied, that is,
directory schemes and hierarchical cache coherent schemes.
[1]http://www.cs.umd.edu/class/fall2001/cmsc411/projects/MIMD/
Two representative COMA architectures are: DDM (Data
mimd.html
Diffusion Machine), KSR1. MIMD(Introduction)

III. CONCLUSION [2] en.wikipedia.org/wiki/Shared_memory


Shared Memory
The main problem one is confronted with in shared-memory
systems is that of the connection of the CPUs to each other [3]http://www.phys.uu.nl/~steen/web08/sm-mimd.html
and to the memory. As more CPUs are added, the collective Shared memory MIMD architecture
bandwidth to the memory ideally should increase linearly
with the number of processors, while each processor should [4] http://www.cs.unm.edu/~moret/deepgreen.ps
preferably communicate directly with all others without the Uniform Memory access
much slower alternative of having to use the memory in an
intermediate stage. Unfortunately, full interconnection is [5] http://www.fmslib.com/fmsman/doc/numa.html
quite costly, growing with O(n²) while increasing the Non-Uniform Memory Access
number of processors with O(n). So, various alternatives
have been tried. [6]http://www.osronline.com/ddkx/appendix/enhancements5_3a7o.
html.
IV. HELPFUL HINTS Cache Coherent Non-Uniform Memory Access
A. Math
[7] en.wikipedia.org/wiki/Cache-only_memory_architecture
No calculation has been shared in the Body text. Cache only Memory access

Units
No units to share on this Publication.

Figures

All figures have already been shown in the Body Text.

You might also like