1/1 Multiprocessors (Or) Shared Memory Multi-Processor Model
1/1 Multiprocessors (Or) Shared Memory Multi-Processor Model
Two categories of parallel computers are architecturally modeled below. These physical models are
distinguished by having a shared common memory or unshared distributed memories.
1.SHARED-MEMORY MULTIPROCESSORS:
^There are 3 shared-memory multiprocessor
models:iUniform Memory-access (UMA) model,^ii.Non
uniform-Memory-access (NUMA)model^iii.Cache-Only
Memory Architecture (COMA) model.^These models
differ in how the memory and peripheral resources are
shared or distributed
i. UNIFORM MEMORY-ACCESS (UMA) MODEL : the
physical memory is uniformly shared by all the
processors. All processors have equal access time to all
memory words, which is why it is called uniform
memory access. Each processor may use a private
cache.When all processors have equal access to all peripheral devices, the system is called a symmetric
multiprocessor.
ii. NON UNIFORM-MEMORY-ACCESS (NUMA) MODEL:
A NUMA multiprocessor is a shared-memory system in which the access time varies with the location of
the memory word. The shared memory is physically distributed to all processors, called local memories.
iii. CACHE-ONLY MEMORY ARCHITECTURE (COMA) MODEL
A multiprocessor using cache-only memory assumes the COMA
model.
Examples of COMA machines include the Swedish Institute of
Computer Science's Data Diffusion Machine and Kendall Square
Research's KSR-1 machine.
The COMA model is a special case of a NUMA machine, in which
the distributed main memories are converted to caches. There is no
memory hierarchy at each processor node. All the caches form a
global address space.
1/8 CALCULATE CPI & MIPS FOR 1GHZ PROCESSOR
1/2…DISTRIBUTED-MEMORY MULTICOMPUTER
A distributed-memory multicomputer system is modeled
in the above figure consists of multiple computers, often
called nodes, interconnected by a message-passing
network^^. Each node is an autonomous computer
consisting of a processor, local memory, and sometimes
attached disks or I/O peripherals. The message-passing
network provides point-to-point static connections
among the nodes. All local memories are private and
are accessible only by local processors. For this reason,
traditional multicomputers have been called no-remote-
memory-access (NORMA)machines.^ However, this
restriction will gradually be removed in future multi
computers with distributed shared memories. Internode
communication is carried out by passing messages through the static connection network.
1/3…ARCHITECTURE OF VECTOR SUPERCOMPUTER WITH NEAT DIAGRAM
A vector computer is often built
on top of a scalar processor.
the vector processor is attached
to the scalar processor as an
optional feature. All instructions
are first decoded by the scalar
control unit. If the decoded
instruction is a scalar operation or
a program control operation, it
will be directly executed by the
scalar processor using the scalar
functional pipelines. If the
instruction is decoded as a vector
operation, it will be sent to the
vector control unit. This control
unit will supervise the flow of
vector data between the main
memory and vector functional
pipelines.
Vector Processor Models: its a register-to-register architecture. Vector registers are used to hold the
vector operands, intermediate and final vector results. The vector functional pipelines retrieve operands
from and put results into the vector registers. All vector registers are programmable in user instructions.
The length of each vector register is usually fixed, say, sixty-four 64-bit component registers in a
vector register in a Cray Series supercomputer. Other machines, like the Fujitsu VP2000 Series, use
reconfigurable vector registers to dynamically match the register length with that of the vector operands
1/5… SIMD SUPER COMPUTERS SIMD computers have a single instruction stream over multiple data
streams. An operational model of an SIMD computer is specified by
a 5-tuple:M = (N,C, I,M, R) }where} 1.N is the number of processing
elements (PEs) in the machine. For example, 2. C is the set of
instructions directly executed by the control unit (CU), including
scalar and program flow control instructions.} 3.I s the set of
instructions broadcast by the CU to all PEs for parallel execution.
4.M is the set of masking schemes, where each mask partitions the
set of PEs into enabled and disabled subsets.}} R is the set of data-
routing functions, specifying various patterns to be set up in the
interconnection network for inter-PE communications.
1/6…STATIC CONNECTION NETWORKS
is under pipelined.
2/5…VIRTUAL MEMORY MODEL FOR MULTIPROCESSOR
1.PRIVATE VIRTUAL MEMORY
In this scheme, each processor has a separate virtual address
space, but all processors share the same physical address space.
Advantages:•Small processor address space•Protection on a per-
page or per-process basis•Private memory maps, which require no
locking Disadvantages•The synonym problem – different virtual
addresses in different/same virtual spaces point to the same
physical page•The same virtual address in different virtual spaces
may point to different pages in physical memory
2.SHARED VIRTUAL MEMORY •All processors share a single shared
virtual address space, with each processor being given a portion of
it.•Some of the virtual addresses can be shared by multiple
processors.Advantages:•All addresses are unique•Synonyms are not
allowed Disadvantages •Processors must be capable of generating large
virtual addresses (usually > 32 bits)•Since the page table is shared,
mutual exclusion must be used to guarantee atomic updates
•Segmentation must be used to confine each process to its own address
space•The address translation process is slower than with private (per
processor) virtual memory
2/6 PAGE REPLACEMENT POLICY:Page Traces: A page trace is a
sequence of page frame numbers (PFNs) generated during the execution of a given program. Least recently used
(LRU)—This policy replaces the page in R(t) which has the longest backwardDistance}}Optimal (OPT) algorithm—This
policy replaces the page in R(t) with the longest forward Distance}}First-in-first-out (FIFO)—This policy replaces the
page in R(t) which has been in memory for the longest timeLeast frequently used (LFU)—This policy replaces the
page in R(t) which has been least referenced in the past. Circular FIFO—This policy joins all the page frame entries
into a circular FIFO queue using a pointer to indicate the front of the queue }}Random replacement—This is a trivial
algorithm which chooses any page for replacement randomly.
2/7…TLB
TRANSLATION LOOKASIDE BUFFER
The TLB is a high-speed lookup table which stores the most recently or likely referenced page
entries. A page entry consists of essentially a (virtual page number, page frame number) pair. It is hoped
that pages belonging to the same working set will be directly translated using the TLB entries.
The use of a TLB and PTs for address translation}} Each virtual address is divided into 3 fields:
}}The leftmost field holds the virtual page number,}}the middle field identifies the cache block number,
}}the rightmost field is the word address within the lock }}The first step of the translation is to use the
virtual page number as a key to search through the TLB for a match.}} The TLB can be implemented with a
special associative memory (content
addressable memory) or use part of the
cache memory }}In case of a match (a hit)
in the TLB, the page frame number is
retrieved from the matched page entry.
The cache block and word address are
copied directly. In case the match
cannot be found (a miss) in the TLB, a
hashed pointer is used to identify one of
the page tables where the desired page
frame number can be retrieved.
2/8… SCHEMAS USED FOR TRANSLATING VIRTUAL ADDRESS INTO PHYSICAL ADDRESS
ADDRESS TRANSLATION MECHANISMS
The process demands the translation of virtual addresses into physical addresses. Various schemes for
virtual address translation are summarized}}The translation demands the use of translation maps which can
be implemented in various ways.}} Translation maps are stored in the cache, in associative memory, or in
the main memory.}} The translation
demands the use of translation maps
which can be implemented in various
ways.}} Translation maps are stored
in the cache, in associative memory,
or in the main memory. }}To access
these maps, a mapping function is
applied to the virtual address. This
function generates a pointer to the
desired translation map.}} This
mapping can be implemented with a hashing or congruence function. }} Hashing is a simple computer
technique for converting a long page number into a short one with fewer bits.}
2/9 PAGE REPLACEMENT POLICY CALCULATE HIT RATIO
3/1..ARBITRATION:- Arbitration is a method of alternative dispute resolution (ADR) where parties
involved in a legal dispute agree to have their case heard and resolved by a neutral third party, known as an
arbitrator
TYPES OF ARBITRATION
1).. CENTRAL ARBITRATION •Potential
masters are daisy chained in a cascade
•A special signal line propagates bus-
grant from first master (at slot 1) to the last master (at slot n). •All requests share the same bus-request
line •The bus-request signals the rise of the bus-grant level, which in turn raises the bus-busy level
2)..DISTRIBUTED ARBITRATION:- •Uses arbitration number to resolve arbitration competition •When two
or more devices compete for the bus, the winner is the one whose arbitration number is the largest
determined by Parallel Contention
Arbitration.. •All potential masters
can send their arbitration number to
shared-bus request/grant (SBRG)
lines and compare its own number
with SBRG number. •Priority based
scheme
3/2..MEMORY
INTERLEAVING
1)..LOW-ORDER INTERLEAVING :- •Low-
order interleaving spreads contiguous
memory locations across the m modules
horizontally • This implies that the low-
order a bits of the memory address are
used to identify the memorymodule.
2)..HIGH-ORDERINTERLEAVING
•High-order interleaving uses the high-
order a bits as the module address and
the low-order b bits as the word
3/3 MAPPING CACHE..
1).. DIRECT MAPPING CACHE :- •Direct mapping of n/m = 2s-r memory blocks to one block frame in the
cache •Placement is by using modulo-m function. Block Bj is mapped to block frame Bi
Bj Bi if i=j mod m •There is a unique block frame Bi that each B j can load into. •There is no way to
implement a block replacement policy. DIVIDED INTO 3 FIELDS: –The lower w bits specify the word offset
within each block. –The upper s bits specify the block address in main memory – The leftmost (s-r) bits
specify the tag to be matched
•ADVANTAGES –Simple hardware –No associative
search –No page replacement policy –Lower cost
•DISADVANTAGES –Rigid mapping –Poorer hit
ratio –Prohibits parallel virtual address
translation –Use larger cache size with more
block frames to avoid contention
2)..FULLY ASSOCIATIVE CACHE Each block in
main memory can be placed in any of the available
block frames as shown in
•Because of this flexibility, an s-bit tag needed in
each cache block. •As s > r, this represents a
significant increase in tag length
•ADVANTAGES:
–Offers most flexibility in mapping cache blocks
–Higher hit ratio
–Allows better block replacement policy with reduced
block contention
• DISADVANTAGES: –Higher hardware cost
–Only moderate size cache –Expensive search process
3)..SET ASSOCIATIVE CACHES
•In a k-way associative cache, the m
cache block frames are divided into
v=m/k sets, with k blocks per set
•Each set is identified by a d-bit set
number, where 2d = v.
•The cache block tags are now
reduced to s-d bits. •In practice, the
set size k, or associativity, is chosen
as 2, 4, 8, 16 or 64 depending on a
tradeoff among block size w, cache
size m and other performance/cost
factors
•Compare the tag with the k tags
within the identified set as shown in
Fig 5.11a.
•Since k is rather small in practice,
the k-way associative search is much
more economical than the full
associativity.
•In general, a block Bj can be
mapped into any one of the available
frames Bf in a set Si defined below.
Bj Bf Si if j(mod v) = i
(2)…
4/1..SNOOPY PROTOCOL BUS PROTOCOL
Snoopy protocols achieve data consistency among the
caches and shared memory through a bus
watching mechanism.}} In the following diagram, two
snoopy bus protocols create different results. Consider 3
processors(P1, P2, Pn) maintaining consistent copies of
block X in their local caches and in the shared memory
module marked X}}Using a write-invalidate protocol, the
processor P1 modifies (writes) its cache from X to X’, and
all other copies are invalidated via the bus. Invalidated
blocks are called dirty, meaning they should not be
used.}} The write-update protocol demands the new block content X’ be broadcast to all cache copies via
the bus.}} The memory copy also updated if write through caches are used. In using write-back caches,