0% found this document useful (0 votes)
37 views17 pages

1/1 Multiprocessors (Or) Shared Memory Multi-Processor Model

The document discusses different types of parallel computer architectures including shared memory multiprocessors and distributed memory multicomputers. It describes uniform memory access (UMA), non-uniform memory access (NUMA) and cache-only memory architecture (COMA) models for shared memory systems. For distributed memory systems, it notes that each node has private memory and communication occurs through message passing.

Uploaded by

Amol M
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views17 pages

1/1 Multiprocessors (Or) Shared Memory Multi-Processor Model

The document discusses different types of parallel computer architectures including shared memory multiprocessors and distributed memory multicomputers. It describes uniform memory access (UMA), non-uniform memory access (NUMA) and cache-only memory architecture (COMA) models for shared memory systems. For distributed memory systems, it notes that each node has private memory and communication occurs through message passing.

Uploaded by

Amol M
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

1/1…MULTIPROCESSORS {OR} SHARED MEMORY MULTI-PROCESSOR MODEL

Two categories of parallel computers are architecturally modeled below. These physical models are
distinguished by having a shared common memory or unshared distributed memories.
1.SHARED-MEMORY MULTIPROCESSORS:
^There are 3 shared-memory multiprocessor
models:iUniform Memory-access (UMA) model,^ii.Non
uniform-Memory-access (NUMA)model^iii.Cache-Only
Memory Architecture (COMA) model.^These models
differ in how the memory and peripheral resources are
shared or distributed
i. UNIFORM MEMORY-ACCESS (UMA) MODEL : the
physical memory is uniformly shared by all the
processors. All processors have equal access time to all
memory words, which is why it is called uniform
memory access. Each processor may use a private
cache.When all processors have equal access to all peripheral devices, the system is called a symmetric
multiprocessor.
ii. NON UNIFORM-MEMORY-ACCESS (NUMA) MODEL:

A NUMA multiprocessor is a shared-memory system in which the access time varies with the location of
the memory word. The shared memory is physically distributed to all processors, called local memories.
iii. CACHE-ONLY MEMORY ARCHITECTURE (COMA) MODEL
A multiprocessor using cache-only memory assumes the COMA
model.
Examples of COMA machines include the Swedish Institute of
Computer Science's Data Diffusion Machine and Kendall Square
Research's KSR-1 machine.
The COMA model is a special case of a NUMA machine, in which
the distributed main memories are converted to caches. There is no
memory hierarchy at each processor node. All the caches form a
global address space.
1/8 CALCULATE CPI & MIPS FOR 1GHZ PROCESSOR
1/2…DISTRIBUTED-MEMORY MULTICOMPUTER
A distributed-memory multicomputer system is modeled
in the above figure consists of multiple computers, often
called nodes, interconnected by a message-passing
network^^. Each node is an autonomous computer
consisting of a processor, local memory, and sometimes
attached disks or I/O peripherals. The message-passing
network provides point-to-point static connections
among the nodes. All local memories are private and
are accessible only by local processors. For this reason,
traditional multicomputers have been called no-remote-
memory-access (NORMA)machines.^ However, this
restriction will gradually be removed in future multi
computers with distributed shared memories. Internode
communication is carried out by passing messages through the static connection network.
1/3…ARCHITECTURE OF VECTOR SUPERCOMPUTER WITH NEAT DIAGRAM
A vector computer is often built
on top of a scalar processor.
the vector processor is attached
to the scalar processor as an
optional feature. All instructions
are first decoded by the scalar
control unit. If the decoded
instruction is a scalar operation or
a program control operation, it
will be directly executed by the
scalar processor using the scalar
functional pipelines. If the
instruction is decoded as a vector
operation, it will be sent to the
vector control unit. This control
unit will supervise the flow of
vector data between the main
memory and vector functional
pipelines.
Vector Processor Models: its a register-to-register architecture. Vector registers are used to hold the
vector operands, intermediate and final vector results. The vector functional pipelines retrieve operands
from and put results into the vector registers. All vector registers are programmable in user instructions.
The length of each vector register is usually fixed, say, sixty-four 64-bit component registers in a
vector register in a Cray Series supercomputer. Other machines, like the Fujitsu VP2000 Series, use
reconfigurable vector registers to dynamically match the register length with that of the vector operands

1/4...DEFINE CLOCK RATE CPI/IPC &MIPS& THROUGHPUT RATE & PARALLEIIZING


The inverse of the cycle time is the .clock rate (/ = 1/ τ in megahertz).}}Different machine instructions may
require different numbers of clock cycles to execute. Therefore, the cycles per instruction }}Furthermore,
CPI = C/Ic and T = Ic x CPI x τ = Ic x CPI/f. The processor speed is often measured in terms of million
instructions per second (MIPS).}} Throughput Rate•Number of programs a system can execute per unit
time, called the system throughput Ws (in programs/second).}} a fully developed parallelizing or vectorizing
compiler which can automatically detect parallelism in source code and transform sequential codes into
parallel constructs.
1/5…FLYNN’S CLASSIFICATION OF COMPUTER ARCHITECTURE
Michael Flynn (1972) introduced a classification of various computer architectures based on notions of
instruction and data streams.
1.SISD (Single Instruction stream over a Single Data stream) computers Conventional sequential machines
are called SISD computers.} They are also called scalar processor i.e., one instruction at a time and each
instruction have only one set of operands}. Single instruction: only one
instruction stream is being acted on by the CPU during any one clock
cycle} Single data: only one data stream is being used as input during any
one clock cycle.
2.SIMD (Single Instruction stream over Multiple Data streams) machines
A type of parallel computer} Single instruction: All processing units
execute the same instruction issued by the control unit at any
given clock cycle. Multiple data: Each processing unit can
operate on a different data element. The processors are
connected to shared memory or interconnection network
providing multiple data to processing unit. This type of
machine typically has an instruction dispatcher, a very high-
bandwidth internal network, and a very large array of very
small-capacity instruction units.
3.MIMD (Multiple Instruction streams over Multiple Data
streams) machinesA single data stream is fed into multiple
processing units. Each processing unit operates on the data
independently via independent instruction streams. A single data
stream is forwarded to different processing unit which are
connected to different controlunit and execute instruction given to
it by control unit to which it is attached.
4.MISD (Multiple Instruction streams and a Single Data stream)
machines
Multiple
Instructions: Every
processor may be
executing a different instruction stream Multiple Data:
Every processor may be working with a different data
stream, multiple data stream is provided by shared memory
Examples: most current supercomputers, networked
parallel computer "grids" and multi-processor SMP
computers - including some types of PCs.

1/5… SIMD SUPER COMPUTERS SIMD computers have a single instruction stream over multiple data
streams. An operational model of an SIMD computer is specified by
a 5-tuple:M = (N,C, I,M, R) }where} 1.N is the number of processing
elements (PEs) in the machine. For example, 2. C is the set of
instructions directly executed by the control unit (CU), including
scalar and program flow control instructions.} 3.I s the set of
instructions broadcast by the CU to all PEs for parallel execution.
4.M is the set of masking schemes, where each mask partitions the
set of PEs into enabled and disabled subsets.}} R is the set of data-
routing functions, specifying various patterns to be set up in the
interconnection network for inter-PE communications.
1/6…STATIC CONNECTION NETWORKS

1/7… TYPES OF DATA DEPENDENCIES {OR} EXPLAIN 3 DEPENDENCIES


Data dependence:
The ordering relationship between statements is indicated by the data dependence. Five type of data
dependence are defined below:1.FLOW DEPENDENCE: A statement S2 is flow dependent on S1 if an
execution path exists from s1 to S2 and if at least one output (variables assigned) of S1feeds in as input
(operands to be used) to S2 also called RAW S1->S2 hazard and denoted as 2.ANTIDEPENDENCE: Statement
S2 is antidependent on the statement S1 if S2 follows S1 in the program order and if the output of S2
overlaps the input to S1 also called RAW S1|->S2 hazard and denoted as 3.OUTPUT DEPENDENCE: Two
statements are output dependent if they produce (write) the same output variable. Also called RAW S1.-
>S2 hazard and denoted as 4.I/O DEPENDENCE: Read and write are I/O statements. I/O dependence
occurs not because the same variable is involved but because the same file referenced by both I/O
statement. 5.UNKNOWN DEPENDENCE: The dependence relation between two statements cannot be
determined in the following situations• A variable appears more than once with subscripts having different
coefficients of the loop variable.•The subscript is non linear in the loop index variable.
Control Dependence: This refers to the situation where
the order of the execution of statements cannot be
determined before run time. For example all condition
statement, where the flow of statement depends on the
output. Different paths taken after a conditional branch
may depend on the data hence we need to eliminate
this data dependence among the instructions.
Resource dependence: Data and control dependencies are based on the
independence of the work to be done. Resource independence is concerned with conflicts in using shared
resources, such as registers,
integer and floating point ALUs, etc. ALU conflicts are called
ALU dependence. Memory (storage) conflicts are called
storage dependence.
1/8 CALCULATE CPI & MIPS 400MHZ PROCESSOR
2/1..DIFFERENCIATE THE CHARACTERSTICS OF CISC AND
RISC {OR} INSTRUCTION SET ARCHITECTURE
•CISC–>Many different instructions}Many different operand data
types}Many different operand addressing formats}}Relatively small
number of general purpose registers}}Many instructions directly
match high-level language
constructions}} Unified cache
for instructions and data (in
most
cases)}}Microprogrammed
control units and ROM in
earlier processors (hard-wired controls units now in some CISC
systems)•RISC->Many fewer instructions than CISC (freeing chip space
for more functional units!)}}Fixed instruction format (e.g. 32 bits) and
simple operand addressing}}Relatively large number of registers}}Small
CPI (close to 1) and high clock rates }}Separate instruction and data
caches}}Hard-wired control units
2/2…ARCHITECTURE OF THE VLIW PROCESSOR AND ITS PIPELINE OPERATION
•VLIW = Very Long Instruction Word}}•Instructions usually hundreds of bits long.}}•Each instruction word
essentially carries multiple “short instructions”}}•Each of the ―short instructions‖ are effectively issued at
the same time.}}•(This is related to the long words frequently used in microcode.)}}•Compilers for VLIW
architectures should optimally try to predict branch outcomes to properly group instructions.
Pipelining in VLIW Processors:•Decoding of instructions is easier in VLIW than in superscalars, because
each “region” of an instruction word is usually limited as to the type of instruction it can contain.
•Code density in VLIW is less than in superscalars, because if a ―region‖ of a VLIW word isn’t needed in a
particular instruction, it must still exist (to be filled with a “no op”).•Superscalars can be compatible with
scalar processors; this is difficult with VLIW parallel and non-parallel architectures.

2/3...Memory Hierarchical Technology


Storage devices such as registers, caches, main memory, disk devices, and
backup storage are often organized as a hierarchy The memory technology
and storage organization at each level is characterized by five
parameters:1.ACCESS TIME TI (round-trip time from CPU to ith level)
2.MEMORY SIZE SI (number of bytes or words in level i)3.COST PER
BYTE ci 4.TRANSFER BANDWIDTH BI (rate of transfer between
levels)5.unit of transfer xi (grain size for transfers between levels i and
i+1) properties:1.Inclusion: The implication of the inclusion property is
that all items of information in the innermost memory level (cache)
also appear in the outer memory levels. 2.Coherence: The equipement
that copies of data items at successive memory levels be consistent is
called the coherence property. 3.Locality: Memory references are
generated by the CPU for either instruction or data access.
2/4…RISC SUPER SCALAR AND PIPELINED PROCESSORS
This subclass of the RISC processors allow multiple instructions to be issued simultaneously during each
cycle.}} The effective CPI of a superscalar processor should be less than that of a generic scalar RISC
processor. }}Clock rates of scalar RISC and superscalar RISC machines are similar. Representative RISC
Processors:1.Sun SPARC 2.Intel i860 3.Motorola M88100 4.AMD 29000
Superscalar processors are designed to exploit more instruction-level parallelism in user programs.}} Only
independent instructions can be executed in parallel without causing a wait state. }}The amount of
instruction level parallelism varies widely depending on the type of code being executed.
PIPELINED PROCESSORS •A base scalar processor:}}issues one instruction per cycle}}has a one-cycle
latency for a simple operation}}has a one-cycle latency between instruction issues}}can be fully utilized if
instructions can enter the pipeline at a rate on one per cycle •For a variety of reasons, instructions might
not be able to be pipelines as aggressively as in a base scalar processor. In these cases, we say the pipeline

is under pipelined.
2/5…VIRTUAL MEMORY MODEL FOR MULTIPROCESSOR
1.PRIVATE VIRTUAL MEMORY
In this scheme, each processor has a separate virtual address
space, but all processors share the same physical address space.
Advantages:•Small processor address space•Protection on a per-
page or per-process basis•Private memory maps, which require no
locking Disadvantages•The synonym problem – different virtual
addresses in different/same virtual spaces point to the same
physical page•The same virtual address in different virtual spaces
may point to different pages in physical memory
2.SHARED VIRTUAL MEMORY •All processors share a single shared
virtual address space, with each processor being given a portion of
it.•Some of the virtual addresses can be shared by multiple
processors.Advantages:•All addresses are unique•Synonyms are not
allowed Disadvantages •Processors must be capable of generating large
virtual addresses (usually > 32 bits)•Since the page table is shared,
mutual exclusion must be used to guarantee atomic updates
•Segmentation must be used to confine each process to its own address
space•The address translation process is slower than with private (per
processor) virtual memory
2/6 PAGE REPLACEMENT POLICY:Page Traces: A page trace is a
sequence of page frame numbers (PFNs) generated during the execution of a given program. Least recently used
(LRU)—This policy replaces the page in R(t) which has the longest backwardDistance}}Optimal (OPT) algorithm—This
policy replaces the page in R(t) with the longest forward Distance}}First-in-first-out (FIFO)—This policy replaces the
page in R(t) which has been in memory for the longest timeLeast frequently used (LFU)—This policy replaces the
page in R(t) which has been least referenced in the past. Circular FIFO—This policy joins all the page frame entries
into a circular FIFO queue using a pointer to indicate the front of the queue }}Random replacement—This is a trivial
algorithm which chooses any page for replacement randomly.
2/7…TLB
TRANSLATION LOOKASIDE BUFFER
The TLB is a high-speed lookup table which stores the most recently or likely referenced page
entries. A page entry consists of essentially a (virtual page number, page frame number) pair. It is hoped
that pages belonging to the same working set will be directly translated using the TLB entries.
The use of a TLB and PTs for address translation}} Each virtual address is divided into 3 fields:
}}The leftmost field holds the virtual page number,}}the middle field identifies the cache block number,
}}the rightmost field is the word address within the lock }}The first step of the translation is to use the
virtual page number as a key to search through the TLB for a match.}} The TLB can be implemented with a
special associative memory (content
addressable memory) or use part of the
cache memory }}In case of a match (a hit)
in the TLB, the page frame number is
retrieved from the matched page entry.
The cache block and word address are
copied directly. In case the match
cannot be found (a miss) in the TLB, a
hashed pointer is used to identify one of
the page tables where the desired page
frame number can be retrieved.

2/8… SCHEMAS USED FOR TRANSLATING VIRTUAL ADDRESS INTO PHYSICAL ADDRESS
ADDRESS TRANSLATION MECHANISMS
The process demands the translation of virtual addresses into physical addresses. Various schemes for
virtual address translation are summarized}}The translation demands the use of translation maps which can
be implemented in various ways.}} Translation maps are stored in the cache, in associative memory, or in
the main memory.}} The translation
demands the use of translation maps
which can be implemented in various
ways.}} Translation maps are stored
in the cache, in associative memory,
or in the main memory. }}To access
these maps, a mapping function is
applied to the virtual address. This
function generates a pointer to the
desired translation map.}} This
mapping can be implemented with a hashing or congruence function. }} Hashing is a simple computer
technique for converting a long page number into a short one with fewer bits.}
2/9 PAGE REPLACEMENT POLICY CALCULATE HIT RATIO
3/1..ARBITRATION:- Arbitration is a method of alternative dispute resolution (ADR) where parties
involved in a legal dispute agree to have their case heard and resolved by a neutral third party, known as an
arbitrator

TYPES OF ARBITRATION
1).. CENTRAL ARBITRATION •Potential
masters are daisy chained in a cascade
•A special signal line propagates bus-
grant from first master (at slot 1) to the last master (at slot n). •All requests share the same bus-request
line •The bus-request signals the rise of the bus-grant level, which in turn raises the bus-busy level
2)..DISTRIBUTED ARBITRATION:- •Uses arbitration number to resolve arbitration competition •When two
or more devices compete for the bus, the winner is the one whose arbitration number is the largest
determined by Parallel Contention
Arbitration.. •All potential masters
can send their arbitration number to
shared-bus request/grant (SBRG)
lines and compare its own number
with SBRG number. •Priority based
scheme
3/2..MEMORY
INTERLEAVING
1)..LOW-ORDER INTERLEAVING :- •Low-
order interleaving spreads contiguous
memory locations across the m modules
horizontally • This implies that the low-
order a bits of the memory address are
used to identify the memorymodule.
2)..HIGH-ORDERINTERLEAVING
•High-order interleaving uses the high-
order a bits as the module address and
the low-order b bits as the word
3/3 MAPPING CACHE..
1).. DIRECT MAPPING CACHE :- •Direct mapping of n/m = 2s-r memory blocks to one block frame in the
cache •Placement is by using modulo-m function. Block Bj is mapped to block frame Bi
Bj Bi if i=j mod m •There is a unique block frame Bi that each B j can load into. •There is no way to
implement a block replacement policy. DIVIDED INTO 3 FIELDS: –The lower w bits specify the word offset
within each block. –The upper s bits specify the block address in main memory – The leftmost (s-r) bits
specify the tag to be matched
•ADVANTAGES –Simple hardware –No associative
search –No page replacement policy –Lower cost
•DISADVANTAGES –Rigid mapping –Poorer hit
ratio –Prohibits parallel virtual address
translation –Use larger cache size with more
block frames to avoid contention
2)..FULLY ASSOCIATIVE CACHE Each block in
main memory can be placed in any of the available
block frames as shown in
•Because of this flexibility, an s-bit tag needed in
each cache block. •As s > r, this represents a
significant increase in tag length

•ADVANTAGES:
–Offers most flexibility in mapping cache blocks
–Higher hit ratio
–Allows better block replacement policy with reduced
block contention
• DISADVANTAGES: –Higher hardware cost
–Only moderate size cache –Expensive search process
3)..SET ASSOCIATIVE CACHES
•In a k-way associative cache, the m
cache block frames are divided into
v=m/k sets, with k blocks per set
•Each set is identified by a d-bit set
number, where 2d = v.
•The cache block tags are now
reduced to s-d bits. •In practice, the
set size k, or associativity, is chosen
as 2, 4, 8, 16 or 64 depending on a
tradeoff among block size w, cache
size m and other performance/cost
factors
•Compare the tag with the k tags
within the identified set as shown in
Fig 5.11a.
•Since k is rather small in practice,
the k-way associative search is much
more economical than the full
associativity.
•In general, a block Bj can be
mapped into any one of the available
frames Bf in a set Si defined below.

Bj Bf Si if j(mod v) = i

4)..SECTOR MAPPING CACHE


•Partition both the cache and main
memory into fixed size sectors. Then
use fully associative search ie., each
sector can be placed in any of the
available sector frames. •The
memory requests are destined
for blocks, not for sectors.
•This can be filtered out by
comparing the sector tag in the
memory address with all sector
tags using a fully associative
search. •If a matched sector
frame is found (a cache hit), the
block field is used to locate the
desired block within the sector
frame.
ADVANTAGES: •Flexible to
implement various bkock
replacement algorithms
•Economical to perform a fully
associative search a limited
number of sector tags.
•Sector partitioning offers
more freedom in grouping
cache lines at both ends of the
mapping.
3/4… PIPELINE PROBLEM (1)..

(2)…
4/1..SNOOPY PROTOCOL BUS PROTOCOL
Snoopy protocols achieve data consistency among the
caches and shared memory through a bus
watching mechanism.}} In the following diagram, two
snoopy bus protocols create different results. Consider 3
processors(P1, P2, Pn) maintaining consistent copies of
block X in their local caches and in the shared memory
module marked X}}Using a write-invalidate protocol, the
processor P1 modifies (writes) its cache from X to X’, and
all other copies are invalidated via the bus. Invalidated
blocks are called dirty, meaning they should not be
used.}} The write-update protocol demands the new block content X’ be broadcast to all cache copies via
the bus.}} The memory copy also updated if write through caches are used. In using write-back caches,

the memory copy is updated later at block replacement


time
4/2 DIRECTLY BASED CACHE COHERENCE SCHEME
4/3.. SIMD {OR} TWO MODELS FOR CONSTRUCTING SIMD SUPER COMPUTERS
SIMD models differentiates on base of memory distribution and addressing scheme used.
Most SIMD computers use a single control unit and distributed memories, except for a few that use
associative memories.
1.DISTRIBUTED MEMORY
MODEL:
Spatial parallelism is exploited
among the PEs.
A distributed memory SIMD
consists of an array of PEs
(supplied with local memory)
which are controlled by the
array control unit.}} Program
and data are loaded into the
control memory through the
host computer and distributed
from there to PEs local
memories.}} An instruction is
sent to the control unit for
decoding. If it is a scalar or
program control operation,
it will be directly executed by a scalar processor attached to the control unit.}} If the decoded instruction is
a vector operation, it will be broadcast to all the PEs for parallel execution.
2.SHARED MEMORY
MODEL: An alignment
network is used as the inter-
PE memory communication
network. This network is
controlled by control unit.}}
The alignment network must
be properly set to avoid
access conflicts.}} Figure
below shows a variation of
the SIMD computer using
shared memory among the
PEs. Most SIMD computers
were built with distributed
memories.
4/4..PROCESSOR
CONSISTENCY MODELS
Processor Consistency Model: The processor consistency model was introduced by Goodman in 1989 and
is analogous to the PRAM consistency model but there’s a small difference i.e there’s a restriction of
memory coherence. Memory coherence refers to the fact that all processes should reflect in the same
order for all write operations to any given memory address.{{A consistency model is a set of rules that
govern the behavior of a distributed system. It establishes the circumstances in which the system’s various
parts can communicate with one another and decides how the system will react to modifications or errors.
A distributed system’s consistency model plays a key role in ensuring the system’s consistency and
dependability in the face of distributed computing difficult ies including network delays and partial
failures.
4/5… VECTOR ACCESS MEMORY
The flow of vector operands between the main
memory and vector registers is usually pipelined
with multiple access paths. 1.C-ACCESS MEMORY
ORGANIZATION:The m-way low-order memory
structure, allows m words to be accessed
concurrently and overlapped. The access cycles in
different memory modules are staggered. The low-
order a bits select the modules, and the high-order
b bits select the word within each module, where
m=2a and a+b = n is the address length.
2. S-ACCESS MEMORY ORGANIZATION:All
memory modules are accessed simultaneously in
a synchronized manner. The high order (n-a) bits
select the same offset word from each module.

3.C/S-ACCESS MEMORY ORGANIZATION:


Here C-access and S-access are combined.
n access buses are used with m interleaved memory modules attached to each bus. }}The m modules on
each bus are m-way interleaved to allow C-access.}} In each memory cycle, at most m.n words are fetched
if the n buses are fully used with pipelined memory accesses
4/6.. STORE AND FORWARD ROUTING & WORMHOLE ROUTING {OR} TWO MESSAGE
PASSING MECHANISM.
I.STORE AND FORWARD ROUTING Packets are the basic unit of information flow in a store-and-forward
network.}} Each node is required to use a packet
buffer.}} A packet is transmitted from a source node
to a destination node through a sequence of
intermediate nodes. When a packet reaches an
intermediate node, it is first stored in the buffer.
Then it is forwarded to the next node if the desired
output channel and a packet buffer in the receiving
node are both available. II.WORMHOLE ROUTING
Packets are subdivided into smaller flits. Flit buffers
are used in the hardware routers attached to
nodes. The transmission from the source node to
the destination node is done through a sequence of
routers. All the flits in the same packet are
transmitted in order as inseparable companions in a
pipelined fashion.
4/7..DIFFERENT TYPE OF VECTOR INSTRUCTIONS

There are six types of vector instructions.


1.Vector - Vector instructions: One or two vector operands are fetched form the respective
vector registers, enter through a functional pipeline unit, and produce result in another vector register.
F1: Vi Vj F2: Vi x Vj Vk Examples: V1 = sin(V2), V3 = V1+ V2
2. Vector - Scalar instructions:Elements of vector register are multiplied by a scalar value.
F3: s x Vi V Examples: V2 = 6 + V1
3.Vector - Memory instructions: This corresponds to Store-load of vector registers (V) and
the Memory (M). F4: M V (Vector Load) F5: VM (Vector Store) Examples: X = V1 V2 = Y
4.Vector reduction instructions: include maximum, minimum, sum, mean value.
F6: Vi s F7: Vi x Vj s
5.Gather and scatter instructions: Two instruction registers are used to gather or scatter
vector elements randomly throughout the memory corresponding to the following mappings
F8: M Vi x Vj (Gather) F9: Vi x Vj M (Scatter)
6.Masking instructions: The Mask vector is used to compress or to expand a vector to a shorter
or longer index vector (bit per index correspondence).F10: Vi x Vm Vj (Vm is a binary vector)
4/7.. ROUTING IN OMEGA NETWORKS
In general, an n-input Omega network has log2n stages. The stages are labeled from 0 to log2n — 1 from
the input end to the output end. }}Data routing is controlled by inspecting the destination code in binary.
When the ith high-order bit of the destination code is a 0, a 2 x 2 switch at stage i connects the input to the
upper output.}}Otherwise, the input is directed to the lower output.
Two switch settings are shown in Figs.
7.8a and b with respect to permutations
Π1 = (0,7,6,4,2) (1,3)(5) and Π2=
(0,6,4,7,3) (1,5)(2), respectively.{{ The
switch settings in Fig. 7.8a are for the
implementation of Π1, which maps 0
7, 7 6, 6 4,{{4 2, 2 0, 1 3, 3 1, 5
5.{{ Consider the routing of a
message from input 001 to output 011.
This involves the use of switches
A, B, and C. Since the most significant
bit of the destination 011 is a "zero,"
switch A must be set straight so that the
input 001 is connected to the upper output (labeled 2). {{ The middle bit in 011 is a "one," thus input 4 to
switch B is connected to the lower output with a "crossover" connection.{{ The least significant bit in 011
is a "one," implying a flat connection in switch C.{{ Similarly, the switches A, E, and D are set for routing a
message from input 101 to output 101. }}There exists no conflict in all the switch settings needed to
implement the permutation Π1. Conflicts in switch settings do exist in three switches identified as F, G, and
H. The conflicts }occurring at F are caused by the desired routings 000 110 and 100 111. {{ To resolve
the conflicts, one request must be blocked.
5/1.. TOMASULO’S ALGORITHM
•For register renaming, we need a set of program
invisible registers to which programmable Register
renaming was also an implicit part of the original
algorithm. registers are re-mapped. •Tomasulo’s
algorithm requires these program invisible
registers to be provided with reservation stations
of functional units. •Let us assume that the
functional units are internally pipelined, and can
complete one operation in every clock cycle.
op operation to be carried out by the functional
unit opnd-1 & opnd-2 two operand values needed
for the operation t1 & t2 two source tags
associated with the operands
5/2..OBJECT ORIENTED MODEL
•Objects dynamically created and manipulated.
•Processing is performed by sending and receiving
messages among objects.
1)..CONCURRENT OOP
•Need of OOP because of abstraction and reusability concept.
•Objects are program entities which encapsulate data and operations in single unit.
•Concurrent manipulation of objects in OOP.
2)..ACTOR MODEL
•This is a framework for Concurrent OOP.
•Actors -> independent component • Communicate via asynchronous message passing.
•3 primitives -> create, send-to and become.
3)..PARALLELISM IN COOP 3 common patterns for parallelism-:
1)Pipeline concurrency 2)Divide and conquer 3)Cooperative Problem Solving
5/3…COMPILATION PHASES IN PARALLEL CODE GENERATION
1)..PARAFASE AND PARAFASE 2
•Transforms sequential programs of fortran
77 into parallel programs.
•Parafase consists of 100 program that are
encoded and passed.
•Pass list indentifies dependencies and
converts it to concurrent program.
•Parafase2 for c and pascal in extension to
fortran.
2)..PFC AND PARASCOPE
•Translates Fortran 77 to Fortran 90 code.
•PFC package extended to PFC + for
parallel code generation on shared
memory multiprocessor.
•PFC performs analysis as following steps
below-: •PFC performs analysis as
following steps below -:
1)Inter-procedure Flow analysis
2)Transformation
3)Dependence analysis
4)Vector Code Generation
5/4… PARALLEL PROGRAMMING
MODELS
1.SHARED-VARIABLE MODEL In all programming
system, processors are active resources and
memory & IO devices are passive resources.
Program is a collection of processes. Parallelism
depends on how IPC(Interprocess Communication)
is implemented. Process address space is shared. To
ensure orderly IPC, a mutual exclusion property
requires that shared object must be shared by only
1 process at a time.
2.MESSAGE-PASSING MODEL Two processes
D and E residing at different processor nodes
may communicate wit each other by passing messages through a direct network. The messages may be
instructions, data,synchronization or interrupt signals etc. Multicomputers are considered loosely coupled
multiprocessors. •No shared Memory •No mutual Exclusion •Synchronization of sender and reciever
process just like telephone call. •No buffer used. •If one process is ready to cummunicate and other is
not,the one that is ready must be blocked. 3.DATA-PARALLEL MODEL :- •Used in SIMD computers.
Parallelism handled by hardware synchronization and flow control. •Fortran 90 ->data parallel lang.
•Require predistrubuted data sets. Data Parallelism •This technique used in array processors(SIMD) •Issue-
>match problem size with machine size.Array Language Extensions
4.OBJECT ORIENTED MODEL •Objects dynamically created and manipulated. •Processing is performed by
sending and receiving messages among objects. CONCURRENT OOP •Need of OOP because of abstraction
and reusability concept. •Objects are program entities which encapsulate data and operations in single
unit. •Concurrent manipulation of objects in OOP. PARALLELISM IN COOP 1)Pipeline concurrency
2)Divide and conquer 3)Cooperative Problem Solving 5.FUNCTIONAL AND LOGIC MODEL •Functional
Programming Language-> Lisp,Sisal and Strand 88. Logic Programming Language-> Concurrent Prolog and
Parlog Functional Programming Model •Should not produce any side effects. •No concept of
storage,assignment and branching. •Single assignment and data flow language functional in nature. Logic
Programming Models •Used for knowledge processing from large database.
5/5.. BRANCH PREDICTION •About 15% to 20% of instructions
in a typical program are branch and jump instructions, including
procedure returns. •Therefore—if hardware resources are to be fully
utilized in a superscalar processor—the processor must start working
on instructions beyond a branch, even before the branch instruction
itself has completed. This is only possible through some form of branch
prediction. •What can be the logical basis for branch prediction? To
understand this, we consider first the reasoning which is involved if one
wishes to predict the result of a tossed coin.
5/6.. LANGUAGE FEATURES FOR PARALLEL PROGRAMING
1.Optimization Features
2.Availability Features
3.Synchronization/communication Features
4.Control Of Parallelism
5.Data Parallelism Features
6.Process Management Features

You might also like