0% found this document useful (0 votes)
32 views59 pages

ACA Mod3

Uploaded by

avi003
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views59 pages

ACA Mod3

Uploaded by

avi003
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 59

Module-III

Chapter 5:Bus, Cache and Shared Memory

Bus Systems:
• System bus of a computer operates on contention basis.
• Several active devices such as processors may request use of the bus at the same time.
• Only one of them can be granted access to bus at a time
• The Effective bandwidth available to each processor is inversely proportional to
the number of processors contending for the bus.
• For this reason, most bus-based commercial multiprocessors have been small
in size.
• The simplicity and low cost of a bus system made it attractive in building
small multiprocessors ranging from 4 to 16 processors.
Backplane Bus Specification
• A backplane bus interconnects processors, data storage and
peripheral devices in a tightly coupled hardware.
• The system bus must be designed to allow communication
between devices on the devices on the bus without disturbing the
internal activities of all the devices attached to the bus.
• Timing protocols must be established to arbitrate among multiple
requests. Operational rules must be set to ensure orderly data
transfers on the bus.
• Signal lines on the backplane are often functionally grouped into
several buses .Various functional boards are plugged into slots on
the backplane. Each slot is provided with one or more connectors
for inserting the boards as demonstrated by the vertical arrows.
Data Transfer Bus
• Composed of data, address, and control lines
• Address lines broadcast data and device address
– Proportional to log of address space size
• Data lines proportional to memory word length
• Control lines specify read/write, timing, and bus error conditions
Bus Arbitration and Control
• The process of assigning control of the DTB to a requester is called
arbitration.
• The requester is called a master, and the receiving end is called a slave.
• Interrupt lines are used to handle interrupts. Dedicated lines may be used to
synchronize parallel activities among the processor modules.
• Utility lines include signals that provide periodic timing and coordinate
the power-up and power-down sequences of the system.
• The backplane is made of signal lines and connectors.
• A special bus controller board is used to house the backplane control logic,
such as the system clock driver, arbiter, bus timer, and power driver
Functional Modules
• Arbiter: functional module that performs arbitration
• Bus timer: measures time for data transfers
• Interrupter: generates interrupt request and provides status/ID
to interrupt handler
• Location monitor: monitors data transfer
• Power monitor: monitors power source
• System clock driver: provides clock timing signal on the
utility bus
• Board interface logic: matches signal line impedance, prop.
time, and termination values
Physical Limitations
• Electrical, mechanical, and packaging
limitations restrict # of boards
• Can mount multiple backplane buses on the
same backplane chassis
• Difficult to scale a bus system due to
packaging constraints
Addressing and Timing Protocols
• Two types of printed circuit boards connected to a bus: active and passive
• Active devices like processors can act as bus masters or as slaves at
different times.
• Passive devices like memories can act only as slaves
• The master can initiate a bus cycle
– Only one can be in control at a time
• The slaves respond to requests by a master
– Multiple slaves can respond
Bus Addressing
• The backplane bus is driven by a digital clock with a fixed cycle time: bus
cycle
• Backplane has limited physical size, so will not skew information
• Factors affecting bus delay:
– Source’s line drivers, destination’s receivers, slot capacitance, line
length, and bus loading effects
• Design should minimize overhead time, so most bus cycles used for
useful operations
• Identify each board with a slot number
• When slot # matches contents of high-order address lines, the board
is selected as a slave (slot addressing)
Synchronizing Timing Protocol
Broadcall and Broadcast
• Broadcall is a read operation involving multiple slaves placing
their data on the bus lines.
• Special AND or OR operations over these data are performed
on the bus from the selected slaves.
• Broadcall operation used to detect multiple interrupt sources.
• A broadcast is a write operation involving multiple slaves.
This operation is essential in implementing multi cache
coherence on the bus.
• Timing protocols are needed to synchronize master and slave
operations.
Synchronous Timing
• All bus transaction steps take place at fixed clock edges as shown
in Figure
• The clock signals are broadcast to all potential masters and slaves.
• Clock cycle time determined by slowest device on bus
• Simple, less circuitry, suitable for devices with relatively the same
speed.
Steps:
1. First data should be stabilized on the data lines
2. Master uses a data-ready pulse to initiate the data transfer
3. Slave uses a data-accept pulse to signal completion of
the information transfer
Advantages:
1. simple to control
2. requires less control circuitry
3. cost is less
Disadvantages:
1. suitable for connecting devices having relatively same speed
otherwise, slower device will slow down the entire bus operation
Asynchronous Timing

• Based on handshaking or interlocking mechanism as shown in Figure


• No fixed clock cycle is needed.
• The rising edge (1) of the data-ready signal from the master triggers
the rising (2) of the data- accept signal from the slave.
• The second signal triggers the falling (3) of the data-ready clock and
removal of data from the bus.
• The third signal triggers the trailing edge (4) of the data accept clock.
• This four-edge handshaking (interlocking) process is repeated until
all the data is transferred.


• Advantages: Provides freedom of variable length clock signals for
different speed devices
– No response time restrictions
– More flexible
• Disadvantage: More complex and costly
Arbitration
• Process of selecting next bus master
• Bus tenure is duration of control
• Arbitrate on a fairness or priority basis
• Arbitration competition and bus transactions take place
concurrently on a parallel bus over separate lines

• Types of arbitration
• Central arbitration
• Distributed arbitration
• Centralized bus arbitration
• Each master can send request
• All requests share the same bus-request line
• Allocation is based on the priority
• Adv: Simplicity
Additional devices can be added
• Disadv: Fixed priority
Slowness
Types of central arbiter

• Daisy-chaining technique is a hybrid of central and distributed arbitration.


• In this techniques all devices that can request are attached serially.
• The central arbiter issues grant signal to the closest device requesting it.
• Devices request the bus by passing a signal to their neighbours who are
closer to the central arbiter.
• If a closer device also requests the bus, then the request from the more
distant device is blocked
• Daisy-chaining is low cost technique and also susceptible to faults.
• It is may lead to starvation for distant devices if a high priority devices
frequently request for Bus
• Simple scheme
• Easy to add devices
• Propagation of bus-grant signal is slow
Independent Requests and Grants

• Provide independent bus-request and grant signals for each


master
• Require a central arbiter, but can use a priority or fairness based
policy
• More flexible and faster than a daisy-chained policy
• Larger number of lines – costly
Distributed Arbitration

• Each master has its own arbiter and unique arbitration number as shown in Fig.
• Uses arbitration number to resolve arbitration competition
• When two or more devices compete for the bus, the winner is the one whose
arbitration number is the largest determined by Parallel Contention Arbitration.
• All potential masters can send their arbitration number to shared-bus request/grant
(SBRG) lines and compare its own number with SBRG number.
• If the SBRG number is greater, the requester is dismissed. At the end, the winner’s
arbitration number remains on the arbitration bus. After the current bus transaction
is completed, the winner seizes control of the bus.
• Priority based scheme
Transfer Modes


– Address-only transfer: no data
– Compelled-data transfer: Address transfer followed by a block of
one or more data transfers to one or more contiguous address.
– Packet-data transfer: Address transfer followed by a fixed-length
block of data transfers from set of continuous address.
– Connected: carry out master’s request and a slave’s response in a
single bus transaction
– Split: splits request and response into separate transactions
• Allow devices with long latency or access time to use bus resources more
efficiently
• May require two or more connected bus transactions

Cache Addressing Models

• Most multiprocessor systems use private caches for


each processor as shown in Fig. 5.6
• Have an interconnection network between caches and
main memory
• Caches can be addressed using either a Physical
Address or Virtual Address.
• Two different cache design models are:
– Physical address cache
– Virtual address cache
• Physical address cache
• When cache is addressed by physical address it is called physical address cache. The cache is
indexed and tagged with physical address.
• Cache lookup must occur after address translation in TLB or MMU. No aliasing is allowed so that
the address is always uniquely translated without confusion.
• After cache miss, load a block from main memory
• Use either write-back or write-through policy

• Advantages:
• No cache flushing on a context switch
• No aliasing problem thus fewer cache bugs in OS kernel.
• Simplistic design
• Requires little intervention from OS kernel

• Disadvantages:
• Slowdown in accessing the cache until the MMU/TLB finishes translating the
address
Virtual Address caches
Virtual Address Caches
• When a cache is indexed or tagged with virtual address it is called virtual address
cache.
• In this model both cache and MMU translation or validation are done in parallel.
• Physical address saved in tags for write back
• More efficient access to cache
Advantages:
– do address translation only on a cache miss
– faster for hits because no address translation
– More efficient access to cache
Disadvantages:
• Cache flushing on a context switch (example : local data segments will get an
erroneous hit for virtual addresses already cached after changing virtual address
space, if no cache flushing).
• Aliasing problem :Different logically addressed data have the same index/tag in the
cache
• Confusion if two or more processors access the same physical cache location
• Apply special tagging with a process key or with a physical address
Block Placement Schemes
Direct Mapping Cache
• Direct mapping of n/m memory blocks to one block frame in the cache
• Placement is by using modulo-m function
• Bj  Bi if i=j mod m
i = j mod m
• i=Cache Line Number
• j=Main Memory Block Number
• m=Number of Lines in Cache
we divide the memory block by the number of cache lines and the
remainder is the cache line address
• Unique block frame Bi that each Bj loads to
• Simplest organization to implement
Advantages
• Simple hardware
• No associative search
• No page replacement policy
• Lower cost
• Higher speed
Disadvantages
• Rigid mapping
• Poorer hit ratio
• Prohibits parallel virtual address translation
• Use larger cache size with more block frames to avoid contention
Fully Associative Cache
• Each block in main memory can be placed in any of the available block frames
• s-bit tag needed in each cache block (s > r)
• An m-way associative search requires the tag to be compared w/ all cache
block tags
• Use an associative memory to achieve a parallel comparison w/all tags
concurrently
• Advantages:
– Offers most flexibility in mapping cache blocks
– Higher hit ratio
– Allows better block replacement policy with reduced block contention
• Disadvantages:
– Higher hardware cost
– Only moderate size cache
– Expensive search process
Set Associative Caches

• Compromise between fully-associative and direct-mapped cache


– Cache is divided into a number of sets
– Each set contains a number of lines
– A given block maps to any line in a specific set
• Use direct-mapping to determine which set in the cache corresponds to a
set in memory
• Memory block could then be in any line of that set
• To compute cache set number:
– SetNum = j mod v
• j = main memory block number
• v = number of sets in cache
Block 0

Block 1

Slot 0 Block 2

Set 0 Slot 1
Block 3

Block 4
Set 1 Slot 0
Slot 1 Block 5
Sector Mapping Cache
• Partition cache and main memory into fixed size sectors then use fully
associative search
• Use sector tags for search and block fields within sector to find block
• Only missing block loaded for a miss
• The ith block in a sector placed into the th block frame in a destined sector
frame
• Attach a valid/invalid bit to block frames
Cache Performance Issues
• Cycle count: # of m/c cycles needed for cache
access, update, and coherence
• Hit ratio: how effectively the cache can reduce
the overall memory access time
• Program trace driven simulation: present
snapshots of program behavior and cache
responses
• Analytical modeling: provide insight into the
underlying processes
Shared Memory Organizations

• Memory interleaving provides a higher bandwidth for pipelined access of


continuous memory locations.
• Methods for allocating and de allocating main memory to multiple user
programs are considered for optimizing memory utilization.
Interleaved Memory Organization
• Goal is to close the speed gap b/t CPU/cache and main memory access
• Provides higher b/w for pipelined access of contiguous memory locations
• The main memory is built with multiple modules.
• These memory modules are connected to a system bus or a switching
network to which other resources are also connected.
• Once presented with a memory address, each memory module returns with
one word per cycle.
• It is possible to present different addresses to different memory modules so
that parallel access of multiple words can be done simultaneously or in a
pipelined fashion.
• Can present different addresses to different modules for parallel/pipelined
access
• m=2a modules, w/ w= 2b words
• Varying linear address assignments
• Have random and block accesses
Addressing Formats
• Low-order interleaving: spread contiguous locations across modules
horizontally
– Lower a bits identify module, b for word
– Supports block access in pipeline fashion
• High-order: contiguous locations within same module
– Higher a bits identify module, b for word
– Cannot support block access
00 00 00 00 01 01 01 01 10 10 10 10 11 11 11 11
00 01 10 11 00 01 10 11 00 01 10 11 00 01 10 11
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

If 16 bit word stored in memory if we want access we will access sequentially it


will take time but by memory interleaving we can access all memory location
parallel.to store above message we require 24=16 ie 4 bit address is require

mod1
mod2 mod3 mod4
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16

Low order inter leaving


00 00 00 00 01 01 01 01 10 10 10 10 11 11 11 11
00 01 10 11 00 01 10 11 00 01 10 11 00 01 10 11
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

mod1 mod2 mod3 mod1

1 5 9 13
2 6 10 14
3 7 11 15
4 8 12 16

High order interleaving


Pipelined Memory Access
• Access of the m memory modules can be overlapped in a pipelined fashion.
• For this purpose, the memory cycle (called the major cycle) is subdivided into m
minor cycles.
An eight-way interleaved memory (with m=8 and w=8 and thus a=b=3) is
shown in Fig. 5.16a.
• Let q be the major cycle and t the minor cycle. These two cycle times are related as
follows:
t = q/m
m=degree of interleaving
q=total time to complete access of one word
t=actual time to produce one word Total block access time is 2q
Effective access time of each word is t
• The timing of the pipelined access of the 8 contiguous memory words is shown in
Figure.
• This type of concurrent access of contiguous words has been called a C-access
memory scheme
Fault Tolerance
• High- and low-order interleaving can be combined to yield many
different interleaved memory organizations.
• Sequential addresses are assigned in the high-order interleaved
memory in each memory module.
• This makes it easier to isolate faulty memory modules in a
memory bank of m memory modules.
• When one module failure is detected, the remaining modules can
still be used by opening a window in the address space.
• This fault isolation cannot be carried out in a low-order
interleaved memory, in which a module failure may paralyze the
entire memory bank.
• Thus low-order interleaving memory is not fault-tolerant.
Memory Allocation Schemes
• Virtual memory allows many s/w processes time-shared use of main
memory
• Memory manager handles the swapping
• It monitors amount of available main memory and decides which processes
should reside and which to remove.
Allocation Policies
• Memory swapping: process of moving blocks of data between
memory levels
• Nonpreemptive allocation: if full, then swaps out some
of the allocated process ,Easier to implement, less efficient
• Preemptive allocation: has freedom to preempt an executing process
More complex, expensive, and flexible
• local allocation: considers only the resident working set of the
faulty process .used by most computers
• Global allocation: considers the history of the working sets of all resident
processes in making a swapping decision
Swapping Systems
• Allow swapping only at entire process level
• Swap device: configurable section of a disk set aside for temp storage of data
swapped
• Swap space: portion of disk set aside
• Depending on system, may swap entire processes only, or the necessary pages
Swapping in UNIX
• System calls that result in a swap:
– Allocation of space for child process being created
– Increase in size of a process address space
– Increased space demand by stack for a process
– Demand for space by a returning process swapped out previously
• Special process 0 is the swapper
Demand Paging Systems
• Allows only pages to be transferred b/t main memory and swap device
• Pages are brought in only on demand
• Allows process address space to be larger than physical address space
• Offers flexibility to dynamically accommodate large # of processes in
physical memory on time-sharing basis
Sequential and Weak Consistency Models
• Memory inconsistency: when memory access order differs from program
execution order
• Sequential consistency: memory accesses (I and D) consistent with
program execution order
Memory Consistency Issues
• Memory model: behavior of a shared memory system as observed
by processors
• Choosing a memory model – compromise between a strong
model minimally restricting s/w and a weak model offering
efficient implementation
• Primitive memory operations: load, store, swap
Event Orderings
• Event ordering helps determine if a memory
event is legal for concurrent accesses
• Program order: order by which memory
access occur for execution of a single process,
w/o any reordering
• Consistency models specify the order by which
events from one process should be observed
by another
Primitive Memory Operations
• Load by Pi complete wrt Pk when issue of a store
to same location by Pk does not affect value
returned by load
• Store by Pi complete wrt Pk when an issued load
to same address by Pk returns the value by this
store
• Load is globally performed if it is performed wrt
all processors and if the store that is the source of
the returned value has been performed wrt to all
Difficulty in Maintaining Correctness on an MIMD

• If no synch. among instruction streams, then large # of


different instruction interleaving's
• Could change execution order, leading to more possibilities
• If accesses are not atomic, then different processors can
observe different interleaving's
Atomicity
• Categories of memory behavior:
– Program order preserved and uniform observation sequence
by all processors
– Out of program order allowed and uniform observation
sequence by all processors
– Out of program order allowed and nonuniform sequences
observed by different processors
• Atomic memory accesses: memory updates are known to all
processors at the same time
• Non-atomic: having individual program orders that conform is
not a sufficient condition for sequential consistency
– Multiprocessor cannot be strongly ordered
Sequential Consistency
• Sufficient conditions:
– Before a load is allowed to perform wrt any other processor, all
previous loads must be globally performed and all previous stores must
be performed wrt all processors
– Before a store is allowed to perform wrt any other processor, all
previous loads must be globally performed and all previous stores must
be performed wrt to all processors
Sequential Consistency Axioms
• A load always returns the value written by the latest store
to the same location
• The memory order confirms to a total binary order in
which shared memory is accessed in real time over all
loads/stores
• If 2 ops appear in particular program order, same memory
order
• Swap op is atomic with respect to stores. No other store
can intervene b/t load and store parts of swap
• All stores and swaps must eventually terminate
Implementation Considerations
• A single port s/w services one op at a time
• Order in which s/w is thrown determines global order
of memory access ops
• Strong ordering preserves the program order in all
processors
• Sequential consistency model leads to poor memory
performance due to the imposed strong ordering of
memory events
Weak Consistency Models
• Multiprocessor model may range from strong (sequential)
consistency to various degrees of weak consistency
• Two models considered
– DSB model
– TSO model
DSB model

DSB have derived a weak consistency model by relating memory


request ordering to synchronization points in the program. We call this
the DSB model specified by the following 3 conditions:
• All previous synchronization accesses must be performed, before a
load or a store access is allowed to perform wrt any other processor.
• All previous load and store accesses must be performed, before a
synchronization access is allowed to perform wrt any other
processor.
• Synchronization accesses sequentially consistent with respect to one
another
TSO Model

TSO weak consistency model with 6 behavioral axioms.


• Load returns latest store result
• Memory order is a total binary relation over all pairs of store
operations
• If two stores appear in a particular program order, then they must
also appear in the same memory order
• If a memory operation follows a load in program order, then it
must also follow load in memory order
• A swap operation is atomic with respect to other stores – no other
store can interleave between
• load/store parts of swap
• All stores and swaps must eventually terminate

You might also like