Unit 4

Download as pdf or txt
Download as pdf or txt
You are on page 1of 48

Memory System

Memory Hierarchy

A memory unit is an essential component in any digital computer since it is needed for storing
programs and data.

Typically, a memory unit can be classified into two categories:

1. The memory unit that establishes direct communication with the CPU is called Main
Memory. The main memory is often referred to as RAM (Random Access Memory).
2. The memory units that provide backup storage are called Auxiliary Memory. For instance,
magnetic disks and magnetic tapes are the most commonly used auxiliary memories.

Apart from the basic classifications of a memory unit, the memory hierarchy consists all of the
storage devices available in a computer system ranging from the slow but high-capacity auxiliary
memory to relatively faster main memory.

The following image illustrates the components in a typical memory hierarchy.

Auxiliary Memory

Auxiliary memory is known as the lowest-cost, highest-capacity and slowest-access storage in a


computer system. Auxiliary memory provides storage for programs and data that are kept for long-
term storage or when not in immediate use. The most common examples of auxiliary memories are
magnetic tapes and magnetic disks.
A magnetic disk is a digital computer memory that uses a magnetization process to write, rewrite and
access data. For example, hard drives, zip disks, and floppy disks.

Magnetic tape is a storage medium that allows for data archiving, collection, and backup for different
kinds of data.

Main Memory

he main memory in a computer system is often referred to as Random Access Memory (RAM).
This memory unit communicates directly with the CPU and with auxiliary memory devices through
an I/O processor.

The programs that are not currently required in the main memory are transferred into auxiliary
memory to provide space for currently used programs and data.

I/O Processor

The primary function of an I/O Processor is to manage the data transfers between auxiliary memories
and the main memory.

Cache Memory

The data or contents of the main memory that are used frequently by CPU are stored in the cache
memory so that the processor can easily access that data in a shorter time. Whenever the CPU
requires accessing memory, it first checks the required data into the cache memory. If the data is
found in the cache memory, it is read from the fast memory. Otherwise, the CPU moves onto the
main memory for the required data.

Main Memory

The main memory acts as the central storage unit in a computer system. It is a relatively large and
fast memory which is used to store programs and data during the run time operations.

The primary technology used for the main memory is based on semiconductor integrated circuits.
The integrated circuits for the main memory are classified into two major units.

1. RAM (Random Access Memory) integrated circuit chips


2. ROM (Read Only Memory) integrated circuit chips

RAM integrated circuit chips

The RAM integrated circuit chips are further classified into two possible operating
modes, static and dynamic.

The primary compositions of a static RAM are flip-flops that store the binary information. The nature
of the stored information is volatile, i.e. it remains valid as long as power is applied to the system.
The static RAM is easy to use and takes less time performing read and write operations as compared
to dynamic RAM.
The dynamic RAM exhibits the binary information in the form of electric charges that are applied to
capacitors. The capacitors are integrated inside the chip by MOS transistors. The dynamic RAM
consumes less power and provides large storage capacity in a single memory chip.

RAM chips are available in a variety of sizes and are used as per the system requirement. The
following block diagram demonstrates the chip interconnection in a 128 * 8 RAM chip.

o A 128 * 8 RAM chip has a memory capacity of 128 words of eight bits (one byte) per word.
This requires a 7-bit address and an 8-bit bidirectional data bus.
o The 8-bit bidirectional data bus allows the transfer of data either from memory to CPU during
a read operation or from CPU to memory during a write operation.
o The read and write inputs specify the memory operation, and the two chip select (CS)
control inputs are for enabling the chip only when the microprocessor selects it.
o The bidirectional data bus is constructed using three-state buffers.
o The output generated by three-state buffers can be placed in one of the three possible states
which include a signal equivalent to logic 1, a signal equal to logic 0, or a high-impedance
state.
o The following function table specifies the operations of a 128 * 8 RAM chip.
o

o From the functional table, we can conclude that the unit is in operation only when CS1 = 1
and CS2 = 0. The bar on top of the second select variable indicates that this input is enabled
when it is equal to 0.

ROM integrated circuit

The primary component of the main memory is RAM integrated circuit chips, but a portion of
memory may be constructed with ROM chips.

A ROM memory is used for keeping programs and data that are permanently resident in the
computer.

Apart from the permanent storage of data, the ROM portion of main memory is needed for storing an
initial program called a bootstrap loader. The primary function of the bootstrap loader program is
to start the computer software operating when power is turned on.

ROM chips are also available in a variety of sizes and are also used as per the system requirement.
The following block diagram demonstrates the chip interconnection in a 512 * 8 ROM chip.

o A ROM chip has a similar organization as a RAM chip. However, a ROM can only perform
read operation; the data bus can only operate in an output mode.
o The 9-bit address lines in the ROM chip specify any one of the 512 bytes stored in it.
o The value for chip select 1 and chip select 2 must be 1 and 0 for the unit to operate.
Otherwise, the data bus is said to be in a high-impedance state.

Auxiliary Memory

An Auxiliary memory is known as the lowest-cost, highest-capacity and slowest-access storage in a


computer system. It is where programs and data are kept for long-term storage or when not in
immediate use. The most common examples of auxiliary memories are magnetic tapes and magnetic
disks.

Magnetic Disks

A magnetic disk is a type of memory constructed using a circular plate of metal or plastic coated with
magnetized materials. Usually, both sides of the disks are used to carry out read/write operations.
However, several disks may be stacked on one spindle with read/write head available on each
surface.

The following image shows the structural representation for a magnetic disk.

o The memory bits are stored in the magnetized surface in spots along the concentric circles
called tracks.
o The concentric circles (tracks) are commonly divided into sections called sectors.
Magnetic Tape

Magnetic tape is a storage medium that allows data archiving, collection, and backup for different
kinds of data. The magnetic tape is constructed using a plastic strip coated with a magnetic recording
medium.

The bits are recorded as magnetic spots on the tape along several tracks. Usually, seven or nine bits
are recorded simultaneously to form a character together with a parity bit.

Magnetic tape units can be halted, started to move forward or in reverse, or can be rewound.
However, they cannot be started or stopped fast enough between individual characters. For this
reason, information is recorded in blocks referred to as records.

Associative Memory

An associative memory can be considered as a memory unit whose stored data can be identified for
access by the content of the data itself rather than by an address or memory location.

Associative memory is often referred to as Content Addressable Memory (CAM).

When a write operation is performed on associative memory, no address or memory location is given
to the word. The memory itself is capable of finding an empty unused location to store the word.

On the other hand, when the word is to be read from an associative memory, the content of the word,
or part of the word, is specified. The words which match the specified content are located by the
memory and are marked for reading.

The following diagram shows the block representation of an Associative memory.


From the block diagram, we can say that an associative memory consists of a memory array and
logic for 'm' words with 'n' bits per word.

The functional registers like the argument register A and key register K each have n bits, one for
each bit of a word. The match register M consists of m bits, one for each memory word.

The words which are kept in the memory are compared in parallel with the content of the argument
register.

The key register (K) provides a mask for choosing a particular field or key in the argument word. If
the key register contains a binary value of all 1's, then the entire argument is compared with each
memory word. Otherwise, only those bits in the argument that have 1's in their corresponding
position of the key register are compared. Thus, the key provides a mask for identifying a piece of
information which specifies how the reference to memory is made.

The following diagram can represent the relation between the memory array and the external
registers in an associative memory.
The cells present inside the memory array are marked by the letter C with two subscripts. The first
subscript gives the word number and the second specifies the bit position in the word. For instance,
the cell Cij is the cell for bit j in word i.

A bit Aj in the argument register is compared with all the bits in column j of the array provided that
Kj = 1. This process is done for all columns j = 1, 2, 3......, n.

If a match occurs between all the unmasked bits of the argument and the bits in word i, the
corresponding bit Mi in the match register is set to 1. If one or more unmasked bits of the argument
and the word do not match, Mi is cleared to 0.

Cache Memory

The data or contents of the main memory that are used frequently by CPU are stored in the cache
memory so that the processor can easily access that data in a shorter time. Whenever the CPU needs
to access memory, it first checks the cache memory. If the data is not found in cache memory, then
the CPU moves into the main memory.

Cache memory is placed between the CPU and the main memory. The block diagram for a cache
memory can be represented as:
The cache is the fastest component in the memory hierarchy and approaches the speed of CPU
components.

The basic operation of a cache memory is as follows:

o When the CPU needs to access memory, the cache is examined. If the word is found in the
cache, it is read from the fast memory.
o If the word addressed by the CPU is not found in the cache, the main memory is accessed to
read the word.
o A block of words one just accessed is then transferred from main memory to cache memory.
The block size may vary from one word (the one just accessed) to about 16 words adjacent to
the one just accessed.
o The performance of the cache memory is frequently measured in terms of a quantity
called hit ratio.
o When the CPU refers to memory and finds the word in cache, it is said to produce a hit.
o If the word is not found in the cache, it is in main memory and it counts as a miss.
o The ratio of the number of hits divided by the total CPU references to memory (hits plus
misses) is the hit ratio.

Virtual Memory
Virtual Memory is a space where large programs can store themselves in form of pages while their
execution and only the required pages or portions of processes are loaded into the main memory.
This technique is useful as large virtual memory is provided for user programs when a very small
physical memory is there.
In real scenarios, most processes never need all their pages at once, for following reasons :

• Error handling code is not needed unless that specific error occurs, some of which are quite
rare.
• Arrays are often over-sized for worst-case scenarios, and only a small fraction of the arrays
are actually used in practice.
• Certain features of certain programs are rarely used.

Benefits of having Virtual Memory

1. Large programs can be written, as virtual space available is huge compared to physical
memory.
2. Less I/O required, leads to faster and easy swapping of processes.
3. More physical memory available, as programs are stored on virtual memory, so they occupy
very less space on actual physical memory.

What is Demand Paging?


The basic idea behind demand paging is that when a process is swapped in, its pages are not swapped
in all at once. Rather they are swapped in only when the process needs them(On demand). This is
termed as lazy swapper, although a pager is a more accurate term.

Initially only those pages are loaded which will be required the process immediately.
The pages that are not moved into the memory, are marked as invalid in the page table. For an
invalid entry the rest of the table is empty. In case of pages that are loaded in the memory, they are
marked as valid along with the information about where to find the swapped out page.
When the process requires any of the page that is not loaded into the memory, a page fault trap is
triggered and following steps are followed,

1. The memory address which is requested by the process is first checked, to verify the request
made by the process.
2. If its found to be invalid, the process is terminated.
3. In case the request by the process is valid, a free frame is located, possibly from a free-frame
list, where the required page will be moved.
4. A new operation is scheduled to move the necessary page from disk to the specified memory
location. ( This will usually block the process on an I/O wait, allowing some other process to
use the CPU in the meantime. )
5. When the I/O operation is complete, the process's page table is updated with the new frame
number, and the invalid bit is changed to valid.
6. The instruction that caused the page fault must now be restarted from the beginning.

There are cases when no pages are loaded into the memory initially, pages are only loaded when
demanded by the process by generating page faults. This is called Pure Demand Paging.
The only major issue with Demand Paging is, after a new page is loaded, the process starts execution
from the beginning. Its is not a big issue for small programs, but for larger programs it affects
performance drastically.

Page Replacement

As studied in Demand Paging, only certain pages of a process are loaded initially into the memory.
This allows us to get more number of processes into the memory at the same time. but what happens
when a process requests for more pages and no free memory is available to bring them in. Following
steps can be taken to deal with this problem :

1. Put the process in the wait queue, until any other process finishes its execution thereby
freeing frames.
2. Or, remove some other process completely from the memory to free frames.
3. Or, find some pages that are not being used right now, move them to the disk to get free
frames. This technique is called Page replacement and is most commonly used. We have
some great algorithms to carry on page replacement efficiently.

Basic Page Replacement

• Find the location of the page requested by ongoing process on the disk.
• Find a free frame. If there is a free frame, use it. If there is no free frame, use a page-
replacement algorithm to select any existing frame to be replaced, such frame is known
as victim frame.
• Write the victim frame to disk. Change all related page tables to indicate that this page is no
longer in memory.
• Move the required page and store it in the frame. Adjust all related page and frame tables to
indicate the change.
• Restart the process that was waiting for this page.

FIFO Page Replacement

• A very simple way of Page replacement is FIFO (First in First Out)


• As new pages are requested and are swapped in, they are added to tail of a queue and the page
which is at the head becomes the victim.
• Its not an effective way of page replacement but can be used for small systems.

LRU Page Replacement


Below is a video, which will explain LRU Page replacement algorithm in details with an example.

Thrashing
A process that is spending more time paging than executing is said to be thrashing. In other words it
means, that the process doesn't have enough frames to hold all the pages for its execution, so it is
swapping pages in and out very frequently to keep executing. Sometimes, the pages which will be
required in the near future have to be swapped out.
Initially when the CPU utilization is low, the process scheduling mechanism, to increase the level of
multiprogramming loads multiple processes into the memory at the same time, allocating a limited
amount of frames to each process. As the memory fills up, process starts to spend a lot of time for the
required pages to be swapped in, again leading to low CPU utilization because most of the
proccesses are waiting for pages. Hence the scheduler loads more processes to increase CPU
utilization, as this continues at a point of time the complete system comes to a stop.
To prevent thrashing we must provide processes with as many frames as they really need "right
now".

A secondary storage device refers to any non-volatile storage device that is internal or external to the
computer. It can be any storage device beyond the primary storage that enables permanent data storage.
A secondary storage device is also known as an auxiliary storage device, backup storage device, tier 2
storage, or external storage. These devices store virtually all programs and applications on a computer,
including the operating system, device drivers, applications and general user data.

The Secondary storage media can be fixed or removable. Fixed Storage media is an internal storage
medium like a hard disk that is fixed inside the computer. A storage medium that is portable and can
be taken outside the computer is termed removable storage media. The main advantage of using
secondary storage devices is:

o In Secondary storage devices, the stored data might not be under the direct control of the
operating system. For example, many organizations store their archival data or critical
documents on secondary storage drives, which their main network cannot access to ensure their
preservation whenever a data breach occurs.
o Since these drives do not interact directly with the main infrastructure and can be situated in a
remote or secure site, it is unlikely that a hacker may access these drives unless they're
physically stolen.

Why do we need Secondary Storage?

Computers use main memory such as random access memory (RAM) and cache to hold data that is
being processed. However, this type of memory is volatile, and it loses its data when the computer is
switched off. General-purpose computers, such as personal computers and tablets, need to store
programs and data for later use.

Skip Ad
That's why secondary storage is needed to keep programs and data long term. Secondary storage is
non-volatile and able to keep data as long term storage. They are used for various purposes such as
backup data used for future restores or disaster recovery, long-term archiving of data that is not
frequently accessed, and storage of non-critical data in lower-performing, less expensive drives.

Without secondary storage, all programs and data would be lost when the computer is switched off.

Characteristics of Secondary Storage Devices

These are some characteristics of secondary memory, which distinguish it from primary memory, such
as:

o It is non-volatile, which means it retains data when power is switched off


o It allows for the storage of data ranging from a few megabytes to petabytes.
o It is cheaper as compared to primary memory.
o Secondary storage devices like CDs and flash drives can transfer the data from one device to
another.

Types of Secondary Storage Device

Here are the two types of secondary storage devices, i.e., fixed storage and removable storage.

1. Fixed Storage

Fixed storage is an internal media device used by a computer system to store data. Usually, these are
referred to as the fixed disk drives or Hard Drives.

Fixed storage devices are not fixed. These can be removed from the system for repairing work,
maintenance purposes, and also for an upgrade, etc. But in general, this can not be done without a
proper toolkit to open up the computer system to provide physical access, which needs to be done by
an engineer.
Technically, almost all data, i.e. being processed on a computer system, is stored on some built-in fixed
storage device. We have the following types of fixed storage:

o Internal flash memory (rare)


o SSD (solid-state disk) units
o Hard disk drives (HDD)

2. Removable Storage

Removable storage is an external media device that is used by a computer system to store data. Usually,
these are referred to as the Removable Disks drives or the External Drives. Removable storage is any
storage device that can be removed from a computer system while the system is running. Examples of
external devices include CDs, DVDs, Blu-ray disk drives, and diskettes and USB drives. Removable
storage makes it easier for a user to transfer data from one computer system to another.

The main benefit of removable disks in storage factors is that they can provide the fast data transfer
rates associated with storage area networks (SANs). We have the following types of Removable
Storage:

o Optical discs (CDs, DVDs, Blu-ray discs)


o Memory cards
o Floppy disks
o Magnetic tapes
o Disk packs
o Paper storage (punched tapes, punched cards)

Classification of Secondary Storage Devices

The following image shows the classification of commonly used secondary storage devices.
Sequential Access Storage Device

It is a class of data storage devices that read stored data in a sequence. This is in contrast to random
access memory (RAM), where data can access in any order, and magnetic tape is the common
sequential access storage device.

i. Magnetic tape: It is a medium for magnetic recording, made of a thin, magnetizable coating
on a long, narrow strip of plastic film. Devices that record and play audio and video using
magnetic tape are tape recorders and videotape recorders. A device that stores computer data
on magnetic tape is known as a tape drive.
It was a key technology in early computer development, allowing unparalleled amounts of data
to be mechanically created, stored for long periods, and rapidly accessed.

Direct Access Storage Devices

A direct-access storage device (DASD) is another name for secondary storage devices that store data
in discrete locations with a unique address, such as hard disk drives, optical drives and most magnetic
storage devices.

1. Magnetic disks: A magnetic disk is a storage device that uses a magnetization process to write,
rewrite and access data. It is covered with a magnetic coating and stores data in the form of tracks,
spots and sectors. Hard disks, zip disks and floppy disks are common examples of magnetic disks.

i. Floppy Disk: A floppy disk is a flexible disk with a magnetic coating on it, and it is packaged
inside a protective plastic envelope. These are among the oldest portable storage devices that
could store up to 1.44 MB of data, but now they are not used due to very little memory storage.
ii. Hard Disk Drive (HDD): Hard disk drive comprises a series of circular disks
called platters arranged one over the other almost ½ inches apart around a spindle. Disks are
made of non-magnetic material like aluminium alloy and coated with 10-20 nm magnetic
material. The standard diameter of these disks is 14 inches, and they rotate with speeds varying
from 4200 rpm (rotations per minute) for personal computers to 15000 rpm for servers.
Data is stored by magnetizing or demagnetizing the magnetic coating. A magnetic reader arm
is used to read data from and write data to the disks. A typical modern HDD has a capacity in
terabytes (TB).

2. Optical Disk: An optical disk is any computer disk that uses optical storage techniques and
technology to read and write data. It is a computer storage disk that stores data digitally and uses laser
beams to read and write data.

i. CD Drive: CD stands for Compact Disk. CDs are circular disks that use optical rays, usually
lasers, to read and write data. They are very cheap as you can get 700 MB of storage space for
less than a dollar. CDs are inserted in CD drives built into the CPU cabinet. They are portable
as you can eject the drive, remove the CD and carry it with you. There are three types of CDs:
o CD-ROM (Compact Disk - Read Only Memory): The manufacturer recorded the
data on these CDs. Proprietary Software, audio or video are released on CD-ROMs.
o CD-R (Compact Disk - Recordable): The user can write data once on the CD-R. It
cannot be deleted or modified later.
o CD-RW (Compact Disk - Rewritable): Data can repeatedly be written and deleted on
these optical disks.
ii. DVD Drive: DVD stands for digital video display. DVD is an optical device that can store 15
times the data held by CDs. They are usually used to store rich multimedia files that need high
storage capacity. DVDs also come in three varieties - read-only, recordable and rewritable.
iii. Blu Ray Disk: Blu Ray Disk (BD) is an optical storage media that stores high definition (HD)
video and other multimedia files. BD uses a shorter wavelength laser than CD/DVD, enabling
the writing arm to focus more tightly on the disk and pack in more data. BDs can store up to
128 GB of data.

3. Memory Storage Devices: A memory device contains trillions of interconnected memory cells that
store data. When switched on or off, these cells hold millions of transistors representing 1s and 0s in
binary code, allowing a computer to read and write information. It includes USB drives, flash memory
devices, SD and memory cards, which you'll recognize as the storage medium used in digital cameras.

i. Flash Drive: A flash drive is a small, ultra-portable storage device. USB flash drives were
essential for easily moving files from one device to another. Flash drives connect to computers
and other devices via a built-in USB Type-Aor USB-C plug, making one a USB device and
cable combination.
Flash drives are often referred to as pen drives, thumb drives, or jump drives. The terms USB
drive and solid-state drive (SSD) are also sometimes used, but most of the time, those refer to
larger, not-so-mobile USB-based storage devices like external hard drives.
These days, a USB flash drive can hold up to 2 TB of storage. They're more expensive per
gigabyte than an external hard drive, but they have prevailed as a simple, convenient solution
for storing and transferring smaller files.
Pen drive has the following advantages in computer organization, such as:
o Transfer Files: A pen drive is a device plugged into a USB port of the system that is
used to transfer files, documents, and photos to a PC and vice versa.
o Portability: The lightweight nature and smaller size of a pen drive make it possible to
carry it from place to place, making data transportation an easier task.
o Backup Storage:Most of the pen drives now come with the feature of having password
encryption, important information related to family, medical records, and photos can be
stored on them as a backup.
o Transport Data: Professionals or Students can now easily transport large data files and
video, audio lectures on a pen drive and access them from anywhere. Independent PC
technicians can store work-related utility tools, various programs, and files on a high-
speed 64 GB pen drive and move from one site to another.
ii. Memory card: A memory cardor memory cartridge is an electronic data storage device used
for storing digital information, typically using flash memory. These are commonly used in
portable electronic devices, such as digital cameras, mobile phones, laptop computers, tablets,
PDAs, portable media players, video game consoles, synthesizers, electronic keyboards and
digital pianos, and allow adding memory to such devices without compromising ergonomy, as
the card is usually contained within the device rather than protruding like USB flash drives.

Difference between Primary and Secondary Memory

Below are some main differences between primary and secondary memory in computer organization.

Primary Memory Secondary Memory

Primary memory is directly accessed by the Secondary memory is not accessed directly by the
Central Processing Unit (CPU). Central Processing Unit (CPU). Instead, data
accessed from a secondary memory is first loaded
into Random Access Memory (RAM) and then sent
to the Processing Unit.
RAM provides a much faster-accessing speed Secondary memory is slower in data accessing.
to data than secondary memory. Computers Typically primary memory is six times faster than
can quickly process data by loading software secondary memory.
programs and required files into primary
memory (RAM).

Primary memory is volatile and gets Secondary memory provides a feature of being non-
completely erased when a computer is shut volatile, which means it can hold on to its data with
down. or without an electrical power su

RAID (Redundant arrays of independent disks) concept will be given as separate file.

Multiprocessors and Thread-Level Parallelism

We have seen the renewed interest in developing multiprocessors in early 2000:


- The slowdown in uniprocessor performance due to the diminishing returns in exploring
instruction-level parallelism.
- Difficulty to dissipate the heat generated by uniprocessors with high clock rates.
- Demand for high-performance servers where thread-level parallelism is natural.
For all these reasons multiprocessor architectures has become increasingly attractive.

A Taxonomy of Parallel Architectures


The idea of using multiple processors both to increase performance and to improve
availability dates back to the earliest electronic computers. About 30 years ago, Flynn
proposed a simple model of categorizing all computers that is still useful today. He
looked at the parallelism in the instruction and data streams called for by the instructions
at the most constrained component of the multiprocessor, and placed all computers in one
of four categories:
1. Single instruction stream, single data stream
(SISD)—This category is the uniprocessor.

PU – Processing Unit

Uniprocessors

2. Single instruction stream, multiple data streams


(SIMD)—The same instruction is executed by multiple processors using different data
streams. Each processor has its own data memory (hence multiple data), but there is a single
instruction memory and control processor, which fetches and dispatches instructions.
Vector architectures are the largest class of processors of this type.

3. Multiple instruction streams, single data stream (MISD)—No commercial


multiprocessor of this type has been built to date, but may be in the future. Some special
purpose stream processors approximate a limited form of this (there is only a single data
stream that is operated on by successive functional units).
4. Multiple instruction streams, multiple data streams (MIMD)—Each processor fetches
its own instructions and operates on its own data. The processors are often off-the-shelf
microprocessors. This is a coarse model, as some multiprocessors are hybrids of these
categories. Nonetheless, it is useful to put a framework on the design space.

1. MIMDs offer flexibility. With the correct hardware and software support, MIMDs
can function as single-user multiprocessors focusing on high performance for one
application, as multiprogrammed multiprocessors running many tasks simultaneously, or
as some combination of these functions.
2. MIMDs can build on the cost/performance advantages of off-the-shelf
microprocessors. In fact, nearly all multiprocessors built today use the same
microprocessors found in workstations and single-processor servers.
With an MIMD, each processor is executing its own instruction stream. In many cases,
each processor executes a different process. Recall from the last chapter, that a process is
an segment of code that may be run independently, and that the state of the process contains
all the information necessary to execute that program on a processor. In a
multiprogrammed environment, where the processors may be running independent tasks,
each process is typically independent of the processes on other processors.
It is also useful to be able to have multiple processors executing a single program and
sharing the code and most of their address space. When multiple processes share code
and data in this way, they are often called threads
. Today, the term thread is often used in a casual way to refer to multiple loci of execution
that may run on different processors, even when they do not share an address space. To
take advantage of an MIMD multiprocessor with n processors, we must usually have at
least n threads or processes to execute. The independent threads are typically identified by
the programmer or created by the compiler. Since the parallelism in this situation is
contained in the threads, it is called thread-level parallelism.
Threads may vary from large-scale, independent processes–for example, independent
programs running in a multiprogrammed fashion on different processors– to parallel
iterations of a loop, automatically generated by a compiler and each executing for
perhaps less than a thousand instructions. Although the size of a thread is important in
considering how to exploit thread-level parallelism efficiently, the important qualitative
distinction is that such parallelism is identified at a high-level by the software system and
that the threads consist of hundreds to millions of instructions that may be executed in
parallel. In contrast, instruction level parallelism is identified by primarily by the
hardware, though with software help in some cases, and is found and exploited one
instruction at a time.
Existing MIMD multiprocessors fall into two classes, depending on the number of
processors involved, which in turn dictate a memory organization and interconnect
strategy. We refer to the multiprocessors by their memory organization, because what
constitutes a small or large number of processors is likely to change over time.
The first group, which we call
centralized shared-memory architectures
Centralized shared memory architectures have at most a few dozen processors in 2000.
For multiprocessors with small processor counts, it is possible for the processors to share
a single centralized memory and to interconnect the processors and memory by a bus. With
large caches, the bus and the single memory, possibly with multiple banks, can satisfy the
memory demands of a small number of processors. By replacing a single bus with multiple
buses, or even a switch, a centralized shared memory design can be scaled to a few dozen
processors. Although scaling beyond that is technically conceivable, sharing a centralized
memory, even organized as multiple banks, becomes less attractive as the number of
processors sharing it increases.
Because there is a single main memory that has a symmetric relationship to all processors
and a uniform access time from any processor, these multiprocessors are often called
symmetric (shared-memory) multiprocessors ( SMPs), and this style of architecture is
sometimes called UMA for uniform memory access. This type of centralized shared-
memory architecture is currently by far the most popular organization.
The second group consists of multiprocessors with physically distributed memory. To
support larger processor counts, memory must be distributed among the processors rather
than centralized; otherwise the memory system would not be able to support the bandwidth
demands of a larger number of processors without incurring excessively long access
latency. With the rapid increase in processor performance and the associated increase in a
processor’s memory bandwidth requirements, the scale of multiprocessor for which
distributed memory is preferred over a single, centralized memory continues to
decrease in number (which is another reason not to use small and large scale). Of course,
the larger number of processors raises the need for a high bandwidth interconnect.

Distributed-memory multiprocessor
Distributing the memory among the nodes has two major benefits. First, it is a cost-
effective way to scale the memory bandwidth, if most of the accesses are to the local
memory in the node. Second, it reduces the latency for accesses to the local memory. These
two advantages make distributed memory attractive at smaller processor counts as
processors get ever faster and require more memory bandwidth and lower memorylatency.
The key disadvantage for a distributed memory architecture is that communicating data
between processors becomes somewhat more complex and has higher latency, at least
when there is no contention, because the processors no longer share a single centralized
memory. As we will see shortly, the use of distributed memory leads to two different
paradigms for interprocessor communication.
Typically, I/O as well as memory is distributed among the nodes of the multiprocessor,
and the nodes may be small SMPs (2–8 processors). Although the use of multiple
processors in a node together with a memory and a network interface is quite useful from
the cost-efficiency viewpoint.

Challenges for Parallel Processing

• Limited parallelism available in programs


– Need new algorithms that can have better parallel performance
• Suppose you want to achieve a speedup of 80 with 100 processors. What fraction
of the original computation can be sequential?
80 = 1
FractionParallel Parallel )
+ (1-
Fraction
100

FractionParallel = 0.9975

Data Communication Models for Multiprocessors

– shared memory: access shared address space implicitly via load and store
operations.
– message-passing: done by explicitly passing messages among the
processors
• can invoke software with Remote Procedure Call (RPC)

• often via library, such as MPI: Message Passing Interface

• also called "Synchronous communication" since communication


causes synchronization between 2 processes

Message-Passing Multiprocessor

- The address space can consist of multiple private address spaces that are
logically disjoint and cannot be addressed by a remote processor

- The same physical address on two different processors refers to two


different locations in two different memories.

Multicomputer (cluster):

- can even consist of completely separate computers connected on a LAN.


- cost-effective for applications that require little or no communication.
Symmetric Shared-Memory Architectures
Multilevel caches can substantially reduce the memory bandwidth demands of a
processor.
This is extremely
- Cost-effective
- This can work as plug in play by placing the processor and cache sub-system on
a board into the bus backplane.

Developed by
• IBM – One chip multiprocessor
• AMD and INTEL- Two –Processor
• SUN – 8 processor multi core
Symmetric shared – memory support caching of
• Shared Data
• Private Data
Private data: used by a single processor
When a private item is cached, its location is migrated to the cache Since no other
processor uses the data, the program behavior is identical to that in a uniprocessor.

Shared data: used by multiple processor


When shared data are cached, the shared value may be replicated in multiple caches
advantages: reduce access latency and memory contention induces a new problem: cache
coherence.

Cache Coherence

Unfortunately, caching shared data introduces a new problem because the view of memory
held by two different processors is through their individual caches, which, without any
additional precautions, could end up seeing two different values.
I. e, If two different processors have two different values for the same location, this
difficulty is generally referred to as cache coherence problem
Cache coherence problem for a single memory location
• Informally:
– “Any read must return the most recent write”
– Too strict and too difficult to implement
• Better:
– “Any write must eventually be seen by a read”
– All writes are seen in proper order (“serialization”)
• Two rules to ensure this:
– “If P writes x and then P1 reads it, P’s write will be seen by P1 if the read
and write are sufficiently far apart”
– Writes to a single location are serialized: seen in one order
• Latest write will be seen
• Otherwise could see writes in illogical order (could see older
value after a newer value)

The definition contains two different aspects of memory system:


• Coherence
• Consistency
A memory system is coherent if,
• Program order is preserved.
• Processor should not continuously read the old data value.
• Write to the same location are serialized.
The above three properties are sufficient to ensure coherence,When a written value will be
seen is also important. This issue is defined by memory consistency model. Coherenceand
consistency are complementary.

Basic schemes for enforcing coherence

Coherence cache provides:


• migration: a data item can be moved to a local cache and used there in a transparent
fashion.
• replication for shared data that are being simultaneously read.
• both are critical to performance in accessing shared data.
To over come these problems, adopt a hardware solution by introducing a protocol to
maintain coherent caches named as Cache Coherence Protocols
These protocols are implemented for tracking the state of any sharing of a data block.
Two classes of Protocols
• Directory based
• Snooping based

Directory based
• Sharing status of a block of physical memory is kept in one location called the
directory.
• Directory-based coherence has slightly higher implementation overhead than
snooping.
• It can scale to larger processor count.

Snooping
• Every cache that has a copy of data also has a copy of the sharing status of the
block.
• No centralized state is kept.
• Caches are also accessible via some broadcast medium (bus or switch)
• Cache controller monitor or snoop on the medium to determine whether or not
they have a copy of a block that is represented on a bus or switch access.
Snooping protocols are popular with multiprocessor and caches attached to single shared
memory as they can use the existing physical connection- bus to memory, to interrogate
the status of the caches. Snoop based cache coherence scheme is implemented on a
shared bus. Any communication medium that broadcasts cache misses to all the processors.

Basic Snoopy Protocols


• Write strategies
– Write-through: memory is always up-to-date
– Write-back: snoop in caches to find most recent copy

• Write Invalidate Protocol


– Multiple readers, single writer
– Write to shared data: an invalidate is sent to all caches which snoop and
invalidate any copies
• Read miss: further read will miss in the cache and fetch a new
copy of the data.

• Write Broadcast/Update Protocol (typically write through)


– Write to shared data: broadcast on bus, processors snoop, and update any
copies
– Read miss: memory/cache is always up-to-date.
• Write serialization: bus serializes requests!
– Bus is single point of arbitration

Examples of Basic Snooping Protocols

Write Invalidate
Write Update

Assume neither cache initially holds X and the value of X in memory is 0

Example Protocol

• Snooping coherence protocol is usually implemented by incorporating a finite-


state controller in each node
• Logically, think of a separate controller associated with each cache block
– That is, snooping operations or cache requests for different blocks can
proceed independently
• In implementations, a single controller allows multiple operations to distinct
blocks to proceed in interleaved fashion
– that is, one operation may be initiated before another is completed, even
through only one cache access or one bus access is allowed at time
Example Write Back Snoopy Protocol

• Invalidation protocol, write-back cache


– Snoops every address on bus
– If it has a dirty copy of requested block, provides that block in response to
the read request and aborts the memory access
• Each memory block is in one state:
– Clean in all caches and up-to-date in memory (Shared)
– OR Dirty in exactly one cache (Exclusive)
– OR Not in any caches
• Each cache block is in one state (track these):
– Shared : block can be read
– OR Exclusive : cache has only copy, its writeable, and dirty
– OR Invalid : block contains no data (in uniprocessor cache too)
• Read misses: cause all caches to snoop bus
• Writes to clean blocks are treated as misses

Write-Back State Machine – CPU


State Transitions for Each Cache Block is as shown below
• CPU may read/write hit/miss to the block
• May place write/read miss on bus
• May receive read/write miss from bus

Cache Coherence State Diagram


Conclusion
• “End” of uniprocessors speedup => Multiprocessors
• Parallelism challenges: % parallalizable, long latency to remote memory
• Centralized vs. distributed memory
– Small MP vs. lower latency, larger BW for Larger MP
• Message Passing vs. Shared Address
– Uniform access time vs. Non-uniform access time
• Snooping cache over shared medium for smaller MP by invalidating other cached
copies on write
• Sharing cached data ⇒ Coherence (values returned by a read), Consistency
(when a written value will be returned by a read)
• Shared medium serializes writes ⇒ Write consistency

Implementation Complications
• Write Races:
– Cannot update cache until bus is obtained
• Otherwise, another processor may get bus first,
and then write the same cache block!
– Two step process:
• Arbitrate for bus
• Place miss on bus and complete operation
– If miss occurs to block while waiting for bus,
handle miss (invalidate may be needed) and then restart.
– Split transaction bus:
• Bus transaction is not atomic:
can have multiple outstanding transactions for a block
• Multiple misses can interleave,
allowing two caches to grab block in the Exclusive state
• Must track and prevent multiple misses for one block
• Must support interventions and invalidations

Performance Measurement
• Overall cache performance is a combination of
– Uniprocessor cache miss traffic
– Traffic caused by communication – invalidation and subsequent cache
misses
• Changing the processor count, cache size, and block size can affect these two
components of miss rate
• Uniprocessor miss rate: compulsory, capacity, conflict
• Communication miss rate: coherence misses
– True sharing misses + false sharing misses

True and False Sharing Miss


• True sharing miss
– The first write by a PE to a shared cache block causes an invalidation to
establish ownership of that block
– When another PE attempts to read a modified word in that cache block, a
miss occurs and the resultant block is transferred
• False sharing miss
– Occur when a block a block is invalidate (and a subsequent reference
causes a miss) because some word in the block, other than the one being
read, is written to
– The block is shared, but no word in the cache is actually shared, andthis
miss would not occur if the block size were a single word
• Assume that words x1 and x2 are in the same cache block, which is in the shared
state in the caches of P1 and P2. Assuming the following sequence of events,
identify each miss as a true sharing miss or a false sharing miss.

Time P1 P2
1 Write x1
2 Read x2
3 Write x1
4 Write x2
5 Read x2

Example Result
• True sharing miss (invalidate P2)
• 2: False sharing miss
– x2 was invalidated by the write of P1, but that value of x1 is not used in
P2
• 3: False sharing miss
– The block containing x1 is marked shared due to the read in P2, but P2 did
not read x1. A write miss is required to obtain exclusive access to the
block
• 4: False sharing miss
• 5: True sharing miss

Distributed Shared-Memory Architectures


Distributed shared-memory architectures
• Separate memory per processor
– Local or remote access via memory controller
– The physical address space is statically distributed
Coherence Problems
• Simple approach: uncacheable
– shared data are marked as uncacheable and only private data are kept in
caches
– very long latency to access memory for shared data
• Alternative: directory for memory blocks
– The directory per memory tracks state of every block in every cache
• which caches have a copies of the memory block, dirty vs. clean, ...
– Two additional complications
• The interconnect cannot be used as a single point of arbitration like
the bus
• Because the interconnect is message oriented, many messages
must have explicit responses
To prevent directory becoming the bottleneck, we distribute directory entries with
memory, each keeping track of which processors have copies of their memory blocks
Directory Protocols
• Similar to Snoopy Protocol: Three states
– Shared: 1 or more processors have the block cached, and the value in
memory is up-to-date (as well as in all the caches)
– Uncached: no processor has a copy of the cache block (not valid in any
cache)
– Exclusive: Exactly one processor has a copy of the cache block, and it has
written the block, so the memory copy is out of date
• The processor is called the owner of the block

• In addition to tracking the state of each cache block, we must track the processors
that have copies of the block when it is shared (usually a bit vector for each memory
block: 1 if processor has copy)

• Keep it simple(r):
– Writes to non-exclusive data => write miss
– Processor blocks until access completes
– Assume messages received and acted upon in order sent

Messages for Directory Protocols

• local node: the node where a request originates


• home node: the node where the memory location and directory entry of an address
reside
• remote node: the node that has a copy of a cache block (exclusive or shared)

State Transition Diagram for Individual Cache Block


• Comparing to snooping protocols:
– identical states
– stimulus is almost identical
– write a shared cache block is treated as a write miss (without fetch the
block)
– cache block must be in exclusive state when it is written
– any shared block must be up to date in memory
• write miss: data fetch and selective invalidate operations sent by the directory
controller (broadcast in snooping protocols)
Directory Operations: Requests and Actions
• Message sent to directory causes two actions:
– Update the directory
– More messages to satisfy request
• Block is in Uncached state: the copy in memory is the current value; only possible
requests for that block are:
– Read miss: requesting processor sent data from memory &requestor made
only sharing node; state of block made Shared.
– Write miss: requesting processor is sent the value & becomes the Sharing
node. The block is made Exclusive to indicate that the only valid copy is
cached. Sharers indicates the identity of the owner.
• Block is Shared => the memory value is up-to-date:
– Read miss: requesting processor is sent back the data from memory &
requesting processor is added to the sharing set.
– Write miss: requesting processor is sent the value. All processors in the set
Sharers are sent invalidate messages, & Sharers is set to identity of
requesting processor. The state of the block is made Exclusive.
• Block is Exclusive: current value of the block is held in the cache of the processor
identified by the set Sharers (the owner) => three possible directory requests:
– Read miss: owner processor sent data fetch message, causing state of
block in owner’s cache to transition to Shared and causes owner to send
data to directory, where it is written to memory & sent back to requesting
processor.
Identity of requesting processor is added to set Sharers, which still
contains the identity of the processor that was the owner (since it still has a
readable copy). State is shared.
– Data write-back: owner processor is replacing the block and hence must
write it back, making memory copy up-to-date
(the home directory essentially becomes the owner), the block is now
Uncached, and the Sharer set is empty.
– Write miss: block has a new owner. A message is sent to old owner
causing the cache to send the value of the block to the directory from
which it is sent to the requesting processor, which becomes the new owner.
Sharers is set to identity of new owner, and state of block is made
Exclusive.

Synchronization : The Basics


Synchronization mechanisms are typically built with user-level software routines that rely
on hardware –supplied synchronization instructions.
• Why Synchronize?
Need to know when it is safe for different processes to use shared data
• Issues for Synchronization:
– Uninterruptable instruction to fetch and update memory (atomic
operation);
– User level synchronization operation using this primitive;
– For large scale MPs, synchronization can be a bottleneck; techniques to
reduce contention and latency of synchronization

Uninterruptable Instruction to Fetch and Update Memory


• Atomic exchange: interchange a value in a register for a value in memory
0 ⇒ synchronization variable is free
1 ⇒ synchronization variable is locked and unavailable
– Set register to 1 & swap
– New value in register determines success in getting lock
0 if you succeeded in setting the lock (you were first)
1 if other processor had already claimed access
– Key is that exchange operation is indivisible
• Test-and-set: tests a value and sets it if the value passes the test
• Fetch-and-increment: it returns the value of a memory location and atomically
increments it
– 0 ⇒ synchronization variable is free
• Hard to have read & write in 1 instruction: use 2 instead
• Load linked (or load locked) + store conditional
– Load linked returns the initial value
– Store conditional returns 1 if it succeeds (no other store to same memory
location since preceding load) and 0 otherwise
• Example doing atomic swap with LL & SC:
try: mov R3,R4 ; mov exchange value
ll R2,0(R1) ; load linked
sc R3,0(R1) ; store conditional
beqz R3,try ; branch store fails (R3 = 0)
mov R4,R2 ; put load value in R4

Example doing fetch & increment with LL & SC:
try: ll R2,0(R1) ; load linked
addi R2,R2,#1 ; increment (OK if reg–reg)
sc R2,0(R1) ; store conditional
beqz R2,try ; branch store fails (R2 = 0)

User Level Synchronization—Operation Using this Primitive


• Spin locks: processor continuously tries to acquire, spinning around a loop trying
to get the lock
li R2,#1
lockit: exch R2,0(R1) ;atomic exchange
bnez R2,lockit ;already locked?
• What about MP with cache coherency?
– Want to spin on cache copy to avoid full memory latency
– Likely to get cache hits for such variables
• Problem: exchange includes a write, which invalidates all other copies; this
generates considerable bus traffic
• Solution: start by simply repeatedly reading the variable; when it changes, then
try exchange (“test and test&set”):
try: li R2,#1
lockit: lw R3,0(R1) ;load var
bnez R3,lockit ;≠ 0 ⇒ not free ⇒ spin
exch R2,0(R1) ;atomic exchange
bnez R2,try ;already locked?
Memory Consistency Models

• What is consistency? When must a processor see the new value? e.g., seems that

P1: A = 0; P2: B = 0;
..... .....
A = 1; B = 1;
L1: if (B == 0) ... L2: if (A == 0) ...
• Impossible for both if statements L1 & L2 to be true?
– What if write invalidate is delayed & processor continues?
• Memory consistency models:
what are the rules for such cases?
• Sequential consistency: result of any execution is the same as if the accesses of
each processor were kept in order and the accesses among different processors
were interleaved ⇒ assignments before ifs above
– SC: delay all memory accesses until all invalidates done
• Schemes faster execution to sequential consistency
• Not an issue for most programs; they are synchronized
– A program is synchronized if all access to shared data are ordered by
synchronization operations
write (x)
...
release (s) {unlock}
...
acquire (s) {lock}
...
read(x)
• Only those programs willing to be nondeterministic are not synchronized: “data
race”: outcome f(proc. speed)
• Several Relaxed Models for Memory Consistency since most programs are
synchronized; characterized by their attitude towards: RAR, WAR, RAW, WAW
to different addresses
Relaxed Consistency Models: The Basics

• Key idea: allow reads and writes to complete out of order, but to use
synchronization operations to enforce ordering, so that a synchronized
programbehaves as if the processor were sequentially consistent
– By relaxing orderings, may obtain performance advantages
– Also specifies range of legal compiler optimizations on shared data
– Unless synchronization points are clearly defined and programs
are synchronized, compiler could not interchange read and write
of 2 shareddata items because might affect the semantics of the
program
• 3 major sets of relaxed orderings:
1. W→R ordering (all writes completed before next read)
• Because retains ordering among writes, many programs that
operate undersequential consistency operate under this model,
without additional synchronization. Called processor consistency
2. W → W ordering (all writes completed before next write)
3. R → W and R → R orderings, a variety of models depending on
orderingrestrictions and how synchronization operations enforce
ordering
• Many complexities in relaxed consistency models; defining precisely
what it means for a write to complete; deciding when processors can see
values that it haswritten

0 CHARAN’S DEGREE COLLEGE


1 CHARAN’S DEGREE COLLEGE

You might also like