0% found this document useful (0 votes)
7 views46 pages

OS Unit-4 Material

The document discusses various memory management strategies in operating systems, including swapping, contiguous memory allocation, and paging. It explains concepts such as logical vs. physical address space, dynamic loading, and fragmentation, as well as the advantages and disadvantages of paging. Additionally, it introduces virtual memory management, highlighting its ability to allow programs to use more memory than physically available and the benefits of shared address spaces among processes.

Uploaded by

Swapna CH
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views46 pages

OS Unit-4 Material

The document discusses various memory management strategies in operating systems, including swapping, contiguous memory allocation, and paging. It explains concepts such as logical vs. physical address space, dynamic loading, and fragmentation, as well as the advantages and disadvantages of paging. Additionally, it introduces virtual memory management, highlighting its ability to allow programs to use more memory than physically available and the benefits of shared address spaces among processes.

Uploaded by

Swapna CH
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

Unit-IV

(chapter-1)
Memory Management Strategies: introduction, Swapping, contiguous memory allocation,
paging.
***
Background:
Memory management is the functionality of an operating system which handles or
manages primary memory and moves processes back and forth between main memory and disk
during execution. Memory management keeps track of each and every memory location,
regardless of either it is allocated to some process or it is free. It checks how much memory is to
be allocated to processes. It decides which process will get memory at what time. It tracks
whenever some memory gets freed or unallocated and correspondingly it updates the status.
Base and Limit Registers:
The base register holds the smallest legal physical memory address;
The limit register specifies the size of the range.

Logical vs. Physical Address Space


 Logical address – generated by the CPU; also referred to as virtual address
 Physical address – address seen by the memory unit
 Logical and physical addresses are the same in compile-time and load-time address-
binding schemes; logical (virtual) and physical addresses differ in execution-time
address-binding scheme
 The set of all logical addresses generated by a program is a logical address space.
 The set of all physical addresses corresponding to these logical addresses is a physical
address space.
Address Binding:
Addresses may be represented in different ways during these steps. Addresses in the
source program are generally symbolic. A compiler typically binds these symbolic addresses to
relocatable addresses. The linkage editor or loader in turn binds the relocatable addresses to
absolute addresses.

Each binding is a mapping from one address space to another. Classically, the binding of
instructions and data to memory addresses can be done at any step along the way:

 Compile time: If you know at compile time where the process will reside in memory,
then absolute code can be generated.
 Load time: If it is not known at compile time where the process will reside in memory,
then the compiler must generate relocatable code.
 Execution time: If the process can be moved during its execution from one memory
segment to another, then binding must be delayed until run time.
Dynamic loading:

In dynamic loading, a routine is not loaded until it is called. All routines are kept on disk
in a relocatable load format. The main program is loaded into memory and is executed. When a
routine needs to call another routine, the calling routine first checks to see whether the other
routine has been loaded. If it has not, the relocatable linking loader is called to load the desired
routine into memory and to update the program’s address tables to reflect this change. Then
control is passed to the newly loaded routine.

The advantage of dynamic loading is that a routine is loaded only when it is needed. This
method is particularly useful when large amounts of code are needed to handle infrequently
occurring cases, such as error routines. In this case, although the total program size may be large,
the portion that is used (and hence loaded) may be much smaller.

Dynamic loading does not require special support from the operating system. It is the
responsibility of the users to design their programs to take advantage of such a method.
Operating systems may help the programmer, however, by providing library routines to
implement dynamic loading.
Static vs Dynamic Linking:
As explained above, when static linking is used, the linker combines all other modules needed
by a program into a single executable program to avoid any runtime dependency.
When dynamic linking is used, it is not required to link the actual module or library with the
program, rather a reference to the dynamic module is provided at the time of compilation and
linking. Dynamic Link Libraries (DLL) in Windows and Shared Objects in Unix are good
examples of dynamic libraries.
***
3.1 Swapping:

A process must be in memory to be executed. A process, however, can be swapped temporarily


out of memory to a backing store and then brought back into memory for continued execution
(Figure 8.5). Swapping makes it possible for the total physical address space of all processes to
exceed the real physical memory of the system, thus increasing the degree of multiprogramming
in a system.

Standard Swapping

Standard swapping involves moving processes between main memory and a backing
store. The backing store is commonly a fast disk. It must be large enough to accommodate copies
of all memory images for all users, and it must provide direct access to these memory images.
The system maintains a ready queue consisting of all processes whose memory images are on
the backing store or in memory and are ready to run. Whenever the CPU scheduler decides to
execute a process, it calls the dispatcher. The dispatcher checks to see whether the next process
in the queue is in memory. If it is not, and if there is no free memory region, the dispatcher swaps
out a process currently in memory and swaps in the desired process. It then reloads registers and
transfers control to the selected process.

The context-switch time in such a swapping system is fairly high. To get an idea of the
context-switch time, let’s assume that the user process is 100 MB in size and the backing store is
a standard hard disk with a transfer rate of 50 MB per second. The actual transfer of the 100- MB
process to or from main memory takes

100 MB/50 MB per second = 2 seconds

The swap time is 200 milliseconds. Since we must swap both out and in, the total swap time is
about 4,000 milliseconds.
Notice that the major part of the swap time is transfer time. The total transfer time is directly
proportional to the amount of memory swapped.
***
3.2 Contiguous Memory Allocation:
The main memory must accommodate both the operating system and the various user processes.

The memory is usually divided into two partitions:


 One for the resident operating system
 One for the user processes.

We can place the operating system in either low memory or high memory. The major factor
affecting this decision is the location of the interrupt vector. Since the interrupt vector is often in
low memory, programmers usually place the operating system in low memory as well.

In contiguous memory allocation, each process is contained in a single section of memory that
is contiguous to the section containing the next process.

Memory Protection:
We can prevent a process from accessing memory. If we have a system with a relocation
register together with a limit register we can prevent memory access. The relocation register
contains the value of the smallest physical address; the limit register contains the range of logical
addresses (for example, relocation = 100040 and limit = 74600). Each logical address must fall
within the range specified by the limit register. The MMU maps the logical address dynamically
by adding the value in the relocation register. This mapped address is sent to memory (Figure
8.6).

When the CPU scheduler selects a process for execution, the dispatcher loads the
relocation and limit registers with the correct values as part of the context switch. Because every
address generated by a CPU is checked against these registers, we can protect both the operating
system and the other users’ programs and data from being modified by this running process.
Memory Allocation:

One of the simplest methods for allocating memory is to divide memory into several
fixed-sized partitions. Each partition may contain exactly one process. Thus, the degree of
multiprogramming is bound by the number of partitions. In this multiplepartition method,
when a partition is free, a process is selected from the input queue and is loaded into the free
partition. When the process terminates, the partition becomes available for another process.

In the variable-partition scheme, the operating system keeps a table indicating which parts of
memory are available and which are occupied. Initially, all memory is available for user
processes and is considered one large block of available memory, a hole. Eventually, as you will
see, memory contains a set of holes of various sizes.

As processes enter the system, they are put into an input queue. The operating system takes into
account the memory requirements of each process and the amount of available memory space in
determining which processes are allocated memory. When a process is allocated space, it is
loaded into memory, and it can then compete for CPU time. When a process terminates, it
releases its memory, which the operating system may then fill with another process from the
input queue.

The general dynamic storage allocation problem, which concerns how to satisfy a request of
size n from a list of free holes. There are many solutions to this problem. The first-fit, best-fit,
And worst-fit strategies are the ones most commonly used to select a free hole from the set of
available holes.

• First fit. Allocate the first hole that is big enough. Searching can start either at the beginning of
the set of holes or at the location where the previous first-fit search ended. We can stop searching
as soon as we find a free hole that is large enough.
• Best fit. Allocate the smallest hole that is big enough. We must search the entire list, unless the
list is ordered by size. This strategy produces the smallest leftover hole.
• Worst fit. Allocate the largest hole. Again, we must search the entire list, unless it is sorted by
size. This strategy produces the largest leftover hole, which may be more useful than the smaller
leftover hole from a best-fit approach.

Simulations have shown that both first fit and best fit are better than worst fit in terms of
decreasing time and storage utilization. Neither first fit nor best fit is clearly better than the other
in terms of storage utilization, but first fit is generally faster.

Fragmentation:
As processes are loaded and removed from memory, the free memory space is broken into little
pieces. It happens after sometimes that processes cannot be allocated to memory blocks
considering their small size and memory blocks remains unused. This problem is known as
Fragmentation.
Fragmentation is of two types –
Internal fragmentation
Memory block assigned to process is bigger. Some portion of memory is left unused, as it
cannot be used by another process.

The internal fragmentation can be reduced by effectively assigning the smallest partition but
large enough for the process.
External fragmentation
Total memory space is enough to satisfy a request or to reside a process in it, but it is not
contiguous, so it cannot be used.

 One solution to the problem of external fragmentation is compaction. The goal is to


shuffle the memory contents so as to place all free memory together in one large block.
 The following diagram shows how fragmentation can cause waste of memory and a
compaction technique can be used to create more free memory out of fragmented
memory.
***

3.3 Paging:
Paging is a memory management technique in which process address space is broken into
blocks of the same size called pages (size is power of 2). The size of the process is measured in
the number of pages.
Similarly, main memory is divided into small fixed-sized blocks of (physical) memory
called frames and the size of a frame is kept the same as that of a page to have optimum
utilization of the main memory and to avoid external fragmentation.
 Size of the page is equal to the size of the frame.
 Paging avoids external fragmentation
Basic Method:
The hardware support for paging is illustrated in Figure 8.10. Every address generated by the
CPU is divided into two parts: a page number (p) and a page offset (d). The page number is
used as an index in to a page table. The page table contains the base address of each page in
physical memory. This base address is combined with the page offset to define the physical
memory address that is sent to the memory unit.

The logical address is as follows:

Where p is an index into the page table and d is the displacement within the page.
The paging model of memory is shown in Figure 8.11.

HOW TO MAP:
For example,
Logical address 0 is page 0, offset 0. Indexing into the page table, we find that page 0 is in frame
5. Thus, logical address 0 maps to physical address 20 [= (5 × 4) + 0]. Logical address 3 (page 0,
offset 3) maps to physical address 23 [= (5 × 4) + 3]. Logical address 4 is page 1, offset 0;
according to the page table, page 1 is mapped to frame 6. Thus, logical address 4 maps to
physical address 24 [= (6 × 4) + 0]. Logical address 13 maps to physical address 9.
Free frames:

Hardware Support:
 Page table is kept in main memory
 Page-table base register (PTBR) points to the page table
 Page-table length register (PRLR) indicates size of the page table
 In this scheme every data/instruction access requires two memory accesses. One for the
page table and one for the data/instruction.
 The two memory access problem can be solved by the use of a special fast-lookup
hardware cache called associative memory or translation look-aside buffers
(TLBs)
 Some TLBs store address-space identifiers (ASIDs) in each TLB entry –
uniquely identifies each process to provide address-space protection for that
process

Using TLB:
The TLB contains only a few of the page-table entries. When a logical address is generated by
the CPU, its page number is presented to the TLB(TLB hit ). If the page number is found, its
frame number is immediately available and is used to access memory.

If the page number is not in the TLB (known as a TLB miss), a memory reference to the page
table must be made.
Protection:
One additional bit is generally attached to each entry in the page table: a valid–invalid bit. When
this bit is set to valid, the associated page is in the process’s logical address space and is thus a
legal (or valid) page. When the bit is set to invalid, the page is not in the process’s logical
address space. Illegal addresses are trapped by use of the valid–invalid bit. The operating system
sets this bit for each page to allow or disallow access to the page.
Shared Pages:
Reentrant code(pure code) is non-self-modifying code: it never changes during execution. Thus,
two or more processes can execute the same code at the same time. three processes sharing a
three-page editor—each page 50 KB in size (the large page size is used to simplify the figure).
Each process has its own data page.

Each process has its own copy of registers and data storage to hold the data for the process’s
execution. The data for two different processes will, of course, be different. Only one copy of the
editor need be kept in physical memory. Each user’s page table maps onto the same physical
copy of the editor, but data pages are mapped onto different frames. Thus, a significant memory
is saved.

Advantages and Disadvantages of Paging:


Here is a list of advantages and disadvantages of paging −
 Paging reduces external fragmentation, but still suffer from internal fragmentation.
 Paging is simple to implement and assumed as an efficient memory management
technique.
 Due to equal size of the pages and frames, swapping becomes very easy.
 Page table requires extra memory space, so may not be good for a system having small
RAM.

***

References:

Abraham Silberschatz, Greg Gagne, and Peter Baer Galvin, "Operating System Concepts, Ninth
Edition "
UNIT-IV

CHAPTER-2
Virtual memory management: introduction, demand paging, copy on-write, page-replacement,
Allocation of frames, Thrashing.
***
3.2 Background:
• Virtual memory is Something which appear to be present but actually it is not.
• Users use virtual memory.
Ex:16gb[virtual memory]4gb[main memory]
• It just gives illusion to user.
• A programmer can write a programme which requires more memory than the capacity of
main memory such a programme is executed by virtual memory techniques in main
memory.
• Virtual Memory Only part of programs need to be in memory for execution.
• Logical address space[V.M] can there for be much larger than physical address
space[M.M].
• Virtual memory Allows address spaces to be shared by several processes.
• More programs running concurrently.
• It Require less I/o needed to load or even to swap process.
Virtual memory involves the separation of logical memory as perceived by users from physical
memory. This separation allows an extremely large virtual memory to be provided for
programmers when only a smaller physical memory is available (Figure 9.1). Virtual memory
makes the task of programming much easier, because the programmer no longer needs to worry
about the amount of physical memory available; she can concentrate instead on the problem to be
programmed.
The virtual address space of a process refers to the logical (or virtual) view of how a process is
stored in memory. Typically, this view is that a process begins at a certain logical address—say,
address 0—and exists in contiguous memory, as shown in Figure 9.2. in fact physical memory
may be organized in page frames and that the physical page frames assigned to a process may not
be contiguous. It is up to the memory management unit (MMU) to map logical pages to physical
page frames in memory.

Note in Figure 9.2 that we allow the heap to grow upward in memory as it is used for dynamic
memory allocation. Similarly, we allow for the stack to grow downward in memory through
successive function calls. The large blank space (or hole) between the heap and the stack is part
of the virtual address space but will require actual physical pages only if the heap or stack grows.
Virtual address spaces that include holes are known as sparse address spaces. Using a sparse
address space is beneficial because the holes can be filled as the stack or heap segments grow or
if we wish to dynamically link libraries (or possibly other shared objects) during program
execution.

In addition to separating logical memory from physical memory, virtual memory allows files
and memory to be shared by two or more processes through page sharing (Section 8.5.4). This
leads to the following benefits:
• System libraries can be shared by several processes through mapping of the shared object
into a virtual address space. Although each process considers the libraries to be part of its
virtual address space, the actual pages where the libraries reside in physical memory are
shared by all the processes (Figure 9.3). Typically, a library is mapped read-only into the
space of each process that is linked with it.
• Similarly, processes can share memory that two or more processes can communicate through
the use of shared memory. Virtual memory allows one process to create a region of memory
that it can share with another process. Processes sharing this region consider it
part of their virtual address space, yet the actual physical pages of memory are shared, much
as is illustrated in Figure 9.3.

• Pages can be shared during process creation with the fork() system call, thus speeding up
process creation.
• We further explore these—and other—benefits of virtual memory later in this chapter.
First, though, we discuss implementing virtual memory through demand paging.
***

3.2.1 Demand Paging:

Pages are loaded only when they are demanded during program execution. Pages that are
never accessed are thus never loaded into physical memory. This technique is known as demand
paging.

A demand-paging system is similar to a paging system with swapping (Figure 9.4) where
processes reside in secondary memory (usually a disk). When we want to execute a process, we
swap it into memory. Rather than swapping the entire process into memory, though, we use a
lazy swapper. A lazy swapper never swaps a page into memory unless that page will be needed.
In the context of a demand-paging system, use of the term “swapper” is technically incorrect. A
swapper manipulates entire processes, whereas a pager is concerned with the individual pages of
a process. We thus use “pager,” rather than “swapper,” in connection with demand paging.
Basic Concepts:

Accessing a page marked valid bit means page hit.


Accessing a page marked invalid bit means page fault.

The procedure for handling this page fault is straightforward (Figure 9.6):

1. We check an internal table (usually kept with the process control block) for this process
to determine whether the reference was a valid or an invalid memory access.
2. If the reference was invalid, we terminate the process. If it was valid but we have not yet
brought in that page, we now page it in.
3. We find a free frame (by taking one from the free-frame list, for example).
4. We schedule a disk operation to read the desired page into the newly allocated frame.
5. When the disk read is complete, we modify the internal table kept with the process and
the page table to indicate that the page is now in memory.
6. We restart the instruction that was interrupted by the trap. The process can now access
the page as though it had always been in memory.

The hardware to support demand paging is the same as the hardware for paging and swapping:
 Page table
 Secondary memory
• Page table. This table has the ability to mark an entry invalid through a valid–invalid bit or a
special value of protection bits.
• Secondary memory. This memory holds those pages that are not present in main memory.
The secondary memory is usually a high-speed disk. It is known as the swap device, and the
section of disk used for this purpose is known as swap space.

Effective access time = (1 −p) × ma + p × page fault time.

To compute the effective access time, we must know how much time is needed to service a page
fault. A page fault causes the following sequence to occur:
1. Trap to the operating system.
2. Save the user registers and process state.
3. Determine that the interrupt was a page fault.
4. Check that the page reference was legal and determine the location of the page on the
disk.
5. Issue a read from the disk to a free frame:
a. Wait in a queue for this device until the read request is serviced.
b. Wait for the device seek and/or latency time.
c. Begin the transfer of the page to a free frame.
6. While waiting, allocate the CPU to some other user (CPU scheduling, optional).
7. Receive an interrupt from the disk I/O subsystem (I/O completed).
8. Save the registers and process state for the other user (if step 6 is executed).
9. Determine that the interrupt was from the disk.
10. Correct the page table and other tables to show that the desired page is now in
memory.
11. Wait for the CPU to be allocated to this process again.
12. Restore the user registers, process state, and new page table, and then resume the
interrupted instruction.

Advantages
Following are the advantages of Demand Paging −
 Large virtual memory.
 More efficient use of memory.
 There is no limit on degree of multiprogramming.
Disadvantages
 Number of tables and the amount of processor overhead for handling page interrupts are
greater than in the case of the simple paged management techniques.

***
3.2.2 copy on-write:
In we illustrated how a process can start quickly by demand-paging in the page containing the
first instruction. However, process creation using the fork() system call may initially bypass the
need for demand paging by using a technique similar to page sharing. This technique provides
rapid process creation and minimizes the number of new pages that must be allocated to the
newly created process.
Recall that the fork() system call creates a child process that is a duplicate of its parent.
Traditionally, fork() worked by creating a copy of the parent’s address space for the child,
duplicating the pages belonging to the parent. However, considering that many child processes
invoke the exec() system call immediately after creation, the copying of the parent’s address
space may be unnecessary. Instead, we can use a technique known as copy-on-write, which
works by allowing the parent and child processes initially to share the same pages. These shared
pages are marked as copy-on-write pages, meaning that if either process writes to a shared page,
a copy of the shared page is created. Copy-on-write is illustrated in Figures 9.7 and 9.8, which
show the contents of the physical memory before and after process 1 modifies page C.

For example, assume that the child process attempts to modify a page containing portions of the
stack, with the pages set to be copy-on-write. The operating system will create a copy of this
page, mapping it to the address space of the child process. The child process will then modify its
copied page and not the page belonging to the parent process. Obviously, when the copy-on-
write technique is used, only the pages that are modified by either process are copied; all
unmodified pages can be shared by the parent and child processes. Note, too, that only pages that
can be modified need be marked as copy-on-write. Pages that cannot be modified (pages
containing executable code) can be shared by the parent and child. Copy-on-write is a
common technique used by several operating systems, including Windows XP, Linux, and
Solaris.

When it is determined that a page is going to be duplicated using copyon-write, it is important to


note the location from which the free page will be allocated. Many operating systems provide a
pool of free pages for such requests. These free pages are typically allocated when the stack or
heap for a process must expand or when there are copy-on-write pages to be managed.

Operating systems typically allocate these pages using a technique known as zero-fill-on-
demand. Zero-fill-on-demand. pages have been zeroed-out before being allocated, thus erasing
the previous contents.

Several versions of UNIX (including Solaris and Linux) provide a variation of the fork() system
call—vfork() (for virtual memory fork)—that operates differently from fork() with copy-on-
write. With vfork(), the parent process is suspended, and the child process uses the address space
of the parent. Because vfork() does not use copy-on-write, if the child process changes any pages
of the parent’s address space, the altered pages will be visible to the parent once it resumes.
Therefore, vfork() must be used with caution to ensure that the child process does not modify the
address space of the parent. vfork() is intended to be used when the child process calls exec()
immediately after creation. Because no copying of pages takes place, vfork() is an extremely
efficient method of process creation and is sometimes used to implement UNIX command-line
shell interfaces.
***
3.2.3 Page Replacement:

Whenever there is a page fault and all frames are filled with pages then the requested page has to
replace the existing page.
Basic Page Replacement:

1. Find the location of the desired page on the disk.


2. Find a free frame:
a. If there is a free frame, use it.
b. If there is no free frame, use a page-replacement algorithm to select a victim frame.
c. Write the victim frame to the disk; change the page and frame tables accordingly.
3. Read the desired page into the newly freed frame; change the page and frame tables.
4. Continue the user process from where the page fault occurred.

3.2.3.1) FIFO Page Replacement:


 Oldest page in main memory is the one which will be selected for replacement.
 Easy to implement, keep a list, replace pages from the tail and add new pages at the
head.

Number of page faults= 15


Total pages in reference string= 20
Page fault ratio=15/20
Belady’s anomaly: for some page-replacement algorithms, the page-fault rate may increase as
the number of allocated frames increases. We would expect that giving more memory to a
process would improve its performance. In some early research, investigators noticed that
this assumption was not always true. Belady’s anomaly was discovered as a result.

3.2.3.2) Optimal Page algorithm:


 An optimal page-replacement algorithm has the lowest page-fault rate of all algorithms.
An optimal page-replacement algorithm exists, and has been called OPT or MIN.
 Replace the page that will not be used for the longest period of time. Use the time
when a page is to be used.

Number of page faults= 9


Total pages in reference string= 20

Page fault ratio=9/20


3.2.3.3) Least Recently Used (LRU) algorithm:
 Page which has not been used for the longest time in main memory is the one which
will be selected for replacement.
 Easy to implement, keep a list, replace pages by looking back into time.

Number of page faults= 12


Total pages in reference string= 20

Page fault ratio=12/20

An LRU page-replacement algorithm may require substantial hardware assistance. The problem
is to determine an order for the frames defined by the time of last use. Two implementations are
feasible:
 Counters. In the simplest case, we associate with each page-table entry a time-of-use
field and add to the CPU a logical clock or counter. The clock is incremented for every
memory reference. Whenever a reference to a page is made, the contents of the clock
register are copied to the time-of-use field in the page-table entry for that page. In this
way, we always have the “time” of the last reference to each page. We replace the page
with the smallest time value. This scheme requires a search of the page table to find the
LRU page and a write to memory (to the time-of-use field in the page table) for each
memory access. The times must also be maintained when page tables are changed (due to
CPU scheduling). Overflow of the clock must be considered.
 Stack. Another approach to implementing LRU replacement is to keep a stack of page
numbers. Whenever a page is referenced, it is removed from the stack and put on the top.
In this way, the most recently used page is always at the top of the stack and the least
recently used page is always at the bottom (Figure 9.16). Because entries must be
removed from the middle of the stack, it is best to implement this approach by using a
doubly linked list with a head pointer and a tail pointer. Removing a page and putting it
on the top of the stack then requires changing six pointers at worst. Each update is a little
more expensive, but there is no search for a replacement; the tail pointer points to
the bottom of the stack, which is the LRU page. This approach is particularly appropriate
for software or microcode implementations of LRU replacement.

3.2.3.4) LRU-Approximation Page Replacement:

The reference bit for a page is set by the hardware whenever that page is referenced (either a read
or a write to any byte in the page). Reference bits are associated with each entry in the page
table. Initially, all bits are cleared (to 0) by the operating system. As a user process executes, the
bit associated with each page referenced is set (to 1) by the hardware. After some time, we can
determine which pages have been used and which have not been used by examining the reference
bits, although we do not know the order of use. This information is the basis for many page-
replacement algorithms that approximate LRU replacement

3.2.3.4.1 Additional-Reference-Bits Algorithm:

We can gain additional ordering information by recording the reference bits at regular intervals.
We can keep an 8-bit byte for each page in a table in memory. At regular intervals (say, every
100 milliseconds), a timer interrupt transfers control to the operating system. The operating
system shifts the reference bit for each page into the high-order bit of its 8-bit byte, shifting the
other bits right by 1 bit and discarding the low-order bit. These 8-bit shift registers contain the
history of page use for the last eight time periods. If the shift register contains00000000, for
example, then the page has not been used for eight time periods. A page that is used at least once
in each period has a shift register value of11111111. A page with a history register value of
11000100 has been used more recently than one with a value of 01110111.

3.2.3.4.2 Second-Chance Algorithm:

The basic algorithm of second-chance replacement is a FIFO replacement algorithm. When a


page has been selected, however, we inspect its reference bit. If the value is 0, we proceed to
replace this page; but if the reference bit is set to 1, we give the page a second chance and
move on to select the next FIFO page. When a page gets a second chance, its reference bit is
cleared, and its arrival time is reset to the current time. Thus, a page that is given a second
chance will not be replaced until all other pages have been replaced (or given second chances). In
addition, if a page is used often enough to keep its reference bit set, it will never be replaced.

One way to implement the second-chance algorithm (sometimes referred to as the clock
algorithm) is as a circular queue. A pointer (that is, a hand on the clock) indicates which page is
to be replaced next. When a frame is needed, the pointer advances until it finds a page with a 0
reference bit. As it advances, it clears the reference bits (Figure 9.17). Once a victim page is
found, the page is replaced, and the new page is inserted in the circular queue in that position.
Notice that, in the worst case, when all bits are set, the pointer cycles through the whole queue,
giving each page a second chance. It clears all the reference bits before selecting the next page
for replacement. Second-chance replacement degenerates to FIFO replacement if all bits are set.

3.2.3.4.3 Enhanced Second-Chance Algorithm:

We can enhance the second-chance algorithm by considering the reference bit and the
modify bit as an ordered pair. With these two bits, we have the following four possible classes:
1. (0, 0) neither recently used nor modified—best page to replace
2. (0, 1) not recently used but modified—not quite as good, because the page will need to be
written out before replacement
3. (1, 0) recently used but clean—probably will be used again soon
4. (1, 1) recently used and modified—probably will be used again soon, and the page will be
need to be written out to disk before it can be replaced
3.2.3.5) Counting-Based Page Replacement:

we can keep a counter of the number of references that have been made to each page and
develop the following two schemes.
• The least frequently used (LFU) page-replacement algorithm requires that the page with the
smallest count be replaced. The reason for this selection is that an actively used page should have
a large reference count. A problem arises, however, when a page is used heavily during the
initial phase of a process but then is never used again. Since it was used heavily, it has a large
count and remains in memory even though it is no longer needed. One solution is to shift the
counts right by 1 bit at regular intervals, forming an exponentially decaying average usage count.
• The most frequently used (MFU) page-replacement algorithm is based on the argument that
the page with the smallest count was probably just brought in and has yet to be used.

6)Page-Buffering Algorithms:

Systems commonly keep a pool of free frames. When a page fault occurs, a victim frame is
chosen as before. However, the desired page is read into a free frame from the pool before the
victim is written out. This procedure allows the process to restart as soon as possible, without
waiting for the victim page to be written out. When the victim is later written out, its frame is
added to the free-frame pool.

An expansion of this idea is to maintain a list of modified pages. Whenever the paging device is
idle, a modified page is selected and is written to the disk. Its modify bit is then reset. This
scheme increases the probability that a page will be clean when it is selected for replacement and
will not need to be written out.

Another modification is to keep a pool of free frames but to remember which page was in each
frame. Since the frame contents are not modified when a frame is written to the disk, the old
page can be reused directly from the free-frame pool if it is needed before that frame is reused.
No I/O is needed in this case. When a page fault occurs, we first check whether the desired page
is in the free-frame pool. If it is not, we must select a free frame and read into it.
When the FIFO replacement algorithm mistakenly replaces a page mistakenly replaces a page
that is still in active use, that page is quickly retrieved from the free-frame buffer, and no I/O is
necessary. The free-frame buffer provides protection against the relatively poor, but simple,
FIFO replacement algorithm.

***
3.2.4 Allocation of Frames:
How do we allocate the fixed amount of free memory among the various processes? If we have
93 free frames and two processes, how many frames does each process get?

Minimum Number of Frames


One reason for allocating at least a minimum number of frames involves performance.
Obviously, as the number of frames allocated to each process decreases, the page-fault rate
increases, slowing process execution. In addition, remember that, when a page fault occurs
before an executing instruction is complete, the instruction must be restarted. Consequently, we
must have enough frames to hold all the different pages that any single instruction can reference.

Allocation Algorithms
 Equal allocation
 Proportional allocation
The easiest way to split m frames among n processes is to give everyone an equal share, m/n
frames (ignoring frames needed by the operating system for the moment). For instance, if there
are 93 frames and five processes, each process will get 18 frames. The three leftover frames can
be used as a free-frame buffer pool. This scheme is called equal allocation.

An alternative is to recognize that various processes will need differing amounts of memory.
Consider a system with a 1-KB frame size. If a small student process of 10 KB and an interactive
database of 127 KB are the only two processes running in a system with 62 free frames, it does
not make much sense to give each process 31 frames. The student process does not need more
than 10 frames, so the other 21 are, strictly speaking, wasted. To solve this problem, we can use
proportional allocation, in which we allocate available memory to each process according to its
size. Let the size of the virtual memory for process pi be si, and define

Then, if the total number of available frames is m, we allocate ai frames to process pi, where ai
is approximately

Of course, we must adjust each ai to be an integer that is greater than the minimum number of
frames required by the instruction set, with a sum not exceeding m. With proportional allocation,
we would split 62 frames between two processes, one of 10 pages and one of 127 pages, by
allocating 4 frames and 57frames, respectively, since
Global versus Local Allocation
Another important factor in the way frames are allocated to the various processes is page
replacement. With multiple processes competing for frames, we can classify page-replacement
algorithms into two broad categories: global replacement and local replacement. Global
replacement allows a process to select a replacement frame from the set of all frames, even if that
frame is currently allocated to some other process; that is, one process can take a frame from
another. Local replacement requires that each process select from only its own set of allocated
frames.

Non-Uniform Memory Access


In systems with multiple CPUs a given CPU can access some sections of main memory faster
than it can access others. These performance differences are caused by how CPUs and memory
are interconnected in the system. Frequently, such a system is made up of several system boards,
each containing multiple CPUs and some memory. The system boards are interconnected in
various ways, ranging from system buses to high-speed network connections like InfiniBand. As
you might expect, the CPUs on a particular board can access the memory on that board with less
delay than they can access memory on other boards in the system. Systems in which memory
access times vary significantly are known collectively as non-uniform memory access (NUMA)
systems, and without exception, they are slower than systems in which memory and CPUs are
located on the same motherboard.
***
2.3.5 Thrashing
If the process does not have the number of frames it needs to support pages inactive use, it will
quickly page-fault. At this point, it must replace some page. However, since all its pages are in
active use, it must replace a page that will be needed again right away. Consequently, it quickly
faults again, and again, and again, replacing pages that it must bring back in immediately.

 Thrashing occurs when a Process is spending more time in paging or Swapping


activities rather than its execution.
 In Thrashing, the state CPU is so much busy swapping that it cannot respond to the user
program as much as it required.

This high paging activity is called thrashing. A process is thrashing if it is spending more
time paging than executing.

Cause of Thrashing
Thrashing results in severe performance problems. Consider the following scenario, which is
based on the actual behaviour of early paging systems. The operating system monitors CPU
utilization. If CPU utilization is too low, we increase the degree of multiprogramming by
introducing a new process to the system. A global page-replacement algorithm is used; it
replaces pages without regard to the process to which they belong. Now suppose that a process
enters a new phase in its execution and needs more frames. It starts faulting and taking frames
away from other processes. These processes need those pages, however, and so they also fault,
taking frames from other processes. These faulting processes must use the paging device to
swap pages in and out. As they queue up for the paging device, the ready queue empties. As
processes wait for the paging device, CPU utilization decreases.

The CPU scheduler sees the decreasing CPU utilization and increases the degree of
multiprogramming as a result. The new process tries to get started betaking frames from running
processes, causing more page faults and a longer queue for the paging device. As a result, CPU
utilization drops even further, and the CPU scheduler tries to increase the degree of
multiprogramming even more. Thrashing has occurred, and system throughput plunges. The
page fault rate increases tremendously. As a result, the effective memory-access time increases.
No work is getting done, because the processes are spending all their time paging.

This phenomenon is illustrated in Figure 9.18, in which CPU utilization is plotted against the
degree of multiprogramming. As the degree of multiprogramming increases, CPU utilization
also increases, although more slowly, until a maximum is reached. If the degree of
multiprogramming is increased even further, thrashing sets in, and CPU utilization drops sharply.
At this point, to increase CPU utilization and stop thrashing, we must decrease the degree of
multiprogramming.

Working-Set Model
As mentioned, the working-set model is based on the assumption of locality. This model uses a
parameter∆, to define the working-set window. The idea is to examine the most recent ∆ page
references. The set of pages in the most recent ∆ page references is the working set (Figure
9.20). If a page is in active use, it will be in the working set. If it is no longer being used, it will
drop from the working set ∆ time units after its last reference. Thus, the working set is an
approximation of the program’s locality.

For example, given the sequence of memory references shown in Figure9.20, if ∆ = 10 memory
references, then the working set at time t1 is {1, 2, 5,6, 7}. By time t2, the working set has
changed to {3, 4}.The accuracy of the working set depends on the selection of ∆. If ∆ is too
small, it will not encompass the entire locality; if ∆ is too large, it may overlap several localities.
In the extreme, if ∆ is infinite, the working set is the set of pages touched during the process
execution.

The most important property of the working set, then, is its size. If we compute the working-set
size, WSSi, for each process in the system, we can then consider that

Where Dis the total demand for frames. Each process is actively using the pages in its working
set. Thus, process i needs WSSi frames. If the total demand is greater than the total number of
available frames (D>m), thrashing will occur, because some processes will not have enough
frames.
Page-Fault Frequency
The working-set model is successful, and knowledge of the working set can be useful for
prepaging, but it seems a clumsy way to control thrashing.

A strategy that uses the page-fault frequency (PFF) takes a more direct approach.
The specific problem is how to prevent thrashing. Thrashing has a high page-fault rate. Thus, we
want to control the page-faultrate. When it is too high, we know that the process needs more
frames. Conversely, if the page-faultrate is too low, then the process may have too many frames.
We can establish upper and lower bounds on the desired page-faultrate (Figure 9.21). If the actual
page-fault rate exceeds the upper limit, we allocate the process another frame. If the page-fault rate
falls below the lower limit, we remove a frame from the process. Thus, we can directly measure
and control the page-faultrate to prevent thrashing.

As with the working-set strategy, we may have to swap out a process. If the page-fault rate
increases and no free frames are available, we must select some process and swap it out to backing
store. The freed frames are then distributed to processes with high page-fault rates.

***
References:

Abraham Silberschatz, Greg Gagne, and Peter Baer Galvin, "Operating System Concepts, Ninth
Edition "
UNIT-4
CHAPTER-3
Storage management: Overview Of Disk structure, Disk attachment, Disk scheduling,
RAID structure, Stable-storage implementation.

4.3.1 Overview of Disk Structure:

Mass storage structure refers to the storage of large amount of data in a persisting and
machine-readable fashion. Mass storage includes devices with removable and non-removable
media. It does not include random access memory(RAM).

Magnetic Disks

 Traditional magnetic disks have the following basic structure:


o One or more platters in the form of disks covered with magnetic
media. Hard disk platters are made of rigid metal, while "floppy" disks
are made of more flexible plastic.
o Each platter has two working surfaces. Older hard disk drives would
sometimes not use the very top or bottom surface of a stack of platters,
as these surfaces were more susceptible to potential damage.
o Each working surface is divided into a number of concentric rings
called tracks. The collection of all tracks that are the same distance
from the edge of the platter, ( i.e. all tracks immediately above one
another in the following diagram ) is called a cylinder.
o Each track is further divided into sectors, traditionally containing 512
bytes of data each, although some modern disks occasionally use larger
sector sizes. ( Sectors also include a header and a trailer, including
checksum information among other things. Larger sector sizes reduce
the fraction of the disk consumed by headers and trailers, but increase
internal fragmentation and the amount of disk that must be marked bad
in the case of errors. )
o The data on a hard drive is read by read-write heads. The standard
configuration ( shown below ) uses one head per surface, each on a
separate arm, and controlled by a common arm assembly which moves
all heads simultaneously from one cylinder to another. ( Other
configurations, including independent read-write heads, may speed up
disk access, but involve serious technical difficulties. )
o The storage capacity of a traditional disk drive is equal to the number
of heads ( i.e. the number of working surfaces ), times the number of
tracks per surface, times the number of sectors per track, times the
number of bytes per sector. A particular physical block of data is
specified by providing the head-sector-cylinder number at which it is
located.
Figure 10.1 - Moving-head disk mechanism.

 In operation the disk rotates at high speed, such as 7200 rpm ( 120 revolutions
per second. ) The rate at which data can be transferred from the disk to the
computer is composed of several steps:
o The positioning time, as known as the seek time or random access
time is the time required to move the heads from one cylinder to
another, and for the heads to settle down after the move. This is
typically the slowest step in the process and the predominant
bottleneck to overall transfer rates.
o The rotational latency is the amount of time required for the desired
sector to rotate around and come under the read-write head. This can
range anywhere from zero to one full revolution, and on the average
will equal one-half revolution. This is another physical step and is
usually the second slowest step behind seek time. ( For a disk rotating
at 7200 rpm, the average rotational latency would be 1/2 revolution /
120 revolutions per second, or just over 4 milliseconds, a long time by
computer standards.
o The transfer rate, which is the time required to move the data
electronically from the disk to the computer. ( Some authors may also
use the term transfer rate to refer to the overall transfer rate, including
seek time and rotational latency as well as the electronic data transfer
rate. )
 Disk heads "fly" over the surface on a very thin cushion of air. If they should
accidentally contact the disk, then a head crash occurs, which may or may not
permanently damage the disk or even destroy it completely. For this reason it
is normal to park the disk heads when turning a computer off, which means to
move the heads off the disk or to an area of the disk where there is no data
stored.
 Floppy disks are normally removable. Hard drives can also be removable, and
some are even hot-swappable, meaning they can be removed while the
computer is running, and a new hard drive inserted in their place.
 Disk drives are connected to the computer via a cable known as the I/O
Bus. Some of the common interface formats include Enhanced Integrated
Drive Electronics, EIDE; Advanced Technology Attachment, ATA; Serial
ATA, SATA, Universal Serial Bus, USB; Fiber Channel, FC, and Small
Computer Systems Interface, SCSI.
 The host controller is at the computer end of the I/O bus, and the disk
controller is built into the disk itself. The CPU issues commands to the host
controller via I/O ports. Data is transferred between the magnetic surface and
onboard cache by the disk controller, and then the data is transferred from that
cache to the host controller and the motherboard memory at electronic speeds.

Solid-State Disks - New

 As technologies improve and economics change, old technologies are often


used in different ways. One example of this is the increasing used of solid
state disks, or SSDs.
 SSDs use memory technology as a small fast hard disk. Specific
implementations may use either flash memory or DRAM chips protected by a
battery to sustain the information through power cycles.
 Because SSDs have no moving parts they are much faster than traditional hard
drives, and certain problems such as the scheduling of disk accesses simply do
not apply.
 However SSDs also have their weaknesses: They are more expensive than
hard drives, generally not as large, and may have shorter life spans.
 SSDs are especially useful as a high-speed cache of hard-disk information that
must be accessed quickly. One example is to store filesystem meta-data, e.g.
directory and inode information, that must be accessed quickly and often.
Another variation is a boot disk containing the OS and some application
executables, but no vital user data. SSDs are also used in laptops to make them
smaller, faster, and lighter.
 Because SSDs are so much faster than traditional hard disks, the throughput of
the bus can become a limiting factor, causing some SSDs to be connected
directly to the system PCI bus for example.

Magnetic Tapes

 Magnetic tapes were once used for common secondary storage before the days
of hard disk drives, but today are used primarily for backups.
 Accessing a particular spot on a magnetic tape can be slow, but once reading
or writing commences, access speeds are comparable to disk drives.
 Capacities of tape drives can range from 20 to 200 GB, and compression can
double that capacity.
Disk Structure:

 The traditional head-sector-cylinder, HSC numbers are mapped to linear block


addresses by numbering the first sector on the first head on the outermost track as
sector 0. Numbering proceeds with the rest of the sectors on that same track, and then
the rest of the tracks on the same cylinder before proceeding through the rest of the
cylinders to the center of the disk. In modern practice these linear block addresses are
used in place of the HSC numbers for a variety of reasons:
1. The linear length of tracks near the outer edge of the disk is much longer than
for those tracks located near the center, and therefore it is possible to squeeze
many more sectors onto outer tracks than onto inner ones.
2. All disks have some bad sectors, and therefore disks maintain a few spare
sectors that can be used in place of the bad ones. The mapping of spare sectors
to bad sectors in managed internally to the disk controller.
3. Modern hard drives can have thousands of cylinders, and hundreds of sectors
per track on their outermost tracks. These numbers exceed the range of HSC
numbers for many ( older ) operating systems, and therefore disks can be
configured for any convenient combination of HSC values that falls within the
total number of sectors physically on the drive.
 There is a limit to how closely packed individual bits can be placed on a physical
media, but that limit is growing increasingly more packed as technological advances
are made.
 Modern disks pack many more sectors into outer cylinders than inner ones, using one
of two approaches:
o With Constant Linear Velocity, CLV, the density of bits is uniform from
cylinder to cylinder. Because there are more sectors in outer cylinders, the disk
spins slower when reading those cylinders, causing the rate of bits passing
under the read-write head to remain constant. This is the approach used by
modern CDs and DVDs.
o With Constant Angular Velocity, CAV, the disk rotates at a constant angular
speed, with the bit density decreasing on outer cylinders. ( These disks would
have a constant number of sectors per track on all cylinders. )

***

4.3.2 Disk Attachment:

Disk drives can be attached either directly to a particular host ( a local disk ) or to a network.

Host-Attached Storage

 Local disks are accessed through I/O Ports as described earlier.


 The most common interfaces are IDE or ATA, each of which allow up to two
drives per host controller.
 SATA is similar with simpler cabling.
 High end workstations or other systems in need of larger number of disks
typically use SCSI disks:
o The SCSI standard supports up to 16 targets on each SCSI bus, one of
which is generally the host adapter and the other 15 of which can be
disk or tape drives.
o A SCSI target is usually a single drive, but the standard also supports
up to 8 units within each target. These would generally be used for
accessing individual disks within a RAID array.
o The SCSI standard also supports multiple host adapters in a single
computer, i.e. multiple SCSI busses.
o Modern advancements in SCSI include "fast" and "wide" versions, as
well as SCSI-2.
o SCSI cables may be either 50 or 68 conductors. SCSI devices may be
external as well as internal.
 FC is a high-speed serial architecture that can operate over optical fiber or
four-conductor copper wires, and has two variants:
o A large switched fabric having a 24-bit address space. This variant
allows for multiple devices and multiple hosts to interconnect, forming
the basis for the storage-area networks, SANs.
o The arbitrated loop, FC-AL, that can address up to 126 devices
( drives and controllers. )

Network-Attached Storage

 Network attached storage connects storage devices to computers using a


remote procedure call, RPC, interface, typically with something like NFS
filesystem mounts. This is convenient for allowing several computers in a
group common access and naming conventions for shared storage.
 NAS can be implemented using SCSI cabling, or ISCSI uses Internet
protocols and standard network connections, allowing long-distance remote
access to shared files.
 NAS allows computers to easily share data storage, but tends to be less
efficient than standard host-attached storage.

Figure 10.2 - Network-attached storage.

Storage-Area Network

 A Storage-Area Network, SAN, connects computers and storage devices in a


network, using storage protocols instead of network protocols.
 One advantage of this is that storage access does not tie up regular networking
bandwidth.
 SAN is very flexible and dynamic, allowing hosts and devices to attach and
detach on the fly.
 SAN is also controllable, allowing restricted access to certain hosts and
devices.
Figure 10.3 - Storage-area network.

***

4.3.3 Disk Scheduling:

 As mentioned earlier, disk transfer speeds are limited primarily by seek


times and rotational latency. When multiple requests are to be processed there is also
some inherent delay in waiting for other requests to be processed.
 Bandwidth is measured by the amount of data transferred divided by the total amount
of time from the first request being made to the last transfer being completed, ( for a
series of disk requests. )
 Both bandwidth and access time can be improved by processing requests in a good
order.
 Disk requests include the disk address, memory address, number of sectors to transfer,
and whether the request is for reading or writing.

FCFS Scheduling

 First-Come First-Serve is simple and intrinsically fair, but not very efficient.
Consider in the following sequence the wild swing from cylinder 122 to 14
and then back to 124:

Figure 10.4 - FCFS disk scheduling.


SSTF Scheduling

 Shortest Seek Time First scheduling is more efficient, but may lead to
starvation if a constant stream of requests arrives for the same general area of
the disk.
 SSTF reduces the total head movement to 236 cylinders, down from 640
required for the same set of requests under FCFS. Note, however that the
distance could be reduced still further to 208 by starting with 37 and then 14
first before processing the rest of the requests.

Figure 10.5 - SSTF disk scheduling.

SCAN Scheduling

 The SCAN algorithm, as known as the elevator algorithm moves back and
forth from one end of the disk to the other, similarly to an elevator processing
requests in a tall building.

Figure 10.6 - SCAN disk scheduling.

 Under the SCAN algorithm, If a request arrives just ahead of the moving head
then it will be processed right away, but if it arrives just after the head has
passed, then it will have to wait for the head to pass going the other way on the
return trip. This leads to a fairly wide variation in access times which can be
improved upon.
 Consider, for example, when the head reaches the high end of the disk:
Requests with high cylinder numbers just missed the passing head, which
means they are all fairly recent requests, whereas requests with low numbers
may have been waiting for a much longer time. Making the return scan from
high to low then ends up accessing recent requests first and making older
requests wait that much longer.

C-SCAN Scheduling

 The Circular-SCAN algorithm improves upon SCAN by treating all requests


in a circular queue fashion - Once the head reaches the end of the disk, it
returns to the other end without processing any requests, and then starts again
from the beginning of the disk:

Figure 10.7 - C-SCAN disk scheduling.

LOOK Scheduling

 LOOK scheduling improves upon SCAN by looking ahead at the queue of


pending requests, and not moving the heads any farther towards the end of the
disk than is necessary. The following diagram illustrates the circular form of
LOOK:

Figure 10.8 - C-LOOK disk scheduling.


Selection of a Disk-Scheduling Algorithm

 With very low loads all algorithms are equal, since there will normally only be
one request to process at a time.
 For slightly larger loads, SSTF offers better performance than FCFS, but may
lead to starvation when loads become heavy enough.
 For busier systems, SCAN and LOOK algorithms eliminate starvation
problems.
 The actual optimal algorithm may be something even more complex than
those discussed here, but the incremental improvements are generally not
worth the additional overhead.
 Some improvement to overall filesystem access times can be made by
intelligent placement of directory and/or inode information. If those structures
are placed in the middle of the disk instead of at the beginning of the disk,
then the maximum distance from those structures to data blocks is reduced to
only one-half of the disk size. If those structures can be further distributed and
furthermore have their data blocks stored as close as possible to the
corresponding directory structures, then that reduces still further the overall
time to find the disk block numbers and then access the corresponding data
blocks.
 On modern disks the rotational latency can be almost as significant as the seek
time, however it is not within the OSes control to account for that, because
modern disks do not reveal their internal sector mapping schemes,
( particularly when bad blocks have been remapped to spare sectors. )
o Some disk manufacturers provide for disk scheduling algorithms
directly on their disk controllers, ( which do know the actual geometry
of the disk as well as any remapping ), so that if a series of requests are
sent from the computer to the controller then those requests can be
processed in an optimal order.
o Unfortunately there are some considerations that the OS must take into
account that are beyond the abilities of the on-board disk-scheduling
algorithms, such as priorities of some requests over others, or the need
to process certain requests in a particular order. For this reason OSes
may elect to spoon-feed requests to the disk controller one at a time in
certain situations.

***

4.3.4 RAID Structure:

 The general idea behind RAID is to employ a group of hard drives together with some
form of duplication, either to increase reliability or to speed up operations, ( or
sometimes both. )
 RAID originally stood for Redundant Array of Inexpensive Disks, and was designed
to use a bunch of cheap small disks in place of one or two larger more expensive ones.
Today RAID systems employ large possibly expensive disks as their components,
switching the definition to Independent disks.
Improvement of Reliability via Redundancy

 The more disks a system has, the greater the likelihood that one of them will
go bad at any given time. Hence increasing disks on a system
actually decreases the Mean Time To Failure, MTTF of the system.
 If, however, the same data was copied onto multiple disks, then the data would
not be lost unless both ( or all ) copies of the data were damaged
simultaneously, which is a MUCH lower probability than for a single disk
going bad. More specifically, the second disk would have to go bad before the
first disk was repaired, which brings the Mean Time To Repair into play. For
example if two disks were involved, each with a MTTF of 100,000 hours and
a MTTR of 10 hours, then the Mean Time to Data Loss would be 500 * 10^6
hours, or 57,000 years!
 This is the basic idea behind disk mirroring, in which a system contains
identical data on two or more disks.
o Note that a power failure during a write operation could cause both
disks to contain corrupt data, if both disks were writing simultaneously
at the time of the power failure. One solution is to write to the two
disks in series, so that they will not both become corrupted ( at least
not in the same way ) by a power failure. And alternate solution
involves non-volatile RAM as a write cache, which is not lost in the
event of a power failure and which is protected by error-correcting
codes.

Improvement in Performance via Parallelism

 There is also a performance benefit to mirroring, particularly with respect to


reads. Since every block of data is duplicated on multiple disks, read
operations can be satisfied from any available copy, and multiple disks can be
reading different data blocks simultaneously in parallel. ( Writes could
possibly be sped up as well through careful scheduling algorithms, but it
would be complicated in practice. )
 Another way of improving disk access time is with striping, which basically
means spreading data out across multiple disks that can be accessed
simultaneously.
o With bit-level striping the bits of each byte are striped across multiple
disks. For example if 8 disks were involved, then each 8-bit byte would
be read in parallel by 8 heads on separate disks. A single disk read
would access 8 * 512 bytes = 4K worth of data in the time normally
required to read 512 bytes. Similarly if 4 disks were involved, then two
bits of each byte could be stored on each disk, for 2K worth of disk
access per read or write operation.
o Block-level striping spreads a filesystem across multiple disks on a
block-by-block basis, so if block N were located on disk 0, then block
N + 1 would be on disk 1, and so on. This is particularly useful when
filesystems are accessed in clusters of physical blocks. Other striping
possibilities exist, with block-level striping being the most common.
RAID Levels

 Mirroring provides reliability but is expensive; Striping improves


performance, but does not improve reliability. Accordingly there are a number
of different schemes that combine the principals of mirroring and striping in
different ways, in order to balance reliability versus performance versus cost.
These are described by different RAID levels, as follows: ( In the diagram that
follows, "C" indicates a copy, and "P" indicates parity, i.e. checksum bits. )
1. Raid Level 0 - This level includes striping only, with no mirroring.
2. Raid Level 1 - This level includes mirroring only, no striping.
3. Raid Level 2 - This level stores error-correcting codes on additional
disks, allowing for any damaged data to be reconstructed by
subtraction from the remaining undamaged data. Note that this scheme
requires only three extra disks to protect 4 disks worth of data, as
opposed to full mirroring. ( The number of disks required is a function
of the error-correcting algorithms, and the means by which the
particular bad bit(s) is(are) identified. )
4. Raid Level 3 - This level is similar to level 2, except that it takes
advantage of the fact that each disk is still doing its own error-
detection, so that when an error occurs, there is no question about
which disk in the array has the bad data. As a result a single parity bit
is all that is needed to recover the lost data from an array of disks.
Level 3 also includes striping, which improves performance. The
downside with the parity approach is that every disk must take part in
every disk access, and the parity bits must be constantly calculated and
checked, reducing performance. Hardware-level parity calculations and
NVRAM cache can help with both of those issues. In practice level 3 is
greatly preferred over level 2.
5. Raid Level 4 - This level is similar to level 3, employing block-level
striping instead of bit-level striping. The benefits are that multiple
blocks can be read independently, and changes to a block only require
writing two blocks ( data and parity ) rather than involving all disks.
Note that new disks can be added seamlessly to the system provided
they are initialized to all zeros, as this does not affect the parity results.
6. Raid Level 5 - This level is similar to level 4, except the parity blocks
are distributed over all disks, thereby more evenly balancing the load
on the system. For any given block on the disk(s), one of the disks will
hold the parity information for that block and the other N-1 disks will
hold the data. Note that the same disk cannot hold both data and parity
for the same block, as both would be lost in the event of a disk crash.
7. Raid Level 6 - This level extends raid level 5 by storing multiple bits
of error-recovery codes, ( such as the Reed-Solomon codes ), for each
bit position of data, rather than a single parity bit. In the example
shown below 2 bits of ECC are stored for every 4 bits of data, allowing
data recovery in the face of up to two simultaneous disk failures. Note
that this still involves only 50% increase in storage needs, as opposed
to 100% for simple mirroring which could only tolerate a single disk
failure.
Figure 10.11 - RAID levels.

 There are also two RAID levels which combine RAID levels 0 and 1 ( striping
and mirroring ) in different combinations, designed to provide both
performance and reliability at the expense of increased cost.
o RAID level 0 + 1 disks are first striped, and then the striped disks
mirrored to another set. This level generally provides better
performance than RAID level 5.
o RAID level 1 + 0 mirrors disks in pairs, and then stripes the mirrored
pairs. The storage capacity, performance, etc. are all the same, but
there is an advantage to this approach in the event of multiple disk
failures, as illustrated below:.
 In diagram (a) below, the 8 disks have been divided into two
sets of four, each of which is striped, and then one stripe set is
used to mirror the other set.
 If a single disk fails, it wipes out the entire stripe set,
but the system can keep on functioning using the
remaining set.
 However if a second disk from the other stripe set now
fails, then the entire system is lost, as a result of two
disk failures.
 In diagram (b), the same 8 disks are divided into four sets of
two, each of which is mirrored, and then the file system is
striped across the four sets of mirrored disks.
 If a single disk fails, then that mirror set is reduced to a
single disk, but the system rolls on, and the other three
mirror sets continue mirroring.
 Now if a second disk fails, ( that is not the mirror of the
already failed disk ), then another one of the mirror sets
is reduced to a single disk, but the system can continue
without data loss.
 In fact the second arrangement could handle as many as
four simultaneously failed disks, as long as no two of
them were from the same mirror pair.

Figure 10.12 - RAID 0 + 1 and 1 + 0

Selecting a RAID Level

 Trade-offs in selecting the optimal RAID level for a particular application


include cost, volume of data, need for reliability, need for performance, and
rebuild time, the latter of which can affect the likelihood that a second disk
will fail while the first failed disk is being rebuilt.
 Other decisions include how many disks are involved in a RAID set and how
many disks to protect with a single parity bit. More disks in the set increases
performance but increases cost. Protecting more disks per parity bit saves cost,
but increases the likelihood that a second disk will fail before the first bad disk
is repaired.
Extensions

 RAID concepts have been extended to tape drives ( e.g. striping tapes for
faster backups or parity checking tapes for reliability ), and for broadcasting of
data.

Problems with RAID

 RAID protects against physical errors, but not against any number of bugs or
other errors that could write erroneous data.
 ZFS adds an extra level of protection by including data block checksums in all
inodes along with the pointers to the data blocks. If data are mirrored and one
copy has the correct checksum and the other does not, then the data with the
bad checksum will be replaced with a copy of the data with the good
checksum. This increases reliability greatly over RAID alone, at a cost of a
performance hit that is acceptable because ZFS is so fast to begin with.

Figure 10.13 - ZFS checksums all metadata and data.

 Another problem with traditional filesystems is that the sizes are fixed, and
relatively difficult to change. Where RAID sets are involved it becomes even
harder to adjust filesystem sizes, because a filesystem cannot span across
multiple filesystems.
 ZFS solves these problems by pooling RAID sets, and by dynamically
allocating space to filesystems as needed. Filesystem sizes can be limited by
quotas, and space can also be reserved to guarantee that a filesystem will be
able to grow later, but these parameters can be changed at any time by the
filesystem's owner. Otherwise filesystems grow and shrink dynamically as
needed.
Figure 10.14 - (a) Traditional volumes and file systems. (b) a ZFS pool and file systems.

***

4.3.5 Stable-Storage Implementation:

 The concept of stable storage involves a storage medium in which data is never lost,
even in the face of equipment failure in the middle of a write operation.
 To implement this requires two ( or more ) copies of the data, with separate failure
modes.
 An attempted disk write results in one of three possible outcomes:
1. The data is successfully and completely written.
2. The data is partially written, but not completely. The last block written may be
garbled.
3. No writing takes place at all.
 Whenever an equipment failure occurs during a write, the system must detect it, and
return the system back to a consistent state. To do this requires two physical blocks
for every logical block, and the following procedure:
1. Write the data to the first physical block.
2. After step 1 had completed, then write the data to the second physical block.
3. Declare the operation complete only after both physical writes have completed
successfully.
 During recovery the pair of blocks is examined.
o If both blocks are identical and there is no sign of damage, then no further
action is necessary.
o If one block contains a detectable error but the other does not, then the
damaged block is replaced with the good copy. ( This will either undo the
operation or complete the operation, depending on which block is damaged
and which is undamaged. )
o If neither block shows damage but the data in the blocks differ, then replace
the data in the first block with the data in the second block. ( Undo the
operation. )
Because the sequence of operations described above is slow, stable storage usually includes
NVRAM as a cache, and declares a write operation complete once it has been written to the
NVRAM.

****

You might also like