0% found this document useful (0 votes)
20 views45 pages

Os Note For IV r19

Uploaded by

Ramu gopireddy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views45 pages

Os Note For IV r19

Uploaded by

Ramu gopireddy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 45

DEPARTMENT OF COMPUTER SCIENCE

AND ENGINEERING

COURSE MATERIAL

Faculty Name: G.SREE HARIKA Course: II B.Tech.


Subject with Code: Operating Systems (19A05403T) Semester/Branch: II/CSE
UNIT –IV
PART 1;DEADLOCKLS

1)RESOURCES

A resource can be a hardware device (e.g., a tape drive) or a piece of information (e.g., a locked record in
a database).

1.1 Preemptable and Nonpreemptable Resources

Resources come in two types: preemptable and nonpreemptable.


A preemptable resource is one that can be taken away from the process owning it with no ill effects.
Memory is an example of a preemptable resource.

A nonpreemptable resource, in contrast, is one that cannot be taken away from its current owner without
causing the computation to fail. CD- ROM is an example of a nonpreemptable resource.

The sequence of events required to use a resource is given below in an abstract form.
1. Request the resource.
2. Use the resource.
3. Release the resource.
If the resource is not available when it is requested, the requesting process is forced to wait. In some
operating systems, the process is automatically blocked when a resource request fails, and awakened when
it becomes available.

1.2 Resource Acquisition

One way of allowing user management of resources is to associate a semaphore.

Figure 6-1. Using a semaphore to protect resources. (a) One resource. (b) Two resources

Sometimes processes need two or more resources. They can be acquired sequentially, as shown in Fig. 6-
l(b). If more than two resources are needed, they are just acquired one after another.
Figure 6-2. (a) Deadlock-free code.

Figure 6-2. (b) Code with a potential deadlock

Now let us consider a situation with two processes, A and B, and two resources. Two scenarios are depicted
in Fig. 6-2. In Fig. 6-2(a), both processes ask for the resources in the same order. In Fig. 6-2(b), they ask for
them in a different order. This difference may seem minor, but it is not.
In Fig. 6-2(a), one of the processes will acquire the first resource before the other one. That process will
then successfully acquire the second resource and do its work. If the other process attempts to acquire
resource 1 before it has been released, the other process will simply block until it becomes available.
In Fig. 6-2(b), the situation is different. It might happen that one of the processes acquires both resources
and effectively blocks out the other process until it is done. However, it might also happen that process A
acquires resource 1 and process B acquires resource 2. Each one will now block when trying to acquire the
other one. Neither process will ever run again. This situation is a deadlock.

2) I N T R O D U C T I O N T O D E A D L O C KS

Deadlock can be defined formally as follows:


A set of processes is deadlocked if each process in the set is waiting for an event that only another process
in the set can cause.

.2.1 Conditions for Resource Deadlocks


Coffman et al. (1971) showed that four conditions must hold for there to be a (resource) deadlock:

1. Mutual exclusion condition. Each resource is either currently assigned to exactly one process or is
available.
2. Hold and wait condition. Processes currently holding resources that were granted earlier can request new
resources.
3. No preemption condition. Resources previously granted cannot be forcibly taken away from a process.
They must be explicitly released by the process holding them.
4. Circular wait condition. There must be a circular chain of two or more processes, each of which is
waiting for a resource held by the next member of the chain.
All four of these conditions must be present for a resource deadlock to occur. If one of them is absent, no
resource deadlock is possible.

. 3 )THE OSTRICH ALGORITHM

The simplest approach is the ostrich algorithm: stick your head in the sand and pretend there is no problem at all.
Reasonable if – deadlocks occur very rarely – cost of prevention is high
UNIX and Windows takes this approach for some of the more complex resource relationships to manage
It’s a trade off between – Convenience (engineering approach) – Correctness (mathematical approach).

To make this contrast more specific, consider an operating system that blocks the caller when an
open system call on a physical device such as a CD-ROM driver or a printer cannot be carried out because
the device is busy. Typically it is up to the device driver to decide what action to take under such
circumstances. Blocking or returning an error code are two obvious possibilities. If one process successfully
opens the CD-ROM drive and another successfully opens the printer and then each process tries to open the
other one and blocks trying, we have a deadlock. Few current systems will detect this.

4 )DEADLOCK DETECTION AND RECOVERY

A second technique is detection and recovery. When this technique is used,the system does not attempt to
prevent deadlocks from occurring. Instead, it lets them occur, tries to detect when this happens, and
then takes some action to recover after the fact. In this section we will look at some of the ways
deadlocks can be detected and some of the ways recovery from them can be handled.

Let us begin with the simplest case: only one resource of each type exists. Such a system might have one
scanner, one CD recorder, one plotter, and one tape drive, but no more than one of each class of
resource.

For such a system, we can construct a resource graph . If this graph contains one or more cycles, a
deadlock exists. Any process that is part of a cycle is deadlocked. If no cycles exist, the system is not
deadlocked.

As an example of a more complex system than the ones we have looked at so far, consider a system with
seven processes, A though G, and six resources, R through W. The state of which resources are
currently owned and which ones are currently being requested is as follows:

1. Process A holds R and wants S.


2. Process B holds nothing but wants T.
3. Process C holds nothing but wants S.
4. Process D holds U and wants S and T.
5. Process E holds Tand wants V.
6. Process F holds W and wants S.
7. Process G holds V and wants U.

The question is: "Is this system deadlocked, and if so, which processes are involved?”

To answer this question, we can construct the resource graph of Fig. 6-5(a). This graph contains one
cycle, which can be seen by visual inspection. The cycle is shown in Fig. 6-5(b). From this cycle, we
can see the processes D, E, and G are all deadlocked. Processes A, C, and F are not deadlock because S
can be allocated
to any one of them, which then finishes and returns it. Then the other two can take it in turn and also
complete.

Figure 6-5. (a) A resource graph, (b) A cycle extracted from (a).

Below we will give a simple one that inspects a graph and terminates either when it has found a cycle or
when it has shown that none exists. It uses one dynamic data structure, L, a list of nodes, as well as the
list of arcs. During the algorithm, arcs will be marked to indicate that they have already been inspected,
to
prevent repeated inspections.
The algorithm operates by carrying out the following steps as specified:

1. For each node, N in the graph, perform the following five steps with N as the starting node.
2. Initialize L to the empty list, and designate all the arcs as unmarked.
3. Add the current node to the end of L and check to see if the node now appears in L two times. If it does,
the graph contains a cycle (listed in L) and the algorithm terminates.
4. From the given node, see if there are any unmarked outgoing arcs. If so, go to step 5; if not, go to step
6.
5. Pick an unmarked outgoing arc at random and mark it. Then follow it to the new current node and go to
step 3.
6. If this node is the initial node, the graph does not contain any cycles and the algorithm terminates.
Otherwise, we have now reached a dead end. Remove it and go back to the previous node, that is, the
one that was current just before this one, make that one the current node, and go to step 3.

To see how the algorithm works in practice, let us use it on the graph of Fig. 6-5(a). The order of
processing the nodes is arbitrary, so let us just inspect them from left to right, top to bottom, first
running the algorithm starting at R, then successively A, B, C, S, D, T, E, F, and so forth. If we hit a
cycle, the algorithm stops.

We start at R and initialize L to the empty list. Then we add R to the list and move to the only possibility,
A, and add it to L, giving L = [R, A). From A we go to S, giving L = [R, A, S]. S has no outgoing arcs,
so it is a dead end, forcing us to backtrack to A. Since A has no unmarked outgoing arcs, we backtrack
to R, completing
our inspection of R.

Now we restart the algorithm starting at A, resetting L to the empty list. This search, too, quickly stops, so
we start again at B. From B we continue to follow outgoing arcs until we get to D, at which time L - [B,
T, E, V, G, U, D]. Now we must make a (random) choice. If we pick S we come to a dead end and
backtrack to D. The second time we pick T and update L to be [B, T, E, V, G, U,D,T], at which point
we discover the cycle and stop the algorithm.

Deadlock Detection with Multiple Resources of Each Type:

When multiple copies of some of the resources exist, a different approach is needed to detect deadlocks.
We will now present a matrix-based algorithm for detecting deadlock among n processes, P, through
P„. It gives the total number of instances of each resource in existence. For example, if class 1 is tape
drives, then Ex = 2 means the system has two tape drives.

Figure 6-6. The four data structures needed by the deadlock detection algorithm
The deadlock detection algorithm can now be given as follows.
1. Look for an unmarked process, P,-, for which the j'-th row of R is less than or equal to A.
2. If such a process is found, add the i-th row of C to A, mark the process, and go back to step 1.
3. If no such process exists, the algorithm terminates.

When the algorithm finishes, all the unmarked processes, if any, are deadlocked.

As an example of how the deadlock detection algorithm works, consider Fig. 6-7. Here we have three
processes and four resource classes, which we have arbitrarily labeled tape drives, plotters,
scanner, and CD-ROM drive. Process 1 has one scanner. Process 2 has two tape drives and a CD-
ROM drive. Process 3
has a plotter and two scanners. Each process needs additional resources, as shown by the R matrix .

Figure 6-7. An example for the deadlock detection algorithm

To run the deadlock detection algorithm, we look for a process whose resource request can be satisfied.
The first one cannot be satisfied because there is no CD-ROM drive available. The second cannot
be satisfied either, because there is no scanner free. Fortunately, the third one can be satisfied, so
process 3 runs
and eventually returns all its resources, giving
A = (22 2 0)
At this point process 2 can run and return its resources, giving
A=(4 2 2 1)
Now the remaining process can run. There is no deadlock in the system.

Recovery from Deadlock


Suppose that our deadlock detection algorithm has succeeded and detected a deadlock. What next?
Some way is needed to recover and get the system going again. In this section we will discuss
various ways of recovering from deadlock.
.
1. Recovery through Preemption
In some cases it may be possible to temporarily take a resource away from its current owner and
give it to another process.
For example, to take a laser printer away from its owner, the operator can collect all the sheets
already printed and put them in a pile. Then the process can be suspended (marked as not
runnable). At this point the printer can be assigned to another process. When that process finishes,
the pile of printed sheets can be put back in the printer's output tray and the original process
restarted.
The ability to take a resource away from a process, have another process use it, and then give it
back without the process noticing it is highly dependent on the nature of the resource. Recovering
this way is frequently difficult or impossible. Choosing the process to suspend depends largely on
which ones have resources that can easily be taken back.

2. Recovery through Rollback


If the system designers and machine operators know that deadlocks are likely, they can arrange to
have processes checkpointed periodically. Checkpointing a process means that its state is written
to a file so that it can be restarted later. The checkpoint contains not only the memory image, but
also the resource state.
When a deadlock is detected, it is easy to see which resources are needed. To do the recovery, a
process that owns a needed resource is rolled back to a point in time before it acquired that
resource by starting one of its earlier checkpoints. In effect, the process is reset to an earlier
moment when it did not have the resource, which is now assigned to one of the deadlocked
processes. If the restarted process tries to acquire the resource again, it will have to wait until it
becomes available.

3.Recovery through Killing Processes


The crudest, but simplest way to break a deadlock is to kill one or more processes. One possibility
is to kill a process in the cycle. With a little luck, the other processes will be able to continue. If
this does not help, it can be repeated until the cycle is broken.

5)DEADLOCK AVOIDANCE

1. Resource Trajectories
2. Safe and unsafe states
3. The Banker’s algorithm for single Resource
4. The Banker’s algorithm for multiple resource

Resource Trajectories:
Process A and B
•Resources: printer and plotter
•A needs printer from I1 to I3
•A needs plotter from I2 to I4
•B needs plotter from I5 to I7
•B needs printer from I6 to I8

The horizontal (vertical) axis represents the number of instructions executed by process A (B)
Every point in the diagram represents a joint state of the two processes
If the system ever enters the box bounded by I1 and I2 on the sides I5 and I6 and top and bottom,
it will eventually deadlock when it gets to the intersection of I2 and I6
At this point, A is requesting the plotter and B is requesting the printer, and both are already
assigned
The entire box is unsafe and must not be entered
At point the only safe thing to do is run process A until it gets to . Beyond that, any trajectory
to will do
At point B is requesting a resource. The system must decide whether to grant it or not
If the grant is made, the system will enter an unsafe region and eventually deadlock.

Figure 3.4: Two process resource trajectories.

Safe and unsafe states:

C: Current Allocation Matrix


A: Resources Available
E: Resources in Existence
Add up all the instances of the resource that have been allocated and to this add all the instances
that are available, the result is the number of instances of that resource class that exist
At any instant of time, there is a current state consisting of E, A, C, and R (Request Matrix)
A state is said to be safe if it is not deadlocked and there is some scheduling order in which every
process can run to completion even if all of them suddenly request their maximum number of
resources immediately
A total of 10 instances of the resource exist, so with 7 resources already allocated, there are 3 still
free
The upper state of Fig 3.5 is safe because there exist a sequence of allocations (scheduler runs B)
that allows all processes to complete; by careful scheduling, can avoid deadlock
The lower state of Fig 3.5 is not safe because this time scheduler runs A and A gets another resource
There is no sequence that guarantees completion
An unsafe state is not a deadlock state
The difference between a safe state and an unsafe state is that from a safe state the system can
guarantee that all processes will finish; from an unsafe state, no such guarantee can be given

Figure 3.5: Demonstration that the state in is safe (upper), and in is not safe (lower).

The Banker’s algorithm for single Resource:

First let's consider the situation when there is one resource type, think of it as units of money (1K
ollars), a banker (the OS) who has a certain number of units in his bank and a number of customers
who can loan a certain number of units from the bank and later pay the loan back (release the
resources). The customers have a credit line that cannot exceed the number of units initially in the
bank, and when they have borrowed their max number of units, they will pay their loan back.

The banker will grant a loan request only if it does not lead to an unsafe state. Eg, the banker
initially has 10 K, and four customers A, B, C and D have credit lines 6K, 5K, 4K and 7K
respectively. The state when no loans have been made is then:

HAS MAX Free:10


A 0 6
B 0 5
C 0 4
D 0 7
Say at some stage the loan situation is

HAS MAX Free:2


A 1 6
B 1 5
C 2 4
D 4 7

Suppose B loans one more unit:

HAS MAX Free:1


A 1 6
B 2 5
C 2 4
D 4 7

This is unsafe: if ALL CUSTOMERS ask their MAXIMUM remaining credit, NONE
can be satisfied, and we have deadlock. So, in this case the banker will not
grant B the loan; ie back to

HAS MAX Free:2


A 1 6
B 1 5
C 2 4
D 4 7

This state is safe because with 2K left, C can borrow her max remaining credit:

HAS MAX Free:0


A 1 6
B 1 5
C 4 4
D 4 7

finish and release her 4 units and finish

HAS MAX Free:4


A 1 6
B 1 5
D 4 7

The remaining state is safe. Why?

D can loan its 3 units

HAS MAX Free:1


A 1 6
B 1 5
D 7 7

and terminate

HAS MAX Free:8


A 1 6
B 1 5

Now B can loan its 4 units and terminate


HAS MAX Free:4
A 1 6
B 5 5
and terminate

HAS MAX Free:9


A 1 6

and now A can go

HAS MAX Free:4

A 6 6
and terminate

HAS MAX Free:`10

So a safe state is a state where a sequence of ALL proceses can get their max
required resources (one at the time) and finish and release all their resources.

The Banker’s algorithm for multiple resource type:


The bankers algorithm for multiple resource types extends the method described above. Say we have
five processes, 6 tape drives, 3 plotters, 4 printers, and 2 CD players. At some stage the state is

Tp Pl Pr CD Tp Pl Pr CD Tp Pl Pr CD
A 3011 1100 E = 6 3 4 2 E: Existing resources
B 0100 0112 P = 5 3 2 2 P: Possessed resources
C 1110 3100 A = 1 0 2 0 A: Available resources
D 1101 0010
E 0000 2110
HAS STILL NEEDS
The Column sums of the HAS matrix is equal to the vector P. Also, E = P+A,or A = E-P.

Checking whether the state is safe.

1. Find a process (row) R with unmet resources ALL less than A, ie a process that can be granted all its
unmet resources. If no such row exists, the state is unsafe.
2. Assume process R takes all its resources and finishes. Mark it as finished and release its resources back to
A.
3. Repeat steps 1 and 2 until either an unsafe state appears, in which case there is a potential for deadlock,
or all processes finish, in which case the original state was safe.
Tp Pl Pr CD Tp Pl Pr CD Tp Pl Pr CD
A 3011 1100 E = 6 3 4 2 E: Existing resources
B 0100 0112 P = 5 3 2 2 P: Possessed resources
C 1110 3100 A = 1 0 2 0 A: Available resources
D 1101 0010
E 0000 2110

HAS STILL NEED

Tp Pl Pr CD Tp Pl Pr CD Tp Pl Pr CD
A 3011 1100 E = 6 3 4 2 E: Existing resources
B 0100 0112 P = 5 3 3 2 P: Possessed resources
C 1110 3100 A = 1 0 1 0 A: Available resources
D 1111 0000
E 0000 2110
HAS STILL NEEDS

Which is safe, as
D can still finish

Tp Pl Pr CD Tp Pl Pr CD Tp Pl Pr CD
A 3011 1100 E = 6 3 4 2 E: Existing resources
B 0100 0112 P = 4 2 2 1 P: Possessed resources
C 1110 3100 A = 2 1 2 1 A: Available resources
E 0000 2110
HAS STILL NEEDS

now E can go

Tp Pl Pr CD Tp Pl Pr CD Tp Pl Pr CD
A 3011 1100 E = 6 3 4 2 E: Existing resources
B 0100 0112 P = 6 3 3 1 P: Possessed resources
C 1110 3100 A = 0 0 1 1 A: Available resources
E 2110 0000
HAS STILL NEEDS

Tp Pl Pr CD Tp Pl Pr CD Tp Pl Pr CD
A 3011 1100 E = 6 3 4 2 E: Existing resources
B 0100 0112 P = 4 2 2 1 P: Possessed resources
C 1110 3100 A = 2 1 2 1 A: Available resources
HAS STILL NEEDS

Which is safe, as
E can still finish

NOW A can go
Tp Pl Pr CD Tp Pl Pr CD Tp Pl Pr CD
A 3011 1100 E = 6 3 4 2 E: Existing resources
B 0100 0112 P = 4 2 2 1 P: Possessed resources
C 1110 3100 A = 2 1 2 1 A: Available resources
HAS STILL NEEDS

Tp Pl Pr CD Tp Pl Pr CD Tp Pl Pr CD
A 4111 0000 E = 6 3 4 2 E: Existing resources
B 0100 0112 P = 5 3 3 1 P: Possessed resources
C 1110 3100 A = 1 0 2 1 A: Available resources

HAS STILL NEEDS

Which is safe, as A can still finish

Tp Pl Pr CD Tp Pl Pr CD Tp Pl Pr CD
E = 6 3 4 2 E: Existing resources
B 0100 0112 P = 1 2 1 0 P: Possessed resources
C 1110 3100 A = 5 1 3 2 A: Available resources

HAS STILL NEEDS

NOW B can go

Tp Pl Pr CD Tp Pl Pr CD Tp Pl Pr CD
E = 6 3 4 2 E: Existing resources
B 0212 0000 P = 1 3 2 2 P: Possessed resources
C 1110 3100 A = 5 0 2 0 A: Available resources

HAS STILL NEEDS

Which is safe, as
B can still finish

Tp Pl Pr CD Tp Pl Pr CD Tp Pl Pr CD
E = 6 3 4 2 E: Existing resources
P = 1 1 1 0 P: Possessed resources
C 1110 3100 A = 5 2 3 2 A: Available resources

HAS STILL NEEDS

NOW C can go

Tp Pl Pr CD Tp Pl Pr CD Tp Pl Pr CD
E = 6 3 4 2 E: Existing resources
P = 4 2 1 0P: Possessed resources
C 4210 0000 A = 2 1 3 2 A: Available resources
HAS STILL NEEDS

Which is safe, as
C can still finish

Tp Pl Pr CD Tp Pl Pr CD Tp Pl Pr CD
E = 6 3 4 2 E: Existing resources
P = 0 0 0 0 P: Possessed resources
A = 6 3 4 2 A: Available resources

HAS STILL NEEDS

So the bankers algorithm, upon receiving a request, "pencils" it into the state and checks whether the state
is still safe..
DEADLOCK PREVENTION:

 Attacking the Mutual Exclusion condition


 Attacking the Hold and wait condition
 Attacking the No preemption condition
 Attacking the Circular wait condition

Attacking the Mutual Exclusion condition


Ensure that no resource is assigned exclusively to a single process. Spooling everything.
•Drawback: not all resources can be spooled.
Attacking the Hold and wait condition
Process requires all its resources before starting
Problem: processes may not know how many resources needed in advance; not an optimal approach
using resources
Attacking the No preemption condition
Forcibly take away the resource. Not realistic.
Attacking the Circular wait condition
Solution 1: A process is entitled only a single resource at any time.
•Solution 2: Global numbering all resources:
•Give a unique number to each resource. All requests must be made in a numerical order
Figure 6-14. Summary of approaches to deadlock prevention.

PART 2 : FILE SYSTEMS

1)FILES:

• Many important applications need to store more information like a file.


• The information must survive the termination of the process using it.
• Multiple processes must be able to access the information concurrently.
• Disks are used to store files
• Information is stored in blocks on the disks
• Can read and write blocks
• Use file system as an abstraction to deal with accessing the information kept in blocks on a disk
• Files are created by a process
• Thousands of them on a disk
• Managed by the OS.
• OS structures them, names them, protects them
• Two ways of looking at file system
• User-how do we name a file, protect it, organize the files
• Implementation-how are they organized on a disk
• Start with user, then go to implementer

• The user point of view


Naming
Structure
Directories

File Naming:

 One to 8 letters in all current OS’s


 Unix, MS-DOS (Fat16) file systems discussed
 Fat (16 and 32) were used in first Windows systems
 Latest Window systems use Native File System
 All OS’s use suffix as part of name
 Unix does not always enforce a meaning for the suffixes
 DOS does enforce a meaning

Suffix Examples
File Structure

 Byte sequences
 Maximum flexibility-can put anything in
 Unix and Windows use this approach
 Fixed length records (card images in the old days)
 Tree of records- uses key field to find records in the tree

Three kinds of files. (a) Byte sequence.


(b) Record sequence. (c) Tree.

File Types

• Regular- contains user information


• Directories
• Character special files- model serial (e.g. printers) I/O devices
Block special files-model disks
Regular Files
• ASCII or binary
• ASCII
• Printable
• Can use pipes to connect programs if they produce/consume ASCII
Binary File Types
• Two Unix examples
• Executable (magic field identifies file as being executable)
• Archive-compiled, not linked library procedures
• Every OS must recognize its own executable
a) An executable file (magic # identifiies it as an executable file)
b)An archive-library procedures compiled but not linked

File Access
• Sequential access- read from the beginning, can’t skip around
• Corresponds to magnetic tape
• Random access- start where you want to start
• Came into play with disks
• Necessary for many applications, e.g. airline reservation system

File Attributes
File operations
• Create -with no data, sets some attributes
• Delete-to free disk space
• Open- after create, gets attributes and disk addresses into main memory
• Close- frees table space used by attributes and addresses
• Read-usually from current pointer position. Need to specify buffer into which data is placed
• Write-usually to current position
• Append- at the end of the file
• Seek-puts file pointer at specific place in file. Read or write from that position on
• Get Attributes-e.g. make needs most recent modification times to arrange for group compilation
• Set Attributes-e.g. protection attributes
• Rename

How can system calls be used?


An example-copyfile abc xyz

• Copies file abc to xyz


• If xyz exists it is over-written
• If it does not exist, it is created
• Uses system calls (read, write)
• Reads and writes in 4K chunks
• Read (system call) into a buffer
• Write (system call) from buffer to output file
Fig;Copyfile abc xyz
Memory-Mapped Files:

a) Segmented process before mapping files into its address space


(b) Process after mapping existing file abc into one segment creating new segment for xyz

Systemcalls:

MAP: Add the files to process virtual address space


UNMAP: Remove the files from process virtual address space address space

2)DIRECTOIES

Sometimes the file system consisting of millions of files,at that situation it is very hard to manage the files. To
manage these files grouped these files and load one group into one partition. Each partition is called a directory
.a directory structure provides a mechanism for organizing many files in the file system.

File systems normally have directories or folders.

Single-Level Directory Systems:


The simplest form of directory system is having one directory containing all the files. Sometimes it is
called the root directory.
The advantages of this scheme are its simplicity and the ability to locate files quickly—-there is only one
place to look, after all. It is often used on simple embedded devices such as telephones, digital cameras,
and some portable music players.
Single-Level Directory Systems:

A single level directory system contains 4 files owned by 3 different people, A, B, and C

Hierarchical Directory System:

With this approach, there can be as many directories as are needed to group the files in natural ways.
Furthermore, if multiple users share a common file server, as is the case on many company networks, each user
can have a private root directory for his or her own hierarchy. This approach is shown in Fig. 4-7. Here, the
directories A, B, and C contained in the root directory each belong to a different user, two of whom have created
subdirectories for projects they are working on.

Hierarchical Directory System:

Path names
 Absolute path name: /usr/carl/cs310/miderm/answers. Refers to parent of current directory
 Relative path name: cs310/midterm/answers-. Refers to current (working) directory
Fig: UNIX Path

Operations on the DIRECTORIES :

• Create creates directory


• Delete directory has to be empty to delete it
• Opendir Must be done before any operations on directory
• Closedir
• Readdir returns next entry in open directory
• Rename
• Link links file to another directory
• Unlink Gets rid of directory entry

System calls for DIRECTORIES:

Readdir-reads next entry in open directory


Rename
Link-links file to path. File can appear in multiple directories!
Unlink-what it sounds like. Only unlinks from pathname specified in call

3)FILE SYSTEM IMPLEMENTATION

 File system layout


 Implementing files
 Implementing directories
 Shared files
 Log structured file systems
 Journaling file system
 Virtual file system
• Files stored on disks. Disks broken up into one or more partitions, with separate fs on each partition
• Sector 0 of disk is the Master Boot Record
• Used to boot the computer
• End of MBR has partition table. Has starting and ending addresses of each partition.
• One of the partitions is marked active in the master boot table
• Boot computer => BIOS reads/executes MBR
• MBR finds active partition and reads in first block (boot block)
• Program in boot block locates the OS for that partition and reads it in
• All partitions start with a boot block

A Possible File System Layout

• Superblock contains info about the fs (e.g. type of fs, number of blocks, …)
• i-nodes contain info about files

Implementing files:
• Most important implementation issue
• Methods
• Contiguous allocation
• Linked list allocation
• Linked list using table
• I-nodes
Contiguous Allocation

a) Contiguous allocation of disk space for 7 files.


(b) The state of the disk after files D and F have been removed
The good
• Easy to implement
• Read performance is great. Only need one seek to locate the first block in the file. The rest is easy.
The bad-
disk becomes fragmented over time
• CD-ROM’s use contiguous allocation because the fs size is known in advance
• DVD’s are stored in a few consecutive 1 GB files because standard for DVD only allows a 1 GB file max

Linked List Allocation

Storing a file as a linked list of disk blocks.

The good
• Gets rid of fragmentation
The bad
• Random access is slow. Need to chase pointers to get to a block

Linked lists using a table in memory


• Put pointers in table in memory
• File Allocation Table (FAT)
• Windows

The Solution-Linked List Allocation Using a Table in Memory


Figure 4-12. Linked list allocation using a file allocation table in main memory

• The bad-table becomes really big


• E.g 200 GB disk with 1 KB blocks needs a 600 MB table
• Growth of the table size is linear with the growth of the disk size

I-nodes
 Keep data structure in memory only for active files
 Data structure lists disk addresses of the blocks and attributes of the files
 K active files, N blocks per file => k*n blocks max!!
 Solves the growth problem
 How big is N?
 Solution: Last entry in table points to disk block which contains pointers to other disk blocks

Figure 4-13. An example i-node.


Implementing Directories
 Open file, path name used to locate directory
 Directory specifies block addresses by providing
 Address of first block (contiguous)
 Number of first block (linked)
Number of i-node

a) fixed-size entries with the disk addresses and attributes (DOS)


b) (b) each entry refers to an i-node. Directory entry contains attributes. (Unix)

 How do we deal with variable length names?


 Problem is that names have gotten very long
 Two approaches
 Fixed header followed by variable length names
Heap-pointer points to names

Two ways of handling long file names in a directory. (a) In-line. (b) In a heap.

Shared Files
File system containing a shared file. File systems is a directed acyclic tree (DAG)

 If B or C adds new blocks, how does other owner find out?


 Use special i-node for shared files-indicates that file is shared
 Use symbolic link - a special file put in B’s directory if C is the owner. Contains the path name of
the file to which it is linked
I-node problem
 If C removes file, B’s directory still points to i-node for shared file
 If i-node is re-used for another file, B’s entry points to wrong i-node
 Solution is to leave i-node and reduce number of owners

(a) Situation prior to linking. (b) After the link is created. (c) After the original owner removes the file.
Symbolic links
 Symbolic link solves problem
 Can have too many symbolic links and they take time to follow
 Big advantage-can point to files on other machines
Log Structured File System
CPU’s faster, disks and memories bigger (much) but disk seek time has not decreased
• Caches bigger-can do reads from cache
• Want to optimize writes because disk needs to be updated
• Structure disk as a log-collect writes and periodically send them to a segment in the disk . Writes tend to
be very small
• Segment has summary of contents (i-nodes, directories….).
• Keep i-node map on disk and cache it in memory to locate i-nodes
• Cleaner thread compacts log. Scans segment for current i-nodes, discarding ones not in use and sending
current ones to memory.
• Writer thread writes current ones out into new segment.
• Works well in Unix. Not compatible with most file systems
• Not used
Journaling File Systems
Want to guard against lost files when there are crashes. Consider what happens when a file has to be
removed.
• Remove the file from its directory.
• Release the i-node to the pool of free i-nodes.
• Return all the disk blocks to the pool of free disk blocks
If there is a crash somewhere in this process, have a mess.
o Keep a journal (i,.e. list) of actions before you take them, write journal to disk, then perform actions. Can
recover from a crash!
o Need to make operations idempotent. Must arrange data structures to do so.
o Mark block n as free is an idempotent operation.
o Adding freed blocks to the end of a list is not idempotent
o NTFS (Windows) and Linux use journaling
Virtual File Systems :
Have multiple fs on same machine
o Windows specifies fs (drives)
o Unix integrates into VFS
o Vfs calls from user
o Lower calls to actual fs
Supports Network File System-file can be on a remote machine

VFS-how it works
o File system registers with VFS (e.g. at boot time)
o At registration time, fs provides list of addresses of function calls the vfs wants
o Vfs gets info from the new fs i-node and puts it in a v-node
o Makes entry in fd table for process
o When process issues a call (e.g. read), function pointers point to concrete function calls

. A simplified view of the data structures and code used by the VFS and concrete file system to do a read.

4) MANAGEMANT AND OPTIMIZATION

Disk-Space Management
Since all the files are normally stored on disk one of the main concerns of file system is management of disk
space.

Block Size
The main question that arises while storing files in a fixed-size blocks is the size of the block. If the block is too
large space gets wasted and if the block is too small time gets waste. So, to choose a correct block size some
information about the file-size distribution is required. Performance and space-utilization are always in conflict.

Keeping track of free blocks


After a block size has been finalized the next issue that needs to be catered is how to keep track of the free
blocks. In order to keep track there are two methods that are widely used:
 Using a linked list: Using a linked list of disk blocks with each block holding as many free disk block
numbers as will fit.
 Bitmap: A disk with n blocks has a bitmap with n bits. Free blocks are represented using 1's and allocated
blocks as 0's as seen below in the figure.

Disk quotas
Multiuser operating systems often provide a mechanism for enforcing disk quotas. A system administrator
assigns each user a maximum allotment of files and blocks and the operating system makes sure that the users do
not exceed their quotas. Quotas are kept track of on a per-user basis in a quota table.

File-system Backups
If a computer's file system is irrevocably lost, whether due to hardware or software restoring all the information
will be difficult, time consuming and in many cases impossible. So it is adviced to always have file-system
backups.
Backing up files is time consuming and as well occupies large amount of space, so doing it efficiently and
convenietly is important. Below are few points to be considered before creating backups for files.

 Is it requied to backup the entire file system or only a part of it.


 Backing up files that haven't been changed from previous backup leads to incremental dumps. So it's better
to take a backup of only those files which have changed from the time of previous backup. But recovery
gets complicated in such cases.
 Since there is immense amount of data, it is generally desired to compress the data before taking a backup
for the same.
 It is difficult to perform a backup on an active file-system since the backup may be inconsistent.
 Making backups introduces many security issues

There are two ways for dumping a disk to the backup disk:
 Physical dump: In this way dump starts at block 0 of the disk, writes all the disk blocks onto thee output
disk in order and stops after copying the last one.
Advantages: Simplicity and great speed.
Disadvantages: inability to skip selected directories, make incremental dumps, and restore individual
files upon request
 Logical dump: In this way the dump starts at one or more specified directories and recursively dump all
files and directories found that have been changed since some given base date. This is the most
commonly used way.
The above figure depicts a popular algorithm used in many UNIX systems wherein squares depict directories and
circles depict files. This algorith dumps all the files and directories that have been modified and also the ones on
the path to a modified file or directory. The dump algorithm maintains a bitmap indexed by i-node number with
several bits per i-node. Bits will be set and cleared in this map as the algorithm proceeds. Although logical
dumping is straightforward, there are few issues associated with it.
 Since the free block list is not a file, it is not dumped and hence it must be reconstructed from scratch after
all the dumps have been restored
 If a file is linked to two or more directories, it is important that the file is restored only one time and that
all the directories that are supposed to point to it do so
 UNIX files may contain holes
 Special files, named pipes and all other files that are not real should never be dumped.

File-system Consistency
To deal with inconsistent file systems, most computers have a utility program that checks file-system
consistency. For example, UNIX has fsck and Windows has sfc. This utility can be run whenever the system is
booted. The utility programs perform two kinds of consistency checks.
 Blocks: To check block consistency the program builds two tables, each one containing a counter for each
block, initially set to 0. If the file system is consistent, each block will have a 1 either in the first table or
in the second table as you can see in the figure below.
In case if both the tables have 0 in it that may be because the block is missing and hence will be reported as a
missing block. The two other situations are if a block is seen more than once in free list and same data block is
present in two or more files.
 In addition to checking to see that each block is properly accounted for, the file-system checker also
checks the directory system. It too uses a table of counters but per file-size rather than per block. These
counts start at 1 when a file is created and are incremented each time a (hard) link is made to the file. In a
consistent file system, both counts will agree
File-system Performance
Since the access to disk is much slower than access to memory, many file systems have been designed with
various optimizations to improve performance as described below.

Caching
The most common technique used to reduce disk access time is the block cache or buffer cache. Cache can be
defined as a collection of items of the same type stored in a hidden or inaccessible place. The most common
algorithm for cache works in such a way that if a disk access is initiated, the cache is checked first to see if the
disk block is present. If yes then the read request can be satisfied without a disk access else the disk block is
copied to cache first and then the read request is processed.

The above figure depicts how to quickly determine if a block is present in a cache or not. For doing so a hash
table can be implemented and look up the result in a hash table.

Block Read Ahead


Another technique to improve file-system performance is to try to get blocks into the cache before they are
needed to increase the hit rate. This works only when files are read sequentially. When a file system is asked for
block 'k' in the file it does that and then also checks before hand if 'k+1' is available if not it schedules a read for
the block k+1 thinking that it might be of use later.
Reducing disk arm motion
Another way to increase file-system performance is by reducing the disk-arm motion by putting blocks that are
likely to be accessed in sequence close to each other,preferably in the same cylinder.
In the above figure all the i-nodes are near the start of the disk, so the average distance between an inode and its
blocks will be half the number of cylinders, requiring long seeks. But to increase the performance the placement
of i-nodes can be modified as below.

Defragmenting Disks
Due to continuous creation and removal of files the disks get badly fragmented with files and holes all over the
place. As a consequence, when a new file is created, the blocks used for it may be spread all over the disk, giving
poor performance. The performance can be restored by moving files around to make them contiguous and to put
all (or at least most) of the free space in one or more large contiguous regions on the disk.
PART 3
SECONDARY STORAGE STRUCTURE

1)OVERVIEW OF SECONDARY STORAGE STRUCTURE

1.1 Magnetic Disks


 Traditional magnetic disks have the following basic structure:
o One or more platters in the form of disks covered with magnetic media. Hard disk platters are
made of rigid metal, while "floppy" disks are made of more flexible plastic.
o Each platter has two working surfaces. Older hard disk drives would sometimes not use the very
top or bottom surface of a stack of platters, as these surfaces were more susceptible to potential
damage.
o Each working surface is divided into a number of concentric rings called tracks. The collection of
all tracks that are the same distance from the edge of the platter, ( i.e. all tracks immediately above
one another in the following diagram ) is called a cylinder.
o Each track is further divided into sectors, traditionally containing 512 bytes of data each, although
some modern disks occasionally use larger sector sizes. ( Sectors also include a header and a
trailer, including checksum information among other things. Larger sector sizes reduce the
fraction of the disk consumed by headers and trailers, but increase internal fragmentation and the
amount of disk that must be marked bad in the case of errors. )
o The data on a hard drive is read by read-write heads. The standard configuration ( shown below )
uses one head per surface, each on a separate arm, and controlled by a common arm
assembly which moves all heads simultaneously from one cylinder to another. ( Other
configurations, including independent read-write heads, may speed up disk access, but involve
serious technical difficulties. )
o The storage capacity of a traditional disk drive is equal to the number of heads ( i.e. the number of
working surfaces ), times the number of tracks per surface, times the number of sectors per track,
times the number of bytes per sector. A particular physical block of data is specified by providing
the head-sector-cylinder number at which it is located.

Figure 10.1 - Moving-head disk mechanism.

 In operation the disk rotates at high speed, such as 7200 rpm ( 120 revolutions per second. ) The rate at
which data can be transferred from the disk to the computer is composed of several steps:
o The positioning time, a.k.a. the seek time or random access time is the time required to move the
heads from one cylinder to another, and for the heads to settle down after the move. This is
typically the slowest step in the process and the predominant bottleneck to overall transfer rates.
o The rotational latency is the amount of time required for the desired sector to rotate around and
come under the read-write head.This can range anywhere from zero to one full revolution, and on
the average will equal one-half revolution. This is another physical step and is usually the second
slowest step behind seek time. ( For a disk rotating at 7200 rpm, the average rotational latency
would be 1/2 revolution / 120 revolutions per second, or just over 4 milliseconds, a long time by
computer standards.
o The transfer rate, which is the time required to move the data electronically from the disk to the
computer. ( Some authors may also use the term transfer rate to refer to the overall transfer rate,
including seek time and rotational latency as well as the electronic data transfer rate. )
 Disk heads "fly" over the surface on a very thin cushion of air. If they should accidentally contact the disk,
then a head crash occurs, which may or may not permanently damage the disk or even destroy it
completely. For this reason it is normal to park the disk heads when turning a computer off, which means
to move the heads off the disk or to an area of the disk where there is no data stored.
 Floppy disks are normally removable. Hard drives can also be removable, and some are even hot-
swappable, meaning they can be removed while the computer is running, and a new hard drive inserted in
their place.
 Disk drives are connected to the computer via a cable known as the I/O Bus. Some of the common
interface formats include Enhanced Integrated Drive Electronics, EIDE; Advanced Technology
Attachment, ATA; Serial ATA, SATA, Universal Serial Bus, USB; Fiber Channel, FC, and Small
Computer Systems Interface, SCSI.
 The host controller is at the computer end of the I/O bus, and the disk controller is built into the disk
itself. The CPU issues commands to the host controller via I/O ports. Data is transferred between the
magnetic surface and onboard cache by the disk controller, and then the data is transferred from that
cache to the host controller and the motherboard memory at electronic speeds.

1.2 Solid-State Disks


 As technologies improve and economics change, old technologies are often used in different ways. One
example of this is the increasing used of solid state disks, or SSDs.
 SSDs use memory technology as a small fast hard disk. Specific implementations may use either flash
memory or DRAM chips protected by a battery to sustain the information through power cycles.
 Because SSDs have no moving parts they are much faster than traditional hard drives, and certain
problems such as the scheduling of disk accesses simply do not apply.
 However SSDs also have their weaknesses: They are more expensive than hard drives, generally not as
large, and may have shorter life spans.
 SSDs are especially useful as a high-speed cache of hard-disk information that must be accessed quickly.
One example is to store filesystem meta-data, e.g. directory and inode information, that must be accessed
quickly and often. Another variation is a boot disk containing the OS and some application executables,
but no vital user data. SSDs are also used in laptops to make them smaller, faster, and lighter.
 Because SSDs are so much faster than traditional hard disks, the throughput of the bus can become a
limiting factor, causing some SSDs to be connected directly to the system PCI bus for example.

1.3 Magnetic Tapes


 Magnetic tapes were once used for common secondary storage before the days of hard disk drives, but
today are used primarily for backups.
 Accessing a particular spot on a magnetic tape can be slow, but once reading or writing commences,
access speeds are comparable to disk drives.
 Capacities of tape drives can range from 20 to 200 GB, and compression can double that capacity.
1.4 Disk Structure
 The traditional head-sector-cylinder, HSC numbers are mapped to linear block addresses by numbering the
first sector on the first head on the outermost track as sector 0. Numbering proceeds with the rest of the
sectors on that same track, and then the rest of the tracks on the same cylinder before proceeding through
the rest of the cylinders to the center of the disk. In modern practice these linear block addresses are used
in place of the HSC numbers for a variety of reasons:
1. The linear length of tracks near the outer edge of the disk is much longer than for those tracks
located near the center, and therefore it is possible to squeeze many more sectors onto outer tracks
than onto inner ones.
2. All disks have some bad sectors, and therefore disks maintain a few spare sectors that can be used
in place of the bad ones. The mapping of spare sectors to bad sectors in managed internally to the
disk controller.
3. Modern hard drives can have thousands of cylinders, and hundreds of sectors per track on their
outermost tracks. These numbers exceed the range of HSC numbers for many ( older ) operating
systems, and therefore disks can be configured for any convenient combination of HSC values
that falls within the total number of sectors physically on the drive.
 There is a limit to how closely packed individual bits can be placed on a physical media, but that limit is
growing increasingly more packed as technological advances are made.
 Modern disks pack many more sectors into outer cylinders than inner ones, using one of two approaches:
o With Constant Linear Velocity, CLV, the density of bits is uniform from cylinder to cylinder.
Because there are more sectors in outer cylinders, the disk spins slower when reading those
cylinders, causing the rate of bits passing under the read-write head to remain constant. This is the
approach used by modern CDs and DVDs.
o With Constant Angular Velocity, CAV, the disk rotates at a constant angular speed, with the bit
density decreasing on outer cylinders. ( These disks would have a constant number of sectors per
track on all cylinders. )

2) DISK ATTACHEMENT

Disk drives can be attached either directly to a particular host ( a local disk ) or to a network.

2.1 Host-Attached Storage


 Local disks are accessed through I/O Ports as described earlier.
 The most common interfaces are IDE or ATA, each of which allow up to two drives per host controller.
 SATA is similar with simpler cabling.
 High end workstations or other systems in need of larger number of disks typically use SCSI disks:
o The SCSI standard supports up to 16 targets on each SCSI bus, one of which is generally the host
adapter and the other 15 of which can be disk or tape drives.
o A SCSI target is usually a single drive, but the standard also supports up to 8 units within each
target. These would generally be used for accessing individual disks within a RAID array. ( See
below. )
o The SCSI standard also supports multiple host adapters in a single computer, i.e. multiple SCSI
busses.
o Modern advancements in SCSI include "fast" and "wide" versions, as well as SCSI-2.
o SCSI cables may be either 50 or 68 conductors. SCSI devices may be external as well as internal.
o See wikipedia for more information on the SCSI interface.
 FC is a high-speed serial architecture that can operate over optical fiber or four-conductor copper wires,
and has two variants:
o A large switched fabric having a 24-bit address space. This variant allows for multiple devices
and multiple hosts to interconnect, forming the basis for the storage-area networks, SANs, to be
discussed in a future section.
o The arbitrated loop, FC-AL, that can address up to 126 devices ( drives and controllers. )
o
2.2 Network-Attached Storage
 Network attached storage connects storage devices to computers using a remote procedure call, RPC,
interface, typically with something like NFS filesystem mounts. This is convenient for allowing several
computers in a group common access and naming conventions for shared storage.
 NAS can be implemented using SCSI cabling, or ISCSI uses Internet protocols and standard network
connections, allowing long-distance remote access to shared files.
 NAS allows computers to easily share data storage, but tends to be less efficient than standard host-
attached storage.

Figure 10.2 - Network-attached storage.

2.3 Storage-Area Network


 A Storage-Area Network, SAN, connects computers and storage devices in a network, using storage
protocols instead of network protocols.
 One advantage of this is that storage access does not tie up regular networking bandwidth.
 SAN is very flexible and dynamic, allowing hosts and devices to attach and detach on the fly.
 SAN is also controllable, allowing restricted access to certain hosts and devices.

Figure 10.3 - Storage-area network.

2 )DISK SCHEDULING

 As mentioned earlier, disk transfer speeds are limited primarily by seek times and rotational
latency. When multiple requests are to be processed there is also some inherent delay in waiting for other
requests to be processed.
 Bandwidth is measured by the amount of data transferred divided by the total amount of time from the
first request being made to the last transfer being completed, ( for a series of disk requests. )
 Both bandwidth and access time can be improved by processing requests in a good order.
 Disk requests include the disk address, memory address, number of sectors to transfer, and whether the
request is for reading or writing.

3.1 FCFS Scheduling


 First-Come First-Serve is simple and intrinsically fair, but not very efficient. Consider in the following
sequence the wild swing from cylinder 122 to 14 and then back to 124:

Figure 10.4 - FCFS disk scheduling.

3.2 SSTF Scheduling


 Shortest Seek Time First scheduling is more efficient, but may lead to starvation if a constant stream of
requests arrives for the same general area of the disk.
 SSTF reduces the total head movement to 236 cylinders, down from 640 required for the same set of
requests under FCFS. Note, however that the distance could be reduced still further to 208 by starting
with 37 and then 14 first before processing the rest of the requests.

Figure 10.5 - SSTF disk scheduling.

3.3 SCAN Scheduling


 The SCAN algorithm, a.k.a. the elevator algorithm moves back and forth from one end of the disk to the
other, similarly to an elevator processing requests in a tall building.
Figure 10.6 - SCAN disk scheduling.
 Under the SCAN algorithm, If a request arrives just ahead of the moving head then it will be processed
right away, but if it arrives just after the head has passed, then it will have to wait for the head to pass
going the other way on the return trip. This leads to a fairly wide variation in access times which can be
improved upon.
 Consider, for example, when the head reaches the high end of the disk: Requests with high cylinder
numbers just missed the passing head, which means they are all fairly recent requests, whereas requests
with low numbers may have been waiting for a much longer time. Making the return scan from high to
low then ends up accessing recent requests first and making older requests wait that much longer.

3.4 C-SCAN Scheduling


 The Circular-SCAN algorithm improves upon SCAN by treating all requests in a circular queue fashion -
Once the head reaches the end of the disk, it returns to the other end without processing any requests, and
then starts again from the beginning of the disk:

Figure 10.7 - C-SCAN disk scheduling.

3.5 LOOK Scheduling


 LOOK scheduling improves upon SCAN by looking ahead at the queue of pending requests, and not
moving the heads any farther towards the end of the disk than is necessary. The following diagram
illustrates the circular form of LOOK:

Figure 10.8 - C-LOOK disk scheduling.

3.6 Selection of a Disk-Scheduling Algorithm


 With very low loads all algorithms are equal, since there will normally only be one request to process at a
time.
 For slightly larger loads, SSTF offers better performance than FCFS, but may lead to starvation when
loads become heavy enough.
 For busier systems, SCAN and LOOK algorithms eliminate starvation problems.
 The actual optimal algorithm may be something even more complex than those discussed here, but the
incremental improvements are generally not worth the additional overhead.
 Some improvement to overall filesystem access times can be made by intelligent placement of directory
and/or inode information. If those structures are placed in the middle of the disk instead of at the
beginning of the disk, then the maximum distance from those structures to data blocks is reduced to only
one-half of the disk size. If those structures can be further distributed and furthermore have their data
blocks stored as close as possible to the corresponding directory structures, then that reduces still further
the overall time to find the disk block numbers and then access the corresponding data blocks.
 On modern disks the rotational latency can be almost as significant as the seek time, however it is not
within the OSes control to account for that, because modern disks do not reveal their internal sector
mapping schemes, ( particularly when bad blocks have been remapped to spare sectors. )
o Some disk manufacturers provide for disk scheduling algorithms directly on their disk controllers,
( which do know the actual geometry of the disk as well as any remapping ), so that if a series of
requests are sent from the computer to the controller then those requests can be processed in an
optimal order.
o Unfortunately there are some considerations that the OS must take into account that are beyond
the abilities of the on-board disk-scheduling algorithms, such as priorities of some requests over
others, or the need to process certain requests in a particular order. For this reason OSes may elect
to spoon-feed requests to the disk controller one at a time in certain situations.

4)RAID STRUCTURE

 The general idea behind RAID is to employ a group of hard drives together with some form of duplication,
either to increase reliability or to speed up operations, ( or sometimes both. )
 RAID originally stood for Redundant Array of Inexpensive Disks, and was designed to use a bunch of
cheap small disks in place of one or two larger more expensive ones. Today RAID systems employ large
possibly expensive disks as their components, switching the definition to Independent disks.

4.1 Improvement of Reliability via Redundancy


 The more disks a system has, the greater the likelihood that one of them will go bad at any given time.
Hence increasing disks on a system actually decreases the Mean Time To Failure, MTTF of the system.
 If, however, the same data was copied onto multiple disks, then the data would not be lost unless both ( or
all ) copies of the data were damaged simultaneously, which is a MUCH lower probability than for a
single disk going bad. More specifically, the second disk would have to go bad before the first disk was
repaired, which brings the Mean Time To Repair into play. For example if two disks were involved,
each with a MTTF of 100,000 hours and a MTTR of 10 hours, then the Mean Time to Data Loss would
be 500 * 10^6 hours, or 57,000 years!
 This is the basic idea behind disk mirroring, in which a system contains identical data on two or more
disks.
o Note that a power failure during a write operation could cause both disks to contain corrupt data,
if both disks were writing simultaneously at the time of the power failure. One solution is to write
to the two disks in series, so that they will not both become corrupted ( at least not in the same
way ) by a power failure. And alternate solution involves non-volatile RAM as a write cache,
which is not lost in the event of a power failure and which is protected by error-correcting codes.

4.2 Improvement in Performance via Parallelism


 There is also a performance benefit to mirroring, particularly with respect to reads. Since every block of
data is duplicated on multiple disks, read operations can be satisfied from any available copy, and
multiple disks can be reading different data blocks simultaneously in parallel. ( Writes could possibly be
sped up as well through careful scheduling algorithms, but it would be complicated in practice. )
 Another way of improving disk access time is with striping, which basically means spreading data out
across multiple disks that can be accessed simultaneously.
o With bit-level striping the bits of each byte are striped across multiple disks. For example if 8
disks were involved, then each 8-bit byte would be read in parallel by 8 heads on separate disks. A
single disk read would access 8 * 512 bytes = 4K worth of data in the time normally required to
read 512 bytes. Similarly if 4 disks were involved, then two bits of each byte could be stored on
each disk, for 2K worth of disk access per read or write operation.
o Block-level striping spreads a filesystem across multiple disks on a block-by-block basis, so if
block N were located on disk 0, then block N + 1 would be on disk 1, and so on. This is
particularly useful when filesystems are accessed in clusters of physical blocks. Other striping
possibilities exist, with block-level striping being the most common.

4.3 RAID Levels


 Mirroring provides reliability but is expensive; Striping improves performance, but does not improve
reliability. Accordingly there are a number of different schemes that combine the principals of mirroring
and striping in different ways, in order to balance reliability versus performance versus cost. These are
described by different RAID levels, as follows: ( In the diagram that follows, "C" indicates a copy, and
"P" indicates parity, i.e. checksum bits. )
1. Raid Level 0 - This level includes striping only, with no mirroring.
2. Raid Level 1 - This level includes mirroring only, no striping.
3. Raid Level 2 - This level stores error-correcting codes on additional disks, allowing for any
damaged data to be reconstructed by subtraction from the remaining undamaged data. Note that
this scheme requires only three extra disks to protect 4 disks worth of data, as opposed to full
mirroring. ( The number of disks required is a function of the error-correcting algorithms, and the
means by which the particular bad bit(s) is(are) identified. )
4. Raid Level 3 - This level is similar to level 2, except that it takes advantage of the fact that each
disk is still doing its own error-detection, so that when an error occurs, there is no question about
which disk in the array has the bad data. As a result a single parity bit is all that is needed to
recover the lost data from an array of disks. Level 3 also includes striping, which improves
performance. The downside with the parity approach is that every disk must take part in every
disk access, and the parity bits must be constantly calculated and checked, reducing performance.
Hardware-level parity calculations and NVRAM cache can help with both of those issues. In
practice level 3 is greatly preferred over level 2.
5. Raid Level 4 - This level is similar to level 3, employing block-level striping instead of bit-level
striping. The benefits are that multiple blocks can be read independently, and changes to a block
only require writing two blocks ( data and parity ) rather than involving all disks. Note that new
disks can be added seamlessly to the system provided they are initialized to all zeros, as this does
not affect the parity results.
6. Raid Level 5 - This level is similar to level 4, except the parity blocks are distributed over all
disks, thereby more evenly balancing the load on the system. For any given block on the disk(s),
one of the disks will hold the parity information for that block and the other N-1 disks will hold
the data. Note that the same disk cannot hold both data and parity for the same block, as both
would be lost in the event of a disk crash.
7. Raid Level 6 - This level extends raid level 5 by storing multiple bits of error-recovery codes, (
such as the Reed-Solomon codes ), for each bit position of data, rather than a single parity bit. In
the example shown below 2 bits of ECC are stored for every 4 bits of data, allowing data recovery
in the face of up to two simultaneous disk failures. Note that this still involves only 50% increase
in storage needs, as opposed to 100% for simple mirroring which could only tolerate a single disk
failure.
Figure 10.11 - RAID levels.

 There are also two RAID levels which combine RAID levels 0 and 1 ( striping and mirroring ) in different
combinations, designed to provide both performance and reliability at the expense of increased cost.
o RAID level 0 + 1 disks are first striped, and then the striped disks mirrored to another set. This
level generally provides better performance than RAID level 5.
o RAID level 1 + 0 mirrors disks in pairs, and then stripes the mirrored pairs. The storage capacity,
performance, etc. are all the same, but there is an advantage to this approach in the event of
multiple disk failures, as illustrated below:.
 In diagram (a) below, the 8 disks have been divided into two sets of four, each of which is
striped, and then one stripe set is used to mirror the other set.
 If a single disk fails, it wipes out the entire stripe set, but the system can keep on
functioning using the remaining set.
 However if a second disk from the other stripe set now fails, then the entire system
is lost, as a result of two disk failures.
 In diagram (b), the same 8 disks are divided into four sets of two, each of which is
mirrored, and then the file system is striped across the four sets of mirrored disks.
 If a single disk fails, then that mirror set is reduced to a single disk, but the system
rolls on, and the other three mirror sets continue mirroring.
 Now if a second disk fails, ( that is not the mirror of the already failed disk ), then
another one of the mirror sets is reduced to a single disk, but the system can
continue without data loss.
 In fact the second arrangement could handle as many as four simultaneously failed
disks, as long as no two of them were from the same mirror pair.
Figure 10.12 - RAID 0 + 1 and 1 + 0
4.4 Selecting a RAID Level
 Trade-offs in selecting the optimal RAID level for a particular application include cost, volume of data,
need for reliability, need for performance, and rebuild time, the latter of which can affect the likelihood
that a second disk will fail while the first failed disk is being rebuilt.
 Other decisions include how many disks are involved in a RAID set and how many disks to protect with a
single parity bit. More disks in the set increases performance but increases cost. Protecting more disks per
parity bit saves cost, but increases the likelihood that a second disk will fail before the first bad disk is
repaired.

4.5 Extensions
 RAID concepts have been extended to tape drives ( e.g. striping tapes for faster backups or parity checking
tapes for reliability ), and for broadcasting of data.

4.6 Problems with RAID


 RAID protects against physical errors, but not against any number of bugs or other errors that could write
erroneous data.
 ZFS adds an extra level of protection by including data block checksums in all inodes along with the
pointers to the data blocks. If data are mirrored and one copy has the correct checksum and the other does
not, then the data with the bad checksum will be replaced with a copy of the data with the good
checksum. This increases reliability greatly over RAID alone, at a cost of a performance hit that is
acceptable because ZFS is so fast to begin with.

Figure 10.13 - ZFS checksums all metadata and data.


 Another problem with traditional filesystems is that the sizes are fixed, and relatively difficult to change.
Where RAID sets are involved it becomes even harder to adjust filesystem sizes, because a filesystem
cannot span across multiple filesystems.
 ZFS solves these problems by pooling RAID sets, and by dynamically allocating space to filesystems as
needed. Filesystem sizes can be limited by quotas, and space can also be reserved to guarantee that a
filesystem will be able to grow later, but these parameters can be changed at any time by the filesystem's
owner. Otherwise filesystems grow and shrink dynamically as needed.

Figure 10.14 - (a) Traditional volumes and file systems. (b) a ZFS pool and file systems.

5) STABLE STORAGE MANAGEMENT

 The concept of stable storage ( first presented in chapter 6 ) involves a storage medium in which data
is never lost, even in the face of equipment failure in the middle of a write operation.
 To implement this requires two ( or more ) copies of the data, with separate failure modes.
 An attempted disk write results in one of three possible outcomes:
1. The data is successfully and completely written.
2. The data is partially written, but not completely. The last block written may be garbled.
3. No writing takes place at all.
 Whenever an equipment failure occurs during a write, the system must detect it, and return the system
back to a consistent state. To do this requires two physical blocks for every logical block, and the
following procedure:
1. Write the data to the first physical block.
2. After step 1 had completed, then write the data to the second physical block.
3. Declare the operation complete only after both physical writes have completed successfully.
 During recovery the pair of blocks is examined.
o If both blocks are identical and there is no sign of damage, then no further action is necessary.
o If one block contains a detectable error but the other does not, then the damaged block is replaced
with the good copy. ( This will either undo the operation or complete the operation, depending on
which block is damaged and which is undamaged. )
o If neither block shows damage but the data in the blocks differ, then replace the data in the first
block with the data in the second block. ( Undo the operation. )
 Because the sequence of operations described above is slow, stable storage usually includes NVRAM as a
cache, and declares a write operation complete once it has been written to the NVRAM.

You might also like