Notes On Lesson
Notes On Lesson
If unsafe P
i
must wait, and the old resource-allocation state is restored
Example of Bankers Algorithm
5 processes P
0
through P
4
; 3 resource types A
(10 instances),
B (5instances, and C (7 instances).
Snapshot at time T
0
:
Allocation Max Available
A B C A B C A B C
P
0
0 1 0 7 5 3 3 3 2
P
1
2 0 0 3 2 2
P
2
3 0 2 9 0 2
P
3
2 1 1 2 2 2
P
4
0 0 2 4 3 3
The content of the matrix. Need is defined to be Max Allocation.
Need
A B C
P
0
7 4 3
P
1
1 2 2
P
2
6 0 0
P
3
0 1 1
P
4
4 3 1
The system is in a safe state since the sequence < P
1
, P
3
, P
4
, P
2
, P
0
> satisfies safety
criteria.
Check that Request s Available (that is, (1,0,2) s (3,3,2) true.
70
Allocation Need Available
A B C A B C A B C
P
0
0 1 0 7 4 3 2 3 0
P
1
3 0 2 0 2 0
P
2
3 0 1 6 0 0
P
3
2 1 1 0 1 1
P
4
0 0 2 4 3 1
Executing safety algorithm shows that sequence <P
1
, P
3
, P
4
, P
0
, P
2
> satisfies safety
requirement.
Can request for (3,3,0) by P
4
be granted?
Can request for (0,2,0) by P
0
be granted?
4.5 DEADLOCK DETECTION
4.5.1 Methods
Allow system to enter deadlock state
Detection algorithm
Recovery scheme
4.5.2 Single Instance of Each Resource Type
Maintain wait-for graph
Nodes are processes.
P
i
P
j
if P
i
is waiting for P
j
.
Periodically invoke an algorithm that searches for a cycle in the graph.
An algorithm to detect a cycle in a graph requires an order of n
2
operations, where n is
the number of vertices in the graph.
Resource-Allocation Graph and Wait-for Graph
71
4.5.3 Several Instances of a Resource Type
Available: A vector of length m indicates the number of available resources of each
type.
Allocation: An n x m matrix defines the number of resources of each type currently
allocated to each process.
Request: An n x m matrix indicates the current request of each process. If Request
[i
j
] = k, then process P
i
is requesting k more instances of resource type. R
j
.
Detection Algorithm
1. Let Work and Finish be vectors of length m and n, respectively Initialize:
(a) Work = Available
(b) For i = 1,2, , n, if Allocation
i
= 0, then
Finish[i] = false;otherwise, Finish[i] = true.
2. Find an index i such that both:
(a) Finish[i] == false
(b) Request
i
s Work
If no such i exists, go to step 4.
3. Work = Work + Allocation
i
Finish[i] = true
go to step 2.
4. If Finish[i] == false, for some i, 1 s i s n, then the system is in deadlock state.
Moreover, if Finish[i] == false, then P
i
is deadlocked.
Example of Detection Algorithm
Five processes P
0
through P
4
;
three resource types
A (7 instances), B (2 instances), and C (6 instances).
Snapshot at time T
0
:
72
Allocation Request Available
A B C A B C A B C
P
0
0 1 0 0 0 0 0 0 0
P
1
2 0 0 2 0 2
P
2
3 0 3 0 0 0
P
3
2 1 1 1 0 0
P
4
0 0 2 0 0 2
Sequence <P
0
, P
2
, P
3
, P
1
, P
4
> will result in Finish[i] = true for all i.
P
2
requests an additional instance of type C.
Request
A B C
P
0
0 0 0
P
1
2 0 1
P
2
0 0 1
P
3
1 0 0
P
4
0 0 2
State of system?
Can reclaim resources held by process P
0
, but insufficient resources to fulfill other
processes; requests.
Deadlock exists, consisting of processes P
1
,
P
2
, P
3
, and P
4
.
4.5.4 Detection-Algorithm Usage
When, and how often, to invoke depends on:
How often a deadlock is likely to occur?
How many processes will need to be rolled back?
one for each disjoint cycle
If detection algorithm is invoked arbitrarily, there may be many cycles in the resource
graph and so we would not be able to tell which of the many deadlocked processes
caused the deadlock.
4.6 RECOVERY FROM DEADLOCK
4.6.1 Process Termination
Abort all deadlocked processes.
Abort one process at a time until the deadlock cycle is eliminated.
73
In which order should we choose to abort?
Priority of the process.
How long process has computed, and how much longer to completion.
Resources the process has used.
Resources process needs to complete.
How many processes will need to be terminated.
Is process interactive or batch?
4.6.2 Recovery from Deadlock: Resource Preemption
Selecting a victim minimize cost.
Rollback return to some safe state, restart process for that state.
Starvation same process may always be picked as victim, include number of
rollback in cost factor.
4.7 COMBINED APPROACH TO DEADLOCK HANDLING
Combine the three basic approaches
prevention
avoidance
detection
allowing the use of the optimal approach for each of resources in the system.
Partition resources into hierarchically ordered classes.
Use most appropriate technique for handling deadlocks within each class.
*****************************************
74
UNIT - III
STORAGE MANAGEMENT
1.MEMORY MANAGEMENT
1.1 BACKGROUND
1.1.1 Introduction
Program must be brought into memory and placed within a process for it to be run.
Input queue collection of processes on the disk that are waiting to be brought into
memory to run the program.
User programs go through several steps before being run.
1.1.2 Binding of Instructions and Data to Memory
Compile time: If memory location known a priori, absolute code can be generated;
must recompile code if starting location changes.
Load time: Must generate relocatable code if memory location is not known at
compile time.
Execution time: Binding delayed until run time if the process can be moved during its
execution from one memory segment to another. Need hardware support for address
maps (e.g., base and limit registers).
Multistep Processing of a User Program
1.1.3 Logical vs. Physical Address Space
The concept of a logical address space that is bound to a separate physical address
space is central to proper memory management.
Logical address generated by the CPU; also referred to as virtual address.
Physical address address seen by the memory unit.
Logical and physical addresses are the same in compile-time and load-time address-
binding schemes; logical (virtual) and physical addresses differ in execution-time
75
address-binding scheme.
1.1.4 Memory-Management Unit (MMU)
Hardware device that maps virtual to physical address.
In MMU scheme, the value in the relocation register is added to every address
generated by a user process at the time it is sent to memory.
The user program deals with logical addresses; it never sees the real physical
addresses.
Dynamic relocation using a relocation register
1.1.5 Dynamic Loading
Routine is not loaded until it is called
Better memory-space utilization; unused routine is never loaded.
Useful when large amounts of code are needed to handle infrequently occurring cases.
No special support from the operating system is required implemented through
program design.
1.1.6 Dynamic Linking
Linking postponed until execution time.
Small piece of code, stub, used to locate the appropriate memory-resident library
routine.
Stub replaces itself with the address of the routine, and executes the routine.
Operating system needed to check if routine is in processes memory address.
Dynamic linking is particularly useful for libraries.
1.2 SWAPPING
A process can be swapped temporarily out of memory to a backing store, and then
brought back into memory for continued execution.
Backing store fast disk large enough to accommodate copies of all memory images
for all users; must provide direct access to these memory images.
Roll out, roll in swapping variant used for priority-based scheduling algorithms;
lower-priority process is swapped out so higher-priority process can be loaded and
executed.
76
Major part of swap time is transfer time; total transfer time is directly proportional to
the amount of memory swapped.
Modified versions of swapping are found on many systems, i.e., UNIX, Linux, and
Windows.
Schematic View of Swapping
1.3 CONTIGUOUS ALLOCATION
1.3.1 Introduction
Main memory usually into two partitions:
Resident operating system, usually held in low memory with interrupt vector.
User processes then held in high memory.
1.3.2 Memory Protection
Single-partition allocation
Relocation-register scheme used to protect user processes from each other, and
from changing operating-system code and data.
Relocation register contains value of smallest physical address; limit register
contains range of logical addresses each logical address must be less than the
limit register.
Hardware Support for Relocation and Limit Registers
Multiple-partition allocation
Hole block of available memory; holes of various size are scattered throughout
memory.
77
When a process arrives, it is allocated memory from a hole large enough to
accommodate it.
Operating system maintains information about:
a) allocated partitions b) free partitions (hole)
1.3.3 Dynamic Storage-Allocation Problem
First-fit: Allocate the first hole that is big enough.
Best-fit: Allocate the smallest hole that is big enough; must search entire list, unless
ordered by size. Produces the smallest leftover hole.
Worst-fit: Allocate the largest hole; must also search entire list. Produces the largest
leftover hole.
First-fit and best-fit better than worst-fit in terms of speed and storage utilization.
1.3.4 Fragmentation
External Fragmentation total memory space exists to satisfy a request, but it is not
contiguous.
Internal Fragmentation allocated memory may be slightly larger than requested
memory; this size difference is memory internal to a partition, but not being used.
Reduce external fragmentation by compaction
Shuffle memory contents to place all free memory together in one large block.
Compaction is possible only if relocation is dynamic, and is done at execution time.
I/O problem
Latch job in memory while it is involved in I/O.
Do I/O only into OS buffers.
1.4 PAGING
1.4.1 Basic method
Logical address space of a process can be noncontiguous; process is allocated physical
memory whenever the latter is available.
Divide physical memory into fixed-sized blocks called frames (size is power of 2,
between 512 bytes and 8192 bytes).
Divide logical memory into blocks of same size called pages.
Keep track of all free frames.
To run a program of size n pages, need to find n free frames and load program.
Set up a page table to translate logical to physical addresses.
Internal fragmentation.
1.4.1.1 Address Translation Scheme
Address generated by CPU is divided into:
Page number (p) used as an index into a page table which contains base address
of each page in physical memory.
Page offset (d) combined with base address to define the physical memory
address that is sent to the memory unit.
78
Address Translation Architecture
Paging Example
Paging Example
79
1.4.1.2 Implementation of Page Table
Page table is kept in main memory.
Page-table base register (PTBR) points to the page table.
Page-table length register (PRLR) indicates size of the page table.
In this scheme every data/instruction access requires two memory accesses. One for
the page table and one for the data/instruction.
The two memory access problem can be solved by the use of a special fast-lookup
hardware cache called associative memory or translation look-aside buffers (TLBs)
1.4.1.3 Associative Memory
Associative memory parallel search
Address translation (A, A)
If A is in associative register, get frame # out.
Otherwise get frame # from page table in memory
1.4.2 Paging Hardware With TLB
1.4.2.1. Effective Access Time
Associative Lookup = c time unit
Assume memory cycle time is 1 microsecond
Hit ratio percentage of times that a page number is found in the associative registers;
ration related to number of associative registers.
Hit ratio = o
Effective Access Time (EAT)
EAT = (1 + c) o + (2 + c)(1 o)
= 2 + c o
1.4.3 Memory Protection
Memory protection implemented by associating protection bit with each frame.
Valid-invalid bit attached to each entry in the page table:
valid indicates that the associated page is in the process logical address space,
80
and is thus a legal page.
invalid indicates that the page is not in the process logical address space.
Valid (v) or Invalid (i) Bit In A Page Table
1.4.4 Page Table Structure
Hierarchical Paging
Hashed Page Tables
Inverted Page Tables
1.4.4.1 Hierarchical Page Tables
Break up the logical address space into multiple page tables.
A simple technique is a two-level page table.
Two-Level Paging Example
A logical address (on 32-bit machine with 4K page size) is divided into:
a page number consisting of 20 bits.
a page offset consisting of 12 bits.
Since the page table is paged, the page number is further divided into:
a 10-bit page number.
a 10-bit page offset.
Thus, a logical address is as follows:
p
i
is an index into the outer page table, and p
2
is the displacement within the page of
the outer page table.
Two-Level Page-Table Scheme
81
Address-Translation Scheme
Address-translation scheme for a two-level 32-bit paging architecture
1.4.4.2 Hashed Page Tables
Common in address spaces > 32 bits.
The virtual page number is hashed into a page table. This page table contains a chain
of elements hashing to the same location.
Virtual page numbers are compared in this chain searching for a match. If a match is
found, the corresponding physical frame is extracted.
Hashed Page Table
Inverted Page Table
One entry for each real page of memory.
Entry consists of the virtual address of the page stored in that real memory location,
with information about the process that owns that page.
Decreases memory needed to store each page table, but increases time needed to search
the table when a page reference occurs.
Use hash table to limit the search to one or at most a few page-table entries.
82
1.4.4.3 Inverted Page Table Architecture
1.4.5 Shared Pages
Shared code
One copy of read-only (reentrant) code shared among processes (i.e., text editors,
compilers, window systems).
Shared code must appear in same location in the logical address space of all
processes.
Private code and data
Each process keeps a separate copy of the code and data.
The pages for the private code and data can appear anywhere in the logical address
space.
Shared Pages Example
83
1.5 SEGMENTATION
1.5.1 Basic method
Memory-management scheme that supports user view of memory.
A program is a collection of segments. A segment is a logical unit such as:
main program,
procedure,
function,
method,
object,
local variables, global variables,
common block,
stack,
symbol table, arrays
Users View of a Program
1.5.2 Segmentation Architecture
Logical address consists of a two tuple:
<segment-number, offset>,
Segment table maps two-dimensional physical addresses; each table entry has:
base contains the starting physical address where the segments reside in memory.
limit specifies the length of the segment.
Segment-table base register (STBR) points to the segment tables location in memory.
Segment-table length register (STLR) indicates number of segments used by a
program;
segment number s is legal if s < STLR.
Relocation.
dynamic
84
by segment table
Sharing.
shared segments
same segment number
Allocation.
first fit/best fit
external fragmentation
Protection. With each entry in segment table associate:
validation bit = 0 illegal segment
read/write/execute privileges
Protection bits associated with segments; code sharing occurs at segment level.
Since segments vary in length, memory allocation is a dynamic storage-allocation
problem.
A segmentation example is shown in the following diagram
Segmentation Hardware
Example of Segmentation
85
1.6 SEGMENTATION WITH PAGING
1.6.1 Multics
The MULTICS system solved problems of external fragmentation and lengthy search
times by paging the segments.
Solution differs from pure segmentation in that the segment-table entry contains not
the base address of the segment, but rather the base address of a page table for this
segment.
MULTICS Address Translation Scheme
1.6.2 Intel 386
As shown in the following diagram, the Intel 386 uses segmentation with paging for
memory management with a two-level paging scheme.
Intel 30386 Address Translation
86
2.VIRTUAL MEMORY
2.1.BACKGROUND
Virtual memory separation of user logical memory from physical memory.
Only part of the program needs to be in memory for execution.
Logical address space can therefore be much larger than physical address
space.
Allows address spaces to be shared by several processes.
Allows for more efficient process creation.
Virtual memory can be implemented via:
Demand paging
Demand segmentation
Virtual Memory That is Larger Than Physical Memory
2.2DEMAND PAGING
2.2.1 Basic concepts
Bring a page into memory only when it is needed.
Less I/O needed
Less memory needed
Faster response
More users
Page is needed reference to it
invalid reference abort
not-in-memory bring to memory
Transfer of a Paged Memory to Contiguous Disk Space
87
Valid-Invalid Bit
With each page table entry a validinvalid bit is associated
(1 in-memory, 0 not-in-memory)
Initially validinvalid but is set to 0 on all entries.
Example of a page table snapshot.
During address translation, if validinvalid bit in page table entry is 0 page fault.
Page Table When Some Pages Are Not in Main Memory
Page Fault
If there is ever a reference to a page, first reference will trap to
OS page fault
OS looks at another table to decide:
Invalid reference abort.
Just not in memory.
Get empty frame.
Swap page into frame.
Reset tables, validation bit = 1.
Restart instruction: Least Recently Used
block move
auto increment/decrement location
Steps in Handling a Page Fault
88
What happens if there is no free frame?
Page replacement find some page in memory, but not really in use, swap it out.
algorithm
performance want an algorithm which will result in minimum number of page
faults.
Same page may be brought into memory several times.
2.2.2 Performance of Demand Paging
Page Fault Rate 0 s p s 1.0
if p = 0 no page faults
if p = 1, every reference is a fault
Effective Access Time (EAT)
EAT = (1 p) x memory access
+ p (page fault overhead
+ [swap page out ]
+ swap page in
+ restart overhead)
2.2.3 Demand Paging Example
Memory access time = 1 microsecond
50% of the time the page that is being replaced has been modified and therefore needs
to be swapped out.
Swap Page Time = 10 msec = 10,000 msec
EAT = (1 p) x 1 + p (15000)
1 + 15000P (in msec)
2.3 PROCESS CREATION
Virtual memory allows other benefits during process creation:
- Copy-on-Write
- Memory-Mapped Files
2.3.1 Copy-on-Write
Copy-on-Write (COW) allows both parent and child processes to initially share the
same pages in memory.
If either process modifies a shared page, only then is the page copied.
COW allows more efficient process creation as only modified pages are copied.
89
Free pages are allocated from a pool of zeroed-out pages.
2.3.2 Memory-Mapped Files
Memory-mapped file I/O allows file I/O to be treated as routine memory access by
mapping a disk block to a page in memory.
A file is initially read using demand paging. A page-sized portion of the file is read
from the file system into a physical page. Subsequent reads/writes to/from the file are
treated as ordinary memory accesses.
Simplifies file access by treating file I/O through memory rather than read() write()
system calls.
Memory Mapped Files
2.4 PAGE REPLACEMENT
2.4.1 Need for Page replacement
Prevent over-allocation of memory by modifying page-fault service routine to include
page replacement.
Use modify (dirty) bit to reduce overhead of page transfers only modified pages are
written to disk.
Page replacement completes separation between logical memory and physical memory
large virtual memory can be provided on a smaller physical memory.
Need For Page Replacement
90
2.4.2 Basic Page Replacement
Find the location of the desired page on disk.
Find a free frame:
- If there is a free frame, use it.
- If there is no free frame, use a page replacement algorithm to select a victim
frame.
Read the desired page into the (newly) free frame. Update the page and frame tables.
Restart the process.
Page Replacement
2.4.3 Page Replacement Algorithms
Want lowest page-fault rate.
Evaluate algorithm by running it on a particular string of memory references (reference
string) and computing the number of page faults on that string.
In all our examples, the reference string is
1, 2, 3, 4, 1, 2, 5, 1, 2, 3, 4, 5.
2.4.3.1 First-In-First-Out (FIFO) Algorithm
Reference string: 1, 2, 3, 4, 1, 2, 5, 1, 2, 3, 4, 5
3 frames (3 pages can be in memory at a time per process)
91
4 frames
FIFO Replacement Beladys Anomaly more frames less
page faults
FIFO Page Replacement
FIFO Illustrating Beladys Anamoly
2.4.3.2 Optimal Algorithm
Replace page that will not be used for longest period of time.
4 frames example
1, 2, 3, 4, 1, 2, 5, 1, 2, 3, 4, 5
92
How do you know this?
Used for measuring how well your algorithm performs.
Optimal Page Replacement
2.4.3.3 Least Recently Used (LRU) Algorithm
Reference string: 1, 2, 3, 4, 1, 2, 5, 1, 2, 3, 4, 5
Counter implementation
Every page entry has a counter; every time page is referenced through this entry,
copy the clock into the counter.
When a page needs to be changed, look at the counters to determine which are to
change.
93
LRU Page Replacement
Stack implementation keep a stack of page numbers in a double link form:
Page referenced:
move it to the top
requires 6 pointers to be changed
No search for replacement
Use Of A Stack to Record The Most Recent Page References
2.4.3.4 LRU Approximation Algorithms
2.4.3.4.1 Additional Reference bit algorithm
Reference bit
With each page associate a bit, initially = 0
When page is referenced bit set to 1.
Replace the one which is 0 (if one exists). We do not know the order, however.
Second chance
Need reference bit.
Clock replacement.
If page to be replaced (in clock order) has reference bit = 1. then:
set reference bit 0.
leave page in memory.
replace next page (in clock order), subject to same rules.
94
2.4.3.4.2 Second-Chance (clock) Page-Replacement Algorithm
2.4.3.5 Counting Algorithms
Keep a counter of the number of references that have been made to each page.
LFU Algorithm: replaces page with smallest count.
MFU Algorithm: based on the argument that the page with the smallest count was
probably just brought in and has yet to be used.
2.5 ALLOCATION OF FRAMES
Each process needs minimum number of pages.
Example: IBM 370 6 pages to handle SS MOVE instruction:
instruction is 6 bytes, might span 2 pages.
2 pages to handle from.
2 pages to handle to.
Two major allocation schemes.
fixed allocation
priority allocation
2.5.1 Fixed Allocation
Equal allocation e.g., if 100 frames and 5 processes, give each 20 pages.
Proportional allocation Allocate according to the size of process.
2.5.2 Priority Allocation
Use a proportional allocation scheme using priorities rather than size.
If process P
i
generates a page fault,
select for replacement one of its frames.
select for replacement a frame from a process with lower priority number.
2.5.3 Global vs. Local Allocation
Global replacement process selects a replacement frame from the set of all frames;
one process can take a frame from another.
95
Local replacement each process selects from only its own set of allocated frames.
2.6 THRASHING
2.6.1 Causes of Thrashing
If a process does not have enough pages, the page-fault rate is very high. This leads
to:
low CPU utilization.
operating system thinks that it needs to increase the degree of multiprogramming.
another process added to the system.
Thrashing a process is busy swapping pages in and out.
Thrashing
Why does paging work?
Locality model
Process migrates from one locality to another.
Localities may overlap.
Why does thrashing occur?
E size of locality > total memory size
2.6.2 Working-Set Model
A working-set window a fixed number of page references
Example: 10,000 instruction
WSS
i
(working set of Process P
i
) =
total number of pages referenced in the most recent A (varies in time)
if A too small will not encompass entire locality.
if A too large will encompass several localities.
if A = will encompass entire program.
D = E WSS
i
total demand frames
if D > m Thrashing
Policy if D > m, then suspend one of the processes.
Working-set model
96
Keeping Track of the Working Set
Approximate with interval timer + a reference bit
Example: A = 10,000
Timer interrupts after every 5000 time units.
Keep in memory 2 bits for each page.
Whenever a timer interrupts copy and sets the values of all reference bits to 0.
If one of the bits in memory = 1 page in working set.
Why is this not completely accurate?
Improvement = 10 bits and interrupt every 1000 time units.
Page-Fault Frequency Scheme
Establish acceptable page-fault rate.
If actual rate too low, process loses frame.
If actual rate too high, process gains frame.
3.CASE STUDY:
3.1MEMORY MANAGEMENT IN LINUX
Rather than describing the theory of memory management in operating systems,
this section tries to pinpoint the main features of the Linux implementation. Although you
do not need to be a Linux virtual memory guru to implement mmap, a basic overview of
how things work is useful. What follows is a fairly lengthy description of the data
structures used by the kernel to manage memory. Once the necessary background has
been covered, we can get into working with these structures.
3.1.1. Address Types
Linux is, of course, a virtual memory system, meaning that the addresses seen by user
programs do not directly correspond to the physical addresses used by the hardware.
Virtual memory introduces a layer of indirection that allows a number of nice things.
With virtual memory, programs running on the system can allocate far more memory than
97
is physically available; indeed, even a single process can have a virtual address space
larger than the system's physical memory. Virtual memory also allows the program to
play a number of tricks with the process's address space, including mapping the program's
memory to device memory.
Thus far, we have talked about virtual and physical addresses, but a number of the details
have been glossed over. The Linux system deals with several types of addresses, each
with its own semantics. Unfortunately, the kernel code is not always very clear on exactly
which type of address is being used in each situation, so the programmer must be careful.
The following is a list of address types used in Linux. Figure 15-1 shows how these
address types relate to physical memory.
User virtual addresses
These are the regular addresses seen by user-space programs. User addresses are
either 32 or 64 bits in length, depending on the underlying hardware architecture,
and each process has its own virtual address space.
Physical addresses
The addresses used between the processor and the system's memory. Physical
addresses are 32- or 64-bit quantities; even 32-bit systems can use larger physical
addresses in some situations.
Bus addresses
The addresses used between peripheral buses and memory. Often, they are the
same as the physical addresses used by the processor, but that is not necessarily
the case. Some architectures can provide an I/O memory management unit
(IOMMU) that remaps addresses between a bus and main memory. An IOMMU
can make life easier in a number of ways (making a buffer scattered in memory
appear contiguous to the device, for example), but programming the IOMMU is
an extra step that must be performed when setting up DMA operations. Bus
addresses are highly architecture dependent, of course.
Kernel logical addresses
These make up the normal address space of the kernel. These addresses map some
portion (perhaps all) of main memory and are often treated as if they were
physical addresses. On most architectures, logical addresses and their associated
physical addresses differ only by a constant offset. Logical addresses use the
hardware's native pointer size and, therefore, may be unable to address all of
physical memory on heavily equipped 32-bit systems. Logical addresses are
usually stored in variables of type unsigned long or void *. Memory returned from
kmalloc has a kernel logical address.
Kernel virtual addresses
Kernel virtual addresses are similar to logical addresses in that they are a mapping
from a kernel-space address to a physical address. Kernel virtual addresses do not
necessarily have the linear, one-to-one mapping to physical addresses that
characterize the logical address space, however. All logical addresses are kernel
virtual addresses, but many kernel virtual addresses are not logical addresses. For
example, memory allocated by vmalloc has a virtual address (but no direct
physical mapping). The kmap function (described later in this chapter) also returns
virtual addresses. Virtual addresses are usually stored in pointer variables.
98
Figure 15-1. Address types used in Linux
If you have a logical address, the macro _ _pa( ) (defined in <asm/page.h>) returns its
associated physical address. Physical addresses can be mapped back to logical addresses
with _ _va( ), but only for low-memory pages.
Different kernel functions require different types of addresses. It would be nice if there
were different C types defined, so that the required address types were explicit, but we
have no such luck. In this chapter, we try to be clear on which types of addresses are used
where.
3.1.2. Physical Addresses and Pages
Physical memory is divided into discrete units called pages. Much of the system's internal
handling of memory is done on a per-page basis. Page size varies from one architecture to
the next, although most systems currently use 4096-byte pages. The constant PAGE_SIZE
(defined in <asm/page.h>) gives the page size on any given architecture.
If you look at a memory addressvirtual or physicalit is divisible into a page number
and an offset within the page. If 4096-byte pages are being used, for example, the 12
least-significant bits are the offset, and the remaining, higher bits indicate the page
number. If you discard the offset and shift the rest of an offset to the right, the result is
called a page frame number (PFN). Shifting bits to convert between page frame numbers
and addresses is a fairly common operation; the macro PAGE_SHIFT tells how many bits
must be shifted to make this conversion.
3.1.3. High and Low Memory
The difference between logical and kernel virtual addresses is highlighted on 32-bit
systems that are equipped with large amounts of memory. With 32 bits, it is possible to
address 4 GB of memory. Linux on 32-bit systems has, until recently, been limited to
substantially less memory than that, however, because of the way it sets up the virtual
address space.
The kernel (on the x86 architecture, in the default configuration) splits the 4-GB virtual
address space between user-space and the kernel; the same set of mappings is used in
99
both contexts. A typical split dedicates 3 GB to user space, and 1 GB for kernel space.
[1]
The kernel's code and data structures must fit into that space, but the biggest consumer of
kernel address space is virtual mappings for physical memory. The kernel cannot directly
manipulate memory that is not mapped into the kernel's address space. The kernel, in
other words, needs its own virtual address for any memory it must touch directly. Thus,
for many years, the maximum amount of physical memory that could be handled by the
kernel was the amount that could be mapped into the kernel's portion of the virtual
address space, minus the space needed for the kernel code itself. As a result, x86-based
Linux systems could work with a maximum of a little under 1 GB of physical memory.
[1]
Many non-x86 architectures are able to efficiently do without the
kernel/user-space split described here, so they can work with up to a 4-GB
kernel address space on 32-bit systems. The constraints described in this
section still apply to such systems when more than 4 GB of memory are
installed, however.
In response to commercial pressure to support more memory while not breaking 32-bit
application and the system's compatibility, the processor manufacturers have added
"address extension" features to their products. The result is that, in many cases, even 32-
bit processors can address more than 4 GB of physical memory. The limitation on how
much memory can be directly mapped with logical addresses remains, however. Only the
lowest portion of memory (up to 1 or 2 GB, depending on the hardware and the kernel
configuration) has logical addresses;
[2]
the rest (high memory) does not. Before accessing
a specific high-memory page, the kernel must set up an explicit virtual mapping to make
that page available in the kernel's address space. Thus, many kernel data structures must
be placed in low memory; high memory tends to be reserved for user-space process
pages.
[2]
The 2.6 kernel (with an added patch) can support a "4G/4G" mode on
x86 hardware, which enables larger kernel and user virtual address spaces
at a mild performance cost.
The term "high memory" can be confusing to some, especially since it has other meanings
in the PC world. So, to make things clear, we'll define the terms here:
Low memory
Memory for which logical addresses exist in kernel space. On almost every
system you will likely encounter, all memory is low memory.
High memory
Memory for which logical addresses do not exist, because it is beyond the address
range set aside for kernel virtual addresses.
On i386 systems, the boundary between low and high memory is usually set at just under
1 GB, although that boundary can be changed at kernel configuration time. This boundary
is not related in any way to the old 640 KB limit found on the original PC, and its
placement is not dictated by the hardware. It is, instead, a limit set by the kernel itself as it
splits the 32-bit address space between kernel and user space.
We will point out limitations on the use of high memory as we come to them in this
chapter.
100
3.1.4. The Memory Map and Struct Page
Historically, the kernel has used logical addresses to refer to pages of physical memory.
The addition of high-memory support, however, has exposed an obvious problem with
that approachlogical addresses are not available for high memory. Therefore, kernel
functions that deal with memory are increasingly using pointers to struct page (defined in
<linux/mm.h>) instead. This data structure is used to keep track of just about everything
the kernel needs to know about physical memory; there is one struct page for each physical
page on the system. Some of the fields of this structure include the following:
atomic_t count;
The number of references there are to this page. When the count drops to 0, the
page is returned to the free list.
void *virtual;
The kernel virtual address of the page, if it is mapped; NULL, otherwise. Low-
memory pages are always mapped; high-memory pages usually are not. This field
does not appear on all architectures; it generally is compiled only where the kernel
virtual address of a page cannot be easily calculated. If you want to look at this
field, the proper method is to use the page_address macro, described below.
unsigned long flags;
A set of bit flags describing the status of the page. These include PG_locked, which
indicates that the page has been locked in memory, and PG_reserved, which
prevents the memory management system from working with the page at all.
There is much more information within struct page, but it is part of the deeper black magic
of memory management and is not of concern to driver writers.
The kernel maintains one or more arrays of struct page entries that track all of the physical
memory on the system. On some systems, there is a single array called mem_map. On
some systems, however, the situation is more complicated. Nonuniform memory access
(NUMA) systems and those with widely discontiguous physical memory may have more
than one memory map array, so code that is meant to be portable should avoid direct
access to the array whenever possible. Fortunately, it is usually quite easy to just work
with struct page pointers without worrying about where they come from.
Some functions and macros are defined for translating between struct page pointers and
virtual addresses:
struct page *virt_to_page(void *kaddr);
This macro, defined in <asm/page.h>, takes a kernel logical address and returns its
associated struct page pointer. Since it requires a logical address, it does not work
with memory from vmalloc or high memory.
struct page *pfn_to_page(int pfn);
Returns the struct page pointer for the given page frame number. If necessary, it
checks a page frame number for validity with pfn_valid before passing it to
pfn_to_page.
void *page_address(struct page *page);
Returns the kernel virtual address of this page, if such an address exists. For high
memory, that address exists only if the page has been mapped. This function is
defined in <linux/mm.h>. In most situations, you want to use a version of kmap
rather than page_address.
#include <linux/highmem.h>
void *kmap(struct page *page);
void kunmap(struct page *page);
kmap returns a kernel virtual address for any page in the system. For low-memory
pages, it just returns the logical address of the page; for high-memory pages, kmap
101
creates a special mapping in a dedicated part of the kernel address space.
Mappings created with kmap should always be freed with kunmap; a limited
number of such mappings is available, so it is better not to hold on to them for too
long. kmap calls maintain a counter, so if two or more functions both call kmap on
the same page, the right thing happens. Note also that kmap can sleep if no
mappings are available.
#include <linux/highmem.h>
#include <asm/kmap_types.h>
void *kmap_atomic(struct page *page, enum km_type type);
void kunmap_atomic(void *addr, enum km_type type);
kmap_atomic is a high-performance form of kmap. Each architecture maintains a
small list of slots (dedicated page table entries) for atomic kmaps; a caller of
kmap_atomic must tell the system which of those slots to use in the type argument.
The only slots that make sense for drivers are KM_USER0 and KM_USER1 (for code
running directly from a call from user space), and KM_IRQ0 and KM_IRQ1 (for
interrupt handlers). Note that atomic kmaps must be handled atomically; your
code cannot sleep while holding one. Note also that nothing in the kernel keeps
two functions from trying to use the same slot and interfering with each other
(although there is a unique set of slots for each CPU). In practice, contention for
atomic kmap slots seems to not be a problem.
We see some uses of these functions when we get into the example code, later in this
chapter and in subsequent chapters.
3.1.5. Page Tables
On any modern system, the processor must have a mechanism for translating virtual
addresses into its corresponding physical addresses. This mechanism is called a page
table; it is essentially a multilevel tree-structured array containing virtual-to-physical
mappings and a few associated flags. The Linux kernel maintains a set of page tables
even on architectures that do not use such tables directly.
A number of operations commonly performed by device drivers can involve manipulating
page tables. Fortunately for the driver author, the 2.6 kernel has eliminated any need to
work with page tables directly. As a result, we do not describe them in any detail; curious
readers may want to have a look at Understanding The Linux Kernel by Daniel P. Bovet
and Marco Cesati (O'Reilly) for the full story.
3.1.6. Virtual Memory Areas
The virtual memory area (VMA) is the kernel data structure used to manage distinct
regions of a process's address space. A VMA represents a homogeneous region in the
virtual memory of a process: a contiguous range of virtual addresses that have the same
permission flags and are backed up by the same object (a file, say, or swap space). It
corresponds loosely to the concept of a "segment," although it is better described as "a
memory object with its own properties." The memory map of a process is made up of (at
least) the following areas:
- An area for the program's executable code (often called text)
- Multiple areas for data, including initialized data (that which has an explicitly
assigned value at the beginning of execution), uninitialized data (BSS),
[3]
and the
program stack
102
[3]
The name BSS is a historical relic from an old assembly operator
meaning "block started by symbol." The BSS segment of
executable files isn't stored on disk, and the kernel maps the zero
page to the BSS address range.
- One area for each active memory mapping
The memory areas of a process can be seen by looking in /proc/<pid/maps> (in which pid,
of course, is replaced by a process ID). /proc/self is a special case of /proc/pid, because it
always refers to the current process. As an example, here are a couple of memory maps
(to which we have added short comments in italics):
The fields in each line are:
start-end perm offset major:minor inode image
Each field in /proc/*/maps (except the image name) corresponds to a field in struct
vm_area_struct:
start
end
The beginning and ending virtual addresses for this memory area.
perm
A bit mask with the memory area's read, write, and execute permissions. This field
describes what the process is allowed to do with pages belonging to the area. The
last character in the field is either p for "private" or s for "shared."
offset
Where the memory area begins in the file that it is mapped to. An offset of 0
means that the beginning of the memory area corresponds to the beginning of the
file.
major
minor
The major and minor numbers of the device holding the file that has been mapped.
Confusingly, for device mappings, the major and minor numbers refer to the disk
partition holding the device special file that was opened by the user, and not the
device itself.
inode
The inode number of the mapped file.
image
The name of the file (usually an executable image) that has been mapped.
15.1.6.1 The vm_area_struct structure
When a user-space process calls mmap to map device memory into its address space, the
system responds by creating a new VMA to represent that mapping. A driver that
supports mmap (and, thus, that implements the mmap method) needs to help that process
by completing the initialization of that VMA. The driver writer should, therefore, have at
least a minimal understanding of VMAs in order to support mmap.
Let's look at the most important fields in struct vm_area_struct (defined in <linux/mm.h>).
These fields may be used by device drivers in their mmap implementation. Note that the
kernel maintains lists and trees of VMAs to optimize area lookup, and several fields of
vm_area_struct are used to maintain this organization. Therefore, VMAs can't be created at
will by a driver, or the structures break. The main fields of VMAs are as follows (note the
similarity between these fields and the /proc output we just saw):
103
unsigned long vm_start; unsigned long vm_end;
The virtual address range covered by this VMA. These fields are the first two
fields shown in /proc/*/maps.
struct file *vm_file;
A pointer to the struct file structure associated with this area (if any).
unsigned long vm_pgoff;
The offset of the area in the file, in pages. When a file or device is mapped, this is
the file position of the first page mapped in this area.
unsigned long vm_flags;
A set of flags describing this area. The flags of the most interest to device driver
writers are VM_IO and VM_RESERVED. VM_IO marks a VMA as being a memory-
mapped I/O region. Among other things, the VM_IO flag prevents the region from
being included in process core dumps. VM_RESERVED tells the memory
management system not to attempt to swap out this VMA; it should be set in most
device mappings.
struct vm_operations_struct *vm_ops;
A set of functions that the kernel may invoke to operate on this memory area. Its
presence indicates that the memory area is a kernel "object," like the struct file we
have been using throughout the book.
void *vm_private_data;
A field that may be used by the driver to store its own information.
Like struct vm_area_struct, the vm_operations_struct is defined in <linux/mm.h>; it includes the
operations listed below. These operations are the only ones needed to handle the process's
memory needs, and they are listed in the order they are declared. Later in this chapter,
some of these functions are implemented.
void (*open)(struct vm_area_struct *vma);
The open method is called by the kernel to allow the subsystem implementing the
VMA to initialize the area. This method is invoked any time a new reference to
the VMA is made (when a process forks, for example). The one exception
happens when the VMA is first created by mmap; in this case, the driver's mmap
method is called instead.
void (*close)(struct vm_area_struct *vma);
When an area is destroyed, the kernel calls its close operation. Note that there's no
usage count associated with VMAs; the area is opened and closed exactly once by
each process that uses it.
struct page *(*nopage)(struct vm_area_struct *vma, unsigned
long address, int
*type);
When a process tries to access a page that belongs to a valid VMA, but that is
currently not in memory, the nopage method is called (if it is defined) for the
related area. The method returns the struct page pointer for the physical page after,
perhaps, having read it in from secondary storage. If the nopage method isn't
defined for the area, an empty page is allocated by the kernel.
int (*populate)(struct vm_area_struct *vm, unsigned long
address, unsigned
long len, pgprot_t prot, unsigned long pgoff, int nonblock);
This method allows the kernel to "prefault" pages into memory before they are
accessed by user space. There is generally no need for drivers to implement the
populate method.
104
UNIT IV
FILE SYSTEMS
1. FILE-SYSTEM INTERFACE
1.1 FILE CONCEPT
1.1.1 Definition
Contiguous logical address space
Types:
Data
numeric
character
binary
Program
1.1.2 File Structure
None - sequence of words, bytes
Simple record structure
Lines
Fixed length
Variable length
Complex Structures
Formatted document
Relocatable load file
Can simulate last two with first method by inserting appropriate control characters.
Who decides:
Operating system
Program
1.1.3 File Attributes
Name only information kept in human-readable form.
Type needed for systems that support different types.
Location pointer to file location on device.
Size current file size.
Protection controls who can do reading, writing, executing.
Time, date, and user identification data for protection, security, and usage
monitoring.
Information about files are kept in the directory structure, which is maintained on
the disk.
1.1.4 File Operations
Create
Write
Read
Reposition within file file seek
105
Delete
Truncate
Open(F
i
) search the directory structure on disk for entry F
i
, and move the content
of entry to memory.
Close (F
i
) move the content of entry F
i
in memory to directory structure on disk.
1.1.5 File Types Name, Extension
1.2 ACCESS METHODS
Sequential Access
read next
write next
reset
no read after last write
(rewrite)
Direct Access
106
read n
write n
position to n
read next
write next
rewrite n
n = relative block number
Sequential-access File
Simulation of Sequential Access on a Direct-access File
Example of Index and Relative Files
1.3 DIRECTORY STRUCTURE
1.3.1 Introduction
A collection of nodes containing information about all files.
107
A Typical File-system Organization
1.3.2 Information in a Device Directory
Name
Type
Address
Current length
Maximum length
Date last accessed (for archival)
Date last updated (for dump)
Owner ID (who pays)
Protection information (discuss later)
1.3.3 Operations Performed on Directory
Search for a file
Create a file
Delete a file
List a directory
Rename a file
Traverse the file system
1.3.4 Organize the Directory (Logically) to Obtain
Efficiency locating a file quickly.
Naming convenient to users.
Two users can have same name for different files.
The same file can have several different names.
Grouping logical grouping of files by properties, (e.g., all Java programs, all
games, ))
1.3.5 Single-Level Directory
A single directory for all users.
108
1.3.6 Two-Level Directory
Separate directory for each user.
1.3.7 Tree-Structured Directories
Efficient searching
Grouping Capability
Current directory (working directory)
cd /spell/mail/prog
type list
Absolute or relative path name
Creating a new file is done in current directory.
Delete a file
rm <file-name>
Creating a new subdirectory is done in current directory.
mkdir <dir-name>
Example: if in current directory /mail
mkdir count
1.3.8 Acyclic-Graph Directories
Have shared subdirectories and files.
Acyclic-Graph Directories
109
Two different names (aliasing)
If dict deletes list dangling pointer.
Solutions:
Backpointers, so we can delete all pointers.
Variable size records a problem.
Backpointers using a daisy chain organization.
Entry-hold-count solution.
1.3.9 General Graph Directory
How do we guarantee no cycles?
Allow only links to file not subdirectories.
Garbage collection.
Every time a new link is added use a cycle detection
algorithm to determine whether it is OK.
1.4 FILE SYSTEM MOUNTING
A file system must be mounted before it can be accessed.
A unmounted file system (I.e. Fig. 11-11(b)) is mounted at a mount point.
(a) Existing. (b) Unmounted Partition
110
Mount Point
1.5 FILE SHARING
1.5.1 Introduction
Sharing of files on multi-user systems is desirable.
Sharing may be done through a protection scheme.
On distributed systems, files may be shared across a network.
Network File System (NFS) is a common distributed file-sharing method.
1.5.2 File Sharing Multiple Users
User IDs identify users, allowing permissions and protections to be per-user
Group IDs allow users to be in groups, permitting group access rights
1.5.3 File Sharing Remote File Systems
Uses networking to allow file system access between systems
Manually via programs like FTP
Automatically, seamlessly using distributed file systems
Semi automatically via the world wide web
Client-server model allows clients to mount remote file systems from servers
Server can serve multiple clients
Client and user-on-client identification is insecure or complicated
NFS is standard UNIX client-server file sharing protocol
CIFS is standard Windows protocol
Standard operating system file calls are translated into remote calls
Distributed Information Systems (distributed naming services) such as LDAP, DNS,
NIS, Active Directory implement unified access to information needed for remote
111
computing
1.5.4 File Sharing Failure Modes
Remote file systems add new failure modes, due to network failure, server failure
Recovery from failure can involve state information about status of each remote
request
Stateless protocols such as NFS include all information in each request, allowing
easy recovery but less security
1.5.5 File Sharing Consistency Semantics
Consistency semantics specify how multiple users are to access a shared file
simultaneously
Similar to Ch 7 process synchronization algorithms
Tend to be less complex due to disk I/O and network latency (for remote file
systems
Andrew File System (AFS) implemented complex remote file sharing
semantics
Unix file system (UFS) implements:
Writes to an open file visible immediately to other users of the same open file
Sharing file pointer to allow multiple users to read and write concurrently
AFS has session semantics
Writes only visible to sessions starting after the file is closed
1.6 PROTECTION
1.6.1 What is Protection?
File owner/creator should be able to control:
what can be done
by whom
1.6.2 Types of Access
Types of access
Read
Write
Execute
Append
Delete
List
1.6.3 Access Lists and Groups
Mode of access: read, write, execute
Three classes of users
112
RWX
a) owner access 7 1 1 1
RWX
b) group access 6 1 1 0
RWX
c) public access 1 0 0 1
Ask manager to create a group (unique name), say G, and add some users to the
group.
For a particular file (say game) or subdirectory, define an appropriate access.
Attach a group to a file
chgrp G game
2. FILE SYSTEM IMPLEMENTATION
File System Structure
File System Implementation
Directory Implementation
Allocation Methods
Free-Space Management
Efficiency and Performance
Recovery
Log-Structured File Systems
NFS
2.1. FILE-SYSTEM STRUCTURE
File structure
Logical storage unit
Collection of related information
File system resides on secondary storage (disks).
File system organized into layers.
File control block storage structure consisting of information about a file.
owner group public
chmod 761 game
113
Layered File System
A Typical File Control Block
In-Memory File System Structures
The following figure illustrates the necessary file system structures provided by the
operating systems.
Figure 12-3(a) refers to opening a file.
Figure 12-3(b) refers to reading a file.
In-Memory File System Structures
Virtual File Systems
Virtual File Systems (VFS) provide an object-oriented way of implementing file
systems.
VFS allows the same system call interface (the API) to be used for different types of
114
file systems.
The API is to the VFS interface, rather than any specific type of file system.
2.2.DIRECTORY IMPLEMENTATION
Linear list of file names with pointer to the data blocks.
simple to program
time-consuming to execute
Hash Table linear list with hash data structure.
decreases directory search time
collisions situations where two file names hash to the same location
fixed size
2.3.ALLOCATION METHODS
An allocation method refers to how disk blocks are allocated for files:
Contiguous allocation
Linked allocation
Indexed allocation
2.3.1.Contiguous Allocation
Each file occupies a set of contiguous blocks on the disk.
Simple only starting location (block #) and length (number of blocks) are required.
Random access.
Wasteful of space (dynamic storage-allocation problem).
Files cannot grow.
Contiguous Allocation of Disk Space
Extent-Based Systems
Many newer file systems (I.e. Veritas File System) use a modified contiguous
115
allocation scheme.
Extent-based file systems allocate disk blocks in extents.
An extent is a contiguous block of disks. Extents are allocated for file allocation. A
file consists of one or more extents.
2.3.2.Linked Allocation
Each file is a linked list of disk blocks: blocks may be scattered anywhere on the
disk.
Linked Allocation (Cont.)
Simple need only starting address
Free-space management system no waste of space
No random access
Mapping
Linked Allocation
File-Allocation Table
2.3.3.Indexed Allocation
Need index table
Random access
Dynamic access without external fragmentation, but have overhead of index block.
Mapping from logical to physical in a file of maximum size of 256K words and block
size of 512 words. We need only 1 block for index table.
Indexed Allocation Mapping
Mapping from logical to physical in a file of unbounded length (block size of 512
words).
Linked scheme Link blocks of index table (no limit on size).
Two-level index (maximum file size is 512
3
)
2.4.FREE-SPACE MANAGEMENT
Bit vector (n blocks)
Free-Space Management (Cont.)
Bit map requires extra space. Example:
116
block size = 2
12
bytes
disk size = 2
30
bytes (1 gigabyte)
n = 2
30
/2
12
= 2
18
bits (or 32K bytes)
Easy to get contiguous files
Linked list (free list)
Cannot get contiguous space easily
No waste of space
Grouping
Counting
Need to protect:
Pointer to free list
Bit map
Must be kept on disk
Copy in memory and disk may differ.
Cannot allow for block[i] to have a situation where bit[i] = 1 in memory and
bit[i] = 0 on disk.
Solution:
Set bit[i] = 1 in disk.
Allocate block[i]
Set bit[i] = 1 in memory
2.5.EFFICIENCY AND PERFORMANCE
Efficiency dependent on:
disk allocation and directory algorithms
types of data kept in files directory entry
Performance
disk cache separate section of main memory for frequently used blocks
free-behind and read-ahead techniques to optimize sequential access
improve PC performance by dedicating section of memory as virtual disk, or
RAM disk.
Various Disk-Caching Locations
Page Cache
A page cache caches pages rather than disk blocks using virtual memory techniques.
Memory-mapped I/O uses a page cache.
Routine I/O through the file system uses the buffer (disk) cache.
This leads to the following figure.
117
2.6.RECOVERY
Consistency checking compares data in directory structure with data blocks on
disk, and tries to fix inconsistencies.
Use system programs to back up data from disk to another storage device (floppy
disk, magnetic tape).
Recover lost file or disk by restoring data from backup.
2.7.LOG STRUCTURED FILE SYSTEMS
Log structured (or journaling) file systems record each update to the file system as a
transaction.
All transactions are written to a log. A transaction is considered committed once it is
written to the log. However, the file system may not yet be updated.
The transactions in the log are asynchronously written to the file system. When the
file system is modified, the transaction is removed from the log.
If the file system crashes, all remaining transactions in the log must still be
performed.
The Sun Network File System (NFS)
An implementation and a specification of a software system for accessing remote
files across LANs (or WANs).
The implementation is part of the Solaris and SunOS operating systems running on
Sun workstations using an unreliable datagram protocol (UDP/IP protocol and
Ethernet.
NFS (Cont.)
Interconnected workstations viewed as a set of independent machines with
independent file systems, which allows sharing among these file systems in a
transparent manner.
A remote directory is mounted over a local file system directory. The mounted
118
directory looks like an integral subtree of the local file system, replacing the
subtree descending from the local directory.
Specification of the remote directory for the mount operation is
nontransparent; the host name of the remote directory has to be provided.
Files in the remote directory can then be accessed in a transparent manner.
Subject to access-rights accreditation, potentially any file system (or directory
within a file system), can be mounted remotely on top of any local directory.
Three Independent File Systems
Mounting in NFS
NFS Mount Protocol
Establishes initial logical connection between server and client.
Mount operation includes name of remote directory to be mounted and name of
server machine storing it.
Mount request is mapped to corresponding RPC and forwarded to mount server
running on server machine.
Export list specifies local file systems that server exports for mounting, along
with names of machines that are permitted to mount them.
Following a mount request that conforms to its export list, the server returns a file
handlea key for further accesses.
File handle a file-system identifier, and an inode number to identify the mounted
directory within the exported file system.
The mount operation changes only the users view and does not affect the server
side.
NFS Protocol
Provides a set of remote procedure calls for remote file operations. The procedures
support the following operations:
searching for a file within a directory
reading a set of directory entries
manipulating links and directories
accessing file attributes
reading and writing files
NFS servers are stateless; each request has to provide a full set of arguments.
Modified data must be committed to the servers disk before results are returned to
the client (lose advantages of caching).
The NFS protocol does not provide concurrency-control mechanisms.
Three Major Layers of NFS Architecture
UNIX file-system interface (based on the open, read, write, and close calls, and file
descriptors).
Virtual File System (VFS) layer distinguishes local files from remote ones, and
local files are further distinguished according to their file-system types.
The VFS activates file-system-specific operations to handle local requests
according to their file-system types.
119
Calls the NFS protocol procedures for remote requests.
NFS service layer bottom layer of the architecture; implements the NFS protocol.
Schematic View of NFS Architecture
NFS Path-Name Translation
Performed by breaking the path into component names and performing a separate
NFS lookup call for every pair of component name and directory vnode.
To make lookup faster, a directory name lookup cache on the clients side holds the
vnodes for remote directory names.
NFS Remote Operations
Nearly one-to-one correspondence between regular UNIX system calls and the NFS
protocol RPCs (except opening and closing files).
NFS adheres to the remote-service paradigm, but employs buffering and caching
techniques for the sake of performance.
3. CASE STUDIES
3.1. THE LINUX SYSTEM
History
Design Principles
Kernel Modules
Process Management
Scheduling
Memory Management
File Systems
Input and Output
Interprocess Communication
Network Structure
Security
History
Linux is a modem, free operating system based on UNIX standards.
First developed as a small but self-contained kernel in 1991 by Linus Torvalds, with
the major design goal of UNIX compatibility.
Its history has been one of collaboration by many users from all around the world,
corresponding almost exclusively over the Internet.
It has been designed to run efficiently and reliably on common PC hardware, but
also runs on a variety of other platforms.
The core Linux operating system kernel is entirely original, but it can run much
existing free UNIX software, resulting in an entire UNIX-compatible operating
system free from proprietary code.
The Linux Kernel
Version 0.01 (May 1991) had no networking, ran only on 80386-compatible Intel
processors and on PC hardware, had extremely limited device-drive support, and
supported only the Minix file system.
Linux 1.0 (March 1994) included these new features:
Support for UNIXs standard TCP/IP networking protocols
BSD-compatible socket interface for networking programming
Device-driver support for running IP over an Ethernet
120
Enhanced file system
Support for a range of SCSI controllers for
high-performance disk access
Extra hardware support
Version 1.2 (March 1995) was the final PC-only Linux kernel.
Linux 2.0
Released in June 1996, 2.0 added two major new capabilities:
Support for multiple architectures, including a fully 64-bit native Alpha port.
Support for multiprocessor architectures
Other new features included:
Improved memory-management code
Improved TCP/IP performance
Support for internal kernel threads, for handling dependencies between
loadable modules, and for automatic loading of modules on demand.
Standardized configuration interface
Available for Motorola 68000-series processors, Sun Sparc systems, and for PC and
PowerMac systems.
The Linux System
Linux uses many tools developed as part of Berkeleys BSD operating system, MITs
X Window System, and the Free Software Foundation's GNU project.
The min system libraries were started by the GNU project, with improvements
provided by the Linux community.
Linux networking-administration tools were derived from 4.3BSD code; recent BSD
derivatives such as Free BSD have borrowed code from Linux in return.
The Linux system is maintained by a loose network of developers collaborating over
the Internet, with a small number of public ftp sites acting as de facto standard
repositories.
Linux Distributions
Standard, precompiled sets of packages, or distributions, include the basic Linux
system, system installation and management utilities, and ready-to-install packages
of common UNIX tools.
The first distributions managed these packages by simply providing a means of
unpacking all the files into the appropriate places; modern distributions include
advanced package management.
Early distributions included SLS and Slackware. Red Hat and Debian are popular
distributions from commercial and noncommercial sources, respectively.
The RPM Package file format permits compatibility among the various Linux
distributions.
Linux Licensing
The Linux kernel is distributed under the GNU General Public License (GPL), the
terms of which are set out by the Free Software Foundation.
Anyone using Linux, or creating their own derivative of Linux, may not make the
derived product proprietary; software released under the GPL may not be
redistributed as a binary-only product.
Design Principles
Linux is a multiuser, multitasking system with a full set of UNIX-compatible tools..
121
Its file system adheres to traditional UNIX semantics, and it fully implements the
standard UNIX networking model.
Main design goals are speed, efficiency, and standardization.
Linux is designed to be compliant with the relevant POSIX documents; at least two
Linux distributions have achieved official POSIX certification.
The Linux programming interface adheres to the SVR4 UNIX semantics, rather
than to BSD behavior.
Components of a Linux System
Components of a Linux System (Cont.)
Like most UNIX implementations, Linux is composed of three main bodies of code;
the most important distinction between the kernel and all other components.
The kernel is responsible for maintaining the important abstractions of the operating
system.
Kernel code executes in kernel mode with full access to all the physical
resources of the computer.
All kernel code and data structures are kept in the same single address space.
Components of a Linux System (Cont.)
The system libraries define a standard set of functions through which applications
interact with the kernel, and which implement much of the operating-system
functionality that does not need the full privileges of kernel code.
The system utilities perform individual specialized management tasks.
Kernel Modules
Sections of kernel code that can be compiled, loaded, and unloaded independent of
the rest of the kernel.
A kernel module may typically implement a device driver, a file system, or a
networking protocol.
The module interface allows third parties to write and distribute, on their own terms,
device drivers or file systems that could not be distributed under the GPL.
Kernel modules allow a Linux system to be set up with a standard, minimal kernel,
without any extra device drivers built in.
Three components to Linux module support:
module management
driver registration
conflict resolution
Module Management
Supports loading modules into memory and letting them talk to the rest of the kernel.
Module loading is split into two separate sections:
Managing sections of module code in kernel memory
Handling symbols that modules are allowed to reference
The module requestor manages loading requested, but currently unloaded, modules;
it also regularly queries the kernel to see whether a dynamically loaded module is
still in use, and will unload it when it is no longer actively needed.
Driver Registration
Allows modules to tell the rest of the kernel that a new driver has become available.
The kernel maintains dynamic tables of all known drivers, and provides a set of
routines to allow drivers to be added to or removed from these tables at any time.
Registration tables include the following items:
122
Device drivers
File systems
Network protocols
Binary format
Conflict Resolution
A mechanism that allows different device drivers to reserve hardware resources and
to protect those resources from accidental use by another driver
The conflict resolution module aims to:
Prevent modules from clashing over access to hardware resources
Prevent autoprobes from interfering with existing device drivers
Resolve conflicts with multiple drivers trying to access the same hardware
Process Management
UNIX process management separates the creation of processes and the running of a
new program into two distinct operations.
The fork system call creates a new process.
A new program is run after a call to execve.
Under UNIX, a process encompasses all the information that the operating system
must maintain t track the context of a single execution of a single program.
Under Linux, process properties fall into three groups: the processs identity,
environment, and context.
Process Identity
Process ID (PID). The unique identifier for the process; used to specify processes to
the operating system when an application makes a system call to signal, modify, or
wait for another process.
Credentials. Each process must have an associated user ID and one or more group
IDs that determine the processs rights to access system resources and files.
Personality. Not traditionally found on UNIX systems, but under Linux each
process has an associated personality identifier that can slightly modify the
semantics of certain system calls.
Used primarily by emulation libraries to request that system calls be compatible with
certain specific flavors of UNIX.
Process Environment
The processs environment is inherited from its parent, and is composed of two null-
terminated vectors:
The argument vector lists the command-line arguments used to invoke the
running program; conventionally starts with the name of the program itself
The environment vector is a list of NAME=VALUE pairs that associates
named environment variables with arbitrary textual values.
Passing environment variables among processes and inheriting variables by a
processs children are flexible means of passing information to components of the
user-mode system software.
The environment-variable mechanism provides a customization of the operating
system that can be set on a per-process basis, rather than being configured for the
system as a whole.
Process Context
The (constantly changing) state of a running program at any point in time.
123
The scheduling context is the most important part of the process context; it is the
information that the scheduler needs to suspend and restart the process.
The kernel maintains accounting information about the resources currently being
consumed by each process, and the total resources consumed by the process in its
lifetime so far.
The file table is an array of pointers to kernel file structures. When making file I/O
system calls, processes refer to files by their index into this table.
Process Context (Cont.)
Whereas the file table lists the existing open files, the
file-system context applies to requests to open new files. The current root and default
directories to be used for new file searches are stored here.
The signal-handler table defines the routine in the processs address space to be
called when specific signals arrive.
The virtual-memory context of a process describes the full contents of the its private
address space.
Processes and Threads
Linux uses the same internal representation for processes and threads; a thread is
simply a new process that happens to share the same address space as its parent.
A distinction is only made when a new thread is created by the clone system call.
fork creates a new process with its own entirely new process context
clone creates a new process with its own identity, but that is allowed to share
the data structures of its parent
Using clone gives an application fine-grained control over exactly what is shared
between two threads.
Scheduling
The job of allocating CPU time to different tasks within an operating system.
While scheduling is normally thought of as the running and interrupting of
processes, in Linux, scheduling also includes the running of the various kernel
tasks.
Running kernel tasks encompasses both tasks that are requested by a running
process and tasks that execute internally on behalf of a device driver.
Kernel Synchronization
A request for kernel-mode execution can occur in two ways:
A running program may request an operating system service, either explicitly
via a system call, or implicitly, for example, when a page fault occurs.
A device driver may deliver a hardware interrupt that causes the CPU to start
executing a kernel-defined handler for that interrupt.
Kernel synchronization requires a framework that will allow the kernels critical
sections to run without interruption by another critical section.
Kernel Synchronization (Cont.)
Linux uses two techniques to protect critical sections:
124
1. Normal kernel code is nonpreemptible
when a time interrupt is received while a process is
executing a kernel system service routine, the kernels
need_resched flag is set so that the scheduler will run
once the system call has completed and control is
about to be returned to user mode.
2. The second technique applies to critical sections that occur in an interrupt
service routines.
By using the processors interrupt control hardware to disable interrupts
during a critical section, the kernel guarantees that it can proceed without the risk
of concurrent access of shared data structures.
Kernel Synchronization (Cont.)
To avoid performance penalties, Linuxs kernel uses a synchronization architecture
that allows long critical sections to run without having interrupts disabled for the
critical sections entire duration.
Interrupt service routines are separated into a top half and a bottom half.
The top half is a normal interrupt service routine, and runs with recursive
interrupts disabled.
The bottom half is run, with all interrupts enabled, by a miniature scheduler
that ensures that bottom halves never interrupt themselves.
This architecture is completed by a mechanism for disabling selected bottom
halves while executing normal, foreground kernel code.
Interrupt Protection Levels
Each level may be interrupted by code running at a higher level, but will never be
interrupted by code running at the same or a lower level.
User processes can always be preempted by another process when a time-sharing
scheduling interrupt occurs.
Process Scheduling
Linux uses two process-scheduling algorithms:
A time-sharing algorithm for fair preemptive scheduling between multiple
processes
A real-time algorithm for tasks where absolute priorities are more important
than fairness
A processs scheduling class defines which algorithm to apply.
For time-sharing processes, Linux uses a prioritized, credit based algorithm.
The crediting rule factors in both the processs history and its priority.
This crediting system automatically prioritizes interactive or I/O-bound
processes.
Process Scheduling (Cont.)
Linux implements the FIFO and round-robin real-time scheduling classes; in both
cases, each process has a priority in addition to its scheduling class.
125
The scheduler runs the process with the highest priority; for equal-priority
processes, it runs the process waiting the longest
FIFO processes continue to run until they either exit or block
A round-robin process will be preempted after a while and moved to the end of
the scheduling queue, so that round-robing processes of equal priority
automatically time-share between themselves.
Symmetric Multiprocessing
Linux 2.0 was the first Linux kernel to support SMP hardware; separate processes
or threads can execute in parallel on separate processors.
To preserve the kernels nonpreemptible synchronization requirements, SMP
imposes the restriction, via a single kernel spinlock, that only one processor at a time
may execute kernel-mode code.
Memory Management
Linuxs physical memory-management system deals with allocating and freeing
pages, groups of pages, and small blocks of memory.
It has additional mechanisms for handling virtual memory, memory mapped into the
address space of running processes.
Splitting of Memory in a Buddy Heap
Managing Physical Memory
The page allocator allocates and frees all physical pages; it can allocate ranges of
physically-contiguous pages on request.
The allocator uses a buddy-heap algorithm to keep track of available physical
pages.
Each allocatable memory region is paired with an adjacent partner.
Whenever two allocated partner regions are both freed up they are combined
to form a larger region.
If a small memory request cannot be satisfied by allocating an existing small
free region, then a larger free region will be subdivided into two partners to
satisfy the request.
Memory allocations in the Linux kernel occur either statically (drivers reserve a
contiguous area of memory during system boot time) or dynamically (via the page
allocator).
Virtual Memory
The VM system maintains the address space visible to each process: It creates pages
of virtual memory on demand, and manages the loading of those pages from disk or
their swapping back out to disk as required.
The VM manager maintains two separate views of a processs address space:
A logical view describing instructions concerning the layout of the address
space.
The address space consists of a set of nonoverlapping regions, each
representing a continuous, page-aligned subset of the address space.
A physical view of each address space which is stored in the hardware page
tables for the process.
Virtual Memory (Cont.)
Virtual memory regions are characterized by:
The backing store, which describes from where the pages for a region come;
regions are usually backed by a file or by nothing (demand-zero memory)
The regions reaction to writes (page sharing or copy-on-write).
126
The kernel creates a new virtual address space
1. When a process runs a new program with the exec system call
2. Upon creation of a new process by the fork system call
The Linux kernel reserves a constant, architecture-dependent region of the virtual
address space of every process for its own internal use.
This kernel virtual-memory area contains two regions:
A static area that contains page table references to every available physical
page of memory in the system, so that there is a simple translation from
physical to virtual addresses when running kernel code.
The reminder of the reserved section is not reserved for any specific purpose;
its page-table entries can be modified to point to any other areas of memory.
Executing and Loading User Programs
Linux maintains a table of functions for loading programs; it gives each function
the opportunity to try loading the given file when an exec system call is made.
The registration of multiple loader routines allows Linux to support both the ELF
and a.out binary formats.
Initially, binary-file pages are mapped into virtual memory; only when a program
tries to access a given page will a page fault result in that page being loaded into
physical memory.
An ELF-format binary file consists of a header followed by several page-aligned
sections; the ELF loader works by reading the header and mapping the sections of
the file into separate regions of virtual memory.
Memory Layout for ELF Programs
Static and Dynamic Linking
A program whose necessary library functions are embedded directly in the
programs executable binary file is statically linked to its libraries.
The main disadvantage of static linkage is that every program generated must
contain copies of exactly the same common system library functions.
Dynamic linking is more efficient in terms of both physical memory and disk-space
usage because it loads the system libraries into memory only once.
FILE SYSTEMS
To the user, Linuxs file system appears as a hierarchical directory tree obeying
UNIX semantics.
Internally, the kernel hides implementation details and manages the multiple
different file systems via an abstraction layer, that is, the virtual file system (VFS).
The Linux VFS is designed around object-oriented principles and is composed of
two components:
A set of definitions that define what a file object is allowed to look like
The inode-object and the file-object structures represent individual files
the file system object represents an entire file system
A layer of software to manipulate those objects.
127
The Linux Ext2fs File System
Ext2fs uses a mechanism similar to that of BSD Fast File System (ffs) for locating
data blocks belonging to a specific file.
The main differences between ext2fs and ffs concern their disk allocation policies.
In ffs, the disk is allocated to files in blocks of 8Kb, with blocks being
subdivided into fragments of 1Kb to store small files or partially filled blocks
at the end of a file.
Ext2fs does not use fragments; it performs its allocations in smaller units. The
default block size on ext2fs is 1Kb, although 2Kb and 4Kb blocks are also
supported.
Ext2fs uses allocation policies designed to place logically adjacent blocks of a
file into physically adjacent blocks on disk, so that it can submit an I/O request
for several disk blocks as a single operation.
Ext2fs Block-Allocation Policies
The Linux Proc File System
The proc file system does not store data, rather, its contents are computed on
demand according to user file I/O requests.
proc must implement a directory structure, and the file contents within; it must then
define a unique and persistent inode number for each directory and files it contains.
It uses this inode number to identify just what operation is required when a
user tries to read from a particular file inode or perform a lookup in a
particular directory inode.
When data is read from one of these files, proc collects the appropriate
information, formats it into text form and places it into the requesting
processs read buffer.
Input and Output
The Linux device-oriented file system accesses disk storage through two caches:
Data is cached in the page cache, which is unified with the virtual memory
system
Metadata is cached in the buffer cache, a separate cache indexed by the
physical disk block.
Linux splits all devices into three classes:
block devices allow random access to completely independent, fixed size blocks
of data
character devices include most other devices; they dont need to support the
functionality of regular files.
network devices are interfaced via the kernels networking subsystem
Device-Driver Block Structure
Block Devices
Provide the main interface to all disk devices in a system.
The block buffer cache serves two main purposes:
it acts as a pool of buffers for active I/O
it serves as a cache for completed I/O
The request manager manages the reading and writing of buffer contents to and
from a block device driver.
Character Devices
A device driver which does not offer random access to fixed blocks of data.
128
A character device driver must register a set of functions which implement the
drivers various file I/O operations.
The kernel performs almost no preprocessing of a file read or write request to a
character device, but simply passes on the request to the device.
The main exception to this rule is the special subset of character device drivers
which implement terminal devices, for which the kernel maintains a standard
interface.
Interprocess Communication
Like UNIX, Linux informs processes that an event has occurred via signals.
There is a limited number of signals, and they cannot carry information: Only the
fact that a signal occurred is available to a process.
The Linux kernel does not use signals to communicate with processes with are
running in kernel mode, rather, communication within the kernel is accomplished
via scheduling states and wait.queue structures.
Passing Data Between Processes
The pipe mechanism allows a child process to inherit a communication channel to
its parent, data written to one end of the pipe can be read a the other.
Shared memory offers an extremely fast way of communicating; any data written by
one process to a shared memory region can be read immediately by any other
process that has mapped that region into its address space.
To obtain synchronization, however, shared memory must be used in conjunction
with another Interprocess-communication mechanism.
Shared Memory Object
The shared-memory object acts as a backing store for shared-memory regions in the
same way as a file can act as backing store for a memory-mapped memory region.
Shared-memory mappings direct page faults to map in pages from a persistent
shared-memory object.
Shared-memory objects remember their contents even if no processes are currently
mapping them into virtual memory.
Network Structure
Networking is a key area of functionality for Linux.
It supports the standard Internet protocols for UNIX to UNIX
communications.
It also implements protocols native to nonUNIX operating systems, in
particular, protocols used on PC networks, such as Appletalk and IPX.
Internally, networking in the Linux kernel is implemented by three layers of
software:
The socket interface
Protocol drivers
Network device drivers
Network Structure (Cont.)
The most important set of protocols in the Linux networking system is the internet
protocol suite.
It implements routing between different hosts anywhere on the network.
On top of the routing protocol are built the UDP, TCP and ICMP protocols.
129
Security
The pluggable authentication modules (PAM) system is available under Linux.
PAM is based on a shared library that can be used by any system component that
needs to authenticate users.
Access control under UNIX systems, including Linux, is performed through the use
of unique numeric identifiers (uid and gid).
Access control is performed by assigning objects a protections mask, which specifies
which access modesread, write, or executeare to be granted to processes with
owner, group, or world access.
Security
Linux augments the standard UNIX setuid mechanism in two ways:
It implements the POSIX specifications saved user-id mechanism, which
allows a process to repeatedly drop and reacquire its effective uid.
It has added a process characteristic that grants just a subset of the rights of
the effective uid.
Linux provides another mechanism that allows a client to selectively pass access to a
single file to some server process without granting it any other privileges.
3.2.WINDOWS 2000
History
Design Principles
System Components
Environmental Subsystems
File system
Networking
Programmer Interface
Windows 2000
32-bit preemptive multitasking operating system for Intel microprocessors.
Key goals for the system:
portability
security
POSIX compliance
multiprocessor support
extensibility
international support
compatibility with MS-DOS and MS-Windows applications.
Uses a micro-kernel architecture.
Available in four versions, Professional, Server, Advanced Server, National Server.
In 1996, more NT server licenses were sold than UNIX licenses
History
In 1988, Microsoft decided to develop a new technology (NT) portable operating
system that supported both the OS/2 and POSIX APIs.
130
Originally, NT was supposed to use the OS/2 API as its native environment but
during development NT was changed to use the Win32 API, reflecting the popularity
of Windows 3.0.
Design Principles
Extensibility layered architecture.
Executive, which runs in protected mode, provides the basic system services.
On top of the executive, several server subsystems operate in user mode.
Modular structure allows additional environmental subsystems to be added
without affecting the executive.
Portability 2000 can be moved from on hardware architecture to another with
relatively few changes.
Written in C and C++.
Processor-dependent code is isolated in a dynamic link library (DLL) called
the hardware abstraction layer (HAL).
Design Principles (Cont.)
Reliability 2000 uses hardware protection for virtual memory, and software
protection mechanisms for operating system resources.
Compatibility applications that follow the IEEE 1003.1 (POSIX) standard can be
complied to run on 2000 without changing the source code.
Performance 2000 subsystems can communicate with one another via high-
performance message passing.
Preemption of low priority threads enables the system to respond quickly to
external events.
Designed for symmetrical multiprocessing
2000 Architecture
Layered system of modules.
Protected mode HAL, kernel, executive.
User mode collection of subsystems
Environmental subsystems emulate different operating systems.
Protection subsystems provide security functions.
Depiction of 2000 Architecture
System Components Kernel
Foundation for the executive and the subsystems.
Never paged out of memory; execution is never preempted.
Four main responsibilities:
thread scheduling
interrupt and exception handling
low-level processor synchronization
recovery after a power failure
Kernel is object-oriented, uses two sets of objects.
dispatcher objects control dispatching and synchronization (events, mutants,
mutexes, semaphores, threads and timers).
control objects (asynchronous procedure calls, interrupts, power notify, power
status, process and profile objects.)
Kernel Process and Threads
The process has a virtual memory address space, information (such as a base
priority), and an affinity for one or more processors.
131
Threads are the unit of execution scheduled by the kernels dispatcher.
Each thread has its own state, including a priority, processor affinity, and
accounting information.
A thread can be one of six states: ready, standby, running, waiting, transition, and
terminated.
Kernel Scheduling
The dispatcher uses a 32-level priority scheme to determine the order of thread
execution. Priorities are divided into two classes..
The real-time class contains threads with priorities ranging from 16 to 31.
The variable class contains threads having priorities from 0 to 15.
Characteristics of 2000s priority strategy.
Trends to give very good response times to interactive threads that are using
the mouse and windows.
Enables I/O-bound threads to keep the I/O devices busy.
Complete-bound threads soak up the spare CPU cycles in the background.
Kernel Scheduling (Cont.)
Scheduling can occur when a thread enters the ready or wait state, when a thread
terminates, or when an application changes a threads priority or processor affinity.
Real-time threads are given preferential access to the CPU; but 2000 does not
guarantee that a real-time thread will start to execute within any particular time
limit. (This is known as soft realtime.)
Windows 2000 Interrupt Request Levels
Kernel Trap Handling
The kernel provides trap handling when exceptions and interrupts are generated by
hardware of software.
Exceptions that cannot be handled by the trap handler are handled by the kernel's
exception dispatcher.
The interrupt dispatcher in the kernel handles interrupts by calling either an
interrupt service routine (such as in a device driver) or an internal kernel routine.
The kernel uses spin locks that reside in global memory to achieve multiprocessor
mutual exclusion.
Executive Object Manager
2000 uses objects for all its services and entities; the object manger supervises the
use of all the objects.
Generates an object handle
Checks security.
Keeps track of which processes are using each object.
Objects are manipulated by a standard set of methods, namely create, open, close,
delete, query name, parse and security.
Executive Naming Objects
The 2000 executive allows any object to be given a name, which may be either
permanent or temporary.
Object names are structured like file path names in MS-DOS and UNIX.
2000 implements a symbolic link object, which is similar to symbolic links in UNIX
that allow multiple nicknames or aliases to refer to the same file.
A process gets an object handle by creating an object by opening an existing one, by
receiving a duplicated handle from another process, or by inheriting a handle from a
parent process.
Each object is protected by an access control list.
132
Executive Virtual Memory Manager
The design of the VM manager assumes that the underlying hardware supports
virtual to physical mapping a paging mechanism, transparent cache coherence on
multiprocessor systems, and virtual addressing aliasing.
The VM manager in 2000 uses a page-based management scheme with a page size
of 4 KB.
The 2000 VM manager uses a two step process to allocate memory.
The first step reserves a portion of the processs address space.
The second step commits the allocation by assigning space in the 2000 paging
file.
Virtual-Memory Layout
Virtual Memory Manager (Cont.)
The virtual address translation in 2000 uses several data structures.
Each process has a page directory that contains 1024 page directory entries of
size 4 bytes.
Each page directory entry points to a page table which contains 1024 page table
entries (PTEs) of size 4 bytes.
Each PTE points to a 4 KB page frame in physical memory.
A 10-bit integer can represent all the values form 0 to 1023, therefore, can select any
entry in the page directory, or in a page table.
This property is used when translating a virtual address pointer to a bye address in
physical memory.
A page can be in one of six states: valid, zeroed, free standby, modified and bad.
Virtual-to-Physical Address Translation
10 bits for page directory entry, 20 bits for page table entry, and 12 bits for byte
offset in page.
Page File Page-Table Entry
Executive Process Manager
Provides services for creating, deleting, and using threads and processes.
Issues such as parent/child relationships or process hierarchies are left to the
particular environmental subsystem that owns the process.
Executive Local Procedure Call Facility
The LPC passes requests and results between client and server processes within a
single machine.
In particular, it is used to request services from the various 2000 subsystems.
When a LPC channel is created, one of three types of message passing techniques
must be specified.
First type is suitable for small messages, up to 256 bytes; port's message queue
is used as intermediate storage, and the messages are copied from one process
to the other.
Second type avoids copying large messages by pointing to a shared memory
section object created for the channel.
Third method, called quick LPC was used by graphical display portions of the
Win32 subsystem.
Executive I/O Manager
The I/O manager is responsible for
file systems
cache management
device drivers
133
network drivers
Keeps track of which installable file systems are loaded, and manages buffers for I/O
requests.
Works with VM Manager to provide memory-mapped file I/O.
Controls the 2000 cache manager, which handles caching for the entire I/O system.
Supports both synchronous and asynchronous operations, provides time outs for
drivers, and has mechanisms for one driver to call another.
File I/O
Executive Security Reference Monitor
The object-oriented nature of 2000 enables the use of a uniform mechanism to
perform runtime access validation and audit checks for every entity in the system.
Whenever a process opens a handle to an object, the security reference monitor
checks the processs security token and the objects access control list to see whether
the process has the necessary rights.
Executive Plug-and-Play Manager
Plug-and-Play (PnP) manager is used to recognize and adapt to changes in the
hardware configuration.
When new devices are added (for example, PCI or USB), the PnP manager loads the
appropriate driver.
The manager also keeps track of the resources used by each device.
Environmental Subsystems
User-mode processes layered over the native 2000 executive services to enable 2000
to run programs developed for other operating system.
2000 uses the Win32 subsystem as the main operating environment; Win32 is used
to start all processes. It also provides all the keyboard, mouse and graphical display
capabilities.
MS-DOS environment is provided by a Win32 application called the virtual dos
machine (VDM), a user-mode process that is paged and dispatched like any other
2000 thread.
Environmental Subsystems (Cont.)
16-Bit Windows Environment:
Provided by a VDM that incorporates Windows on Windows.
Provides the Windows 3.1 kernel routines and sub routines for window
manager and GDI functions.
The POSIX subsystem is designed to run POSIX applications following the POSIX.1
standard which is based on the UNIX model.
Environmental Subsystems (Cont.)
OS/2 subsystems runs OS/2 applications.
Logon and Security Subsystems authenticates users logging to to Windows 2000
systems. Users are required to have account names and passwords.
134
- The authentication package authenticates users whenever they attempt to
access an object in the system. Windows 2000 uses Kerberos as the default
authentication package.
FILE SYSTEM
The fundamental structure of the 2000 file system (NTFS) is a volume.
Created by the 2000 disk administrator utility.
Based on a logical disk partition.
May occupy a portions of a disk, an entire disk, or span across several disks.
All metadata, such as information about the volume, is stored in a regular file.
NTFS uses clusters as the underlying unit of disk allocation.
A cluster is a number of disk sectors that is a power of two.
Because the cluster size is smaller than for the 16-bit FAT file system, the
amount of internal fragmentation is reduced.
File System Internal Layout
NTFS uses logical cluster numbers (LCNs) as disk addresses.
A file in NTFS is not a simple byte stream, as in MS-DOS or UNIX, rather, it is a
structured object consisting of attributes.
Every file in NTFS is described by one or more records in an array stored in a
special file called the Master File Table (MFT).
Each file on an NTFS volume has a unique ID called a file reference.
64-bit quantity that consists of a 48-bit file number and a 16-bit sequence
number.
Can be used to perform internal consistency checks.
The NTFS name space is organized by a hierarchy of directories; the index root
contains the top level of the B+ tree.
File System Recovery
All file system data structure updates are performed inside transactions that are
logged.
Before a data structure is altered, the transaction writes a log record that
contains redo and undo information.
After the data structure has been changed, a commit record is written to the
log to signify that the transaction succeeded.
After a crash, the file system data structures can be restored to a consistent
state by processing the log records.
File System Recovery (Cont.)
This scheme does not guarantee that all the user file data can be recovered after a
crash, just that the file system data structures (the metadata files) are undamaged
and reflect some consistent state prior to the crash.
The log is stored in the third metadata file at the beginning of the volume.
The logging functionality is provided by the 2000 log file service.
File System Security
Security of an NTFS volume is derived from the 2000 object model.
Each file object has a security descriptor attribute stored in this MFT record.
135
This attribute contains the access token of the owner of the file, and an access
control list that states the access privileges that are granted to each user that has
access to the file.
Volume Management and Fault Tolerance
FtDisk, the fault tolerant disk driver for 2000, provides several ways to combine
multiple SCSI disk drives into one logical volume.
Logically concatenate multiple disks to form a large logical volume, a volume set.
Interleave multiple physical partitions in round-robin fashion to form a stripe set
(also called RAID level 0, or disk striping).
Variation: stripe set with parity, or RAID level 5.
Disk mirroring, or RAID level 1, is a robust scheme that uses a mirror set two
equally sized partitions on tow disks with identical data contents.
To deal with disk sectors that go bad, FtDisk, uses a hardware technique called
sector sparing and NTFS uses a software technique called cluster remapping.
Volume Set On Two Drives
Stripe Set on Two Drives
Stripe Set With Parity on Three Drives
Mirror Set on Two Drives
File System Compression
To compress a file, NTFS divides the files data into compression units, which are
blocks of 16 contiguous clusters.
For sparse files, NTFS uses another technique to save space.
Clusters that contain all zeros are not actually allocated or stored on disk.
Instead, gaps are left in the sequence of virtual cluster numbers stored in the
MFT entry for the file.
When reading a file, if a gap in the virtual cluster numbers is found, NTFS just
zero-fills that portion of the callers buffer.
File System Reparse Points
A reparse point returns an error code when accessed. The reparse data tells the I/O
manager what to do next.
Reparse points can be used to provide the functionality of UNIX mounts
Reparse points can also be used to access files that have been moved to offline
storage.
Networking
2000 supports both peer-to-peer and client/server networking; it also has facilities
for network management.
To describe networking in 2000, we refer to two of the internal networking
interfaces:
NDIS (Network Device Interface Specification) Separates network adapters
from the transport protocols so that either can be changed without affecting
the other.
TDI (Transport Driver Interface) Enables any session layer component to
use any available transport mechanism.
2000 implements transport protocols as drivers that can be loaded and unloaded
from the system dynamically.
136
Networking Protocols
The server message block (SMB) protocol is used to send I/O requests over the
network. It has four message types:
- Session control
- File
- Printer
- Message
The network basic Input/Output system (NetBIOS) is a hardware abstraction
interface for networks. Used to:
Establish logical names on the network.
Establish logical connections of sessions between two logical names on the
network.
Support reliable data transfer for a session via NetBIOS requests or SMBs
Networking Protocols (Cont.)
NetBEUI (NetBIOS Extended User Interface): default protocol for Windows 95
peer networking and Windows for Workgroups; used when 2000 wants to share
resources with these networks.
2000 uses the TCP/IP Internet protocol to connect to a wide variety of operating
systems and hardware platforms.
PPTP (Point-to-Point Tunneling Protocol) is used to communicate between Remote
Access Server modules running on 2000 machines that are connected over the
Internet.
The 2000 NWLink protocol connects the NetBIOS to Novell NetWare networks.
Networking Protocols (Cont.)
The Data Link Control protocol (DLC) is used to access IBM mainframes and HP
printers that are directly connected to the network.
2000 systems can communicate with Macintosh computers via the Apple Talk
protocol if an 2000 Server on the network is running the Windows 2000 Services for
Macintosh package.
Networking Dist. Processing Mechanisms
2000 supports distributed applications via named NetBIOS,named pipes and
mailslots, Windows Sockets, Remote Procedure Calls (RPC), and Network Dynamic
Data Exchange (NetDDE).
NetBIOS applications can communicate over the network using NetBEUI, NWLink,
or TCP/IP.
Named pipes are connection-oriented messaging mechanism that are named via the
uniform naming convention (UNC).
Mailslots are a connectionless messaging mechanism that are used for broadcast
applications, such as for finding components on the network,
Winsock, the windows sockets API, is a session-layer interface that provides a
standardized interface to many transport protocols that may have different
addressing schemes.
Distributed Processing Mechanisms (Cont.)
The 2000 RPC mechanism follows the widely-used Distributed Computing
Environment standard for RPC messages, so programs written to use 2000 RPCs are
very portable.
RPC messages are sent using NetBIOS, or Winsock on TCP/IP networks, or
named pipes on LAN Manager networks.
137
2000 provides the Microsoft Interface Definition Language to describe the
remote procedure names, arguments, and results.
Networking Redirectors and Servers
In 2000, an application can use the 2000 I/O API to access files from a remote
computer as if they were local, provided that the remote computer is running an MS-
NET server.
A redirector is the client-side object that forwards I/O requests to remote files,
where they are satisfied by a server.
For performance and security, the redirectors and servers run in kernel mode.
Access to a Remote File
The application calls the I/O manager to request that a file be opened (we assume
that the file name is in the standard UNC format).
The I/O manager builds an I/O request packet.
The I/O manager recognizes that the access is for a remote file, and calls a driver
called a Multiple Universal Naming Convention Provider (MUP).
The MUP sends the I/O request packet asynchronously to all registered redirectors.
A redirector that can satisfy the request responds to the MUP.
To avoid asking all the redirectors the same question in the future, the MUP
uses a cache to remember with redirector can handle this file.
Access to a Remote File (Cont.)
The redirector sends the network request to the remote system.
The remote system network drivers receive the request and pass it to the server
driver.
The server driver hands the request to the proper local file system driver.
The proper device driver is called to access the data.
The results are returned to the server driver, which sends the data back to the
requesting redirector.
Networking Domains
NT uses the concept of a domain to manage global access rights within groups.
A domain is a group of machines running NT server that share a common security
policy and user database.
2000 provides three models of setting up trust relationships.
One way, A trusts B
Two way, transitive, A trusts B, B trusts C so A, B, C trust each other
Crosslink allows authentication to bypass hierarchy to cut down on
authentication traffic.
Name Resolution in TCP/IP Networks
On an IP network, name resolution is the process of converting a computer name to
an IP address.
e.g., www.bell-labs.com resolves to 135.104.1.14
2000 provides several methods of name resolution:
Windows Internet Name Service (WINS)
broadcast name resolution
domain name system (DNS)
a host file
an LMHOSTS file
138
Name Resolution (Cont.)
WINS consists of two or more WINS servers that maintain a dynamic database of
name to IP address bindings, and client software to query the servers.
WINS uses the Dynamic Host Configuration Protocol (DHCP), which automatically
updates address configurations in the WINS database, without user or administrator
intervention.
Programmer Interface Access to Kernel Obj.
A process gains access to a kernel object named XXX by calling the CreateXXX
function to open a handle to XXX; the handle is unique to that process.
A handle can be closed by calling the CloseHandle function; the system may delete
the object if the count of processes using the object drops to 0.
Given a handle to process and the handles value a second process can get a
handle to the same object, and thus share it.
Programmer Interface Process Management
Process is started via the CreateProcess routine which loads any dynamic link
libraries that are used by the process, and creates a primary thread.
Additional threads can be created by the CreateThread function.
Every dynamic link library or executable file that is loaded into the address space of
a process is identified by an instance handle.
Process Management (Cont.)
Scheduling in Win32 utilizes four priority classes:
- IDLE_PRIORITY_CLASS (priority level 4)
- NORMAL_PRIORITY_CLASS (level8 typical for most processes
- HIGH_PRIORITY_CLASS (level 13)
- REALTIME_PRIORITY_CLASS (level 24)
To provide performance levels needed for interactive programs, 2000 has a special
scheduling rule for processes in the NORMAL_PRIORITY_CLASS.
2000 distinguishes between the foreground process that is currently selected on
the screen, and the background processes that are not currently selected.
When a process moves into the foreground, 2000 increases the scheduling
quantum by some factor, typically 3.
Process Management (Cont.)
The kernel dynamically adjusts the priority of a thread depending on whether it is
I/O-bound or CPU-bound.
To synchronize the concurrent access to shared objects by threads, the kernel
provides synchronization objects, such as semaphores and mutexes.
In addition, threads can synchronize by using the WaitForSingleObject or
WaitForMultipleObjects functions.
Another method of synchronization in the Win32 API is the critical section.
Programmer Interface Interprocess Comm.
Win32 applications can have interprocess communication by sharing kernel objects.
An alternate means of interprocess communications is message passing, which is
particularly popular for Windows GUI applications.
One thread sends a message to another thread or to a window.
A thread can also send data with the message.
Every Win32 thread has its own input queue from which the thread receives
messages.
This is more reliable than the shared input queue of 16-bit windows, because with
139
separate queues, one stuck application cannot block input to the other applications.
Programmer Interface Memory Management
Virtual memory:
- VirtualAlloc reserves or commits virtual memory.
- VirtualFree decommits or releases the memory.
These functions enable the application to determine the virtual address at
which the memory is allocated.
An application can use memory by memory mapping a file into its address space.
Multistage process.
Two processes share memory by mapping the same file into their virtual
memory.
**********************************
140
UNIT-V
I/O SYSTEMS
1. I/O SYSTEMS
The basic model of the UNIX I/O system is a sequence of bytes that can be
accessed either randomly or sequentially. There are no access methods and no
control blocks in a typical UNIX user process.Different programs expect various
levels of structure, but the kernel does not impose structure on I/O. For instance,
the convention for text files is lines of ASCII characters separated by a single
newline character (the ASCII line-feed character), but the kernel knows nothing
about this convention. For the purposes of most programs, the model is further
simplified to being a stream of data bytes, or an I/O stream. It is this single
common data form that makes the characteristic UNIX tool-based approach work
Kernighan & Pike, 1984. An I/O stream from one program can be fed as input to
almost any other program. (This kind of traditional UNIX I/O stream should not
be confused with the Eighth Edition stream I/O system or with the System V,
Release 3 STREAMS, both of which can be accessed as traditional I/O streams.)
Descriptors and I/O
UNIX processes use descriptors to reference I/O streams. Descriptors are small
unsigned integers obtained from the open and socket system calls.. A read or
write system call can be applied to a descriptor to transfer data. The close
system call can be used to deallocate any descriptor.Descriptors represent
underlying objects supported by the kernel, and are created by system calls
specific to the type of object. In 4.4BSD, three kinds of objects can be
represented by descriptors: files, pipes, and sockets.
- A file is a linear array of bytes with at least one name. A file exists until
all its names are deleted explicitly and no process holds a descriptor for
it. A process acquires a descriptor for a file by opening that file's name
with the open system call. I/O devices are accessed as files.
- A pipe is a linear array of bytes, as is a file, but it is used solely as an
I/O stream, and it is unidirectional. It also has no name, and thus cannot
be opened with open. Instead, it is created by the pipe system call,
which returns two descriptors, one of which accepts input that is sent to
the other descriptor reliably, without duplication, and in order. The
system also supports a named pipe or FIFO. A FIFO has properties
identical to a pipe, except that it appears in the filesystem; thus, it can
be opened using the open system call. Two processes that wish to
communicate each open the FIFO: One opens it for reading, the other
for writing.
- A socket is a transient object that is used for interprocess
communication; it exists only as long as some process holds a descriptor
referring to it. A socket is created by the socket system call, which
returns a descriptor for it. There are different kinds of sockets that
support various communication semantics, such as reliable delivery of
141
data, preservation of message ordering, and preservation of message
boundaries.
In systems before 4.2BSD, pipes were implemented using the filesystem; when
sockets were introduced in 4.2BSD, pipes were reimplemented as sockets.
The kernel keeps for each process a descriptor table, which is a table that the
kernel uses to translate the external representation of a descriptor into an
internal representation. (The descriptor is merely an index into this table.) The
descriptor table of a process is inherited from that process's parent, and thus
access to the objects to which the descriptors refer also is inherited. The main
ways that a process can obtain a descriptor are by opening or creation of an
object, and by inheritance from the parent process. In addition, socket IPC
allows passing of descriptors in messages between unrelated processes on the
same machine.
Every valid descriptor has an associated file offset in bytes from the beginning of the
object. Read and write operations start at this offset, which is updated after each data
transfer. For objects that permit random access, the file offset also may be set with the
lseek system call. Ordinary files permit random access, and some devices do, as well.
Pipes and sockets do not.
When a process terminates, the kernel reclaims all the descriptors that were in
use by that process. If the process was holding the final reference to an object,
the object's manager is notified so that it can do any necessary cleanup actions,
such as final deletion of a file or deallocation of a socket.
2. I/O Hardware
There are 3 basic hardware operations: read from a hardware module, write to a
module,and write to the output data stream. The read from a module also has a variations
that allows reading into a variable or directly into the output data stream. In addition,
there is a clear crate statement which performs the appropriate operation for that crate.
Output, either explicitly done with the output statement, or implicitly done by a hard-ware
read operation, is assumed to be in units of 4 byte integers. Each time a code section is
called, it produces a single bank of 4 byte integers. The bank header (including bank
length) is inserted automatically.
3. A KERNEL I/O SUBSYSTEM
142
3.1 KERNEL MODULES
3.1.1 Introduction
Sections of kernel code that can be compiled, loaded, and unloaded independent of
the rest of the kernel
A kernel module may typically implement a device driver, a file system, or a
networking protocol
The module interface allows third parties to write and distribute, on their own terms,
device drivers or file systems that could not be distributed under the GPL
Kernel modules allow a Linux system to be set up with a standard, minimal kernel,
without any extra device drivers built in
Three components to Linux module support:
module management
driver registration
conflict resolution
3.1.2 Module Management
Supports loading modules into memory and letting them talk to the rest of the kernel
Module loading is split into two separate sections:
Managing sections of module code in kernel memory
Handling symbols that modules are allowed to reference
The module requestor manages loading requested, but currently unloaded, modules;
it also regularly queries the kernel to see whether a dynamically loaded module is
still in use, and will unload it when it is no longer actively needed
143
3.1.3 Driver Registration
Allows modules to tell the rest of the kernel that a new driver has become available
The kernel maintains dynamic tables of all known drivers, and provides a set of
routines to allow drivers to be added to or removed from these tables at any time
Registration tables include the following items:
Device drivers
File systems
Network protocols
Binary format
3.1.4 Conflict Resolution
A mechanism that allows different device drivers to reserve hardware resources and
to protect those resources from accidental use by another driver
The conflict resolution module aims to:
Prevent modules from clashing over access to hardware resources
Prevent autoprobes from interfering with existing device drivers
Resolve conflicts with multiple drivers trying to access the same hardware
4. STREAMS
STREAM a full-duplex communication channel between a user-level process
and a device.
A STREAM consists of:
- STREAM head interfaces with the user process
- driver end interfaces with the device
- zero or more STREAM modules between them.
Each module contains a read queue and a write queue
Message passing is used to communicate between queues
4.1 The STREAMS Structure
144
5 PERFORMANCE
I/O a major factor in system performance:
Demands CPU to execute device driver, kernel I/O code
Context switches due to interrupts
Data copying
Network traffic especially stressful
5.1 Improving Performance
Reduce number of context switches
Reduce data copying
Reduce interrupts by using large transfers, smart controllers, polling
Use DMA
Balance CPU, memory, bus, and I/O performance for highest throughput
145
6. MASS-STORAGE SYSTEMS
6.1 Overview of Mass Storage Structure
Magnetic disks provide bulk of secondary storage of modern computers
Drives rotate at 60 to 200 times per second
Transfer rate is rate at which data flow between drive and computer
Positioning time (random-access time) is time to move disk arm to desired
cylinder (seek time) and time for desired sector to rotate under the disk head
(rotational latency)
Head crash results from disk head making contact with the disk surface
Thats bad
Disks can be removable
Drive attached to computer via I/O bus
Busses vary, including EIDE, ATA, SATA, USB, Fibre Channel, SCSI
Host controller in computer uses bus to talk to disk controller built into drive or
storage array
Moving-head Disk Machanism
Magnetic tape
Was early secondary-storage medium
Relatively permanent and holds large quantities of data
Access time slow
Random access ~1000 times slower than disk
Mainly used for backup, storage of infrequently-used data, transfer medium
between systems
Kept in spool and wound or rewound past read-write head
Once data under head, transfer rates comparable to disk
20-200GB typical storage
Common technologies are 4mm, 8mm, 19mm, LTO-2 and SDLT
6.2 DISK STRUCTURE
146
6.2.1 Introduction
Disk drives are addressed as large 1-dimensional arrays of logical blocks, where the
logical block is the smallest unit of transfer.
The 1-dimensional array of logical blocks is mapped into the sectors of the disk
sequentially.
Sector 0 is the first sector of the first track on the outermost cylinder.
Mapping proceeds in order through that track, then the rest of the tracks in
that cylinder, and then through the rest of the cylinders from outermost to
innermost.
6.2.2 Disk Attachment
Host-attached storage accessed through I/O ports talking to I/O busses
SCSI itself is a bus, up to 16 devices on one cable, SCSI initiator requests operation
and SCSI targets perform tasks
Each target can have up to 8 logical units (disks attached to device controller
FC is high-speed serial architecture
Can be switched fabric with 24-bit address space the basis of storage area
networks (SANs) in which many hosts attach to many storage units
Can be arbitrated loop (FC-AL) of 126 devices
6.2.3 Network-Attached Storage
Network-attached storage (NAS) is storage made available over a network rather
than over a local connection (such as a bus)
NFS and CIFS are common protocols
Implemented via remote procedure calls (RPCs) between host and storage
New iSCSI protocol uses IP network to carry the SCSI protocol
6.2.4 Storage Area Network
Common in large storage environments (and becoming more common)
Multiple hosts attached to multiple storage arrays flexible
147
6.3 DISK SCHEDULING
6.3.1 Introduction
The operating system is responsible for using hardware efficiently for the disk
drives, this means having a fast access time and disk bandwidth.
Access time has two major components
Seek time is the time for the disk are to move the heads to the cylinder
containing the desired sector.
Rotational latency is the additional time waiting for the disk to rotate the
desired sector to the disk head.
Minimize seek time
Seek time ~ seek distance
Disk bandwidth is the total number of bytes transferred, divided by the total time
between the first request for service and the completion of the last transfer.
Several algorithms exist to schedule the servicing of disk I/O requests.
We illustrate them with a request queue (0-199).
98, 183, 37, 122, 14, 124, 65, 67
Head pointer 53
6.3.2 FCFS
FCFS
148
Illustration shows total head movement of 640 cylinders.
6.3.3 SSTF
Selects the request with the minimum seek time from the current head position.
SSTF scheduling is a form of SJF scheduling; may cause starvation of some
requests.
Illustration shows total head movement of 236 cylinders.
SSTF
6.3.4 SCAN
SCAN
The disk arm starts at one end of the disk, and moves toward the other end, servicing
requests until it gets to the other end of the disk, where the head movement is
reversed and servicing continues.
Sometimes called the elevator algorithm.
Illustration shows total head movement of 208 cylinders.
149
6.3.5 C-SCAN
C-SCAN
Provides a more uniform wait time than SCAN.
The head moves from one end of the disk to the other. servicing requests as it
goes. When it reaches the other end, however, it immediately returns to the
beginning of the disk, without servicing any requests on the return trip.
Treats the cylinders as a circular list that wraps around from the last cylinder to
the first one.
6.3.6 C-LOOK
C-LOOK
Version of C-SCAN
Arm only goes as far as the last request in each direction, then reverses direction
immediately, without first going all the way to the end of the disk.
150
6.3.7 Selecting a Disk-Scheduling Algorithm
SSTF is common and has a natural appeal
SCAN and C-SCAN perform better for systems that place a heavy load on the disk.
Performance depends on the number and types of requests.
Requests for disk service can be influenced by the file-allocation method.
The disk-scheduling algorithm should be written as a separate module of the
operating system, allowing it to be replaced with a different algorithm if necessary.
Either SSTF or LOOK is a reasonable choice for the default algorithm.
6.4. DISK MANAGEMENT
Low-level formatting, or physical formatting Dividing a disk into sectors that
the disk controller can read and write.
To use a disk to hold files, the operating system still needs to record its own data
structures on the disk.
Partition the disk into one or more groups of cylinders.
Logical formatting or making a file system.
Boot block initializes system.
The bootstrap is stored in ROM.
Bootstrap loader program.
Methods such as sector sparing used to handle bad blocks.
Booting from a Disk in Windows 2000
151
6.5. SWAP-SPACE MANAGEMENT
Swap-space Virtual memory uses disk space as an extension of main memory.
Swap-space can be carved out of the normal file system,or, more commonly, it can
be in a separate disk partition.
Swap-space management
4.3BSD allocates swap space when process starts; holds text segment (the
program) and data segment.
Kernel uses swap maps to track swap-space use.
Solaris 2 allocates swap space only when a page is forced out of physical
memory, not when the virtual memory page is first created.
Data Structures for Swapping on Linux Systems
152
7. RAID
RAID is an acronym first defined by David A. Patterson, Garth A. Gibson, and Randy
Katz at the University of California, Berkeley in 1987 to describe a redundant array of
inexpensive disks,
[1]
a technology that allowed computer users to achieve high levels of
storage reliability from low-cost and less reliable PC-class disk-drive components, via the
technique of arranging the devices into arrays for redundancy.
Marketers representing industry RAID manufacturers later reinvented the term to describe
a redundant array of independent disks as a means of dissociating a "low cost"
expectation from RAID technology.
[2]
"RAID" is now used as an umbrella term for computer data storage schemes that can
divide and replicate data among multiple hard disk drives. The different
schemes/architectures are named by the word RAID followed by a number, as in RAID 0,
RAID 1, etc. RAID's various designs involve two key design goals: increase data
reliability and/or increase input/output performance. When multiple physical disks are set
up to use RAID technology, they are said to be in a RAID array. This array distributes
data across multiple disks, but the array is seen by the computer user and operating
system as one single disk. RAID can be set up to serve several different purposes.
Level Description
Minimum #
of disks
Space
Efficiency
Image
RAID 0
"Striped set without parity"
or "Striping". Provides
improved performance and
additional storage but no
redundancy or fault tolerance.
Because there is no
redundancy, this level is not
actually a Redundant Array
of Inexpensive Disks, i.e. not
true RAID. However, because
of the similarities to RAID
(especially the need for a
controller to distribute data
across multiple disks), simple
stripe sets are normally
referred to as RAID 0. Any
disk failure destroys the
array, which has greater
consequences with more
disks in the array (at a
minimum, catastrophic data
loss is twice as severe
2 n
153
compared to single drives
without RAID). A single disk
failure destroys the entire
array because when data is
written to a RAID 0 drive, the
data is broken into fragments.
The number of fragments is
dictated by the number of
disks in the array. The
fragments are written to their
respective disks
simultaneously on the same
sector. This allows smaller
sections of the entire chunk of
data to be read off the drive in
parallel, increasing
bandwidth. RAID 0 does not
implement error checking so
any error is unrecoverable.
More disks in the array means
higher bandwidth, but greater
risk of data loss.
RAID 1
'Mirrored set without
parity' or 'Mirroring'.
Provides fault tolerance from
disk errors and failure of all
but one of the drives.
Increased read performance
occurs when using a multi-
threaded operating system
that supports split seeks, as
well as a very small
performance reduction when
writing. Array continues to
operate so long as at least one
drive is functioning. Using
RAID 1 with a separate
controller for each disk is
sometimes called duplexing.
2
1 (size of
the
smallest
disk)
RAID 2
Hamming code parity.
Disks are synchronized and
striped in very small stripes,
often in single bytes/words.
Hamming codes error
correction is calculated across
corresponding bits on disks,
3
154
and is stored on multiple
parity disks.
RAID 3
Striped set with dedicated
parity or bit interleaved
parity or byte level parity.
This mechanism provides
fault tolerance similar to
RAID 5. However, because
the strip across the disks is a
lot smaller than a filesystem
block, reads and writes to the
array perform like a single
drive with a high linear write
performance. For this to work
properly, the drives must
have synchronised rotation. If
one drive fails, the
performance doesn't change.
3 n-1
RAID 4
Block level parity. Identical
to RAID 3, but does block-
level striping instead of byte-
level striping. In this setup,
files can be distributed
between multiple disks. Each
disk operates independently
which allows I/O requests to
be performed in parallel,
though data transfer speeds
can suffer due to the type of
parity. The error detection is
achieved through dedicated
parity and is stored in a
separate, single disk unit.
3 n-1
RAID 5
Striped set with distributed
parity or interleave parity.
Distributed parity requires all
drives but one to be present to
operate; drive failure requires
replacement, but the array is
not destroyed by a single
drive failure. Upon drive
failure, any subsequent reads
can be calculated from the
3 n-1
155
distributed parity such that
the drive failure is masked
from the end user. The array
will have data loss in the
event of a second drive
failure and is vulnerable until
the data that was on the failed
drive is rebuilt onto a
replacement drive. A single
drive failure in the set will
result in reduced performance
of the entire set until the
failed drive has been replaced
and rebuilt.
RAID 6
Striped set with dual
distributed parity. Provides
fault tolerance from two drive
failures; array continues to
operate with up to two failed
drives. This makes larger
RAID groups more practical,
especially for high
availability systems. This
becomes increasingly
important because large-
capacity drives lengthen the
time needed to recover from
the failure of a single drive.
Single parity RAID levels are
vulnerable to data loss until
the failed drive is rebuilt: the
larger the drive, the longer
the rebuild will take. Dual
parity gives time to rebuild
the array without the data
being at risk if a (single)
additional drive fails before
the rebuild is complete.
4 n-2
7.1.STABLE STORAGE
Stable storage is a classification of computer data storage technology that guarantees
atomicity for any given write operation and allows software to be written that is robust
against some hardware and power failures. To be considered atomic, upon reading back a
just written-to portion of the disk, the storage subsystem must return either the write data
156
or the data that was on that portion of the disk before the write operation. Most computer
disk drives are not considered stable storage because they do not guarantee atomic write:
an error could be returned upon subsequent read of the disk where it was just written to in
lieu of either the new or prior data.
Multiple techniques have been developed to achieve the atomic property from weakly-
atomic devices such as disks. Writing data to a disk in two places in a specific way is one
technique and can be done by application software. Most often though, stable storage
functionality is achieved by mirroring data on separate disks via RAID technology (level
1 or greater). The RAID controller implements the disk writing algorithms that enable
separate disks to act as stable storage. The RAID technique is robust against some single
disk failure in an array of disks whereas the software technique of writing to separate
areas of the same disk only protects against some kinds of internal disk media failures
such as bad sectors in single disk arrangements.
7.2.TERTIARY STORAGE
Tertiary storage or tertiary memory, provides a third level of storage. Typically it
involves a robotic mechanism which will mount (insert) and dismount removable mass
storage media into a storage device according to the system's demands; this data is often
copied to secondary storage before use. It is primarily used for archival of rarely accessed
information since it is much slower than secondary storage (e.g. 560 seconds vs. 1-10
milliseconds). This is primarily useful for extraordinarily large data stores, accessed
without human operators. Typical examples include tape libraries and optical jukeboxes.
When a computer needs to read information from the tertiary storage, it will first consult a
catalog database to determine which tape or disc contains the information. Next, the
computer will instruct a robotic arm to fetch the medium and place it in a drive. When the
computer has finished reading the information, the robotic arm will return the medium to
its place in the library.
8. CASE STUDY
8.1 IO in Linux
Input and Output
The Linux device-oriented file system accesses disk storage through two caches:
Data is cached in the page cache, which is unified with the virtual memory system
Metadata is cached in the buffer cache, a separate cache indexed by the physical
disk block.
Linux splits all devices into three classes:
block devices allow random access to completely independent, fixed size blocks of
data
character devices include most other devices; they dont need to support the
functionality of regular files.
network devices are interfaced via the kernels networking subsystem
157
Device-Driver Block Structure
Block Devices
Provide the main interface to all disk devices in a system.
The block buffer cache serves two main purposes:
it acts as a pool of buffers for active I/O
it serves as a cache for completed I/O
The request manager manages the reading and writing of buffer contents to and from a
block device driver.
Character Devices
A device driver which does not offer random access to fixed blocks of data.
A character device driver must register a set of functions which implement the drivers
various file I/O operations.
The kernel performs almost no preprocessing of a file read or write request to a
character device, but simply passes on the request to the device.
The main exception to this rule is the special subset of character device drivers which
implement terminal devices, for which the kernel maintains a standard interface.
familiar with the role of process schedulers, such as the new O(1) scheduler, many users are not so
***************************