OS Course File
OS Course File
UNIT-1
What is an Operating System?
A program that acts as an intermediary between a user of a computer and the computer hardware
Operating system goals:
Execute user programs and make solving user problems easier
Make the computer system convenient to use
Use the computer hardware in an efficient manner
OS is a resource allocator
Manages all resources
Decides between conflicting requests for efficient and fair resource use
OS is a control program
Controls execution of programs to prevent errors and improper use of the computer
No universally accepted definition
Everything a vendor ships when you order an operating system” is good approximation
but varies wildly.
“The one program running at all times on the computer” is the kernel. Everything else is
either a system program (ships with the operating system) or an application program.
Computer-system operation
One or more CPUs, device controllers connect through common bus providing access to
shared memory
Concurrent execution of CPUs and devices competing for memory cycles
Computer-System Operation
Interrupt Handling
The operating system preserves the state of the CPU by storing registers and the program
counter
Determines which type of interrupt has occurred:
polling
vectored interrupt system
Separate segments of code determine what action should be taken for each type of
interrupt
Interrupt Timeline
Storage Hierarchy
Two types
1.Asymmetric Multiprocessing
2.Symmetric Multiprocessing
Clustered Systems
Like multiprocessor
systems, but multiple systems working together
Usually sharing storage via a storage-area network (SAN)
Provides a high-availability service which survives failures
Asymmetric clustering has one machine in hot-standby mode
Symmetric clustering has multiple nodes running applications, monitoring each other
Some clusters are for high-performance computing (HPC)
Applications must be written to use parallelization
OSes provide environments in which programs run, and services for the users of the system,
including:
User Interfaces - Means by which users can issue commands to the system. Depending
on the system these may be a command-line interface ( e.g. sh, csh, ksh, tcsh, etc. ), a
GUI interface ( e.g. Windows, X-Windows, KDE, Gnome, etc. ), or a batch command
systems. The latter are generally older systems using punch cards of job-control
language, JCL, but may still be used today for specialty systems designed for a single
purpose.
Program Execution - The OS must be able to load a program into RAM, run the
program, and terminate the program, either normally or abnormally.
I/O Operations - The OS is responsible for transferring data to and from I/O devices,
including keyboards, terminals, printers, and storage devices.
File-System Manipulation - In addition to raw data storage, the OS is also responsible
for maintaining directory and subdirectory structures, mapping file names to specific
blocks of data storage, and providing tools for navigating and utilizing the file system.
Communications - Inter-process communications, IPC, either between processes
running on the same processor, or between processes running on separate processors or
separate machines. May be implemented as either shared memory or message passing, (
or some systems may offer both. )
Error Detection - Both hardware and software errors must be detected and handled
appropriately, with a minimum of harmful repercussions. Some systems may include
complex error avoidance or recovery systems, including backups, RAID drives, and other
redundant systems. Debugging and diagnostic tools aid users and administrators in
tracing down the cause of problems.
Resource Allocation - E.g. CPU cycles, main memory, storage space, and peripheral
devices. Some resources are managed with generic systems and others with very
carefully designed and specially tuned systems, customized for a particular resource and
operating environment.
Accounting - Keeping track of system activity and resource usage, either for billing
purposes or for statistical record keeping that can be used to optimize future performance.
Protection and Security - Preventing harm to the system and to resources, either through
wayward internal processes or malicious outsiders. Authentication, ownership, and
restricted access are obvious parts of this system. Highly secure systems may log all
process activity down to excruciating detail, and security regulation dictate the storage of
those records on permanent non-erasable medium for extended times in secure ( off-site )
facilities.
Gets and processes the next user request, and launches the requested programs.
In some systems the CI may be incorporated directly into the kernel.
More commonly the CI is a separate program that launches once the user logs in
or otherwise accesses the system.
UNIX, for example, provides the user with a choice of different shells, which may
either be configured to launch automatically at login, or which may be changed on
the fly. ( Each of these shells uses a different configuration file of initial settings
and commands that are executed upon startup. )
Different shells provide different functionality, in terms of certain commands that
are implemented directly by the shell without launching any external programs.
Most provide at least a rudimentary command interpretation structure for use in
shell script programming ( loops, decision constructs, variables, etc. )
An interesting distinction is the processing of wild card file naming and I/O re-
direction. On UNIX systems those details are handled by the shell, and the
program which is launched sees only a list of filenames generated by the shell
from the wild cards. On a DOS system, the wild cards are passed along to the
programs, which can interpret the wild cards as the program sees fit.
Figure 1.2 - The Bourne shell command interpreter in Solaris 10
Generally implemented as a desktop metaphor, with file folders, trash cans, and
resource icons.
Icons represent some item on the system, and respond accordingly when the icon
is activated.
First developed in the early 1970's at Xerox PARC research facility.
In some systems the GUI is just a front end for activating a traditional command
line interpreter running in the background. In others the GUI is a true graphical
shell in its own right.
Mac has traditionally provided ONLY the GUI interface. With the advent of OSX
( based partially on UNIX ), a command line interface has also become available.
Because mice and keyboards are impractical for small mobile devices, these
normally use a touch-screen interface today, that responds to various patterns of
swipes or "gestures". When these first came out they often had a physical
keyboard and/or a trackball of some kind built in, but today a virtual keyboard is
more commonly implemented on the touch screen.
System calls provide a means for user or application programs to call upon the services of
the operating system.
Generally written in C or C++, although some are written in assembly for optimal
performance.
Figure 2.4 illustrates the sequence of system calls required to copy a file:
Figure 1.5 - Example of how system calls are used.
You can use "strace" to see more examples of the large number of system calls invoked
by a single simple command. Read the man page for strace, and try some simple
examples. ( strace mkdir temp, strace cd temp, strace date > t.t, strace cp t.t t.2, etc. )
Most programmers do not use the low-level system calls directly, but instead use an
"Application Programming Interface", API. The following sidebar shows the read( ) call
available in the API on UNIX based systems::
The use of APIs instead of direct system calls provides for greater program portability between
different systems. The API then makes the appropriate system calls through the system call
interface, using a table lookup to access specific numbered system calls, as shown in Figure 2.6:
Figure 1.6 - The handling of a user application invoking the open( ) system call
Parameters are generally passed to system calls via registers, or less commonly, by values
pushed onto the stack. Large blocks of data are generally accessed indirectly, through a
memory address passed in a register or on the stack, as shown in Figure 2.7:
Six major categories, as outlined in Figure 1.8 and the following six subsections:
Process control system calls include end, abort, load, execute, create process,
terminate process, get/set process attributes, wait for time or event, signal event,
and allocate and free memory.
Processes must be created, launched, monitored, paused, resumed,and eventually
stopped.
When one process pauses or stops, then another must be launched or resumed
When processes stop abnormally it may be necessary to provide core dumps
and/or other diagnostic or recovery tools.
Compare DOS ( a single-tasking system ) with UNIX ( a multi-tasking system ).
o When a process is launched in DOS, the command interpreter first unloads
as much of itself as it can to free up memory, then loads the process and
transfers control to it. The interpreter does not resume until the process has
completed, as shown in Figure 1.9:
Figure 1.9 - MS-DOS execution. (a) At system startup. (b) Running a program.
File management system calls include create file, delete file, open, close, read,
write, reposition, get file attributes, and set file attributes.
These operations may also be supported for directories as well as ordinary files.
Device management system calls include request device, release device, read,
write, reposition, get/set device attributes, and logically attach or detach devices.
Devices may be physical ( e.g. disk drives ), or virtual / abstract ( e.g. files,
partitions, and RAM disks ).
Some systems represent devices as special files in the file system, so that
accessing the "file" calls upon the appropriate device drivers in the OS. See for
example the /dev directory on any UNIX system.
Information maintenance system calls include calls to get/set the time, date,
system data, and process, file, or device attributes.
Systems may also provide the ability to dump memory at any time, single step
programs pausing execution after each instruction, and tracing the operation of
programs, all of which can help to debug programs.
1.4.5 Communication
1.4.6 Protection
System programs provide OS functionality through separate applications, which are not
part of the kernel or command interpreters. They are also known as system utilities or
system applications.
Most systems also ship with useful applications such as calculators and simple editors, (
e.g. Notepad ). Some debate arises as to the border between system and non-system
applications.
System programs may be divided into these categories:
o File management - programs to create, delete, copy, rename, print, list, and
generally manipulate files and directories.
o Status information - Utilities to check on the date, time, number of users,
processes running, data logging, etc. System registries are used to store and recall
configuration information for particular applications.
o File modification - e.g. text editors and other tools which can change file
contents.
o Programming-language support - E.g. Compilers, linkers, debuggers, profilers,
assemblers, library archive management, interpreters for common languages, and
support for make.
o Program loading and execution - loaders, dynamic loaders, overlay loaders,
etc., as well as interactive debuggers.
o Communications - Programs for providing connectivity between processes and
users, including mail, web browsers, remote logins, file transfers, and remote
command execution.
o Background services - System daemons are commonly started when the system
is booted, and run for as long as the system is running, handling necessary
services. Examples include network daemons, print servers, process schedulers,
and system error monitoring services.
Most operating systems today also come complete with a set of application programs to
provide additional services, such as copying files or checking the time and date.
Most users' views of the system is determined by their command interpreter and the
application programs. Most never make system calls, even through the API, ( with the
exception of simple ( file ) I/O in user-written programs. )
Requirements define properties which the finished system must have, and are a
necessary first step in designing any large complex system.
o User requirements are features that users care about and understand, and
are written in commonly understood vernacular. They generally do not
include any implementation details, and are written similar to the product
description one might find on a sales brochure or the outside of a shrink-
wrapped box.
o System requirements are written for the developers, and include more
details about implementation specifics, performance requirements,
compatibility constraints, standards compliance, etc. These requirements
serve as a "contract" between the customer and the developers, ( and
between developers and subcontractors ), and can get quite detailed.
Requirements for operating systems can vary greatly depending on the planned
scope and usage of the system. ( Single user / multi-user, specialized system /
general purpose, high/low security, performance needs, operating environment,
etc. )
1.6.3 Implementation
When DOS was originally written its developers had no idea how big and important it would
eventually become. It was written by a few programmers in a relatively short amount of time,
without the benefit of modern software engineering techniques, and then gradually grew over
time to exceed its original expectations. It does not break the system into subsystems, and has no
distinction between user and kernel modes, allowing all programs direct access to the underlying
hardware. ( Note that user versus kernel mode was not supported by the 8088 chip set anyway,
so that really wasn't an option back then. )
The original UNIX OS used a simple layered approach, but almost all the OS was in one big
layer, not really breaking the OS down into layered subsystems:
Figure 1.12 - Traditional UNIX system structure
1.7.3 Microkernels
The basic idea behind micro kernels is to remove all non-essential services from
the kernel, and implement them as system applications instead, thereby making
the kernel as small and efficient as possible.
Most microkernels provide basic process and memory management, and message
passing between other services, and not much more.
Security and protection can be enhanced, as most services are performed in user
mode, not kernel mode.
System expansion can also be easier, because it only involves adding more system
applications, not rebuilding a new kernel.
Mach was the first and most widely known microkernel, and now forms a major
component of Mac OSX.
Windows NT was originally microkernel, but suffered from performance
problems relative to Windows 95. NT 4.0 improved performance by moving more
services into the kernel, and now XP is back to being more monolithic.
Another microkernel example is QNX, a real-time OS for embedded systems.
1.7.4 Modules
Most OSes today do not strictly adhere to one architecture, but are hybrids of
several.
1.7.5.1 Mac OS X
The Max OSX architecture relies on the Mach microkernel for basic
system management services, and the BSD kernel for additional services.
Application services and dynamically loadable modules ( kernel
extensions ) provide the rest of the OS functionality:
Virtual Machines
The concept of a virtual machine is to provide an interface that looks like independent
hardware, to multiple different OSes running simultaneously on the same physical
hardware. Each OS believes that it has access to and control over its own CPU, RAM,
I/O devices, hard drives, etc.
One obvious use for this system is for the development and testing of software that must
run on multiple platforms and/or OSes.
One obvious difficulty involves the sharing of hard drives, which are generally
partitioned into separate smaller virtual disks for each operating OS.
1.8.1 History
Virtual machines first appeared as the VM Operating System for IBM mainframes
in 1972.
1.8.2 Benefits
Each OS runs independently of all the others, offering protection and security
benefits.
( Sharing of physical resources is not commonly implemented, but may be done
as if the virtual machines were networked together. )
Virtual machines are a very useful tool for OS development, as they allow a user
full access to and control over a virtual machine, without affecting other users
operating the real machine.
As mentioned before, this approach can also be useful for product development
and testing of SW that must run on multiple OSes / HW platforms.
1.8.3 Simulation
1.8.5 Implementation
1.8.6.1 VMware
Process Concepts
A process includes:
program counter
stack
data section
Process in Memory
Process State
Process state
Program counter
CPU registers
CPU scheduling information
Memory-management information
Accounting information
I/O status information
Schedulers
Long-term scheduler (or job scheduler) – selects which processes should be brought
into the ready queue
Short-term scheduler (or CPU scheduler) – selects which process should be executed
next and allocates CPU
Process Creation
Parent process create children processes, which, in turn create other processes, forming
a tree of processes
Generally, process identified and managed via a process identifier (pid)
Resource sharing
Parent and children share all resources
Children share subset of parent’s resources
Parent and child share no resources
Child has a program loaded into it
UNIX examples
fork system call creates new process
exec system call used after a fork to replace the process’ memory space with a new
program
Process Creation
Process Termination
Process executes last statement and asks the operating system to delete it (exit)
Output data from child to parent (via wait)
Process’ resources are deallocated by operating system
Parent may terminate execution of children processes (abort)
Child has exceeded allocated resources
Task assigned to child is no longer required
If parent is exiting
Some operating system do not allow child to continue if its parent terminates
Multithreading Models
Many-to-One
One-to-One
Many-to-Many
Many-to-One
Many user-level threads mapped to single kernel thread
Examples:
One-to-One
Many-to-Many Model
Similar to M:M, except that it allows a user thread to be bound to kernel thread
Examples
IRIX
HP-UX
Tru64 UNIX
Solaris 8 and earlier
Thread Libraries
Thread library provides programmer with API for creating and managing threads
Two primary ways of implementing
Library entirely in user space
Kernel-level library supported by the OS
Pthreads
CPU SCHEDULING
Almost all programs have some alternating cycle of CPU number crunching and waiting
for I/O of some kind. (Even a simple fetch from memory takes a long time relative to
CPU speeds.)
In a simple system running a single process, the time spent waiting for I/O is wasted, and
those CPU cycles are lost forever.
A scheduling system allows one process to use the CPU while another is waiting for I/O,
thereby making full use of otherwise lost CPU cycles.
The challenge is to make the overall system as "efficient" and "fair" as possible, subject
to varying and often dynamic conditions, and where "efficient" and "fair" are somewhat
subjective terms, often subject to shifting priority policies.
CPU bursts vary from process to process, and from program to program, but an
extensive study shows frequency patterns similar to that shown in Figure 2.2:
Figure 2.2 - Histogram of CPU-burst durations.
Whenever the CPU becomes idle, it is the job of the CPU Scheduler ( a.k.a. the
short-term scheduler ) to select another process from the ready queue to run next.
The storage structure for the ready queue and the algorithm used to select the next
process are not necessarily a FIFO queue. There are several alternatives to choose
from, as well as numerous adjustable parameters for each algorithm, which is the
basic subject of this entire chapter.
2.1.4 Dispatcher
The dispatcher is the module that gives control of the CPU to the process
selected by the scheduler. This function involves:
o Switching context.
o Switching to user mode.
o Jumping to the proper location in the newly loaded program.
The dispatcher needs to be as fast as possible, as it is run on every context switch.
The time consumed by the dispatcher is known as dispatch latency.
There are several different criteria to consider when trying to select the "best" scheduling
algorithm for a particular situation and environment, including:
o CPU utilization - Ideally the CPU would be busy 100% of the time, so as to
waste 0 CPU cycles. On a real system CPU usage should range from 40% ( lightly
loaded ) to 90% ( heavily loaded. )
o Throughput - Number of processes completed per unit time. May range from 10
/ second to 1 / hour depending on the specific processes.
o Turnaround time - Time required for a particular process to complete, from
submission time to completion. ( Wall clock time. )
o Waiting time - How much time processes spend in the ready queue waiting their
turn to get on the CPU.
( Load average - The average number of processes sitting in the ready
queue waiting their turn to get into the CPU. Reported in 1-minute, 5-
minute, and 15-minute averages by "uptime" and "who". )
o Response time - The time taken in an interactive program from the issuance of a
command to the commence of a response to that command.
In general one wants to optimize the average value of a criteria ( Maximize CPU
utilization and throughput, and minimize all the others. ) However some times one wants
to do something different, such as to minimize the maximum response time.
Sometimes it is most desirable to minimize the variance of a criteria than the actual
value. I.e. users are more accepting of a consistent predictable system than an
inconsistent one, even if it is a little bit slower.
The following subsections will explain several common scheduling strategies, looking at only a
single CPU burst each for a small number of processes. Obviously real systems have to deal with
a lot more simultaneous processes executing their CPU-I/O burst cycles.
2.3.1 First-Come First-Serve Scheduling, FCFS
FCFS is very simple - Just a FIFO queue, like customers waiting in line at the
bank or the post office or at a copying machine.
Unfortunately, however, FCFS can yield some very long average wait times,
particularly if the first process to get there takes a long time. For example,
consider the following three processes:
In the first Gantt chart below, process P1 arrives first. The average waiting time
for the three processes is ( 0 + 24 + 27 ) / 3 = 17.0 ms.
In the second Gantt chart below, the same three processes have an average wait
time of ( 0 + 3 + 6 ) / 3 = 3.0 ms. The total run time for the three bursts is the
same, but in the second case two of the three finish much quicker, and the other
process is only delayed by a short amount.
FCFS can also block the system in a busy dynamic system in another way, known
as the convoy effect. When one CPU intensive process blocks the CPU, a number
of I/O intensive processes can get backed up behind it, leaving the I/O devices
idle. When the CPU hog finally relinquishes the CPU, then the I/O processes pass
through the CPU quickly, leaving the CPU idle while everyone queues up for I/O,
and then the cycle repeats itself when the CPU intensive process gets back to the
ready queue.
Priority scheduling is a more general case of SJF, in which each job is assigned a
priority and the job with the highest priority gets scheduled first. ( SJF uses the
inverse of the next expected burst time as its priority - The smaller the expected
burst, the higher the priority. )
Note that in practice, priorities are implemented using integers within a fixed
range, but there is no agreed-upon convention as to whether "high" priorities use
large numbers or small numbers. This book uses low number for high priorities,
with 0 being the highest possible priority.
For example, the following Gantt chart is based upon these process burst times
and priorities, and yields an average waiting time of 8.2 ms:
Round robin scheduling is similar to FCFS scheduling, except that CPU bursts are
assigned with limits called time quantum.
When a process is given the CPU, a timer is set for whatever value has been set
for a time quantum.
o If the process finishes its burst before the time quantum timer expires, then
it is swapped out of the CPU just like the normal FCFS algorithm.
o If the timer goes off first, then the process is swapped out of the CPU and
moved to the back end of the ready queue.
The ready queue is maintained as a circular queue, so when all processes have had
a turn, then the scheduler gives the first process another turn, and so on.
RR scheduling can give the effect of all processors sharing the CPU equally,
although the average wait time can be longer than with other scheduling
algorithms. In the following example the average wait time is 5.66 ms.
Figure 2.5 - The way in which turnaround time varies with the time quantum.
In general, turnaround time is minimized if most processes finish their next cpu
burst within one time quantum. For example, with three processes of 10 ms bursts
each, the average turnaround time for 1 ms quantum is 29, and for 10 ms quantum
it reduces to 20. However, if it is made too large, then RR just degenerates to
FCFS. A rule of thumb is that 80% of CPU bursts should be smaller than the time
quantum.
When processes can be readily categorized, then multiple separate queues can be
established, each implementing whatever scheduling algorithm is most
appropriate for that type of job, and/or with different parametric adjustments.
Scheduling must also be done between queues, that is scheduling one queue to get
time relative to other queues. Two common options are strict priority ( no job in a
lower priority queue runs until all higher priority queues are empty ) and round-
robin ( each queue gets a time slice in turn, possibly of different sizes. )
Note that under this algorithm jobs cannot switch from queue to queue - Once
they are assigned a queue, that is their queue until they finish.
Contention scope refers to the scope in which threads compete for the use of
physical CPUs.
On systems implementing many-to-one and many-to-many threads, Process
Contention Scope, PCS, occurs, because competition occurs between threads that
are part of the same process. ( This is the management / scheduling of multiple
user threads on a single kernel thread, and is managed by the thread library. )
System Contention Scope, SCS, involves the system scheduler scheduling kernel
threads to run on one or more CPUs. Systems implementing one-to-one threads (
XP, Solaris 9, Linux ), use only SCS.
PCS scheduling is typically done with priority, where the programmer can set
and/or change the priority of threads created by his or her programs. Even time
slicing is not guaranteed among threads of equal priority.
When multiple processors are available, then the scheduling gets more complicated,
because now there is more than one CPU which must be kept busy and in effective use at
all times.
Load sharing revolves around balancing the load between multiple processors.
Multi-processor systems may be heterogeneous, ( different kinds of CPUs ), or
homogenous, ( all the same kind of CPU ). Even in the latter case there may be special
scheduling constraints, such as devices which are connected via a private bus to only one
of the CPUs. This book will restrict its discussion to homogenous systems.
2.5.1 Approaches to Multiple-Processor Scheduling
Processors contain cache memory, which speeds up repeated accesses to the same
memory locations.
If a process were to switch from one processor to another each time it got a time
slice, the data in the cache ( for that process ) would have to be invalidated and re-
loaded from main memory, thereby obviating the benefit of the cache.
Therefore SMP systems attempt to keep processes on the same processor, via
processor affinity. Soft affinity occurs when the system attempts to keep
processes on the same processor but makes no guarantees. Linux and some other
OSes support hard affinity, in which a process specifies that it is not to be moved
between processors.
Main memory architecture can also affect process affinity, if particular CPUs
have faster access to memory on the same chip or board than to other memory
loaded elsewhere. ( Non-Uniform Memory Access, NUMA. ) As shown below, if
a process has an affinity for a particular CPU, then it should preferentially be
assigned memory storage in "local" fast access areas.
Traditional SMP required multiple CPU chips to run multiple kernel threads
concurrently.
Recent trends are to put multiple CPUs ( cores ) onto a single chip, which appear
to the system as multiple processors.
Compute cycles can be blocked by the time needed to access memory, whenever
the needed data is not already present in the cache. ( Cache misses. ) In Figure
2.10, as much as half of the CPU cycles are lost to memory stall.
2.6 Background
Recall that back in Chapter 3 we looked at cooperating processes ( those that can effect or
be effected by other simultaneously running processes ), and as an example, we used the
producer-consumer cooperating processes:
item nextProduced;
while( true ) {
item nextConsumed;
while( true ) {
}
The only problem with the above code is that the maximum number of items
which can be placed into the buffer is BUFFER_SIZE - 1. One slot is unavailable
because there always has to be a gap between the producer and the consumer.
We could try to overcome this deficiency by introducing a counter variable, as
shown in the following code segments:
Unfortunately we have now introduced a new problem, because both the producer
and the consumer are adjusting the value of the variable counter, which can lead
to a condition known as a race condition. In this condition a piece of code may or
may not work correctly, depending on which of two simultaneous processes
executes first, and more importantly if one of the processes gets interrupted such
that the other process runs between important steps of the first process. ( Bank
balance example discussed in class. )
The particular problem above comes from the producer executing "counter++" at
the same time the consumer is executing "counter--". If one process gets part way
through making the update and then the other process butts in, the value of
counter can get left in an incorrect state.
But, you might say, "Each of those are single instructions - How can they get
interrupted halfway through?" The answer is that although they are single
instructions in C++, they are actually three steps each at the hardware level: (1)
Fetch counter from memory into a register, (2) increment or decrement the
register, and (3) Store the new value of counter back to memory. If the
instructions from the two processes get interleaved, there could be serious
problems, such as illustrated by the following:
Exercise: What would be the resulting value of counter if the order of statements
T4 and T5 were reversed? ( What should the value of counter be after one
producer and one consumer, assuming the original value was 5? )
Note that race conditions are notoriously difficult to identify and debug, because
by their very nature they only occur on rare occasions, and only when the timing
is just exactly right. ( or wrong! :-) ) Race conditions are also very difficult to
reproduce. :-(
Obviously the solution is to only allow one process at a time to manipulate the
value "counter". This is a very common occurrence among cooperating processes,
so lets look at some ways in which this is done, as well as some classic problems
in this area.
A solution to the critical section problem must satisfy the following three conditions:
1. Mutual Exclusion - Only one process at a time can be executing in their critical
section.
2. Progress - If no process is currently executing in their critical section, and one or
more processes want to execute their critical section, then only the processes not
in their remainder sections can participate in the decision, and the decision cannot
be postponed indefinitely. ( i.e. processes cannot be blocked forever waiting to get
into their critical sections. )
3. Bounded Waiting - There exists a limit as to how many other processes can get
into their critical sections after a process requests entry into their critical section
and before that request is granted. ( i.e. a process requesting entry into their
critical section will get a turn eventually, and there is a limit as to how many other
processes get to go first. )
We assume that all processes proceed at a non-zero speed, but no assumptions can be
made regarding the relative speed of one process versus another.
Kernel processes can also be subject to race conditions, which can be especially
problematic when updating commonly shared kernel data structures such as open file
tables or virtual memory management. Accordingly kernels can take on one of two
forms:
To prove that the solution is correct, we must examine the three conditions listed above:
1. Mutual exclusion - If one process is executing their critical section when the
other wishes to do so, the second process will become blocked by the flag of the
first process. If both processes attempt to enter at the same time, the last process
to execute "turn = j" will be blocked.
2. Progress - Each process can only be blocked at the while if the other process
wants to use the critical section ( flag[ j ] = = true ), AND it is the other process's
turn to use the critical section ( turn = = j ). If both of those conditions are true,
then the other process ( j ) will be allowed to enter the critical section, and upon
exiting the critical section, will set flag[ j ] to false, releasing process i. The shared
variable turn assures that only one process at a time can be blocked, and the flag
variable allows one process to release the other when exiting their critical section.
3. Bounded Waiting - As each process enters their entry section, they set the turn
variable to be the other processes turn. Since no process ever sets it back to their
own turn, this ensures that each process will have to let the other process go first
at most one time before it becomes their turn again.
Note that the instruction "turn = j" is atomic, that is it is a single machine instruction
which cannot be interrupted.
To generalize the solution(s) expressed above, each process when entering their critical
section must set some sort of lock, to prevent other processes from entering their critical
sections simultaneously, and must release the lock when exiting their critical section, to
allow other processes to proceed. Obviously it must be possible to attain the lock only
when no other process has already set a lock. Specific implementations of this general
procedure can get quite complicated, and may include hardware solutions as outlined in
this section.
One simple solution to the critical section problem is to simply prevent a process from
being interrupted while in their critical section, which is the approach taken by non
preemptive kernels. Unfortunately this does not work well in multiprocessor
environments, due to the difficulties in disabling and the re-enabling interrupts on all
processors. There is also a question as to how this approach affects timing if the clock
interrupt is disabled.
Another approach is for hardware to provide certain atomic operations. These operations
are guaranteed to operate as a single instruction, without interruption. One such operation
is the "Test and Set", which simultaneously sets a boolean lock variable and returns its
previous value, as shown in Figures 2.14 and 2.15:
Figures 2.14 and 2.15 illustrate "test_and_set( )" function
The above examples satisfy the mutual exclusion requirement, but unfortunately do not
guarantee bounded waiting. If there are multiple processes trying to get into their critical
sections, there is no guarantee of what order they will enter, and any one process could
have the bad luck to wait forever until they got their turn in the critical section. ( Since
there is no guarantee as to the relative rates of the processes, a very fast process could
theoretically release the lock, whip through their remainder section, and re-lock the lock
before a slower process got a chance. As more and more processes are involved vying for
the same resource, the odds of a slow process getting locked out completely increase. )
Figure 5.7 illustrates a solution using test-and-set that does satisfy this requirement, using
two shared data structures, boolean lock and boolean waiting[ N ], where N is the number
of processes in contention for critical sections:
The key feature of the above algorithm is that a process blocks on the AND of the critical
section being locked and that this process is in the waiting state. When exiting a critical
section, the exiting process does not just unlock the critical section and let the other
processes have a free-for-all trying to get in. Rather it first looks in an orderly progression
( starting with the next process on the list ) for a process that has been waiting, and if it
finds one, then it releases that particular process from its waiting state, without unlocking
the critical section, thereby allowing a specific process into the critical section while
continuing to block all the others. Only if there are no other processes currently waiting is
the general lock removed, allowing the next process to come along access to the critical
section.
Unfortunately, hardware level locks are especially difficult to implement in multi-
processor architectures. Discussion of such issues is left to books on advanced computer
architecture.
The hardware solutions presented above are often difficult for ordinary programmers to
access, particularly on multi-processor machines, and particularly because they are often
platform-dependent.
Therefore most systems offer a software API equivalent called mutex locks or simply
mutexes. ( For mutual exclusion )
The terminology when using mutexes is to acquire a lock prior to entering a critical
section, and to release it when exiting, as shown in Figure 2.17:
Figure 2.17 - Solution to the critical-section problem using mutex locks
Just as with hardware locks, the acquire step will block the process if the lock is in use by
another process, and both the acquire and release operations are atomic.
Acquire and release can be implemented as shown here, based on a boolean variable
"available":
One problem with the implementation shown here, ( and in the hardware solutions
presented earlier ), is the busy loop used to block processes in the acquire phase. These
types of locks are referred to as spinlocks, because the CPU just sits and spins while
blocking the process.
Spinlocks are wasteful of cpu cycles, and are a really bad idea on single-cpu single-
threaded machines, because the spinlock blocks the entire computer, and doesn't allow
any other process to release the lock. ( Until the scheduler kicks the spinning process off
of the cpu. )
On the other hand, spinlocks do not incur the overhead of a context switch, so they are
effectively used on multi-threaded machines when it is expected that the lock will be
released after a short time.
2.11 Semaphores
A more robust alternative to simple mutexes is to use semaphores, which are integer
variables for which only two ( atomic ) operations are defined, the wait and signal
operations, as shown in the following figure.
Note that not only must the variable-changing steps ( S-- and S++ ) be indivisible, it is
also necessary that for the wait operation when the test proves false that there be no
interruptions before S gets decremented. It IS okay, however, for the busy loop to be
interrupted when the test is true, which prevents the system from hanging forever.
o Counting semaphores can take on any integer value, and are usually used
to count the number remaining of some limited resource. The counter is
initialized to the number of such resources available in the system, and
whenever the counting semaphore is greater than zero, then a process can
enter a critical section and use one of the resources. When the counter gets
to zero ( or negative in some implementations ), then the process blocks
until another process frees up a resource and increments the counting
semaphore with a signal call. ( The binary semaphore can be seen as just a
special case where the number of resources initially available is just one. )
o Semaphores can also be used to synchronize certain operations between
processes. For example, suppose it is important that process P1 execute
statement S1 before process P2 executes statement S2.
First we create a semaphore named synch that is shared by the two
processes, and initialize it to zero.
Then in process P1 we insert the code:
S1;
signal( synch );
wait( synch );
S2;
The big problem with semaphores as described above is the busy loop in the wait
call, which consumes CPU cycles without doing any useful work. This type of
lock is known as a spinlock, because the lock just sits there and spins while it
waits. While this is generally a bad thing, it does have the advantage of not
invoking context switches, and so it is sometimes used in multi-processing
systems when the wait time is expected to be short - One thread spins on one
processor while another completes their critical section on another processor.
An alternative approach is to block a process when it is forced to wait for an
available semaphore, and swap it out of the CPU. In this implementation each
semaphore needs to maintain a list of processes that are blocked waiting for it, so
that one of the processes can be woken up and swapped back in when the
semaphore becomes available. ( Whether it gets swapped back into the CPU
immediately or whether it needs to hang out in the ready queue for a while is a
scheduling problem. )
The new definition of a semaphore and the corresponding wait and signal
operations are shown as follows:
Note that in this implementation the value of the semaphore can actually become
negative, in which case its magnitude is the number of processes waiting for that
semaphore. This is a result of decrementing the counter before checking its value.
Key to the success of semaphores is that the wait and signal operations be atomic,
that is no other process can execute a wait or signal on the same semaphore at the
same time. ( Other processes could be allowed to do other things, including
working with other semaphores, they just can't have access to this semaphore. )
On single processors this can be implemented by disabling interrupts during the
execution of wait and signal; Multiprocessor systems have to use more complex
methods, including the use of spinlocking.
One important problem that can arise when using semaphores to block processes
waiting for a limited resource is the problem of deadlocks, which occur when
multiple processes are blocked, each waiting for a resource that can only be freed
by one of the other ( blocked ) processes, as illustrated in the following example
The following classic problems are used to test virtually every new proposed synchronization
algorithm.
In the readers-writers problem there are some processes ( termed readers ) who
only read the shared data, and never change it, and there are other processes (
termed writers ) who may change the data in addition to or instead of reading it.
There is no limit to how many readers can access the data simultaneously, but
when a writer accesses the data, it needs exclusive access.
There are several variations to the readers-writers problem, most centered around
relative priorities of readers versus writers.
o The first readers-writers problem gives priority to readers. In this problem,
if a reader wants access to the data, and there is not already a writer
accessing it, then access is granted to the reader. A solution to this
problem can lead to starvation of the writers, as there could always be
more readers coming along to access the data. ( A steady stream of readers
will jump ahead of waiting writers as long as there is currently already
another reader accessing the data, because the writer is forced to wait until
the data is idle, which may never happen if there are enough readers. )
o The second readers-writers problem gives priority to the writers. In this
problem, when a writer wants access to the data it jumps to the head of the
queue - All waiting readers are blocked, and the writer gets access to the
data as soon as it becomes available. In this solution the readers may be
starved by a steady stream of writers.
The following code is an example of the first readers-writers problem, and
involves an important counter and two binary semaphores:
o
o
o
o
o readcount is used by the reader processes, to count the number of readers
currently accessing the data.
o mutex is a semaphore used only by the readers for controlled access to
readcount.
o rw_mutex is a semaphore used to block and release the writers. The first
reader to access the data will set this lock and the last reader to exit will
release it; The remaining readers do not touch rw_mutex. ( Eighth edition
called this variable wrt. )
o Note that the first reader to come along will block on rw_mutex if there is
currently a writer accessing the data, and that all following readers will
only block on mutex for their turn to increment readcount.
2.13 Monitors
Semaphores can be very useful for solving concurrency problems, but only if
programmers use them properly. If even one process fails to abide by the proper use of
semaphores, either accidentally or deliberately, then the whole system breaks down. (
And since concurrency problems are by definition rare events, the problem code may
easily go unnoticed and/or be heinous to debug. )
For this reason a higher-level language construct has been developed, called monitors.
A monitor is essentially a class, in which all data is private, and with the special
restriction that only one method within any given monitor object may be active at
the same time. An additional restriction is that monitor methods may only access
the shared data within the monitor and any data passed to them as parameters. I.e.
they cannot access any data external to the monitor.
But now there is a potential problem - If process P within the monitor issues a
signal that would wake up process Q also within the monitor, then there would be
two processes running simultaneously within the monitor, violating the exclusion
requirement. Accordingly there are two possible solutions to this dilemma:
Signal and wait - When process P issues the signal to wake up process Q, P then waits, either
for Q to leave the monitor or on some other condition.
Signal and continue - When P issues the signal, Q waits, either for P to exit the monitor or for
some other condition.
There are arguments for and against either choice. Concurrent Pascal offers a third alternative -
The signal call causes the signaling process to immediately exit the monitor, so that the waiting
process can then wake up and proceed.
Java and C# ( C sharp ) offer monitors bulit-in to the language. Erlang offers
similar but different constructs.
This solution to the dining philosophers uses monitors, and the restriction that a
philosopher may only pick up chopsticks when both are available. There are also
two key data structures in use in this solution:
1. enum { THINKING, HUNGRY,EATING } state[ 5 ]; A philosopher may
only set their state to eating when neither of their adjacent neighbors is
eating. ( state[ ( i + 1 ) % 5 ] != EATING && state[ ( i + 4 ) % 5 ] !=
EATING ).
2. condition self[ 5 ]; This condition is used to delay a hungry philosopher
who is unable to acquire chopsticks.
In the following solution philosophers share a monitor, DiningPhilosophers, and
eat using the following sequence of operations:
1. DiningPhilosophers.pickup( ) - Acquires chopsticks, which may block the
process.
2. eat
3. DiningPhilosophers.putdown( ) - Releases the chopsticks.
Unfortunately the use of monitors to restrict access to resources still only works if
programmers make the requisite acquire and release calls properly. One option
would be to place the resource allocation code into the monitor, thereby
eliminating the option for programmers to bypass or ignore the monitor, but then
that would substitute the monitor's scheduling algorithms for whatever other
scheduling algorithms may have been chosen for that particular resource. Chapter
14 on Protection presents more advanced methods for enforcing "nice"
cooperation among processes contending for shared resources.
Concurrent Pascal, Mesa, C#, and Java all implement monitors as described here.
Erlang provides concurrency support using a similar mechanism.
UNIT-III
Main Memory
3.1 Background
Obviously memory accesses and memory management are a very important part of
modern computer operation. Every instruction has to be fetched from memory before it
can be executed, and most instructions involve retrieving data from memory or storing
data in memory or both.
The advent of multi-tasking OSs compounds the complexity of memory management,
because as processes are swapped in and out of the CPU, so must their code and data be
swapped in and out of memory, all at high speeds and without interfering with any other
processes.
Shared memory, virtual memory, the classification of memory as read-only versus read-
write, and concepts like copy-on-write forking all further complicate the issue.
It should be noted that from the memory chips point of view, all memory accesses are
equivalent. The memory hardware doesn't know what a particular part of memory is
being used for, nor does it care. This is almost true of the OS as well, although not
entirely.
The CPU can only access its registers and main memory. It cannot, for example, make
direct access to the hard drive, so any data stored there must first be transferred into the
main memory chips before the CPU can work with it. ( Device drivers communicate with
their hardware via interrupts and "memory" accesses, sending short instructions for
example to transfer data from the hard drive to a specified location in main memory. The
disk controller monitors the bus for such instructions, transfers the data, and then notifies
the CPU that the data is there with another interrupt, but the CPU never gets direct access
to the disk. )
Memory accesses to registers are very fast, generally one clock tick, and a CPU may be
able to execute more than one machine instruction per clock tick.
Memory accesses to main memory are comparatively slow, and may take a number of
clock ticks to complete. This would require intolerable waiting by the CPU if it were not
for an intermediary fast memory cache built into most modern CPUs. The basic idea of
the cache is to transfer chunks of memory at a time from the main memory to the cache,
and then to access individual memory locations one at a time from the cache.
User processes must be restricted so that they only access memory locations that
"belong" to that particular process. This is usually implemented using a base register and
a limit register for each process, as shown in Figures 3.1 and 3.2 below. Every memory
access made by a user process is checked against these two registers, and if a memory
access is attempted outside the valid range, then a fatal error is generated. The OS
obviously has access to all existing memory locations, as this is necessary to swap users'
code and data in and out of memory. It should also be obvious that changing the contents
of the base and limit registers is a privileged activity, allowed only to the OS kernel.
Figure 3.1 - A base and a limit register define a logical addresss space
Figure 3.2 - Hardware address protection with base and limit registers
User programs typically refer to memory addresses with symbolic names such as "i",
"count", and "averageTemperature". These symbolic names must be mapped or bound to
physical memory addresses, which typically occurs in several stages:
o Compile Time - If it is known at compile time where a program will reside in
physical memory, then absolute code can be generated by the compiler,
containing actual physical addresses. However if the load address changes at
some later time, then the program will have to be recompiled. DOS .COM
programs use compile time binding.
o Load Time - If the location at which a program will be loaded is not known at
compile time, then the compiler must generate relocatable code, which references
addresses relative to the start of the program. If that starting address changes, then
the program must be reloaded but not recompiled.
o Execution Time - If a program can be moved around in memory during the
course of its execution, then binding must be delayed until execution time. This
requires special hardware, and is the method implemented by most modern OSes.
Figure 3.3 shows the various stages of the binding processes and the units involved in
each stage:
Figure 3.3 - Multistep processing of a user program
The address generated by the CPU is a logical address, whereas the address actually seen
by the memory hardware is a physical address.
Addresses bound at compile time or load time have identical logical and physical
addresses.
Addresses created at execution time, however, have different logical and physical
addresses.
o In this case the logical address is also known as a virtual address, and the two
terms are used interchangeably by our text.
o The set of all logical addresses used by a program composes the logical address
space, and the set of all corresponding physical addresses composes the physical
address space.
The run time mapping of logical to physical addresses is handled by the memory-
management unit, MMU.
o The MMU can take on many forms. One of the simplest is a modification of the
base-register scheme described earlier.
o The base register is now termed a relocation register, whose value is added to
every memory request at the hardware level.
Note that user programs never see physical addresses. User programs work entirely in
logical address space, and any memory references or manipulations are done using purely
logical addresses. Only when the address gets sent to the physical memory chips is the
physical memory address generated.
Rather than loading an entire program into memory at once, dynamic loading loads up
each routine as it is called. The advantage is that unused routines need never be loaded,
reducing total memory usage and generating faster program startup times. The downside
is the added complexity and overhead of checking to see if a routine is loaded every time
it is called and then loading it up if it is not already loaded.
With static linking library modules get fully included in executable modules, wasting
both disk space and main memory usage, because every program that included a certain
routine from the library would have to have their own copy of that routine linked into
their executable code.
With dynamic linking, however, only a stub is linked into the executable module,
containing references to the actual library module linked in at run time.
o This method saves disk space, because the library routines do not need to be fully
included in the executable modules, only the stubs.
o We will also learn that if the code section of the library routines is reentrant,
( meaning it does not modify the code while it runs, making it safe to re-enter it ),
then main memory can be saved by loading only one copy of dynamically linked
routines into memory and sharing the code amongst all processes that are
concurrently using it. ( Each process would have their own copy of the data
section of the routines, but that may be small relative to the code segments. )
Obviously the OS must manage shared routines in memory.
o An added benefit of dynamically linked libraries ( DLLs, also known as shared
libraries or shared objects on UNIX systems ) involves easy upgrades and
updates. When a program uses a routine from a standard library and the routine
changes, then the program must be re-built ( re-linked ) in order to incorporate the
changes. However if DLLs are used, then as long as the stub doesn't change, the
program can be updated merely by loading new versions of the DLLs onto the
system. Version information is maintained in both the program and the DLLs, so
that a program can specify a particular version of the DLL if necessary.
o In practice, the first time a program calls a DLL routine, the stub will recognize
the fact and will replace itself with the actual routine from the DLL library.
Further calls to the same routine will access the routine directly and not incur the
overhead of the stub access. ( Following the UML Proxy Pattern. )
o ( Additional information regarding dynamic linking is available at
http://www.iecc.com/linker/linker10.html )
8.2 Swapping
One approach to memory management is to load each process into a contiguous space.
The operating system is allocated space first, usually at either low or high memory
locations, and then the remaining available memory is allocated to processes as needed.
( The OS is usually loaded low, because that is where the interrupt vectors are located,
but on older systems part of the OS was loaded high to make more room in low memory (
within the 640K barrier ) for user processes. )
The system shown in Figure 3.6 below allows protection against user programs accessing
areas that they should not, allows programs to be relocated to different memory starting
addresses as needed, and allows the memory space devoted to the OS to grow or shrink
dynamically as needs change.
3.3.3. Fragmentation
All the memory allocation strategies suffer from external fragmentation, though first and
best fits experience the problems more so than worst fit. External fragmentation means
that the available memory is broken up into lots of little pieces, none of which is big
enough to satisfy the next memory requirement, although the sum total could.
The amount of memory lost to fragmentation may vary with algorithm, usage patterns,
and some design decisions such as which end of a hole to allocate and which end to save
on the free list.
Statistical analysis of first fit, for example, shows that for N blocks of allocated memory,
another 0.5 N will be lost to fragmentation.
Internal fragmentation also occurs, with all memory allocation strategies. This is caused
by the fact that memory is allocated in blocks of a fixed size, whereas the actual memory
needed will rarely be that exact size. For a random distribution of memory requests, on
the average 1/2 block will be wasted per memory request, because on the average the last
allocated block will be only half full.
o Note that the same effect happens with hard drives, and that modern hardware
gives us increasingly larger drives and memory at the expense of ever larger block
sizes, which translates to more memory lost to internal fragmentation.
o Some systems use variable size blocks to minimize losses due to internal
fragmentation.
If the programs in memory are relocatable, ( using execution-time address binding ), then
the external fragmentation problem can be reduced via compaction, i.e. moving all
processes down to one end of physical memory. This only involves updating the
relocation register for each process, as all internal work is done using logical addresses.
Another solution as we will see in upcoming sections is to allow processes to use non-
contiguous blocks of physical memory, with a separate relocation register for each block.
3.4 Segmentation
3.4.1 Basic Method
Most users ( programmers ) do not think of their programs as existing in one continuous
linear address space.
Rather they tend to think of their memory in multiple segments, each dedicated to a
particular use, such as code, data, the stack, the heap, etc.
Memory segmentation supports this view by providing addresses with a segment number
( mapped to a segment base address ) and an offset from the beginning of that segment.
For example, a C compiler might generate 5 segments for the user code, library code,
global ( static ) variables, the stack, and the heap, as shown in Figure 3.7:
3.5 Paging
Paging is a memory management scheme that allows processes physical memory to be
discontinuous, and which eliminates problems with fragmentation by allocating memory
in equal sized blocks known as pages.
Paging eliminates most of the problems of the other methods discussed previously, and is
the predominant memory management technique used today.
The basic idea behind paging is to divide physical memory into a number of equal sized
blocks called frames, and to divide a programs logical memory space into blocks of the
same size called pages.
Any page ( from any process ) can be placed into any available frame.
The page table is used to look up what frame a particular page is stored in at the moment.
In the following example, for instance, page 2 of the program's logical memory is
currently stored in frame 3 of physical memory:
( DOS used to use an addressing scheme with 16 bit frame numbers and 16-bit offsets, on
hardware that only supported 24-bit hardware addresses. The result was a resolution of
starting frame addresses finer than the size of a single frame, and multiple frame-offset
combinations that mapped to the same physical hardware address. )
Consider the following micro example, in which a process has 16 bytes of logical
memory, mapped in 4 byte pages into 32 bytes of physical memory. ( Presumably some
other processes would be consuming the remaining 16 bytes of physical memory. )
Figure 3.12 - Paging example for a 32-byte memory with 4-byte pages
Note that paging is like having a table of relocation registers, one for each page of the
logical memory.
There is no external fragmentation with paging. All blocks of physical memory are used,
and there are no gaps in between and no problems with finding the right sized hole for a
particular chunk of memory.
There is, however, internal fragmentation. Memory is allocated in chunks the size of a
page, and on the average, the last page will only be half full, wasting on the average half
a page of memory per process. ( Possibly more, if processes keep their code and data in
separate pages. )
Larger page sizes waste more memory, but are more efficient in terms of overhead.
Modern trends have been to increase page sizes, and some systems even have multiple
size pages to try and make the best of both worlds.
Page table entries ( frame numbers ) are typically 32 bit numbers, allowing access to
2^32 physical page frames. If those frames are 4 KB in size each, that translates to 16 TB
of addressable physical memory. ( 32 + 12 = 44 bits of physical address space. )
When a process requests memory ( e.g. when its code is loaded in from disk ), free
frames are allocated from a free-frame list, and inserted into that process's page table.
Processes are blocked from accessing anyone else's memory because all of their memory
requests are mapped through their page table. There is no way for them to generate an
address that maps into any other process's memory space.
The operating system must keep track of each individual process's page table, updating it
whenever the process's pages get moved in and out of memory, and applying the correct
page table when processing system calls for a particular process. This all increases the
overhead involved when swapping processes in and out of the CPU. ( The currently
active page table must be updated to reflect the process that is currently running. )
Figure 3.13 - Free frames (a) before allocation and (b) after allocation
Page lookups must be done for every memory reference, and whenever a process gets
swapped in or out of the CPU, its page table must be swapped in and out too, along with
the instruction registers, etc. It is therefore appropriate to provide hardware support for
this operation, in order to make it as fast as possible and to make process switches as fast
as possible also.
One option is to use a set of registers for the page table. For example, the DEC PDP-11
uses 16-bit addressing and 8 KB pages, resulting in only 8 pages per process. ( It takes 13
bits to address 8 KB of offset, leaving only 3 bits to define a page number. )
An alternate option is to store the page table in main memory, and to use a single register
( called the page-table base register, PTBR ) to record where in memory the page table is
located.
o Process switching is fast, because only the single register needs to be changed.
o However memory access just got half as fast, because every memory access now
requires two memory accesses - One to fetch the frame number from memory and
then another one to access the desired memory location.
o The solution to this problem is to use a very special high-speed memory device
called the translation look-aside buffer, TLB.
The benefit of the TLB is that it can search an entire table for a key value
in parallel, and if it is found anywhere in the table, then the corresponding
lookup value is returned.
Figure 3.14 - Paging hardware with TLB
The TLB is very expensive, however, and therefore very small. ( Not large
enough to hold the entire page table. ) It is therefore used as a cache
device.
Addresses are first checked against the TLB, and if the info is not
there ( a TLB miss ), then the frame is looked up from main
memory and the TLB is updated.
If the TLB is full, then replacement strategies range from least-
recently used, LRU to random.
Some TLBs allow some entries to be wired down, which means
that they cannot be removed from the TLB. Typically these would
be kernel frames.
Some TLBs store address-space identifiers, ASIDs, to keep track
of which process "owns" a particular entry in the TLB. This allows
entries from multiple processes to be stored simultaneously in the
TLB without granting one process access to some other process's
memory location. Without this feature the TLB has to be flushed
clean with every process switch.
The percentage of time that the desired information is found in the TLB is
termed the hit ratio.
( Eighth Edition Version: ) For example, suppose that it takes 100
nanoseconds to access main memory, and only 20 nanoseconds to search
the TLB. So a TLB hit takes 120 nanoseconds total ( 20 to find the frame
number and then another 100 to go get the data ), and a TLB miss takes
220 ( 20 to search the TLB, 100 to go get the frame number, and then
another 100 to go get the data. ) So with an 80% TLB hit ratio, the average
memory access time would be:
for a 40% slowdown to get the frame number. A 98% hit rate would yield 122
nanoseconds average access time ( you should verify this ), for a 22% slowdown.
( Ninth Edition Version: ) The ninth edition ignores the 20 nanoseconds
required to search the TLB, yielding
for a 20% slowdown to get the frame number. A 99% hit rate would yield 101
nanoseconds average access time ( you should verify this ), for a 1% slowdown.
3.5.3 Protection
The page table can also help to protect processes from accessing memory that they
shouldn't, or their own memory in ways that they shouldn't.
A bit or bits can be added to the page table to classify a page as read-write, read-only,
read-write-execute, or some combination of these sorts of things. Then each memory
reference can be checked to ensure it is accessing the memory in the appropriate mode.
Valid / invalid bits can be added to "mask off" entries in the page table that are not in use
by the current process, as shown by example in Figure 3.12 below.
Note that the valid / invalid bits described above cannot block all illegal memory
accesses, due to the internal fragmentation. ( Areas of memory in the last page that are
not entirely filled by the process, and may contain data left over by whoever used that
frame last. )
Many processes do not use all of the page table available to them, particularly in modern
systems with very large potential page tables. Rather than waste memory by creating a
full-size page table for every process, some systems use a page-table length register,
PTLR, to specify the length of the page table.
Paging systems can make it very easy to share blocks of memory, by simply duplicating
page numbers in multiple page frames. This may be done with either code or data.
If code is reentrant, that means that it does not write to or change the code in any way ( it
is non self-modifying ), and it is therefore safe to re-enter it. More importantly, it means
the code can be shared by multiple processes, so long as each has their own copy of the
data and registers, including the instruction register.
In the example given below, three different users are running the editor simultaneously,
but the code is only loaded into memory ( in the page frames ) one time.
Some systems also implement shared memory in this fashion.
Most modern computer systems support logical address spaces of 2^32 to 2^64.
With a 2^32 address space and 4K ( 2^12 ) page sizes, this leave 2^20 entries in the page
table. At 4 bytes per entry, this amounts to a 4 MB page table, which is too large to
reasonably keep in contiguous memory. ( And to swap in and out of memory with each
process switch. ) Note that with 4K pages, this would take 1024 pages just to hold the
page table!
One option is to use a two-tier paging system, i.e. to page the page table.
For example, the 20 bits described above could be broken down into two 10-bit page
numbers. The first identifies an entry in the outer page table, which identifies where in
memory to find one page of an inner page table. The second 10 bits finds a specific entry
in that inner page table, which in turn identifies a particular frame in physical memory.
( The remaining 12 bits of the 32 bit logical address are the offset within the 4K frame. )
Figure 3.17 A two-level page-table scheme
Figure 3.18 - Address translation for a two-level 32-bit paging architecture
VAX Architecture divides 32-bit addresses into 4 equal sized sections, and each page is
512 bytes, yielding an address form of:
With a 64-bit logical address space and 4K pages, there are 52 bits worth of page
numbers, which is still too many even for two-level paging. One could increase the
paging level, but with 10-bit page tables it would take 7 levels of indirection, which
would be prohibitively slow memory access. So some other approach must be used.
One common data structure for accessing data that is sparsely distributed over a broad
range of possible values is with hash tables. Figure 3.16 below illustrates a hashed page
table using chain-and-bucket hashing:
Figure 3.19 - Hashed page table
Another approach is to use an inverted page table. Instead of a table listing all of the
pages for a particular process, an inverted page table lists all of the pages currently loaded
in memory, for all processes. ( I.e. there is one entry per frame instead of one entry per
page. )
Access to an inverted page table can be slow, as it may be necessary to search the entire
table in order to find the desired page ( or to discover that it is not there. ) Hashing the
table can help speedup the search process.
Inverted page tables prohibit the normal method of implementing shared memory, which
is to map multiple logical pages to a common physical frame. ( Because each frame is
now mapped to one and only one process. )
o The descriptor tables contain 8-byte descriptions of each segment, including base
and limit registers.
o Logical linear addresses are generated by looking the selector up in the descriptor
table and adding the appropriate base address to the offset, as shown in Figure
3.22:
Figure 3.22 - IA-32 segmentation
Pentium paging normally uses a two-tier paging scheme, with the first 10 bits being a
page number for an outer page table ( a.k.a. page directory ), and the next 10 bits being a
page number within one of the 1024 inner page tables, leaving the remaining 12 bits as an
offset into a 4K page.
A special bit in the page directory can indicate that this page is a 4MB page, in which
case the remaining 22 bits are all used as offset and the inner tier of page tables is not
used.
The CR3 register points to the page directory for the current process, as shown in Figure
8.23 below.
If the inner page table is currently swapped out to disk, then the page directory will have
an "invalid bit" set, and the remaining 31 bits provide information on where to find the
swapped out page table on the disk.
Figure 3.23 - Paging in the IA-32 architecture.
VIRTUAL MEMORY
1 Background
Preceding sections talked about how to avoid memory fragmentation by breaking process
memory requirements down into smaller bites ( pages ), and storing the pages non-
contiguously in memory. However the entire process still had to be stored in memory
somewhere.
In practice, most real processes do not need all their pages, or at least not all at once, for
several reasons:
1. Error handling code is not needed unless that specific error occurs, some of which
are quite rare.
2. Arrays are often over-sized for worst-case scenarios, and only a small fraction of
the arrays are actually used in practice.
3. Certain features of certain programs are rarely used, such as the routine to balance
the federal budget. :-)
The ability to load only the portions of processes that were actually needed ( and only
when they were needed ) has several benefits:
o Programs could be written for a much larger address space ( virtual memory space
) than physically exists on the computer.
o Because each process is only using a fraction of their total address space, there is
more memory left for other programs, improving CPU utilization and system
throughput.
o Less I/O is needed for swapping processes in and out of RAM, speeding things
up.
Figure below shows the general layout of virtual memory, which can be much larger than
physical memory:
Figure 3.25 - Diagram showing virtual memory that is larger than physical memory
Figure 3.25 shows virtual address space, which is the programmers logical view of
process memory storage. The actual physical layout is controlled by the process's page
table.
Note that the address space shown in Figure 9.2 is sparse - A great hole in the middle of
the address space is never used, unless the stack and/or the heap grow to fill the hole.
Virtual memory also allows the sharing of files and memory by multiple processes, with
several benefits:
o System libraries can be shared by mapping them into the virtual address space of
more than one process.
o Processes can also share virtual memory by mapping the same block of memory
to more than one process.
o Process pages can be shared during a fork( ) system call, eliminating the need to
copy all of the pages of the original ( parent ) process.
The basic idea behind demand paging is that when a process is swapped in, its pages are
not swapped in all at once. Rather they are swapped in only when the process needs them.
( on demand. ) This is termed a lazy swapper, although a pager is a more accurate term.
The basic idea behind paging is that when a process is swapped in, the pager only loads
into memory those pages that it expects the process to need ( right away. )
Pages that are not loaded into memory are marked as invalid in the page table, using the
invalid bit. ( The rest of the page table entry may either be blank or contain information
about where to find the swapped-out page on the hard drive. )
If the process only ever accesses pages that are loaded in memory ( memory resident
pages ), then the process runs exactly as if all the pages were loaded in to memory.
Figure 3.29 - Page table when some pages are not in main memory.
On the other hand, if a page is needed that was not originally loaded up, then a page fault
trap is generated, which must be handled in a series of steps:
1. The memory address requested is first checked, to make sure it was a valid
memory request.
2. If the reference was invalid, the process is terminated. Otherwise, the page must
be paged in.
3. A free frame is located, possibly from a free-frame list.
4. A disk operation is scheduled to bring in the necessary page from disk. ( This will
usually block the process on an I/O wait, allowing some other process to use the
CPU in the meantime. )
5. When the I/O operation is complete, the process's page table is updated with the
new frame number, and the invalid bit is changed to indicate that this is now a
valid page reference.
6. The instruction that caused the page fault must now be restarted from the
beginning, ( as soon as this process gets another turn on the CPU. )
In an extreme case, NO pages are swapped in for a process until they are requested by
page faults. This is known as pure demand paging.
In theory each instruction could generate multiple page faults. In practice this is very rare,
due to locality of reference, covered in section 9.6.1.
The hardware necessary to support virtual memory is the same as for paging and
swapping: A page table and secondary memory. ( Swap space, whose allocation is
discussed in chapter 12. )
A crucial part of the process is that the instruction must be restarted from scratch once the
desired page has been made available in memory. For most simple instructions this is not
a major difficulty. However there are some architectures that allow a single instruction to
modify a fairly large block of data, ( which may span a page boundary ), and if some of
the data gets modified before the page fault occurs, this could cause problems. One
solution is to access both ends of the block before executing the instruction, guaranteeing
that the necessary pages get paged in before the instruction begins.
3.2.2 Performance of Demand Paging
Obviously there is some slowdown and performance hit whenever a page fault occurs and
the system has to go get it from memory, but just how big a hit is it exactly?
There are many steps that occur when servicing a page fault ( see book for full details ),
and some of the steps are optional or variable. But just for the sake of discussion, suppose
that a normal memory access requires 200 nanoseconds, and that servicing a page fault
takes 8 milliseconds. ( 8,000,000 nanoseconds, or 40,000 times a normal memory
access. ) With a page fault rate of p, ( on a scale from 0 to 1 ), the effective access time is
now:
( 1 - p ) * ( 200 ) + p * 8000000
= 200 + 7,999,800 * p
which clearly depends heavily on p! Even if only one access in 1000 causes a page fault, the
effective access time drops from 200 nanoseconds to 8.2 microseconds, a slowdown of a factor
of 40 times. In order to keep the slowdown less than 10%, the page fault rate must be less than
0.0000025, or one in 399,990 accesses.
A subtlety is that swap space is faster to access than the regular file system, because it
does not have to go through the whole directory structure. For this reason some systems
will transfer an entire process from the file system to swap space before starting up the
process, so that future paging all occurs from the ( relatively ) faster swap space.
Some systems use demand paging directly from the file system for binary code ( which
never changes and hence does not have to be stored on a page operation ), and to reserve
the swap space for data segments that must be stored. This approach is used by both
Solaris and BSD Unix.
In order to make the most use of virtual memory, we load several processes into memory
at the same time. Since we only load the pages that are actually needed by each process at
any given time, there is room to load many more processes than if we had to load in the
entire process.
However memory is also needed for other purposes ( such as I/O buffering ), and what
happens if some process suddenly decides it needs more pages and there aren't any free
frames available? There are several possible solutions to consider:
1. Adjust the memory used by I/O buffering, etc., to free up some frames for user
processes. The decision of how to allocate memory for I/O versus user processes
is a complex one, yielding different policies on different systems. ( Some allocate
a fixed amount for I/O, and others let the I/O system contend for memory along
with everything else. )
2. Put the process requesting more pages into a wait queue until some free frames
become available.
3. Swap some process out of memory completely, freeing up its page frames.
4. Find some page in memory that isn't being used right now, and swap that page
only out to disk, freeing up a frame that can be allocated to the process requesting
it. This is known as page replacement, and is the most common solution. There
are many different algorithms for page replacement, which is the subject of the
remainder of this section.
The previously discussed page-fault processing assumed that there would be free frames
available on the free-frame list. Now the page-fault handling must be modified to free up
a frame if necessary, as follows:
1. Find the location of the desired page on the disk, either in swap space or in the file
system.
2. Find a free frame:
a. If there is a free frame, use it.
b. If there is no free frame, use a page-replacement algorithm to select an
existing frame to be replaced, known as the victim frame.
c. Write the victim frame to disk. Change all related page tables to indicate
that this page is no longer in memory.
3. Read in the desired page and store it in the frame. Adjust all related page and
frame tables to indicate the change.
4. Restart the process that was waiting for this page.
Figure 3.32 - Page replacement.
Note that step 3c adds an extra disk write to the page-fault handling, effectively doubling
the time required to process a page fault. This can be alleviated somewhat by assigning a
modify bit, or dirty bit to each page, indicating whether or not it has been changed since it
was last loaded in from disk. If the dirty bit has not been set, then the page is unchanged,
and does not need to be written out to disk. Otherwise the page write is required. It
should come as no surprise that many page replacement strategies specifically look for
pages that do not have their dirty bit set, and preferentially select clean pages as victim
pages. It should also be obvious that unmodifiable code pages never get their dirty bits
set.
There are two major requirements to implement a successful demand paging system. We
must develop a frame-allocation algorithm and a page-replacement algorithm. The
former centers around how many frames are allocated to each process ( and to other
needs ), and the latter deals with how to select a page for replacement when there are no
free frames available.
The overall goal in selecting and tuning these algorithms is to generate the fewest number
of overall page faults. Because disk access is so slow relative to memory access, even
slight improvements to these algorithms can yield large improvements in overall system
performance.
Algorithms are evaluated using a given string of memory accesses known as a reference
string, which can be generated in one of ( at least ) three common ways:
1. Randomly generated, either evenly distributed or with some distribution curve
based on observed system behavior. This is the fastest and easiest approach, but
may not reflect real performance well, as it ignores locality of reference.
2. Specifically designed sequences. These are useful for illustrating the properties of
comparative algorithms in published papers and textbooks, ( and also for
homework and exam problems. :-) )
3. Recorded memory references from a live system. This may be the best approach,
but the amount of data collected can be enormous, on the order of a million
addresses per second. The volume of collected data can be reduced by making
two important observations:
1. Only the page number that was accessed is relevant. The offset within that
page does not affect paging operations.
2. Successive accesses within the same page can be treated as a single page
request, because all requests after the first are guaranteed to be page hits.
( Since there are no intervening requests for other pages that could remove
this page from the page table. )
So for example, if pages were of size 100 bytes, then the sequence of
address requests ( 0100, 0432, 0101, 0612, 0634, 0688, 0132, 0038, 0420 )
would reduce to page requests ( 1, 4, 1, 6, 1, 0, 4 )
As the number of available frames increases, the number of page faults should decrease,
as shown in Figure 3.33:
Although FIFO is simple and easy, it is not always optimal, or even efficient.
An interesting effect that can occur with FIFO is Belady's anomaly, in which increasing
the number of frames available can actually increase the number of page faults that
occur! Consider, for example, the following chart based on the page sequence ( 1, 2, 3, 4,
1, 2, 5, 1, 2, 3, 4, 5 ) and a varying number of available frames. Obviously the maximum
number of faults is 12 ( every request generates a fault ), and the minimum number is 5 (
each page loaded only once ), but in between there are some interesting results:
The discovery of Belady's anomaly lead to the search for an optimal page-replacement
algorithm, which is simply that which yields the lowest of all possible page-faults, and
which does not suffer from Belady's anomaly.
Such an algorithm does exist, and is called OPT or MIN. This algorithm is simply
"Replace the page that will not be used for the longest time in the future."
For example, Figure 9.14 shows that by applying OPT to the same reference string used
for the FIFO example, the minimum number of possible page faults is 9. Since 6 of the
page-faults are unavoidable ( the first reference to each new page ), FIFO can be shown
to require 3 times as many ( extra ) page faults as the optimal algorithm. ( Note: The book
claims that only the first three page faults are required by all algorithms, indicating that
FIFO is only twice as bad as OPT. )
Unfortunately OPT cannot be implemented in practice, because it requires foretelling the
future, but it makes a nice benchmark for the comparison and evaluation of real proposed
new algorithms.
In practice most page-replacement algorithms try to approximate OPT by predicting
( estimating ) in one fashion or another what page will not be used for the longest period
of time. The basis of FIFO is the prediction that the page that was brought in the longest
time ago is the one that will not be needed again for the longest future time, but as we
shall see, there are many other prediction methods, all striving to match the performance
of OPT.
Figure 3.36 - Optimal page-replacement algorithm
The prediction behind LRU, the Least Recently Used, algorithm is that the page that has
not been used in the longest time is the one that will not be used again in the near future.
( Note the distinction between FIFO and LRU: The former looks at the oldest load time,
and the latter looks at the oldest use time. )
Some view LRU as analogous to OPT, except looking backwards in time instead of
forwards. ( OPT has the interesting property that for any reference string S and its reverse
R, OPT will generate the same number of page faults for S and for R. It turns out that
LRU has this same property. )
Figure 9.15 illustrates LRU for our sample string, yielding 12 page faults, ( as compared
to 15 for FIFO and 9 for OPT. )
LRU is considered a good replacement policy, and is often used. The problem is how
exactly to implement it. There are two simple approaches commonly used:
1. Counters. Every memory access increments a counter, and the current value of
this counter is stored in the page table entry for that page. Then finding the LRU
page involves simple searching the table for the page with the smallest counter
value. Note that overflowing of the counter must be considered.
2. Stack. Another approach is to use a stack, and whenever a page is accessed, pull
that page from the middle of the stack and place it on the top. The LRU page will
always be at the bottom of the stack. Because this requires removing objects from
the middle of the stack, a doubly linked list is the recommended data structure.
Note that both implementations of LRU require hardware support, either for incrementing
the counter or for managing the stack, as these operations must be performed for every
memory access.
Neither LRU or OPT exhibit Belady's anomaly. Both belong to a class of page-
replacement algorithms called stack algorithms, which can never exhibit Belady's
anomaly. A stack algorithm is one in which the pages kept in memory for a frame set of
size N will always be a subset of the pages kept for a frame size of N + 1. In the case of
LRU, ( and particularly the stack implementation thereof ), the top N pages of the stack
will be the same for all frame set sizes of N or anything larger.
Figure 3.38 - Use of a stack to record the most recent page references.
Unfortunately full implementation of LRU requires hardware support, and few systems
provide the full hardware support necessary.
However many systems offer some degree of HW support, enough to approximate LRU
fairly well. ( In the absence of ANY hardware support, FIFO might be the best available
choice. )
In particular, many systems provide a reference bit for every entry in a page table, which
is set anytime that page is accessed. Initially all bits are set to zero, and they can also all
be cleared at any time. One bit of precision is enough to distinguish pages that have been
accessed since the last clear from those that have not, but does not provide any finer grain
of detail.
Finer grain is possible by storing the most recent 8 reference bits for each page in an 8-bit
byte in the page table entry, which is interpreted as an unsigned int.
o At periodic intervals ( clock interrupts ), the OS takes over, and right-shifts each
of the reference bytes by one bit.
o The high-order ( leftmost ) bit is then filled in with the current value of the
reference bit, and the reference bits are cleared.
o At any given time, the page with the smallest value for the reference byte is the
LRU page.
Obviously the specific number of bits used and the frequency with which the reference
byte is updated are adjustable, and are tuned to give the fastest performance on a given
hardware platform.
The second chance algorithm is essentially a FIFO, except the reference bit is used to
give pages a second chance at staying in the page table.
o When a page must be replaced, the page table is scanned in a FIFO ( circular
queue ) manner.
o If a page is found with its reference bit not set, then that page is selected as the
next victim.
o If, however, the next page in the FIFO does have its reference bit set, then it is
given a second chance:
The reference bit is cleared, and the FIFO search continues.
If some other page is found that did not have its reference bit set, then that
page will be selected as the victim, and this page ( the one being given the
second chance ) will be allowed to stay in the page table.
If , however, there are no other pages that do not have their reference bit
set, then this page will be selected as the victim when the FIFO search
circles back around to this page on the second pass.
If all reference bits in the table are set, then second chance degrades to FIFO, but also
requires a complete search of the table for every page-replacement.
As long as there are some pages whose reference bits are not set, then any page
referenced frequently enough gets to stay in the page table indefinitely.
This algorithm is also known as the clock algorithm, from the hands of the clock moving
around the circular queue.
There are several algorithms based on counting the number of references that have been
made to a given page, such as:
o Least Frequently Used, LFU: Replace the page with the lowest reference count.
A problem can occur if a page is used frequently initially and then not used any
more, as the reference count remains high. A solution to this problem is to right-
shift the counters periodically, yielding a time-decaying average reference count.
o Most Frequently Used, MFU: Replace the page with the highest reference count.
The logic behind this idea is that pages that have already been referenced a lot
have been in the system a long time, and we are probably done with them,
whereas pages referenced only a few times have only recently been loaded, and
we still need them.
In general counting-based algorithms are not commonly used, as their implementation is
expensive and they do not approximate OPT well.
There are a number of page-buffering algorithms that can be used in conjunction with the afore-
mentioned algorithms, to improve overall performance and sometimes make up for inherent
weaknesses in the hardware and/or the underlying page-replacement algorithms:
Maintain a certain minimum number of free frames at all times. When a page-fault
occurs, go ahead and allocate one of the free frames from the free list first, to get the
requesting process up and running again as quickly as possible, and then select a victim
page to write to disk and free up a frame as a second step.
Keep a list of modified pages, and when the I/O system is otherwise idle, have it write
these pages out to disk, and then clear the modify bits, thereby increasing the chance of
finding a "clean" page for the next potential victim.
Keep a pool of free frames, but remember what page was in it before it was made free.
Since the data in the page is not actually cleared out when the page is freed, it can be
made an active page again without having to load in any new data from disk. This is
useful when an algorithm mistakenly replaces a page that in fact is needed again soon.
3.8 Applications and Page Replacement
Some applications ( most notably database programs ) understand their data accessing
and caching needs better than the general-purpose OS, and should therefore be given
reign to do their own memory management.
Sometimes such programs are given a raw disk partition to work with, containing raw
data blocks and no file system structure. It is then up to the application to use this disk
partition as extended memory or for whatever other reasons it sees fit.
We said earlier that there were two important tasks in virtual memory management: a page-
replacement strategy and a frame-allocation strategy. This section covers the second part of that
pair.
The absolute minimum number of frames that a process must be allocated is dependent
on system architecture, and corresponds to the worst-case scenario of the number of
pages that could be touched by a single ( machine ) instruction.
If an instruction ( and its operands ) spans a page boundary, then multiple pages could be
needed just for the instruction fetch.
Memory references in an instruction touch more pages, and if those memory locations
can span page boundaries, then multiple pages could be needed for operand access also.
The worst case involves indirect addressing, particularly where multiple levels of indirect
addressing are allowed. Left unchecked, a pointer to a pointer to a pointer to a pointer to
a . . . could theoretically touch every page in the virtual address space in a single machine
instruction, requiring every virtual page be loaded into physical memory simultaneously.
For this reason architectures place a limit ( say 16 ) on the number of levels of indirection
allowed in an instruction, which is enforced with a counter initialized to the limit and
decremented with every level of indirection in an instruction - If the counter reaches zero,
then an "excessive indirection" trap occurs. This example would still require a minimum
frame allocation of 17 per process.
Equal Allocation - If there are m frames available and n processes to share them, each
process gets m / n frames, and the leftovers are kept in a free-frame buffer pool.
Proportional Allocation - Allocate the frames proportionally to the size of the process,
relative to the total size of all processes. So if the size of process i is S_i, and S is the sum
of all S_i, then the allocation for process P_i is a_i = m * S_i / S.
Variations on proportional allocation could consider priority of process rather than just
their size.
Obviously all allocations fluctuate over time as the number of available free frames, m,
fluctuates, and all are also subject to the constraints of minimum allocation. ( If the
minimum allocations cannot be met, then processes must either be swapped out or not
allowed to start until more free frames become available. )
One big question is whether frame allocation ( page replacement ) occurs on a local or
global level.
With local replacement, the number of pages allocated to a process is fixed, and page
replacement occurs only amongst the pages allocated to this process.
With global replacement, any page may be a potential victim, whether it currently
belongs to the process seeking a free frame or not.
Local page replacement allows processes to better control their own page fault rates, and
leads to more consistent performance of a given process over different system load levels.
Global page replacement is overall more efficient, and is the more commonly used
approach.
The above arguments all assume that all memory is equivalent, or at least has equivalent
access times.
This may not be the case in multiple-processor systems, especially where each CPU is
physically located on a separate circuit board which also holds some portion of the
overall system memory.
In these latter systems, CPUs can access memory that is physically located on the same
board much faster than the memory on the other boards.
The basic solution is akin to processor affinity - At the same time that we try to schedule
processes on the same CPU to minimize cache misses, we also try to allocate memory for
those processes on the same boards, to minimize access times.
The presence of threads complicates the picture, especially when the threads get loaded
onto different processors.
Solaris uses an lgroup as a solution, in a hierarchical fashion based on relative latency.
For example, all processors and RAM on a single board would probably be in the same
lgroup. Memory assignments are made within the same lgroup if possible, or to the next
nearest lgroup otherwise. ( Where "nearest" is defined as having the lowest access time. )
3.5 Thrashing
If a process cannot maintain its minimum required number of frames, then it must be
swapped out, freeing up frames for other processes. This is an intermediate level of CPU
scheduling.
But what about a process that can keep its minimum, but cannot keep all of the frames
that it is currently using on a regular basis? In this case it is forced to page out pages that
it will need again in the very near future, leading to large numbers of page faults.
A process that is spending more time paging than executing is said to be thrashing.
To prevent thrashing we must provide processes with as many frames as they really need
"right now", but how do we know what that is?
The working set model is based on the concept of locality, and defines a working set
window, of length delta. Whatever pages are included in the most recent delta page
references are said to be in the processes working set window, and comprise its current
working set, as illustrated in Figure 9.20:
A more direct approach is to recognize that what we really want to control is the page-
fault rate, and to allocate frames based on this directly measurable value. If the page-fault
rate exceeds a certain upper bound then that process needs more frames, and if it is below
a given lower bound, then it can afford to give up some of its frames to other processes.
( I suppose a page-replacement strategy could be devised that would select victim frames
based on the process with the lowest current page-fault frequency. )
Note that there is a direct relationship between the page-fault rate and the working-set, as
a process moves from one locality to another:
UNIT-IV
FILE-SYSTEM INTERFACE
Windows ( and some other systems ) use special file extensions to indicate
the type of each file:
Macintosh stores a creator attribute for each file, according to the program that
first created it with the create( ) system call.
UNIX stores magic numbers at the beginning of certain files. ( Experiment with
the "file" command, especially in directories such as /bin and /dev )
Some files contain an internal structure, which may or may not be known to the
OS.
For the OS to support particular file formats increases the size and complexity of
the OS.
UNIX treats all files as sequences of bytes, with no further consideration of the
internal structure. ( With the exception of executable binary programs, which it
must know how to load and find the first executable statement, etc. )
Macintosh files have two forks - a resource fork, and a data fork. The resource
fork contains information relating to the UI, such as icons and button images, and
can be modified independently of the data fork, which contains the code or data as
appropriate.
4.1.5 Internal File Structure
Disk files are accessed in units of physical blocks, typically 512 bytes or some
power-of-two multiple thereof. ( Larger physical disks use larger block sizes, to
keep the range of block numbers within the range of a 32-bit integer. )
Internally files are organized in units of logical units, which may be as small as a
single byte, or may be a larger size corresponding to some data record or structure
size.
The number of logical units which fit into one physical block determines its
packing, and has an impact on the amount of internal fragmentation ( wasted
space ) that occurs.
As a general rule, half a physical block is wasted for each file, and the larger the
block sizes the more space is lost to internal fragmentation.
A sequential access file emulates magnetic tape operation, and generally supports
a few operations:
o read next - read a record and advance the tape to the next position.
o write next - write a record and advance the tape to the next position.
o rewind
o skip n records - May or may not be supported. N may be limited to
positive numbers, or may be limited to +/- 1.
Jump to any record and read that record. Operations supported include:
o read n - read record number n. ( Note an argument is now required. )
o write n - write record number n. ( Note an argument is now required. )
o jump to record n - could be 0 or the end of file.
o Query current record - used to return back to this record later.
o Sequential access can be easily emulated using direct access. The inverse
is complicated and inefficient.
Figure 4.3- Simulation of sequential access on a direct-access file.
An indexed access scheme can be easily built on top of a direct access system.
Very large files may require a multi-tiered indexing scheme, i.e. indexes of
indexes.
An obvious extension to the two-tiered directory structure, and the one with
which we are all most familiar.
Each user / process has the concept of a current directory from which all
( relative ) searches take place.
Files may be accessed using either absolute pathnames ( relative to the root of the
tree ) or relative pathnames ( relative to the current directory. )
Directories are stored the same as any other file in the system, except there is a bit
that identifies them as directories, and they have some special structure that the
OS understands.
One question for consideration is whether or not to allow the removal of
directories that are not empty - Windows requires that directories be emptied first,
and UNIX provides an option for deleting entire sub-trees.
Figure 4.7 - Tree-structured directory structure.
When the same files need to be accessed in more than one place in the directory
structure ( e.g. because they are being shared by more than one user / process ), it
can be useful to provide an acyclic-graph structure. ( Note the directed arcs from
parent to child. )
UNIX provides two types of links for implementing the acyclic-graph structure. (
See "man ln" for more details. )
o A hard link ( usually just called a link ) involves multiple directory entries
that both refer to the same file. Hard links are only valid for ordinary files
in the same filesystem.
o A symbolic link, that involves a special file, containing information about
where to find the linked file. Symbolic links may be used to link
directories and/or files in other filesystems, as well as ordinary files in the
current filesystem.
Windows only supports symbolic links, termed shortcuts.
Hard links require a reference count, or link count for each file, keeping track of
how many directory entries are currently referring to this file. Whenever one of
the references is removed the link count is reduced, and when it reaches zero, the
disk space can be reclaimed.
For symbolic links there is some question as to what to do with the symbolic links
when the original file is moved or deleted:
o One option is to find all the symbolic links and adjust them also.
o Another is to leave the symbolic links dangling, and discover that they are
no longer valid the next time they are used.
o What if the original file is removed, and replaced with another file having
the same name before the symbolic link is next used?
Figure 4.8 - Acyclic-graph directory structure.
If cycles are allowed in the graphs, then several problems can arise:
o Search algorithms can go into infinite loops. One solution is to not follow
links in search algorithms. ( Or not to follow symbolic links, and to only
allow symbolic links to refer to directories. )
o Sub-trees can become disconnected from the rest of the tree and still not
have their reference counts reduced to zero. Periodic garbage collection is
required to detect and resolve this problem. ( chkdsk in DOS and fsck in
UNIX search for these problems, among others, even though cycles are
not supposed to be allowed in either system. Disconnected disk blocks that
are not marked as free are added back to the file systems with made-up file
names, and can usually be safely deleted. )
The basic idea behind mounting file systems is to combine multiple file systems into one
large tree structure.
The mount command is given a filesystem to mount and a mount point ( directory ) on
which to attach it.
Once a file system is mounted onto a mount point, any further references to that directory
actually refer to the root of the mounted file system.
Any files ( or sub-directories ) that had been stored in the mount point directory prior to
mounting the new filesystem are now hidden by the mounted filesystem, and are no
longer available. For this reason some systems only allow mounting onto empty
directories.
Filesystems can only be mounted by root, unless root has previously configured certain
filesystems to be mountable onto certain pre-determined mount points. ( E.g. root may
allow users to mount floppy filesystems to /mnt or something like it. ) Anyone can run
the mount command to see what filesystems are currently mounted.
Filesystems may be mounted read-only, or have other restrictions imposed.
Figure 4.10 - File system. (a) Existing system. (b) Unmounted volume.
The traditional Windows OS runs an extended two-tier directory structure, where the first
tier of the structure separates volumes by drive letters, and a tree structure is implemented
below that level.
Macintosh runs a similar system, where each new volume that is found is automatically
mounted and added to the desktop when it is found.
More recent Windows systems allow filesystems to be mounted to any directory in the
filesystem, much like UNIX.
4.5 File Sharing
The advent of the Internet introduces issues for accessing files stored on remote
computers
o The original method was ftp, allowing individual files to be transported
across systems as needed. Ftp can be either account and password
controlled, or anonymous, not requiring any user name or password.
o Various forms of distributed file systems allow remote file systems to be
mounted onto a local directory structure, and accessed using normal file
access commands. ( The actual files are still transported across the
network as needed, possibly using ftp as the underlying transport
mechanism. )
o The WWW has made it easy once again to access files on remote systems
without mounting their filesystems, generally using ( anonymous ) ftp as
the underlying file transport mechanism.
The Domain Name System, DNS, provides for a unique naming system
across all of the Internet.
Domain names are maintained by the Network Information System, NIS,
which unfortunately has several security issues. NIS+ is a more secure
version, but has not yet gained the same widespread acceptance as NIS.
Microsoft's Common Internet File System, CIFS, establishes a network
login for each user on a networked system with shared file access. Older
Windows systems used domains, and newer systems ( XP, 2000 ), use
active directories. User names must match across the network for this
system to be valid.
A newer approach is the Lightweight Directory-Access Protocol, LDAP,
which provides a secure single sign-on for all users to access all resources
on a network. This is a secure system which is gaining in popularity, and
which has the maintenance advantage of combining authorization
information in one central location.
Consistency Semantics deals with the consistency between the views of shared
files on a networked system. When one user changes the file, when do other users
see the changes?
At first glance this appears to have all of the synchronization issues discussed in
Chapter 6. Unfortunately the long delays involved in network operations prohibit
the use of atomic operations as discussed in that chapter.
4.6 Protection
Files must be kept safe for reliability ( against accidental damage ), and protection
( against deliberate malicious access. ) The former is usually managed with backup
copies. This section discusses the latter.
One simple protection scheme is to remove all access to a file. However this makes the
file unusable, so some sort of controlled access must be arranged.
One approach is to have complicated Access Control Lists, ACL, which specify
exactly what access is allowed or denied for specific users or groups.
o The AFS uses this system for distributed access.
o Control is very finely adjustable, but may be complicated, particularly
when the specific users involved are unknown. ( AFS allows some wild
cards, so for example all users on a certain remote system may be trusted,
or a given username may be trusted when accessing from any remote
system. )
UNIX uses a set of 9 access control bits, in three groups of three. These
correspond to R, W, and X permissions for each of the Owner, Group, and Others.
( See "man chmod" for full details. ) The RWX bits control the following
privileges for ordinary files and directories:
In addition there are some special bits that can also be applied:
o The set user ID ( SUID ) bit and/or the set group ID ( SGID ) bits applied
to executable files temporarily change the identity of whoever runs the
program to match that of the owner / group of the executable program.
This allows users running specific programs to have access to files ( while
running that program ) to which they would normally be unable to
access. Setting of these two bits is usually restricted to root, and must be
done with caution, as it introduces a potential security leak.
o The sticky bit on a directory modifies write permission, allowing users to
only delete files for which they are the owner. This allows everyone to
create files in /tmp, for example, but to only delete files which they have
created, and not anyone else's.
o The SUID, SGID, and sticky bits are indicated with an S, S, and T in the
positions for execute permission for the user, group, and others,
respectively. If the letter is lower case, ( s, s, t ), then the corresponding
execute permission is not also given. If it is upper case, ( S, S, T ), then the
corresponding execute permission IS given.
o The numeric form of chmod is needed to set these advanced bits.
Some systems can apply passwords, either to individual files, or to specific sub-
directories, or to the entire system. There is a trade-off between the number of
passwords that must be maintained ( and remembered by the users ) and the
amount of information that is vulnerable to a lost or forgotten password.
Older systems which did not originally have multi-user file access permissions (
DOS and older versions of Mac ) must now be retrofitted if they are to share files
on a network.
Access to a file requires access to all the files along its path as well. In a cyclic
directory structure, users may have different access to the same file accessed
through different paths.
Sometimes just the knowledge of the existence of a file of a certain name is a
security ( or privacy ) concern. Hence the distinction between the R and X bits on
UNIX directories.
Hard disks have two important properties that make them suitable for secondary storage
of files in file systems: (1) Blocks of data can be rewritten in place, and (2) they are direct
access, allowing any block of data to be accessed with only ( relatively ) minor
movements of the disk heads and rotational latency.
Disks are usually accessed in physical blocks, rather than a byte at a time. Block sizes
may range from 512 bytes to 4K or larger.
File systems organize storage on disk drives, and can be viewed as a layered design:
o At the lowest layer are the physical devices, consisting of the magnetic media,
motors & controls, and the electronics connected to them and controlling them.
Modern disk put more and more of the electronic controls directly on the disk
drive itself, leaving relatively little work for the disk controller card to perform.
o I/O Control consists of device drivers, special software programs ( often written
in assembly ) which communicate with the devices by reading and writing special
codes directly to and from memory addresses corresponding to the controller
card's registers. Each controller card ( device ) on a system has a different set of
addresses ( registers, a.k.a. ports ) that it listens to, and a unique set of command
codes and results codes that it understands.
o The basic file system level works directly with the device drivers in terms of
retrieving and storing raw blocks of data, without any consideration for what is in
each block. Depending on the system, blocks may be referred to with a single
block number, ( e.g. block # 234234 ), or with head-sector-cylinder combinations.
o The file organization module knows about files and their logical blocks, and how
they map to physical blocks on the disk. In addition to translating from logical to
physical blocks, the file organization module also maintains the list of free blocks,
and allocates free blocks to files as needed.
o The logical file system deals with all of the meta data associated with a file ( UID,
GID, mode, dates, etc ), i.e. everything about the file except the data itself. This
level manages the directory structure and the mapping of file names to file control
blocks, FCBs, which contain all of the meta data as well as block number
information for finding the data on the disk.
The layered approach to file systems means that much of the code can be used uniformly
for a wide variety of different file systems, and only certain layers need to be filesystem
specific. Common file systems in use include the UNIX file system, UFS, the Berkeley
Fast File System, FFS, Windows systems FAT, FAT32, NTFS, CD-ROM systems ISO
9660, and for Linux the extended file systems ext2 and ext3 ( among 40 others
supported.)
Figure 4.13- Layered file system.
4.8.1Overview
Physical disks are commonly divided into smaller units called partitions. They can
also be combined into larger units, but that is most commonly done for RAID
installations and is left for later chapters.
Partitions can either be used as raw devices ( with no structure imposed upon
them ), or they can be formatted to hold a filesystem ( i.e. populated with FCBs
and initial directory structures as appropriate. ) Raw partitions are generally used
for swap space, and may also be used for certain programs such as databases that
choose to manage their own disk storage system. Partitions containing filesystems
can generally only be accessed using the file system structure by ordinary users,
but can often be accessed as a raw device also by root.
The boot block is accessed as part of a raw partition, by the boot program prior to
any operating system being loaded. Modern boot programs understand multiple
OSes and filesystem formats, and can give the user a choice of which of several
available systems to boot.
The root partition contains the OS kernel and at least the key portions of the OS
needed to complete the boot process. At boot time the root partition is mounted,
and control is transferred from the boot program to the kernel found there. ( Older
systems required that the root partition lie completely within the first 1024
cylinders of the disk, because that was as far as the boot program could reach.
Once the kernel had control, then it could access partitions beyond the 1024
cylinder boundary. )
Continuing with the boot process, additional filesystems get mounted, adding
their information into the appropriate mount table structure. As a part of the
mounting process the file systems may be checked for errors or inconsistencies,
either because they are flagged as not having been closed properly the last time
they were used, or just for general principals. Filesystems may be mounted either
automatically or manually. In UNIX a mount point is indicated by setting a flag in
the in-memory copy of the inode, so all future references to that inode get re-
directed to the root directory of the mounted filesystem.
Directories need to be fast to search, insert, and delete, with a minimum of wasted disk
space.
Linear List
A linear list is the simplest and easiest directory structure to set up, but it does
have some drawbacks.
Finding a file ( or verifying one does not already exist upon creation ) requires a
linear search.
Deletions can be done by moving all entries, flagging an entry as deleted, or by
moving the last entry into the newly vacant position.
Sorting the list makes searches faster, at the expense of more complex insertions
and deletions.
A linked list makes insertions and deletions into a sorted list easier, with overhead
for the links.
More complex data structures, such as B-trees, could also be considered.
Hash Table
There are three major methods of storing files on disks: contiguous, linked, and indexed.
Disk files can be stored as linked lists, with the expense of the storage space
consumed by each link. ( E.g. a block may be 508 bytes instead of 512. )
Linked allocation involves no external fragmentation, does not require pre-known
file sizes, and allows files to grow dynamically at any time.
Unfortunately linked allocation is only efficient for sequential access files, as
random access requires starting at the beginning of the list for each new location
access.
Allocating clusters of blocks reduces the space wasted by pointers, at the cost of
internal fragmentation.
Another big problem with linked allocation is reliability if a pointer is lost or
damaged. Doubly linked lists provide some protection, at the cost of additional
overhead and wasted space.
The File Allocation Table, FAT, used by DOS is a variation of linked allocation, where all
the links are stored in a separate table at the beginning of the disk. The benefit of this
approach is that the FAT table can be cached in memory, greatly improving random access
speeds.
Figure 4.19- File-allocation table.
Indexed Allocation combines all of the indexes for accessing each file into a
common block ( for that file ), as opposed to spreading them all over the disk or
storing them in a FAT table.
Performance
The optimal allocation method is different for sequential access files than for
random access files, and is also different for small files than for large files.
Some systems support more than one allocation method, which may require
specifying how the file is to be used ( sequential or random access ) at the time it
is allocated. Such systems also provide conversion utilities.
Some systems have been known to use contiguous access for small files, and
automatically switch to an indexed scheme when file sizes surpass a certain
threshold.
And of course some systems adjust their allocation schemes ( e.g. block sizes ) to
best match the characteristics of the hardware for optimum performance.
Bit Vector
One simple approach is to use a bit vector, in which each bit represents a disk
block, set to 1 if free or 0 if allocated.
Fast algorithms exist for quickly finding contiguous blocks of a given size
The down side is that a 40GB disk requires over 5MB just to store the bitmap. (
For example. )
Linked List
A linked list can also be used to keep track of all free blocks.
Traversing the list and/or finding a contiguous block of a given size are not easy,
but fortunately are not frequently needed operations. Generally the system just
adds and removes single blocks from the beginning of the list.
The FAT table keeps track of the free list as just one more linked list on the table.
Figure 4.22- Linked free-space list on disk.
Grouping
A variation on linked list free lists is to use links of blocks of indices of free
blocks. If a block holds up to N addresses, then the first block in the linked-list
contains up to N-1 addresses of free blocks and a pointer to the next block of free
addresses.
Counting
When there are multiple contiguous blocks of free space then the system can keep
track of the starting address of the group and the number of contiguous free
blocks. As long as the average length of a contiguous group of free blocks is
greater than two this offers a savings in space needed for the free list. ( Similar to
compression techniques used for graphics images when a group of pixels all the
same color is encountered. )
Space Maps
Sun's ZFS file system was designed for HUGE numbers and sizes of files,
directories, and even file systems.
The resulting data structures could be VERY inefficient if not implemented
carefully. For example, freeing up a 1 GB file on a 1 TB file system could involve
updating thousands of blocks of free list bit maps if the file was spread across the
disk.
ZFS uses a combination of techniques, starting with dividing the disk up into (
hundreds of ) metaslabs of a manageable size, each having their own space map.
Free blocks are managed using the counting technique, but rather than write the
information to a table, it is recorded in a log-structured transaction record.
Adjacent free blocks are also coalesced into a larger single free block.
An in-memory space map is constructed using a balanced tree data structure,
constructed from the log data.
The combination of the in-memory tree and the on-disk log provide for very fast
and efficient management of these very large files and free blocks.
Efficiency
UNIX pre-allocates inodes, which occupies space even before any files are
created.
UNIX also distributes inodes across the disk, and tries to store data files near their
inode, to reduce the distance of disk seeks between the inodes and the data.
Some systems use variable size clusters depending on the file size.
The more data that is stored in a directory ( e.g. last access time ), the more often
the directory blocks have to be re-written.
As technology advances, addressing schemes have had to grow as well.
o Sun's ZFS file system uses 128-bit pointers, which should theoretically
never need to be expanded. ( The mass required to store 2^128 bytes with
atomic storage would be at least 272 trillion kilograms! )
Kernel table sizes used to be fixed, and could only be changed by rebuilding the
kernels. Modern tables are dynamically allocated, but that requires more
complicated algorithms for accessing them.
Performance
Page replacement strategies can be complicated with a unified cache, as one needs
to decide whether to replace process or file pages, and how many pages to
guarantee to each category of pages. Solaris, for example, has gone through many
variations, resulting in priority paging giving process pages priority over file I/O
pages, and setting limits so that neither can knock the other completely out of
memory.
Another issue affecting performance is the question of whether to implement
synchronous writes or asynchronous writes. Synchronous writes occur in the
order in which the disk subsystem receives them, without caching; Asynchronous
writes are cached, allowing the disk subsystem to schedule writes in a more
efficient order ( See Chapter 12. ) Metadata writes are often done synchronously.
Some systems support flags to the open call requiring that writes be synchronous,
for example for the benefit of database systems that require their writes be
performed in a required order.
The type of file access can also have an impact on optimal page replacement
policies. For example, LRU is not necessarily a good policy for sequential access
files. For these types of files progression normally goes in a forward direction
only, and the most recently used page will not be needed again until after the file
has been rewound and re-read from the beginning, ( if it is ever needed at all. ) On
the other hand, we can expect to need the next page in the file fairly soon. For this
reason sequential access files often take advantage of two special policies:
o Free-behind frees up a page as soon as the next page in the file is
requested, with the assumption that we are now done with the old page
and won't need it again for a long time.
o Read-ahead reads the requested page and several subsequent pages at the
same time, with the assumption that those pages will be needed in the near
future. This is similar to the track caching that is already performed by the
disk controller, except it saves the future latency of transferring data from
the disk controller memory into motherboard main memory.
The caching system and asynchronous writes speed up disk writes considerably,
because the disk subsystem can schedule physical writes to the disk to minimize
head movement and disk seek times. ( See Chapter 12. ) Reads, on the other hand,
must be done more synchronously in spite of the caching system, with the result
that disk writes can counter-intuitively be much faster on average than disk reads.
MASS-STORAGE STRUCTURE
Magnetic Disks
In operation the disk rotates at high speed, such as 7200 rpm ( 120 revolutions per
second. ) The rate at which data can be transferred from the disk to the computer
is composed of several steps:
o The positioning time, a.k.a. the seek time or random access time is the
time required to move the heads from one cylinder to another, and for the
heads to settle down after the move. This is typically the slowest step in
the process and the predominant bottleneck to overall transfer rates.
o The rotational latency is the amount of time required for the desired
sector to rotate around and come under the read-write head.This can range
anywhere from zero to one full revolution, and on the average will equal
one-half revolution. This is another physical step and is usually the second
slowest step behind seek time. ( For a disk rotating at 7200 rpm, the
average rotational latency would be 1/2 revolution / 120 revolutions per
second, or just over 4 milliseconds, a long time by computer standards.
o The transfer rate, which is the time required to move the data
electronically from the disk to the computer. ( Some authors may also use
the term transfer rate to refer to the overall transfer rate, including seek
time and rotational latency as well as the electronic data transfer rate. )
Disk heads "fly" over the surface on a very thin cushion of air. If they should
accidentally contact the disk, then a head crash occurs, which may or may not
permanently damage the disk or even destroy it completely. For this reason it is
normal to park the disk heads when turning a computer off, which means to move
the heads off the disk or to an area of the disk where there is no data stored.
Floppy disks are normally removable. Hard drives can also be removable, and
some are even hot-swappable, meaning they can be removed while the computer
is running, and a new hard drive inserted in their place.
Disk drives are connected to the computer via a cable known as the I/O Bus.
Some of the common interface formats include Enhanced Integrated Drive
Electronics, EIDE; Advanced Technology Attachment, ATA; Serial ATA, SATA,
Universal Serial Bus, USB; Fiber Channel, FC, and Small Computer Systems
Interface, SCSI.
The host controller is at the computer end of the I/O bus, and the disk controller
is built into the disk itself. The CPU issues commands to the host controller via
I/O ports. Data is transferred between the magnetic surface and onboard cache by
the disk controller, and then the data is transferred from that cache to the host
controller and the motherboard memory at electronic speeds.
As technologies improve and economics change, old technologies are often used
in different ways. One example of this is the increasing used of solid state disks,
or SSDs.
SSDs use memory technology as a small fast hard disk. Specific implementations
may use either flash memory or DRAM chips protected by a battery to sustain the
information through power cycles.
Because SSDs have no moving parts they are much faster than traditional hard
drives, and certain problems such as the scheduling of disk accesses simply do not
apply.
However SSDs also have their weaknesses: They are more expensive than hard
drives, generally not as large, and may have shorter life spans.
SSDs are especially useful as a high-speed cache of hard-disk information that
must be accessed quickly. One example is to store filesystem meta-data, e.g.
directory and inode information, that must be accessed quickly and often. Another
variation is a boot disk containing the OS and some application executables, but
no vital user data. SSDs are also used in laptops to make them smaller, faster, and
lighter.
Because SSDs are so much faster than traditional hard disks, the throughput of the
bus can become a limiting factor, causing some SSDs to be connected directly to
the system PCI bus for example.
Magnetic Tapes - Magnetic tapes were once used for common secondary storage before the
days of hard disk drives, but today are used primarily for backups.
Accessing a particular spot on a magnetic tape can be slow, but once reading or
writing commences, access speeds are comparable to disk drives.
Capacities of tape drives can range from 20 to 200 GB, and compression can
double that capacity.
The traditional head-sector-cylinder, HSC numbers are mapped to linear block addresses
by numbering the first sector on the first head on the outermost track as sector 0.
Numbering proceeds with the rest of the sectors on that same track, and then the rest of
the tracks on the same cylinder before proceeding through the rest of the cylinders to the
center of the disk. In modern practice these linear block addresses are used in place of the
HSC numbers for a variety of reasons:
1. The linear length of tracks near the outer edge of the disk is much longer than for
those tracks located near the center, and therefore it is possible to squeeze many
more sectors onto outer tracks than onto inner ones.
2. All disks have some bad sectors, and therefore disks maintain a few spare sectors
that can be used in place of the bad ones. The mapping of spare sectors to bad
sectors in managed internally to the disk controller.
3. Modern hard drives can have thousands of cylinders, and hundreds of sectors per
track on their outermost tracks. These numbers exceed the range of HSC numbers
for many ( older ) operating systems, and therefore disks can be configured for
any convenient combination of HSC values that falls within the total number of
sectors physically on the drive.
There is a limit to how closely packed individual bits can be placed on a physical media,
but that limit is growing increasingly more packed as technological advances are made.
Modern disks pack many more sectors into outer cylinders than inner ones, using one of
two approaches:
o With Constant Linear Velocity, CLV, the density of bits is uniform from cylinder
to cylinder. Because there are more sectors in outer cylinders, the disk spins
slower when reading those cylinders, causing the rate of bits passing under the
read-write head to remain constant. This is the approach used by modern CDs and
DVDs.
o With Constant Angular Velocity, CAV, the disk rotates at a constant angular
speed, with the bit density decreasing on outer cylinders. ( These disks would
have a constant number of sectors per track on all cylinders. )
Host-Attached Storage
Network-Attached Storage
Storage-Area Network
As mentioned earlier, disk transfer speeds are limited primarily by seek times and
rotational latency. When multiple requests are to be processed there is also some
inherent delay in waiting for other requests to be processed.
Bandwidth is measured by the amount of data transferred divided by the total amount of
time from the first request being made to the last transfer being completed, ( for a series
of disk requests. )
Both bandwidth and access time can be improved by processing requests in a good order.
Disk requests include the disk address, memory address, number of sectors to transfer,
and whether the request is for reading or writing.
FCFS Scheduling
First-Come First-Serve is simple and intrinsically fair, but not very efficient.
Consider in the following sequence the wild swing from cylinder 122 to 14 and
then back to 124:
SSTF Scheduling
Shortest Seek Time First scheduling is more efficient, but may lead to starvation
if a constant stream of requests arrives for the same general area of the disk.
SSTF reduces the total head movement to 236 cylinders, down from 640 required
for the same set of requests under FCFS. Note, however that the distance could be
reduced still further to 208 by starting with 37 and then 14 first before processing
the rest of the requests.
SCAN Scheduling
The SCAN algorithm, a.k.a. the elevator algorithm moves back and forth from
one end of the disk to the other, similarly to an elevator processing requests in a
tall building.
Under the SCAN algorithm, If a request arrives just ahead of the moving head
then it will be processed right away, but if it arrives just after the head has passed,
then it will have to wait for the head to pass going the other way on the return trip.
This leads to a fairly wide variation in access times which can be improved upon.
Consider, for example, when the head reaches the high end of the disk: Requests
with high cylinder numbers just missed the passing head, which means they are
all fairly recent requests, whereas requests with low numbers may have been
waiting for a much longer time. Making the return scan from high to low then
ends up accessing recent requests first and making older requests wait that much
longer.
C-SCAN Scheduling
LOOK Scheduling
With very low loads all algorithms are equal, since there will normally only be
one request to process at a time.
For slightly larger loads, SSTF offers better performance than FCFS, but may lead
to starvation when loads become heavy enough.
For busier systems, SCAN and LOOK algorithms eliminate starvation problems.
The actual optimal algorithm may be something even more complex than those
discussed here, but the incremental improvements are generally not worth the
additional overhead.
Some improvement to overall filesystem access times can be made by intelligent
placement of directory and/or inode information. If those structures are placed in
the middle of the disk instead of at the beginning of the disk, then the maximum
distance from those structures to data blocks is reduced to only one-half of the
disk size. If those structures can be further distributed and furthermore have their
data blocks stored as close as possible to the corresponding directory structures,
then that reduces still further the overall time to find the disk block numbers and
then access the corresponding data blocks.
On modern disks the rotational latency can be almost as significant as the seek
time, however it is not within the OSes control to account for that, because
modern disks do not reveal their internal sector mapping schemes, ( particularly
when bad blocks have been remapped to spare sectors. )
o Some disk manufacturers provide for disk scheduling algorithms directly
on their disk controllers, ( which do know the actual geometry of the disk
as well as any remapping ), so that if a series of requests are sent from the
computer to the controller then those requests can be processed in an
optimal order.
o Unfortunately there are some considerations that the OS must take into
account that are beyond the abilities of the on-board disk-scheduling
algorithms, such as priorities of some requests over others, or the need to
process certain requests in a particular order. For this reason OSes may
elect to spoon-feed requests to the disk controller one at a time in certain
situations.
Disk Formatting
Before a disk can be used, it has to be low-level formatted, which means laying
down all of the headers and trailers marking the beginning and ends of each
sector. Included in the header and trailer are the linear sector numbers, and error-
correcting codes, ECC, which allow damaged sectors to not only be detected, but
in many cases for the damaged data to be recovered ( depending on the extent of
the damage. ) Sector sizes are traditionally 512 bytes, but may be larger,
particularly in larger drives.
ECC calculation is performed with every disk read or write, and if damage is
detected but the data is recoverable, then a soft error has occurred. Soft errors are
generally handled by the on-board disk controller, and never seen by the OS. ( See
below. )
Once the disk is low-level formatted, the next step is to partition the drive into one
or more separate partitions. This step must be completed even if the disk is to be
used as a single large partition, so that the partition table can be written to the
beginning of the disk.
After partitioning, then the filesystems must be logically formatted, which
involves laying down the master directory information ( FAT table or inode
structure ), initializing free lists, and creating at least the root directory of the
filesystem. ( Disk partitions which are to be used as raw devices are not logically
formatted. This saves the overhead and disk space of the filesystem structure, but
requires that the application program manage its own disk storage requirements. )
Boot Block
Bad Blocks
No disk can be manufactured to 100% perfection, and all physical objects wear
out over time. For these reasons all disks are shipped with a few bad blocks, and
additional blocks can be expected to go bad slowly over time. If a large number of
blocks go bad then the entire disk will need to be replaced, but a few here and
there can be handled through other means.
In the old days, bad blocks had to be checked for manually. Formatting of the disk
or running certain disk-analysis tools would identify bad blocks, and attempt to
read the data off of them one last time through repeated tries. Then the bad blocks
would be mapped out and taken out of future service. Sometimes the data could
be recovered, and sometimes it was lost forever. ( Disk analysis tools could be
either destructive or non-destructive. )
Modern disk controllers make much better use of the error-correcting codes, so
that bad blocks can be detected earlier and the data usually recovered. ( Recall
that blocks are tested with every write as well as with every read, so often errors
can be detected before the write operation is complete, and the data simply written
to a different sector instead. )
Note that re-mapping of sectors from their normal linear progression can throw
off the disk scheduling optimization of the OS, especially if the replacement
sector is physically far away from the sector it is replacing. For this reason most
disks normally keep a few spare sectors on each cylinder, as well as at least one
spare cylinder. Whenever possible a bad sector will be mapped to another sector
on the same cylinder, or at least a cylinder as close as possible. Sector slipping
may also be performed, in which all sectors between the bad sector and the
replacement sector are moved down by one, so that the linear progression of
sector numbers can be maintained.
If the data on a bad block cannot be recovered, then a hard error has occurred.,
which requires replacing the file(s) from backups, or rebuilding them from
scratch.
Modern systems typically swap out pages as needed, rather than swapping out entire
processes. Hence the swapping system is part of the virtual memory management system.
Managing swap space is obviously an important task for modern OSes.
Swap-Space Use
Swap-Space Location
Swap space can be physically located in one of two locations:
Historically OSes swapped out entire processes as needed. Modern systems swap
out only individual pages, and only as needed. ( For example process code blocks
and other blocks that have not been changed since they were originally loaded are
normally just freed from the virtual memory system rather than copying them to
swap space, because it is faster to go find them again in the filesystem and read
them back in from there than to write them out to swap space and then read them
back. )
In the mapping system shown below for Linux systems, a map of swap space is
kept in memory, where each entry corresponds to a 4K block in the swap space.
Zeros indicate free slots and non-zeros refer to how many processes have a
mapping to that particular block ( >1 for shared pages only. )
UNIT-V
Deadlocks
Figure 5.2 - Resource allocation graph with a deadlock Figure 5.3 - Resource allocation graph with a cycle
but no deadlock
5.3 Methods for Handling Deadlocks
Deadlocks can be prevented by preventing at least one of the four required conditions:
To prevent this condition processes must be prevented from holding one or more
resources while simultaneously waiting for one or more others. There are several
possibilities for this:
o Require that all processes request all resources at one time. This can be
wasteful of system resources if a process needs one resource early in its
execution and doesn't need some other resource until much later.
o Require that processes holding resources must release them before
requesting new resources, and then re-acquire the released resources along
with the new ones in a single new request. This can be a problem if a
process has partially completed an operation using a resource and then
fails to get it re-allocated after releasing it.
o Either of the methods described above can lead to starvation if a process
requires one or more popular resources.
5.4.3 No Preemption
One way to avoid circular wait is to number all resources, and to require that
processes request resources only in strictly increasing ( or decreasing ) order.
In other words, in order to request resource Rj, a process must first release all Ri
such that i >= j.
One big challenge in this scheme is determining the relative ordering of the
different resources
The general idea behind deadlock avoidance is to prevent deadlocks from ever
happening, by preventing at least one of the aforementioned conditions.
This requires more information about each process, AND tends to lead to low device
utilization. ( I.e. it is a conservative approach. )
In some algorithms the scheduler only needs to know the maximum number of each
resource that a process might potentially use. In more complex algorithms the scheduler
can also take advantage of the schedule of exactly what resources may be needed in what
order.
When a scheduler sees that starting a process or granting resource requests may lead to
future deadlocks, then that process is just not started or the request is not granted.
A resource allocation state is defined by the number of available and allocated resources,
and the maximum requirements of all processes in the system.
A state is safe if the system can allocate all resources requested by all processes (
up to their stated maximums ) without entering a deadlock state.
More formally, a state is safe if there exists a safe sequence of processes { P0, P1,
P2, ..., PN } such that all of the resource requests for Pi can be granted using the
resources currently allocated to Pi and all processes Pj where j < i. ( I.e. if all the
processes prior to Pi finish and free up their resources, then Pi will be able to
finish also, using the resources that they have freed up. )
If a safe sequence does not exist, then the system is in an unsafe state, which
MAY lead to deadlock. ( All safe states are deadlock free, but not all unsafe states
lead to deadlocks. )
For example, consider a system with 12 tape drives, allocated as follows. Is this a
safe state? What is the safe sequence?
What happens to the above table if process P2 requests and is granted one more
tape drive?
Key to the safe state approach is that when a request is made for resources, the
request is granted only if the resulting allocation state is a safe one.
5.5.2 Resource-Allocation Graph Algorithm
If resource categories have only single instances of their resources, then deadlock
states can be detected by cycles in the resource-allocation graphs.
In this case, unsafe states can be recognized and avoided by augmenting the
resource-allocation graph with claim edges, noted by dashed lines, which point
from a process to a resource that it may request in the future.
In order for this technique to work, all claim edges must be added to the graph for
any particular process before that process is allowed to request any resources.
( Alternatively, processes may only make requests for resources for which they
have already established claim edges, and claim edges cannot be added to any
process that is currently holding resources. )
When a process makes a request, the claim edge Pi->Rj is converted to a request
edge. Similarly when a resource is released, the assignment reverts back to a
claim edge.
This approach works by denying requests that would produce cycles in the
resource-allocation graph, taking claim edges into effect.
Consider for example what happens when process P2 requests resource R2:
The resulting resource-allocation graph would have a cycle in it, and so the
request cannot be granted.
For resource categories that contain more than one instance the resource-
allocation graph method does not work, and more complex ( and less efficient )
methods must be chosen.
The Banker's Algorithm gets its name because it is a method that bankers could
use to assure that when they lend out resources they will still be able to satisfy all
their clients. ( A banker won't loan out a little money to start building a house
unless they are assured that they will later be able to loan out the rest of the
money to finish the house. )
When a process starts up, it must state in advance the maximum allocation of
resources it may request, up to the amount available on the system.
When a request is made, the scheduler determines whether granting the request
would leave the system in a safe state. If not, then the process must wait until the
request can be granted safely.
The banker's algorithm relies on several key data structures: ( where n is the
number of processes and m is the number of resource categories. )
o Available[ m ] indicates how many resources are currently available of
each type.
o Max[ n ][ m ] indicates the maximum demand of each process of each
resource.
o Allocation[ n ][ m ] indicates the number of each resource category
allocated to each process.
o Need[ n ][ m ] indicates the remaining resources needed of each type for
each process. ( Note that Need[ i ][ j ] = Max[ i ][ j ] - Allocation[ i ][ j ]
for all i, j. )
For simplification of discussions, we make the following notations / observations:
o One row of the Need vector, Need[ i ], can be treated as a vector
corresponding to the needs of process i, and similarly for Allocation and
Max.
o A vector X is considered to be <= a vector Y if X[ i ] <= Y[ i ] for all i.
Now that we have a tool for determining if a particular state is safe or not,
we are now ready to look at the Banker's algorithm itself.
This algorithm determines if a new request is safe, and grants it only if it is
safe to do so.
When a request is made ( that does not exceed currently available
resources ), pretend it has been granted, and then see if the resulting state
is a safe one. If so, grant the request, and if not, deny the request, as
follows:
1. Let Request[ n ][ m ] indicate the number of resources of each type
currently requested by processes. If Request[ i ] > Need[ i ] for any
process i, raise an error condition.
2. If Request[ i ] > Available for any process i, then that process must
wait for resources to become available. Otherwise the process can
continue to step 3.
3. Check to see if the request can be granted safely, by pretending it
has been granted and then seeing if the resulting state is safe. If so,
grant the request, and if not, then the process must wait until its
request can be granted safely.The procedure for granting a request
( or pretending to for testing purposes ) is:
Available = Available - Request
Allocation = Allocation + Request
Need = Need - Request
5.5.3.3 An Illustrative Example
If deadlocks are not avoided, then another approach is to detect when they have occurred
and recover somehow.
In addition to the performance hit of constantly checking for deadlocks, a policy /
algorithm must be in place for recovering from deadlocks, and there is potential for lost
work when processes must be aborted or have their resources preempted.
5.6.1 Single Instance of Each Resource Type
If each resource category has a single instance, then we can use a variation of the
resource-allocation graph known as a wait-for graph.
A wait-for graph can be constructed from a resource-allocation graph by
eliminating the resources and collapsing the associated edges, as shown in the
figure below.
An arc from Pi to Pj in a wait-for graph indicates that process Pi is waiting for a
resource that process Pj is currently holding.
Figure 5.9 - (a) Resource allocation graph. (b) Corresponding wait-for graph
The detection algorithm outlined here is essentially the same as the Banker's
algorithm, with two subtle differences:
o In step 1, the Banker's Algorithm sets Finish[ i ] to false for all i. The
algorithm presented here sets Finish[ i ] to false only if Allocation[ i ] is
not zero. If the currently allocated resources for this process are zero, the
algorithm sets Finish[ i ] to true. This is essentially assuming that IF all of
the other processes can finish, then this process can finish also.
Furthermore, this algorithm is specifically looking for which processes are
involved in a deadlock situation, and a process that does not have any
resources allocated cannot be involved in a deadlock, and so can be
removed from any further consideration.
o Steps 2 and 3 are unchanged
o In step 4, the basic Banker's Algorithm says that if Finish[ i ] == true for
all i, that there is no deadlock. This algorithm is more specific, by stating
that if Finish[ i ] == false for any process Pi, then that process is
specifically involved in the deadlock which has been detected.
( Note: An alternative method was presented above, in which Finish held integers
instead of booleans. This vector would be initialized to all zeros, and then filled
with increasing integers as processes are detected which can finish. If any
processes are left at zero when the algorithm completes, then there is a deadlock,
and if not, then the integers in finish describe a safe sequence. To modify this
algorithm to match this section of the text, processes with allocation = zero could
be filled in with N, N - 1, N - 2, etc. in step 1, and any processes left with Finish =
0 in step 4 are the deadlocked processes. )
Consider, for example, the following state, and determine if it is currently
deadlocked:
Now suppose that process P2 makes a request for an additional instance of type C,
yielding the state shown below. Is the system now deadlocked?
5.6.3 Detection-Algorithm Usage
1. Process priorities.
2. How long the process has been running, and how close it is to finishing.
3. How many and what type of resources is the process holding. ( Are they
easy to preempt and restore? )
4. How many more resources does the process need to complete.
5. How many processes will need to be terminated
6. Whether the process is interactive or batch.
7. Whether or not the process has made non-restorable changes to any
resource.
When preempting resources to relieve deadlock, there are three important issues
to be addressed:
1. Selecting a victim - Deciding which resources to preempt from which
processes involves many of the same decision criteria outlined above.
2. Rollback - Ideally one would like to roll back a preempted process to a
safe state prior to the point at which that resource was originally allocated
to the process. Unfortunately it can be difficult or impossible to determine
what such a safe state is, and so the only safe rollback is to roll back all the
way back to the beginning. ( I.e. abort the process and make it start over. )
3. Starvation - How do you guarantee that a process won't starve because its
resources are constantly being preempted? One option would be to use a
priority system, and increase the priority of a process every time its
resources get preempted. Eventually it should get a high enough priority
that it won't get preempted any m
PROTECTION
Obviously to prevent malicious misuse of the system by users or programs. See chapter
15 for a more thorough coverage of this goal.
To ensure that each shared resource is used only in accordance with system policies,
which may be set either by system designers or by system administrators.
To ensure that errant programs cause the minimal amount of damage possible.
Note that protection systems only provide the mechanisms for enforcing policies and
ensuring reliable systems. It is up to administrators and users to implement those
mechanisms effectively.
The principle of least privilege dictates that programs, users, and systems be given just
enough privileges to perform their tasks.
This ensures that failures do the least amount of harm and allow the least of harm to be
done.
For example, if a program needs special privileges to perform a task, it is better to make
it a SGID program with group ownership of "network" or "backup" or some other pseudo
group, rather than SUID with root ownership. This limits the amount of damage that can
occur if something goes wrong.
Typically each user is given their own account, and has only enough privilege to modify
their own files.
The root account should not be used for normal day to day activities - The System
Administrator should also have an ordinary account, and reserve use of the root account
for only those tasks which need the root privileges
Rings are numbered from 0 to 7, with outer rings having a subset of the privileges
of the inner rings.
Each file is a memory segment, and each segment description includes an entry
that indicates the ring number associated with that segment, as well as read, write,
and execute privileges.
Each process runs in a ring, according to the current-ring-number, a counter
associated with each process.
A process operating in one ring can only access segments associated with higher
( farther out ) rings, and then only according to the access bits. Processes cannot
access segments associated with lower rings.
Domain switching is achieved by a process in one ring calling upon a process
operating in a lower ring, which is controlled by several factors stored with each
segment descriptor:
o An access bracket, defined by integers b1 <= b2.
o A limit b3 > b2
o A list of gates, identifying the entry points at which the segments may be
called.
If a process operating in ring i calls a segment whose bracket is such that b1 <= i
<= b2, then the call succeeds and the process remains in ring i.
Otherwise a trap to the OS occurs, and is handled as follows:
o If i < b1, then the call is allowed, because we are transferring to a
procedure with fewer privileges. However if any of the parameters being
passed are of segments below b1, then they must be copied to an area
accessible by the called procedure.
o If i > b2, then the call is allowed only if i <= b3 and the call is directed to
one of the entries on the list of gates.
Overall this approach is more complex and less efficient than other protection
schemes.
The model of protection that we have been discussing can be viewed as an access matrix,
in which columns represent different system resources and rows represent different
protection domains. Entries within the matrix indicate what access that domain has to that
resource.
Domain switching can be easily supported under this model, simply by providing
"switch" access to other domains:
The ability to copy rights is denoted by an asterisk, indicating that processes in that
domain have the right to copy that access within the same column, i.e. for the same
object. There are two important variations:
o If the asterisk is removed from the original access right, then the right is
transferred, rather than being copied. This may be termed a transfer right as
opposed to a copy right.
o If only the right and not the asterisk is copied, then the access right is added to the
new domain, but it may not be propagated further. That is the new domain does
not also receive the right to copy the access. This may be termed a limited copy
right, as shown in Figure 14.5 below:
Figure 5.14 - Access matrix with copy rights.
The owner right adds the privilege of adding new rights or removing existing ones:
Copy and owner rights only allow the modification of rights within a column. The
addition of control rights, which only apply to domain objects, allow a process operating
in one domain to affect the rights available in other domains. For example in the table
below, a process operating in domain D2 has the right to control any of the rights in
domain D4.
The simplest approach is one big global table with < domain, object, rights >
entries.
Unfortunately this table is very large ( even if sparse ) and so cannot be kept in
memory ( without invoking virtual memory techniques. )
There is also no good way to specify groupings - If everyone has access to some
resource, then it still needs a separate entry for every domain.
Each column of the table can be kept as a list of the access rights for that
particular object, discarding blank entries.
For efficiency a separate list of default access rights can also be kept, and checked
first.
In a similar fashion, each row of the table can be kept as a list of the capabilities
of that domain.
Capability lists are associated with each domain, but not directly accessible by the
domain or any user process.
Capability lists are themselves protected resources, distinguished from other data
in one of two ways:
o A tag, possibly hardware implemented, distinguishing this special type of
data. ( other types may be floats, pointers, booleans, etc. )
o The address space for a program may be split into multiple segments, at
least one of which is inaccessible by the program itself, and used by the
operating system for maintaining the process's access right capability list.
5.12.4 A Lock-Key Mechanism
5.12.5 Comparison
o As systems have developed, protection systems have become more powerful, and
also more specific and specialized.
o To refine protection even further requires putting protection capabilities into the
hands of individual programmers, so that protection policies can be implemented
on the application level, i.e. to protect resources in ways that are known to the
specific applications but not to the more general operating system.
o Security. Security provided by the kernel offers better protection than that
provided by a compiler. The security of the compiler-based enforcement is
dependent upon the integrity of the compiler itself, as well as requiring
that files not be modified after they are compiled. The kernel is in a better
position to protect itself from modification, as well as protecting access to
specific files. Where hardware support of individual memory accesses is
available, the protection is stronger still.
o Flexibility. A kernel-based protection system is not as flexible to provide
the specific protection needed by an individual programmer, though it may
provide support which the programmer may make use of. Compilers are
more easily changed and updated when necessary to change the protection
services offered or their implementation.
o Efficiency. The most efficient protection mechanism is one supported by
hardware and microcode. Insofar as software based protection is
concerned, compiler-based systems have the advantage that many checks
can be made off-line, at compile time, rather that during execution.
The concept of incorporating protection mechanisms into programming
languages is in its infancy, and still remains to be fully developed. However the
general goal is to provide mechanisms for three functions:
Android, Inc. was founded in Palo Alto, California in October 2003 by Andy Rubin (co-founder
of Danger), Rich Miner (co-founder of Wildfire Communications, Inc.), Nick Sears (once VP
at T-Mobile), and Chris White (headed design and interface development at WebTV) to develop,
in Rubin's words "smarter mobile devices that are more aware of its owner's location and
preferences".]The early intentions of the company were to develop an advanced operating system
for digital cameras, when it was realized that the market for the devices was not large enough,
and diverted their efforts to producing a Smartphone operating system to rival those
of Symbian and Windows Mobile. Despite the past accomplishments of the founders and early
employees, Android Inc. operated secretly, revealing only that it was working on software for
mobile phones.
Google acquired Android Inc. on August 17, 2005; key employees of Android Inc., including
Rubin, Miner, and White, stayed at the company after the acquisition. Not much was known
about Android Inc. at the time, but many assumed that Google was planning to enter the mobile
phone market with this move. At Google, the team led by Rubin developed a mobile device
platform powered by the Linux kernel. Google marketed the platform to handset makers
and carriers on the promise of providing a flexible, upgradable system. Google had lined up a
series of hardware component and software partners and signaled to carriers that it was open to
various degrees of cooperation on their part.
LINUX
Linux was originally developed as a free operating system for Intel x86-based personal
computers. It has since been ported to more computer hardware platforms than any other
operating system. It is a leading operating system on servers and other big iron systems such
as mainframe computers and supercomputers. As of June 2013, more than 95% of the
world's 500 fastest supercomputers run some variant of Linux,] including all the 44 fastest. Linux
also runs on embedded systems, which are devices whose operating system is typically built into
the firmware and is highly tailored to the system; this includes mobile phones, tablet computers,
network routers, facility automation controls, televisions and video game consoles. Android,
which is a widely used operating system for mobile devices, is built on top of the Linux kernel.
UNIT-II
1. a) What is a Process? Explain the process state transition diagram.
b) Describe the typical elements of Process Control Block (PCB).
2. a) Define Thread. What are the advantages and uses of threads?
b) What is the difference between a thread and a process?
3. a) What is Process Scheduling? Explain different types of process scheduler.
b) Explain context switching.
4. a) Discuss about various criteria used for short term scheduling.
b) Explain i) Process preemption ii) Dispatcher.
5. Explain the following CPU scheduling algorithms with an example
a) FCFS b) SJF c) Priority Based d) Round Robin Scheduling.
6. a) Explain i) Multilevel Queue Scheduling ii) Multilevel Feedback Queue Scheduling.
b) Define the following: i) Starvation ii) Aging.
7. Explain a) Multiprocessor Scheduling b) Real-Time Scheduling c) Thread Scheduling.
8. Explain Process scheduling algorithms in Linux and Windows.
9. Explain the following in detail:
a) Race condition b) Process synchronization c) Critical Section.
10. Explain the requirements to be satisfied by a solution to Critical Section problem.
11. Write the Peterson’s solution for the critical section problem and explain?
12. Explain the solution to critical section problem using locks and hardware instructions.
13. What is semaphore? Explain semaphore mechanism with an example.
14. Explain the solution using semaphore for the following problems
a) Producer-Consumer b) Readers-writers c) Dining Philosophers
15. a) What are Monitors? Give the monitors solution to Dining Philosophers problem.
b) Explain Process synchronization in Linux and Windows.
UNIT-III
1. Explain the following:
a) Logical & Physical Address space b) Swapping in memory management.
2. Explain the following
a) Internal fragmentation b) External fragmentation c) First Fit d) Best Fit e) Worst fit.
3. Explain in detail the paging technique of memory management. Give an example.
4. Explain about segmentation in memory management. Give an example.
5. Explain the following
a) Translation Look-aside Buffer (TLB) b) Hierarchical paging
c) Hashed page table d) Inverted Page table.
6. Define the following
a) Virtual Memory b) Demand Paging. c) Page Fault d) Thrashing.
7. Consider the following page reference string
1,2,3,4,2,1,5,6,2,1,2,3,7,6,3,2,1,2,3,6
How many page faults would occur for the following page replacement algorithms, assuming
three or five frames? Initially all frames are empty, so the first unique pages will also cost one
page fault.
LRU, FIFO, Optimal and Second chance Replacement algorithms.
8. Explain the following:
a) Be lady’s Anomaly b) Allocation of Frames.
UNIT-IV
1. a) What is a file? Discuss its attributes.
b) Explain the various types of operations that can be performed on files.
2. Explain the access methods a) Sequential b) Direct c) Indexed d) Indexed sequential
3. a) What are the various access rights that can be assigned to a particular user for a particular
file?
b) Explain the various types of files.
4. a) Explain file directory.
b) Explain the following directory structures i) Single Level ii) Two-Level iii)Tree Structured
iv) Acyclic graph v) General graph.
5. a) What do you mean by file system mounting? Explain.
b) Explain in detail File Sharing.
6. Explain in detail file protection mechanism.
7. a) Define file system . What are its design problems? Explain the layered file system
architecture.
b) Give an overview of file system implementation.
8. Explain a) Partitions and Mounting b) Virtual File System c) Linear list and hash table
directory implementation.
9. Explain in detail various disk allocation methods.
10. Explain various techniques implemented for free space management. Discuss with suitable
examples.
11. Explain in detail the following a) Disk Structure b) Disk Attachment c) Magnetic disks
d) Magnetic Tapes.
12. Explain the different disk scheduling algorithms with an example.
UNIT-V
1. a) Explain the necessary and sufficient conditions for a deadlock.
b) Explain the use of Resource Allocation Graph in dealing with deadlocks.
2. Explain in detail deadlock prevention.
3. Explain deadlock avoidance using Resource Allocation Graph with an example.
4. Explain the Banker’s algorithm for avoiding deadlocks.
5. Write an algorithm for detecting the presence of deadlock.
6. What are the approaches that can be used for recovering from deadlock?
7. Consider a system with 5 processes P0, P1, P2, P3, and P4 with 3 resources A,B, and C.
Resource A has 10, B has 5 and C has 7 instances. The maximum and allocation matrices are as
follows:
A B C A B C
P0 7 5 3 0 1 0
P1 3 2 2 2 0 0
P2 9 0 2 3 0 2
P3 2 2 2 2 1 1
P4 4 3 3 0 0 2
MAX ALLOCATION
18.Assignment Questions
Assignment-I OS Basics
1. Explain the following operating systems in detail.
i) Simple Batch Systems ii) Multiprogramming Systems
iii) Time shared systems iv) Parallel systems v) Real Time systems
2. a) Explain OS Structure.
b) Explain the dual mode of operation.
c) Write short notes on Special-purpose systems.
The processes are assumed to have arrived in the order P1, P2, P3, P4, P5, all at time 0.
a. Draw four Gantt charts that illustrate the execution of these processes using the
following scheduling algorithms: FCFS, SJF, non preemptive priority and RR
(quantum=1).
b. What is the turnaround time and waiting time of each process for each of these
scheduling algorithms?
c. Which of the algorithms results in the minimum average waiting time?
3. Consider the following set of processes, with the length of the CPU burst given in
milliseconds
Process Arrival Time Burst Time Priority
P1 1 7 3
P2 2 8 2
P3 3 6 5
P4 4 9 4
P5 5 4 1
Give the Gantt chart illustrating the execution of these processes using Shortest
Remaining Time First (SRTF) and preemptive priority scheduling. Find the average
waiting time and average turnaround time for each of these algorithms.
5. Explain in detail how semaphores and monitors are used to solve the following problems
i) Producer-Consumer problem ii) Readers –Writers Problem iii) Dining –Philosophers
problem
2. a. Explain the directory implementation through i) Linear List ii) Hash Table.
b. Compare various file allocation techniques.
c. Explain the following Free Space Management Techniques:
i) Bit Map ii) Linked List iii) Grouping
b. Suppose that a disk drive has 4,000 cylinders, numbered 0 to 3999. The drive is
currently serving a request at cylinder 143, and the previous request was at cylinder 120.
The queue of pending requests, in FIFO order is:
87, 2465,918, 1784, 998, 509, 122, 750, 130
Starting from the current head position, what is the total distance (in cylinders) that
the disk arm moves to satisfy all the pending requests for each of the following
Disk Scheduling algorithms?
Assignment-V
1. a) Define Deadlock. Give an example.
b) Explain the necessary and sufficient conditions for deadlock.
c) Explain the use of Resource Allocation Graph in dealing with deadlocks. Give examples.
5. a) What is an access matrix? How can it be used for protection? Explain some implementation schemes.
b) What is the need for revocation of access rights? Discuss various ways of implementing.
UNIT – I
1. What is operating system?
a) collection of programs that manages hardware resources
b) system service provider to the application programs
c) link to interface the hardware and application programs
d) all of the mentioned
2. To access the services of operating system, the interface is provided by the
a) system calls b) API c) library d) assembly instructions
4. Which one of the following error will be handle by the operating system?
a) power failure b) lack of paper in printer c) connection failure in the network d) all
5. The main function of the command interpreter is
a) to get and execute the next user-specified command
b) to provide the interface between the API and application program
c) to handle the files in operating system d) none of the mentioned
6. By operating system, the resource management can be done via
a) time division multiplexing b) space division multiplexing
c) both (a) and (b) d) none of the mentioned
8. The OS X has
a) monolithic kernel b) hybrid kernel
c) microkernel d) monolithic kernel with modules
9. The systems which allows only one process execution at a time, are called
a) uniprogramming systems b) uniprocessing systems
c) unitasking systems d) none of the mentioned
10. In operating system, each process has its own
a) address space and global variables b) open files
c) pending alarms, signals and signal handlers d) all of the mentioned
11. A process can be terminated due to
a) normal exit b) fatal error
c) killed by another process d) all of the mentioned
12. What is the ready state of a process?
a) when process is scheduled to run after some execution
b) when process is unable to run until some task has been completed
c) when process is using the CPU d) none of the mentioned
13. What is interprocess communication?
a) communication within the process b) communication between two process
c) communication between two threads of same process d) none of the mentioned
14. A set of processes is deadlock if
a) each process is blocked and will remain so forever b) each process is terminated
c) all processes are trying to kill each other d) none of the mentioned
15. A process stack does not contain
a) function parameters b) local variables c) return addresses d) PID of child process
16. Which system call returns the process identifier of a terminated child?
a) wait b) exit c) fork d) get
17. The address of the next instruction to be executed by the current process is provided by the
a) CPU registers b) program counter
c) process stack d) pipe
18 A Process Control Block(PCB) does not contain which of the following :
a) Code b) Stack c) Heap d) Data e) Program Counter f) Process State
g) I/O status information h) bootstrap program
19. The number of processes completed per unit time is known as __________.
a) Output b) Throughput c) Efficiency d) Capacity
20) The state of a process is defined by :
a) the final activity of the process b) the activity just executed by the process
c) the activity to next be executed by the process d) the current activity of the process
UNIT-II
35) The only state transition that is initiated by the user process itself is :
a) block b) wakeup c) dispatch d) None of these
36) In a time-sharing operating system, when the time slot given to a process is completed, the
process goes from the running state to the :
a) Blocked state b) Ready state c) Suspended state d) Terminated state
37) In a multi-programming environment :
a) the processor executes more than one process at a time
b) the programs are developed by more than one person
c) more than one process resides in the memory
d) a single user can execute many programs at the same time
38) Suppose that a process is in “Blocked” state waiting for some I/O service. When the service
is completed, it goes to the :
a) Running state b) Ready state c) Suspended state d) Terminated state
39) The context of a process in the PCB of a process does not contain :
a) the value of the CPU registers b) the process state
c) memory-management information d) context switch time
40) Which of the following need not necessarily be saved on a context switch between processes
? (GATE CS 2000)
a) General purpose registers b) Translation look-aside buffer
c) Program counter d) All of these
UNIT-III
41) The address generated by the CPU is referred to as :
a) physical address b) logical address c) Neither a nor b
42) The address loaded into the memory address register of the memory is referred to as :
a) physical address b) logical address c) Neither a nor b
43) The run time mapping from virtual to physical addresses is done by a hardware device called
the :
a) Virtual to physical mapper b) memory management unit
c) memory mapping unit d) None of these
44) The base register is also known as the :
a) basic register b) regular register c) relocation register d) delocation register
45) The size of a process is limited to the size of :
a) physical memory b) external storage c) secondary storage d) None of these
46) If execution time binding is being used, then a process ______ be swapped to a different
memory space.
a) has to be b) can never c) mustd) may
47) The ________ consists of all processes whose memory images are in the backing store or in
memory and are ready to run.
a) wait queue b) ready queuec) CPUd) secondary storage
48) The _________ time in a swap out of a running process and swap in of a new process into
the memory is very high.
a) context – switch b) waiting c) execution d) All of these
49) The major part of swap time is _______ time.
a) waiting b) transfer c) execution d) None of these
50) Swapping _______ be done when a process has pending I/O, or has to execute I/O operations
only into operating system buffers.
a) must b) can c) must never d) maybe
51) Swap space is allocated :
a) as a chunk of disk b) separate from a file system c) into a file system d) All of these
52. The address of a page table in memory is pointed by
a) stack pointer b) page table base register c) page register d) program counter
ii) How many page faults does the FIFO page replacement algorithm produce ?
a) 10 b) 15 c) 11 d) 12
60) Applying the LRU page replacement to the following reference string :
12452124
The main memory can accommodate 3 pages and it already has pages 1 and 2. Page 1 came in
before page 2.
How many page faults will occur ?
a) 2 b) 3 c) 4 d) 5
iii) For FIFO page replacement algorithms with 3 frames, the number of page faults is :
a) 16 b) 15 c) 14 d) 11
v) For Optimal page replacement algorithms with 3 frames, the number of page faults is :
a) 16 b) 15 c) 14 d) 11
vi) For Optimal page replacement algorithms with 5 frames, the number of page faults is :
a) 6 b) 7 c) 10 d) 9
UNIT-IV
64. ______ is a unique tag, usually a number, identifies the file within the file system.
a) File identifier b) File name c) File type d) none of the mentioned
67. Which file is a sequence of bytes organized into blocks understandable by the system’s
linker?
a) object file b) source file c) executable file d) text file
70. Mapping of network file system protocol to local file system is done by
a) network file system b) local file system c) volume manager d) remote mirror.
71. Which one of the following explains the sequential file access method?
a) random access according to the given byte number b) read bytes one at a time, in order
c) read/write sequentially by record d) read/write randomly by record
73. In which type of allocation method each file occupy a set of contiguous block on the disk?
a) contiguous allocation b) dynamic-storage allocation
c) linked allocation d) indexed allocation
75. Which protocol establishes the initial logical connection between a server and a client?
a) transmission control protocol b) user datagram protocol
c) mount protocol d) datagram congestion control protocol
78) The operating system keeps a small table containing information about all open files called :
a) system table b) open-file table c) file table d) directory table
81) The direct access method is based on a ______ model of a file, as _____ allow random
access to any file block.
a) magnetic tape, magnetic tapes b) tape, tapes c) disk, disks d) All of these
94) The heads of the magnetic disk are attached to a _____ that moves all the heads as a unit.
a) spindle b) disk arm c) track d) None of these
95) The set of tracks that are at one arm position make up a ___________.
a) magnetic disks b) electrical disks c) assemblies d) cylinders
96) The time taken to move the disk arm to the desired cylinder is called the :
a) positioning time b) random access time c) seek time d) rotational latency
97) The time taken for the desired sector to rotate to the disk head is called :
a) positioning time b) random access time c) seek time d) rotational latency
98) When the head damages the magnetic surface, it is known as _________.
a) disk crash b) head crash c) magnetic damage d) All of these
99) Consider a disk queue with requests for I/O to blocks on cylinders :
98 183 37 122 14 124 65 67
i) Considering FCFS (first cum first served) scheduling, the total number of head movements is,
if the disk head is initially at 53 :
a) 600 b) 620 c) 630 d) 640
ii) Considering SSTF (shortest seek time first) scheduling, the total number of head movements
is, if the disk head is initially at 53 :
a) 224 b) 236 c) 245 d) 240
UNIT-V
100) In domain structure what is Access-right equal to ?
a) Access-right = object-name, rights-set b) Access-right = read-name, write-set
c) Access-right = read-name, execute-set d) Access-right = object-name, execute-set
104) What are the three additional operations to change the contents of the access-matrix ?
a) Copy b) Owner c) Deny d) control
105) Who can add new rights and remove some rights ?
a) Copy b) transfer c) limited copy d) owner
107) Which two rights allow a process to change the entries in a column ?
a) copy and transfer b) copy and owner c) owner and transfer d) deny and copy
124. Which one of the following is a visual ( mathematical ) way to determine the deadlock
occurrence?
a) resource allocation graph b) starvation graph c) inversion graph d) none
136) The disadvantage of a process being allocated all its resources before beginning its
execution is :
a) Low CPU utilization b) Low resource utilization
c) Very high resource utilization d) None of these
137) To ensure no preemption, if a process is holding some resources and requests another
resource that cannot be immediately allocated to it :
a) then the process waits for the resources be allocated to it
b) the process keeps sending requests until the resource is allocated to it
c) the process resumes execution without the resource being allocated to it
d) then all resources currently being held are preempted
138) One way to ensure that the circular wait condition never holds is to :
a) impose a total ordering of all resource types and to determine whether one precedes another in
the ordering
b) to never let a process acquire resources that are held by other processes
c) to let a process wait for only one resource at a time
d) All of these
140) A deadlock avoidance algorithm dynamically examines the __________, to ensure that a
circular wait condition can never exist.
a) resource allocation state b) system storage state c) operating system d) resources
146) The resource allocation graph is not applicable to a resource allocation system :
a) with multiple instances of each resource type
b) with a single instance of each resource type c) Both a and b
147) The Banker’s algorithm is _____________ than the resource allocation graph algorithm.
a) less efficient b) more efficient c) None of these
148) The data structures available in the Banker’s algorithm are : (choose all that apply)
a) Available b) Need c) Allocation d) Maximum e) Minimum f) All of these
150) A system with 5 processes P0 through P4 and three resource types A, B, C has A with 10
instances, B with 5 instances, and C with 7 instances. At time t0, the following snapshot has been
taken :
Process
P0
P1
P2
P3
P4
Allocation (process-wise : P0 through P4 top to bottom)
ABC
010
200
302
211
002
Max (process-wise : P0 through P4 top to bottom)
ABC
753
322
902
222
433
Available
ABC
332
The sequence leads the system to :
a) an unsafe state b) a safe state c) a protected state d) a deadlock
152) The wait-for graph is a deadlock detection algorithm that is applicable when :
a) all resources have a single instance b) all resources have multiple instances
c) both a and b
156) If deadlocks occur frequently, the detection algorithm must be invoked ________.
a) rarely b) frequently c) None of these
157) The disadvantage of invoking the detection algorithm for every request is :
a) overhead of the detection algorithm due to consumption of memory
b) excessive time consumed in the request to be allocated memory
c) considerable overhead in computation time d) All of these.
158) A deadlock eventually cripples system throughput and will cause the CPU utilization to
______.
a) increase b) drop c) stay still d) None of these
159) Every time a request for allocation cannot be granted immediately, the detection algorithm
is invoked. This will help identify : (choose all that apply)
a) the set of processes that have been deadlocked
b) the set of processes in the deadlock queue
c) the specific process that caused the deadlock
d) All of these
160) A computer system has 6 tape drives, with ‘n’ processes competing for them. Each process
may need 3 tape drives. The maximum value of ‘n’ for which the system is guaranteed to be
deadlock free is :
a) 2 b) 3 c) 4 d) 1
161) A system has 3 processes sharing 4 resources. If each process needs a maximum of 2 units
then, deadlock :
a) can never occur b) may occur c) has to occur d) None of these
162) ‘m’ processes share ‘n’ resources of the same type. The maximum need of each process
doesn’t exceed ‘n’ and the sum of all their maximum needs is always less than m+n. In this
setup, deadlock :
a) can never occur b) may occur c) has to occur d) None of these
164) The two ways of aborting processes and eliminating deadlocks are : (choose all that apply)
a) Abort all deadlocked processes b) Abort all processes
c) Abort one process at a time until the deadlock cycle is eliminated
d) All of these
165) Those processes should be aborted on occurrence of a deadlock, the termination of which :
a) is more time consuming b) incurs minimum cost
c) safety is not hampered d) All of these
166) The process to be aborted is chosen on the basis of the following factors : (choose all that
apply)
a) priority of the process b) process is interactive or batch
c) how long the process has computed d) how much more long before its completion
e) how many more resources the process needs before its completion
f) how many and what type of resources the process has used
g) how many resources are available in the system h) All of these
167) Cost factors of process termination include : (choose all that apply)
a) number of resources the deadlock process is holding
b) CPU utilization at the time of deadlock
c) amount of time a deadlocked process has thus far consumed during its execution
d) All of the above
168) If we preempt a resource from a process, the process cannot continue with its normal
execution and it must be :
a) aborted b) rolled back c) terminated d) queued
169) To _______ to a safe state, the system needs to keep more information about the states of
processes.
a) abort the process b) roll back the process c) queue the process d) None of these
170) If the resources are always preempted from the same process, __________ can occur.
a) deadlock b) system crash c) aging d) starvation.
171) The solution to starvation is :
a) the number of rollbacks must be included in the cost factor
b) the number of resources must be included in resource preemption
c) resource preemption be done instead d) All of these
172) What are two capabilities defined in CAP system ?
a) data capability b) address capability c) hardware capabilityd) software capability
20.Tutorial Problems
Tutorial Sheet-I (Unit-I)
2. Explain the following: a) Parallel systems b) Distribution Systems c) Real Time systems
1. Five processes arrive at time 0, in the order given, with the length of the CPU-burst time in
milliseconds, as shown below.
Processes Burst time
P1 8
P2 9
P3 5
P4 6
P5 4
a)Find the average waiting time, considering the following algorithms:
(i) FCFS (ii) SJF (iii) RR (time quantum = 4 milliseconds).
b) Which algorithm gives the minimum average waiting time?
2. What is dispatcher, explain short term scheduling and long term scheduling.
3. Explain First fit, Best fit and Worst fit algorithms with an example.
4. Suppose that a disk drive has 5,000 cylinders, numbered 0 to 4999. The drive is currently
serving a request at cylinder 143, and the previous request was at cylinder 125. The queue of
pending requests, in FIFO order is:
86, 1470,913, 1774, 948, 1509, 1022, 1750, 130
Starting from the current head position, what is the total distance (in cylinders) that the
disk arm moves to satisfy all the pending requests for each of the following Disk Scheduling
algorithms? a) FCFS b) SSTF c) SCAN d) C-SCAN e) LOOK f) C-LOOK
21.Known gaps
As stated prerequisites are taught in the previous semesters, there are no known gaps
22.Discussion topics
Unit-1:
1. Distributed Systems
2. Real-time Embedded System
3. Operating Systems Generation (SYSGEN)
4. Handheld Systems
5. Multimedia Systems
6. Computing Environments
a) Traditional Computing
b) Client server Computing
c) Peer-to-peer Computing
d) Web-based Computing
10.Threading issues
Unit-2:
1. Synchronization examples
2. Atomic transactions
3. Case study on UNIX
4. Case study on Linux
5. Case study on Windows
6. Classic problems of synchronization
Unit-3:
1. Page-Replacement Algorithms
2. Memory Management
3. Case study on UNIX
4. Case study on Linux
5. Case study on Windows
6. Paging Techniques
Unit-4:
1. free-space management
2. Case study on UNIX
3. Case study on Linux
4. Case study on Windows
5. File system
6. Protection in file systems
7. Mass-storage structure
8. disk scheduling
Unit-5:
1. Deadlock Detection
2. Deadlock Avoidance
3. Recovery form deadlock
4. Protection
5. Security
6. Case study on UNIX
7. Case study on Linux
8. Case study on Windows
9. Implementation of Access Matrix