Os Entire Notes
Os Entire Notes
UNIT-2
UNIT-3
UNIT-4
UNIT-5
File Systems: Files, Directories, File system implementation . Secondary-
Storage Structure: Overview of disk structure, and attachment, Disk
scheduling, RAID structure,
Introduction
An Operating System (OS) is an interface between a computer user and computer hardware. An
operating system is a software which performs all the basic tasks like file management, memory
management, process management, handling input and output, and controlling peripheral devices
such as disk drives and printers.
An operating system is software that enables applications to interact with a computer's hardware.
The software that contains the core components of the operating system is called the kernel.
The primary purposes of an Operating System are to enable applications (spftwares) to interact
with a computer's hardware and to manage a system's hardware and software resources.
Some popular Operating Systems include Linux Operating System, Windows Operating System,
VMS, OS/400, AIX, z/OS, etc. Today, Operating systems is found almost in every device like
mobile phones, personal computers, mainframe computers, automobiles, TV, Toys etc.
Definitions
We can have a number of definitions of an Operating System. Let's go through few of them:
An Operting System is the low-level software that supports a computer's basic functions, such as
scheduling tasks and controlling peripherals.
An operating system is a program that acts as an interface between the user and the computer
hardware and controls the execution of all kinds of programs.
An operating system (OS) is system software that manages computer hardware, software resources,
and provides common services for computer programs.
a) Process management
• Allocating and deallocating the resources.
• Allocates resources such that the system doesn’t run out of resources.
• Offering mechanisms for process synchronization.
• Helps in process communication
b) Memory management
• Allocating/deallocating memory to store programs.
• Deciding the amount of memory that should be allocated to the program.
• Memory distribution while multiprocessing.
• Update the status in case memory is freed
• Keeps record of how much memory is used and how much is unused.
c) File Management
• Keeps track of location and status of files.
• Allocating and deallocating resources.
• Decides which resource to be assigned to which file.
• Creating file
• Editing a file
• Updating a file
• Deleting a files
d) Device management
• Allocating and deallocating devices to different processes.
• Keeps records of all the devices attached to the computer.
• Decides which device to be allocated to which process and for how much time.
e) Security & Privacy − By means of password and similar other techniques, it prevents
unauthorized access to programs and data.
f) Control over system performance − Recording delays between request for a service and
response from the system.
g) Job accounting − Keeping track of time and resources used by various jobs and users.
h) Error detecting − Production of dumps, traces, error messages, and other debugging and
error detecting aids.
• Memory management
• Process management
• Job scheduling
• Resource allocation strategies
• Swap space / virtual memory in physical memory
• Interrupt handling
• File system management
• Protection and security
• Inter-process communications
Figure 1.3 - Memory layout for a multiprogramming system
Interrupt-driven nature of modern OSes requires that erroneous processes not be able
to disturb anything else.
1.4.1 Dual-Mode and Multimode Operation
• The concept of modes can be extended beyond two, requiring more than a single
mode bit
• CPUs that support virtualization use one of these extra bits to indicate when the
virtual machine manager, VMM, is in control of the system. The VMM has more
privileges than ordinary user programs, but not so many as the full kernel.
• System calls are typically implemented in the form of software interrupts, which
causes the hardware's interrupt handler to transfer control over to an appropriate
interrupt handler, which is part of the operating system, switching the mode bit to kernel
mode in the process. The interrupt handler checks exactly which interrupt was
generated, checks additional parameters ( generally passed through registers ) if
appropriate, and then calls the appropriate kernel service routine to handle the service
requested by the system call.
• User programs' attempts to execute illegal instructions ( privileged or non-
existent instructions ), or to access forbidden memory areas, also generate software
interrupts, which are trapped by the interrupt handler and control is transferred to the
OS, which issues an appropriate error message, possibly dumps data to a log ( core )
file for later analysis, and then terminates the offending program.
1.4.2 Timer
• Before the kernel begins executing user code, a timer is set to generate an
interrupt.
• The timer interrupt handler reverts control back to the kernel.
• This assures that no user process can take over the system.
• Timer control is a privileged instruction, ( requiring kernel mode. )
• or external attackers attempting to access or damage the system.
1.5 Computing Environments
1.5.1 Traditional Computing
• Any computer or process on the network may provide services to any other which
requests it. There is no clear "leader" or overall organization.
• May employ a central "directory" server for looking up the location of resources,
or may use peer-to-peer searching to find resources.
• E.g. Skype uses a central server to locate a desired peer, and then further
communication is peer to peer.
• ( For more information on the Flourish conference held at UIC on the subject of
Free Libre and Open Source Software , visit http://www.flourishconf.com )
• Open-Source software is published ( sometimes sold ) with the source code, so
that anyone can see and optionally modify the code.
• Open-source SW is often developed and maintained by a small army of loosely
connected often unpaid programmers, each working towards the common good.
• Critics argue that open-source SW can be buggy, but proponents counter that
bugs are found and fixed quickly, since there are so many pairs of eyes inspecting all
the code.
• Open-source operating systems are a good resource for studying OS
development, since students can examine the source code and even change it and re-
compile the changes.
1.12.1 History
• Developed by Linus Torvalds in Finland in 1991 as the first full operating system
developed by GNU.
• Many different distributions of Linux have evolved from Linus's original,
including RedHat, SUSE, Fedora, Debian, Slackware, and Ubuntu, each geared toward
a different group of end-users and operating environments.
• To run Linux on a Windows system using VMware, follow these steps:
1. Download the free "VMware Player" tool
from http://www.vmware.com/download/player and install it on your system
2. Choose a Linux version from among hundreds of virtual machine images
at http://www.vmware.com/appliances
3. Boot the virtual machine within VMware Player.
1.12.3 BSD UNIX
• UNIX was originally developed at ATT Bell labs, and the source code made
available to computer science students at many universities, including the University of
California at Berkeley, UCB.
• UCB students developed UNIX further, and released their product as BSD UNIX
in both binary and source-code format.
• BSD UNIX is not open-source, however, because a license is still needed from
ATT.
• In spite of various lawsuits, there are now several versions of BSD UNIX,
including Free BSD, NetBSD, OpenBSD, and DragonflyBSD
• The source code is located in /usr/src.
• The core of the Mac operating system is Darwin, derived from BSD UNIX, and
is available at http://developer.apple.com/opensource/index.html
1.13.4 Solaris
• Solaris is the UNIX operating system for computers from Sun Microsystems.
• Solaris was originally based on BSD UNIX, and has since migrated to ATT
SystemV as its basis.
• Parts of Solaris are now open-source, and some are not because they are still
covered by ATT copyrights.
• It is possible to change the open-source components of Solaris, re-compile them,
and then link them in with binary libraries of the copyrighted portions of Solaris.
• Open Solaris is available from http://www.opensolaris.org/os/
• Solaris also allows viewing of the source code online, without having to
download and unpack the entire package.
• User Interfaces - Means by which users can issue commands to the system.
Depending on the system these may be a command-line interface ( e.g. sh, csh,
ksh, tcsh, etc. ), a GUI interface ( e.g. Windows, X-Windows, KDE, Gnome, etc.
), or a batch command systems. The latter are generally older systems using
punch cards of job-control language, JCL, but may still be used today for
specialty systems designed for a single purpose.
• Program Execution - The OS must be able to load a program into RAM, run
the program, and terminate the program, either normally or abnormally.
• I/O Operations - The OS is responsible for transferring data to and from I/O
devices, including keyboards, terminals, printers, and storage devices.
• File-System Manipulation - In addition to raw data storage, the OS is also
responsible for maintaining directory and subdirectory structures, mapping file
names to specific blocks of data storage, and providing tools for navigating and
utilizing the file system.
• Communications - Inter-process communications, IPC, either between
processes running on the same processor, or between processes running on
separate processors or separate machines. May be implemented as either shared
memory or message passing, ( or some systems may offer both. )
• Error Detection - Both hardware and software errors must be detected and
handled appropriately, with a minimum of harmful repercussions. Some systems
may include complex error avoidance or recovery systems, including backups,
RAID drives, and other redundant systems. Debugging and diagnostic tools aid
users and administrators in tracing down the cause of problems.
• Resource Allocation - E.g. CPU cycles, main memory, storage space, and
peripheral devices. Some resources are managed with generic systems and others
with very carefully designed and specially tuned systems, customized for a
particular resource and operating environment.
• Accounting - Keeping track of system activity and resource usage, either for
billing purposes or for statistical record keeping that can be used to optimize
future performance.
• Protection and Security - Preventing harm to the system and to resources, either
through wayward internal processes or malicious outsiders. Authentication,
ownership, and restricted access are obvious parts of this system. Highly secure
systems may log all process activity down to excruciating detail, and security
regulation dictate the storage of those records on permanent non-erasable
medium for extended times in secure ( off-site ) facilities.
• Gets and processes the next user request, and launches the requested
programs.
• In some systems the CI may be incorporated directly into the kernel.
• More commonly the CI is a separate program that launches once the user
logs in or otherwise accesses the system.
• UNIX, for example, provides the user with a choice of different shells,
which may either be configured to launch automatically at login, or
which may be changed on the fly. ( Each of these shells uses a different
configuration file of initial settings and commands that are executed
upon startup. )
• Different shells provide different functionality, in terms of certain
commands that are implemented directly by the shell without launching
any external programs. Most provide at least a rudimentary command
interpretation structure for use in shell script programming ( loops,
decision constructs, variables, etc. )
• An interesting distinction is the processing of wild card file naming and
I/O re-direction. On UNIX systems those details are handled by the
shell, and the program which is launched sees only a list of filenames
generated by the shell from the wild cards. On a DOS system, the wild
cards are passed along to the programs, which can interpret the wild
cards as the program sees fit.
Figure 2.2 - The Bourne shell command interpreter in Solaris 10
• System calls provide a means for user or application programs to call upon the
services of the operating system.
• Generally written in C or C++, although some are written in assembly for optimal
performance.
• Figure 2.5 illustrates the sequence of system calls required to copy a file:
Figure 2.5 - Example of how system calls are used.
• You can use "strace" to see more examples of the large number of system calls
invoked by a single simple command. Read the man page for strace, and try some
simple examples. ( strace mkdir temp, strace cd temp, strace date > t.t, strace cp
t.t t.2, etc. )
• Most programmers do not use the low-level system calls directly, but instead use
an "Application Programming Interface", API. The following sidebar shows the
read( ) call available in the API on UNIX based systems::
The use of APIs instead of direct system calls provides for greater program portability
between different systems. The API then makes the appropriate system calls through
the system call interface, using a table lookup to access specific numbered system
calls, as shown in Figure 2.6:
Figure 2.6 - The handling of a user application invoking the open( ) system call
• Parameters are generally passed to system calls via registers, or less commonly,
by values pushed onto the stack. Large blocks of data are generally accessed
indirectly, through a memory address passed in a register or on the stack, as
shown in Figure 2.7:
Six major categories, as outlined in Figure 2.8 and the following six subsections:
2.4.1 Process Control
• Process control system calls include end, abort, load, execute, create process,
terminate process, get/set process attributes, wait for time or event, signal event,
and allocate and free memory.
• Processes must be created, launched, monitored, paused, resumed, and
eventually stopped.
• When one process pauses or stops, then another must be launched or resumed
• When processes stop abnormally it may be necessary to provide core dumps
and/or other diagnostic or recovery tools.
• Compare DOS ( a single-tasking system ) with UNIX ( a multi-tasking system
).
o When a process is launched in DOS, the command interpreter first unloads
as much of itself as it can to free up memory, then loads the process and
transfers control to it. The interpreter does not resume until the process has
completed, as shown in Figure 2.9:
Figure 2.9 - MS-DOS execution. (a) At system startup. (b) Running a program.
• File management system calls include create file, delete file, open, close, read,
write, reposition, get file attributes, and set file attributes.
• These operations may also be supported for directories as well as ordinary files.
• ( The actual directory structure may be implemented using ordinary files on the
file system, or through other means.
• Device management system calls include request device, release device, read,
write, reposition, get/set device attributes, and logically attach or detach
devices.
• Devices may be physical ( e.g. disk drives ), or virtual / abstract ( e.g. files,
partitions, and RAM disks ).
• Some systems represent devices as special files in the file system, so that
accessing the "file" calls upon the appropriate device drivers in the OS. See
for example the /dev directory on any UNIX system.
• Information maintenance system calls include calls to get/set the time, date,
system data, and process, file, or device attributes.
• Systems may also provide the ability to dump memory at any time, single step
programs pausing execution after each instruction, and tracing the operation of
programs, all of which can help to debug programs.
2.4.5 Communication
2.4.6 Protection
When DOS was originally written its developers had no idea how big and important it
would eventually become. It was written by a few programmers in a relatively short
amount of time, without the benefit of modern software engineering techniques, and
then gradually grew over time to exceed its original expectations. It does not break the
system into subsystems, and has no distinction between user and kernel modes,
allowing all programs direct access to the underlying hardware. ( Note that user versus
kernel mode was not supported by the 8088 chip set anyway, so that really wasn't an
option back then. )
The original UNIX OS used a simple layered approach, but almost all the OS was in
one big layer, not really breaking the OS down into layered subsystems:
Figure 2.12 - Traditional UNIX system structure
• Another approach is to break the OS into a number of smaller layers, each of which
rests on the layer below it, and relies solely on the services provided by the next
lower layer.
• This approach allows each layer to be developed and debugged independently, with
the assumption that all lower layers have already been debugged and are trusted to
deliver proper services.
• The problem is deciding what order in which to place the layers, as no layer can
call upon the services of any higher layer, and so many chicken-and-egg situations
may arise.
• Layered approaches can also be less efficient, as a request for service from a higher
layer has to filter through all lower layers before it reaches the HW, possibly with
significant processing at each step.
Figure 2.13 - A layered operating system
2.7.3 Microkernels
• The basic idea behind micro kernels is to remove all non-essential services from the
kernel, and implement them as system applications instead, thereby making the
kernel as small and efficient as possible.
• Most microkernels provide basic process and memory management, and message
passing between other services, and not much more.
• Security and protection can be enhanced, as most services are performed in user
mode, not kernel mode.
• System expansion can also be easier, because it only involves adding more system
applications, not rebuilding a new kernel.
• Mach was the first and most widely known microkernel, and now forms a major
component of Mac OSX.
• Windows NT was originally microkernel, but suffered from performance problems
relative to Windows 95. NT 4.0 improved performance by moving more services
into the kernel, and now XP is back to being more monolithic.
• Another microkernel example is QNX, a real-time OS for embedded systems.
Figure 2.14 - Architecture of a typical microkernel
2.7.4 Modules
2.7.5.1 Mac OS X
• The Max OSX architecture relies on the Mach microkernel for basic system
management services, and the BSD kernel for additional services. Application
services and dynamically loadable modules ( kernel extensions ) provide the rest of
the OS functionality:
2.7.5.2 iOS
• The iOS operating system was developed by Apple for iPhones and iPads. It runs
with less memory and computing power needs than Max OS X, and supports
touchscreen interface and graphics for small screens:
2.7.5.3 Android
• The Android OS was developed for Android smartphones and tablets by the Open
Handset Alliance, primarily Google.
• Android is an open-source OS, as opposed to iOS, which has lead to its popularity.
• Android includes versions of Linux and a Java virtual machine both optimized for
small platforms.
• Android apps are developed using a special Java-for-Android development
environment.
Processes
References:
1. Abraham Silberschatz, Greg Gagne, and Peter Baer Galvin, "Operating System
Concepts, Ninth Edition ", Chapter 3
• Process memory is divided into four sections as shown in Figure 3.1 below:
o The text section comprises the compiled program code, read in from non-volatile
storage when the program is launched.
o The data section stores global and static variables, allocated and initialized prior to
executing main.
o The heap is used for dynamic memory allocation, and is managed via calls to new,
delete, malloc, free, etc.
o The stack is used for local variables. Space on the stack is reserved for local variables
when they are declared ( at function entrance or elsewhere, depending on the
language ), and the space is freed up when the variables go out of scope. Note that
the stack is also used for function return values, and the exact mechanisms of stack
management may be language specific.
o Note that the stack and the heap start at opposite ends of the process's free space and
grow towards each other. If they should ever meet, then either a stack overflow error
will occur, or else a call to new or malloc will fail due to insufficient memory
available.
• When processes are swapped out of memory and later restored, additional
information must also be stored and restored. Key among them are the program
counter and the value of all program registers.
For each process there is a Process Control Block, PCB, which stores the following
( types of ) process-specific information, as illustrated in Figure 3.1. ( Specific
details may vary from system to system. )
• Modern systems allow a single process to have multiple threads of execution, which
execute concurrently. Threads are covered extensively in the next chapter.
• The two main objectives of the process scheduling system are to keep the CPU busy
at all times and to deliver "acceptable" response times for all programs, particularly
for interactive ones.
• The process scheduler must meet these objectives by implementing suitable policies
for swapping processes in and out of the CPU.
• ( Note that these objectives can be conflicting. In particular, every time the system
steps in to swap processes it takes up time on the CPU to do so, which is thereby
"lost" from doing any useful productive work. )
3.2.2 Schedulers
Figure 3.11
3.3.2 Process Termination
• Processes may request their own termination by making the exit( ) system call,
typically returning an int. This int is passed along to the parent if it is doing a wait(
), and is typically zero on successful completion and some non-zero code in the event
of problems.
o child code:
o int exitCode;
exit( exitCode ); // return exitCode; has the same effect when executed
from main( )
o parent code:
o pid_t pid;
o int status
o pid = wait( &status );
o // pid indicates which child exited. exitCode in low-order bits of status
// macros can test the high-order bits of status for why it stopped
• Processes may also be terminated by the system for a variety of reasons, including:
o The inability of the system to deliver necessary system resources.
o In response to a KILL command, or other un handled process interrupt.
o A parent may kill its children if the task assigned to them is no longer needed.
o If the parent exits, the system may or may not allow the child to continue without a
parent. ( On UNIX systems, orphaned processes are generally inherited by init,
which then proceeds to kill them. The UNIX nohup command allows a child to
continue executing after its parent has exited. )
• When a process terminates, all of its system resources are freed up, open files flushed
and closed, etc. The process termination status and execution times are returned to
the parent if the parent is waiting for the child to terminate, or eventually returned to
init if the process becomes an orphan. ( Processes which are trying to terminate but
which cannot because their parent is not waiting for them are termed zombies. These
are eventually inherited by init as orphans and killed off. Note that modern UNIX
shells do not produce as many orphans and zombies as older systems used to. )
• Shared Memory is faster once it is set up, because no system calls are required and
access occurs at normal memory speeds. However it is more complicated to set up,
and doesn't work as well across multiple computers. Shared memory is generally
preferable when large amounts of information must be shared quickly on the same
computer.
• Message Passing requires system calls for every message transfer, and is therefore
slower, but it is simpler to set up and works well across multiple computers. Message
passing is generally preferable when the amount and/or frequency of data transfers
is small, or when multiple computers are involved.
• This is a classic example, in which one process is producing data and another process
is consuming the data. ( In this example in the order in which it is produced, although
that could vary. )
• The data is passed via an intermediary buffer, which may be either unbounded or
bounded. With a bounded buffer the producer may have to wait until there is space
available in the buffer, but with an unbounded buffer the producer will never need
to wait. The consumer may need to wait in either case until there is data available.
• This example uses shared memory and a circular queue. Note in the code below that
only the producer changes "in", and only the consumer changes "out", and that they
can never be accessing the same array location at the same time.
• First the following data is set up in the shared memory area:
#define BUFFER_SIZE 10
typedef struct {
...
} item;
item nextProduced;
while( true ) {
• Then the consumer process. Note that the buffer is empty when "in" is equal to "out":
item nextConsumed;
while( true ) {
• Message passing systems must support at a minimum system calls for "send
message" and "receive message".
• A communication link must be established between the cooperating processes before
messages can be sent.
• There are three key issues to be resolved in message passing systems as further
explored in the next three subsections:
o Direct or indirect communication ( naming )
o Synchronous or asynchronous communication
o Automatic or explicit buffering.
3.4.2.1 Naming
• With direct communication the sender must know the name of the receiver to
which it wishes to send a message.
o There is a one-to-one link between every sender-receiver pair.
o For symmetric communication, the receiver must also know the specific name of
the sender from which it wishes to receive messages.
For asymmetric communications, this is not necessary.
• Indirect communication uses shared mailboxes, or ports.
o Multiple processes can share the same mailbox or boxes.
o Only one process can read any given message in a mailbox. Initially the process that
creates the mailbox is the owner, and is the only one allowed to read mail in the
mailbox, although this privilege may be transferred.
▪ ( Of course the process that reads the message can immediately turn around and
place an identical message back in the box for someone else to read, but that may
put it at the back end of a queue of messages. )
o The OS must provide system calls to create and delete mailboxes, and to send and
receive messages to/from mailboxes.
3.4.2.2 Synchronization
3.4.2.3 Buffering
• Messages are passed via queues, which may have one of three capacity
configurations:
1. Zero capacity - Messages cannot be stored in the queue, so senders must block until
receivers accept the messages.
2. Bounded capacity- There is a certain pre-determined finite capacity in the queue.
Senders must block if the queue is full, until space becomes available in the queue,
but may be either blocking or non-blocking otherwise.
3. Unbounded capacity - The queue has a theoretical infinite capacity, so senders are
never forced to block.
Threads
References:
4.1 Overview
4.1.2 Benefits
• In the many-to-one model, many user-level threads are all mapped onto a
single kernel thread.
• Thread management is handled by the thread library in user space, which is
very efficient.
• However, if a blocking system call is made, then the entire process blocks,
even if the other user threads would otherwise be able to continue.
• Because a single kernel thread can operate only on a single CPU, the many-
to-one model does not allow individual processes to be split across multiple
CPUs.
• Green threads for Solaris and GNU Portable Threads implement the many-to-
one model in the past, but few systems continue to do so today.
• The one-to-one model creates a separate kernel thread to handle each user
thread.
• One-to-one model overcomes the problems listed above involving blocking
system calls and the splitting of processes across multiple CPUs.
• However the overhead of managing the one-to-one model is more significant,
involving more overhead and slowing down the system.
• Most implementations of this model place a limit on how many threads can
be created.
• Linux and Windows from 95 to XP implement the one-to-one model for
threads.
Figure 4.3 - One-to-one model
• Thread libraries provide programmers with an API for creating and managing
threads.
• Thread libraries may be implemented either in user space or in kernel space.
The former involves API functions implemented solely within user space,
with no kernel support. The latter involves system calls, and requires a kernel
with thread library support.
• There are three main thread libraries in use today:
1. POSIX Pthreads - may be provided as either a user or kernel library, as
an extension to the POSIX standard.
2. Win32 threads - provided as a kernel-level library on Windows
systems.
3. Java threads - Since Java generally runs on a Java Virtual Machine, the
implementation of threads is based upon whatever OS and hardware the
JVM is running on, i.e. either Pthreads or Win32 threads depending on
the system.
• The following sections will demonstrate the use of threads in all three systems
for calculating the sum of integers from 0 to N in a separate thread, and storing
the result in a variable "sum".
4.3.1 Pthreads
• The POSIX standard ( IEEE 1003.1c ) defines the specification for pThreads,
not the implementation.
• pThreads are available on Solaris, Linux, Mac OSX, Tru64, and via public
domain shareware for Windows.
• Global variables are shared amongst all threads.
• One thread can wait for the others to rejoin before continuing.
• pThreads begin execution in a specified function, in this example the runner(
) function:
4.4.2 Windows Threads
• Similar to pThreads. Examine the code example to see the differences, which
are mostly syntactic & nomenclature:
4.4.3 Java Threads
• Q: If one thread forks, is the entire process copied, or is the new process
single-threaded?
• A: System dependant.
• A: If the new process execs right away, there is no need to copy all the other
threads. If it doesn't, then the entire process should be copied.
• A: Many versions of UNIX provide multiple versions of the fork call for this
purpose.
• Threads that are no longer needed may be cancelled by another thread in one
of two ways:
1. Asynchronous Cancellation cancels the thread immediately.
2. Deferred Cancellation sets a flag indicating the thread should cancel itself
when it is convenient. It is then up to the cancelled thread to check this flag
periodically and exit nicely when it sees the flag set.
• ( Shared ) resource allocation and inter-thread data transfers can be
problematic with asynchronous cancellation.
• Most data is shared among threads, and this is one of the major benefits of
using threads in the first place.
• However sometimes threads need thread-specific data also.
• Most major thread libraries ( pThreads, Win32, Java ) provide support for
thread-specific data, known as thread-local storage or TLS. Note that this is
more like static data than local variables,because it does not cease to exist
when the function ends.
References:
1. Abraham Silberschatz, Greg Gagne, and Peter Baer Galvin, "Operating System Concepts,
Ninth Edition ", Chapter 6
• Almost all programs have some alternating cycle of CPU number crunching and waiting
for I/O of some kind. ( Even a simple fetch from memory takes a long time relative to CPU
speeds. )
• In a simple system running a single process, the time spent waiting for I/O is wasted, and
those CPU cycles are lost forever.
• A scheduling system allows one process to use the CPU while another is waiting for I/O,
thereby making full use of otherwise lost CPU cycles.
• The challenge is to make the overall system as "efficient" and "fair" as possible, subject to
varying and often dynamic conditions, and where "efficient" and "fair" are somewhat
subjective terms, often subject to shifting priority policies.
• Almost all processes alternate between two states in a continuing cycle, as shown in Figure
2.1 below :
o A CPU burst of performing calculations, and
o An I/O burst, waiting for data transfer in or out of the system.
• Whenever the CPU becomes idle, it is the job of the CPU Scheduler (the short-term
scheduler ) to select another process from the ready queue to run next.
• The storage structure for the ready queue and the algorithm used to select the next process
are not necessarily a FIFO queue. There are several alternatives to choose from, as well as
numerous adjustable parameters for each algorithm.
2.1.4 Dispatcher
• The dispatcher is the module that gives control of the CPU to the process selected by the
scheduler. This function involves:
o Switching context.
o Switching to user mode.
o Jumping to the proper location in the newly loaded program.
• The dispatcher needs to be as fast as possible, as it is run on every context switch. The time
consumed by the dispatcher is known as dispatch latency.
• There are several different criteria to consider when trying to select the "best" scheduling
algorithm for a particular situation and environment, including:
o CPU utilization - Ideally the CPU would be busy 100% of the time, so as to waste 0
CPU cycles. On a real system CPU usage should range from 40% ( lightly loaded ) to
80% ( heavily loaded. )
o Throughput - Number of processes completed per unit time. May range from 10 /
second to 1 / hour depending on the specific processes.
o Turnaround time - Time required for a particular process to complete, from
submission time to completion. ( Wall clock time. )
o Waiting time - How much time processes spend in the ready queue waiting their turn
to get on the CPU.
▪ ( Load average - The average number of processes sitting in the ready queue waiting
their turn to get into the CPU. Reported in 1-minute, 5-minute, and 15-minute averages
by "uptime" and "who". )
o Response time - The time taken in an interactive program from the issuance of a
command to the commence of a response to that command.
• In general one wants to optimize the average value of a criteria ( Maximize CPU utilization
and throughput, and minimize all the others. ) However sometimes one wants to do
something different, such as to minimize the maximum response time.
• Sometimes it is most desirable to minimize the variance of criteria than the actual value.
I.e. users are more accepting of a consistent predictable system than an inconsistent one,
even if it is a little bit slower.
• FCFS is very simple - Just a FIFO queue, like customers waiting in line at the bank or the
post office or at a copying machine.
• Unfortunately, however, FCFS can yield some very long average wait times, particularly
if the first process to get there takes a long time. For example, consider the following three
processes:
P1 24
P2 3
P3 3
• In the first Gantt chart below, process P1 arrives first. The average waiting time for the
three processes is ( 0 + 24 + 27 ) / 3 = 17.0 ms.
• In the second Gantt chart below, the same three processes have an average wait time of ( 0
+ 3 + 6 ) / 3 = 3.0 ms. The total run time for the three bursts is the same, but in the second
case two of the three finish much quicker, and the other process is only delayed by a short
amount.
• The idea behind the SJF algorithm is to pick the quickest fastest little job that needs to be
done, get it out of the way first, and then pick the next smallest fastest job to do next.
• ( Technically this algorithm picks a process based on the next shortest CPU burst, not the
overall process time. )
• For example, the Gantt chart below is based upon the following CPU burst times, ( and the
assumption that all jobs arrive at the same time. )
Process Burst Time
P1 6
P2 8
P3 7
P4 3
• In the case above the average wait time is ( 0 + 3 + 9 + 16 ) / 4 = 7.0 ms, ( as opposed to
10.25 ms for FCFS for the same processes. )
• SJF can be either preemptive or non-preemptive. Preemption occurs when a new process
arrives in the ready queue that has a predicted burst time shorter than the time remaining
in the process whose burst is currently on the CPU. Preemptive SJF is sometimes referred
to as shortest remaining time first scheduling.
• For example, the following Gantt chart is based upon the following data:
P1 0 8
P2 1 4
P3 2 9
p4 3 5
P1 10 3
P2 1 1
P3 2 4
P4 1 5
P5 5 2
• Priorities can be assigned either internally or externally. Internal priorities are assigned by
the OS using criteria such as average burst time, ratio of CPU to I/O activity, system
resource use, and other factors available to the kernel. External priorities are assigned by
users, based on the importance of the job, fees paid, politics, etc.
• Priority scheduling can be either preemptive or non-preemptive.
• Priority scheduling can suffer from a major problem known as indefinite blocking,
or starvation, in which a low-priority task can wait forever because there are always some
other jobs around that have higher priority.
o If this problem is allowed to occur, then processes will either run eventually when
the system load lightens ( at say 2:00 a.m. ), or will eventually get lost when the
system is shut down or crashes. ( There are rumors of jobs that have been stuck for
years. )
o One common solution to this problem is aging, in which priorities of jobs increase
the longer they wait. Under this scheme a low-priority job will eventually get its
priority raised high enough that it gets run.
2.3.4 Round Robin Scheduling
• Round robin scheduling is similar to FCFS scheduling, except that CPU bursts are assigned
with limits called time quantum.
• When a process is given the CPU, a timer is set for whatever value has been set for a time
quantum.
o If the process finishes its burst before the time quantum timer expires, then it is
swapped out of the CPU just like the normal FCFS algorithm.
o If the timer goes off first, then the process is swapped out of the CPU and moved
to the back end of the ready queue.
• The ready queue is maintained as a circular queue, so when all processes have had a turn,
then the scheduler gives the first process another turn, and so on.
• RR scheduling can give the effect of all processors sharing the CPU equally, although the
average wait time can be longer than with other scheduling algorithms. In the following
example the average wait time is 5.66 ms.
P1 24
P2 3
P3 3
• The performance of RR is sensitive to the time quantum selected. If the quantum is large
enough, then RR reduces to the FCFS algorithm; If it is very small, then each process gets
1/nth of the processor time and share the CPU equally.
• BUT, a real system invokes overhead for every context switch, and the smaller the time
quantum the more context switches there are. ( See Figure 6.4 below. ) Most modern
systems use time quantum between 10 and 100 milliseconds, and context switch times on
the order of 10 microseconds, so the overhead is small relative to the time quantum.
Figure 6.4 - The way in which a smaller time quantum increases context switches.
• Turn around time also varies with quantum time, in a non-apparent manner.
• In general, turnaround time is minimized if most processes finish their next cpu burst within
one time quantum. For example, with three processes of 10 ms bursts each, the average
turnaround time for 1 ms quantum is 29, and for 10 ms quantum it reduces to 20. However,
if it is made too large, then RR just degenerates to FCFS. A rule of thumb is that 80% of
CPU bursts should be smaller than the time quantum.
• When processes can be readily categorized, then multiple separate queues can be
established, each implementing whatever scheduling algorithm is most appropriate for that
type of job, and/or with different parametric adjustments.
• Scheduling must also be done between queues, that is scheduling one queue to get time
relative to other queues. Two common options are strict priority ( no job in a lower priority
queue runs until all higher priority queues are empty ) and round-robin ( each queue gets a
time slice in turn, possibly of different sizes. )
• Note that under this algorithm jobs cannot switch from queue to queue - Once they are
assigned a queue, that is their queue until they finish.
Figure 6.5 - Multilevel queue scheduling
• A solution to the critical section problem must satisfy the following three conditions:
1. Mutual Exclusion - Only one process at a time can be executing in their critical
section.
2. Progress - If no process is currently executing in their critical section, and one or
more processes want to execute their critical section, then only the processes not in
their remainder sections can participate in the decision, and the decision cannot be
postponed indefinitely. ( I.e. processes cannot be blocked forever waiting to get into
their critical sections. )
3. Bounded Waiting - There exists a limit as to how many other processes can get
into their critical sections after a process requests entry into their critical section
and before that request is granted. ( I.e. a process requesting entry into their critical
section will get a turn eventually, and there is a limit as to how many other processes
get to go first. )
• We assume that all processes proceed at a non-zero speed, but no assumptions can be made
regarding the relative speed of one process versus another.
• Kernel processes can also be subject to race conditions, which can be especially
problematic when updating commonly shared kernel data structures such as open file tables
or virtual memory management. Accordingly kernels can take on one of two forms:
o Non-preemptive kernels do not allow processes to be interrupted while in kernel
mode. This eliminates the possibility of kernel-mode race conditions, but requires
kernel mode operations to complete very quickly, and can be problematic for real-
time systems, because timing cannot be guaranteed.
o Preemptive kernels allow for real-time operations, but must be carefully written to
avoid race conditions. This can be especially tricky on SMP systems, in which
multiple kernel processes may be running simultaneously on different processors.
2.6 Semaphores
• A more robust alternative to simple mutexes is to use semaphores, which are integer
variables for which only two ( atomic ) operations are defined, the wait and signal
operations, as shown in the following figure.
• Note that not only must the variable-changing steps ( S-- and S++ ) be indivisible, it is also
necessary that for the wait operation when the test proves false that there be no interruptions
before S gets decremented. It IS okay, however, for the busy loop to be interrupted when
the test is true, which prevents the system from hanging forever.
2.6.1 Semaphore Usage
o Counting semaphores can take on any integer value, and are usually used to count
the number remaining of some limited resource. The counter is initialized to the
number of such resources available in the system, and whenever the counting
semaphore is greater than zero, then a process can enter a critical section and use
one of the resources. When the counter gets to zero ( or negative in some
implementations ), then the process blocks until another process frees up a resource
and increments the counting semaphore with a signal call. ( The binary semaphore
can be seen as just a special case where the number of resources initially available
is just one. )
o Semaphores can also be used to synchronize certain operations between processes.
For example, suppose it is important that process P1 execute statement S1 before
process P2 executes statement S2.
▪ First we create a semaphore named synch that is shared by the two
processes, and initialize it to zero.
▪ Then in process P1 we insert the code:
S1;
signal( synch );
▪ Because synch was initialized to 0, process P2 will block on the wait until
after P1 executes the call to signal.
• The big problem with semaphores as described above is the busy loop in the wait call,
which consumes CPU cycles without doing any useful work. This type of lock is known
as a spinlock, because the lock just sits there and spins while it waits. While this is
generally a bad thing, it does have the advantage of not invoking context switches, and so
it is sometimes used in multi-processing systems when the wait time is expected to be short
- One thread spins on one processor while another completes their critical section on
another processor.
• An alternative approach is to block a process when it is forced to wait for an available
semaphore, and swap it out of the CPU. In this implementation each semaphore needs to
maintain a list of processes that are blocked waiting for it, so that one of the processes can
be woken up and swapped back in when the semaphore becomes available. ( Whether it
gets swapped back into the CPU immediately or whether it needs to hang out in the ready
queue for a while is a scheduling problem. )
• One important problem that can arise when using semaphores to block processes waiting
for a limited resource is the problem of deadlocks, which occur when multiple processes
are blocked, each waiting for a resource that can only be freed by one of the other ( blocked
) processes, as illustrated in the following example. ( Deadlocks are covered more
completely in chapter 7. )
• Another problem to consider is that of starvation, in which one or more processes gets
blocked forever, and never get a chance to take their turn in the critical section. For
example, in the semaphores above, we did not specify the algorithms for adding processes
to the waiting queue in the semaphore in the wait( ) call, or selecting one to be removed
from the queue in the signal( ) call. If the method chosen is a FIFO queue, then every
process will eventually get their turn, but if a LIFO queue is implemented instead, then the
first process to start waiting could starve.
2.7 Monitors
• Semaphores can be very useful for solving concurrency problems, but only if
programmers use them properly. If even one process fails to abide by the proper use of
semaphores, either accidentally or deliberately, then the whole system breaks down. ( And
since concurrency problems are by definition rare events, the problem code may easily go
unnoticed and/or be heinous to debug. )
• For this reason a higher-level language construct has been developed, called monitors.
• A monitor is essentially a class, in which all data is private, and with the special restriction
that only one method within any given monitor object may be active at the same time. An
additional restriction is that monitor methods may only access the shared data within the
monitor and any data passed to them as parameters. I.e. they cannot access any data external
to the monitor.
Figure 5.15 - Syntax of a monitor.
• Figure 5.16 shows a schematic of a monitor, with an entry queue of processes waiting their
turn to execute monitor operations ( methods. )
Figure 5.16 - Schematic view of a monitor
• In order to fully realize the potential of monitors, we need to introduce one additional new
data type, known as a condition.
o A variable of type condition has only two legal operations, wait and signal. I.e. if
X was defined as type condition, then legal operations would be X.wait( ) and
X.signal( )
o The wait operation blocks a process until some other process calls signal, and adds
the blocked process onto a list associated with that condition.
o The signal process does nothing if there are no processes waiting on that condition.
Otherwise it wakes up exactly one process from the condition's list of waiting
processes. ( Contrast this with counting semaphores, which always affect the
semaphore on a signal call. )
• Figure 6.18 below illustrates a monitor that includes condition variables within its data
space. Note that the condition variables, along with the list of processes currently waiting
for the conditions, are in the data space of the monitor - The processes on these lists are not
"in" the monitor, in the sense that they are not executing any code in the monitor.
Figure 5.17 - Monitor with condition variables
• But now there is a potential problem - If process P within the monitor issues a signal that
would wake up process Q also within the monitor, then there would be two processes
running simultaneously within the monitor, violating the exclusion requirement.
Accordingly there are two possible solutions to this dilemma:
Signal and wait - When process P issues the signal to wake up process Q, P then waits, either for
Q to leave the monitor or on some other condition.
Signal and continue - When P issues the signal, Q waits, either for P to exit the monitor or for
some other condition.
There are arguments for and against either choice. Concurrent Pascal offers a third alternative -
The signal call causes the signaling process to immediately exit the monitor, so that the waiting
process can then wake up and proceed.
2.8 Classic Problems of Synchronization
The following classic problems are used to test virtually every new proposed synchronization
algorithm.
• One possible solution, as shown in the following code section, is to use a set of five
semaphores ( chopsticks[ 5 ] ), and to have each hungry philosopher first wait on their left
chopstick ( chopsticks[ i ] ), and then wait on their right chopstick ( chopsticks[ ( i + 1 ) %
5])
• But suppose that all five philosophers get hungry at the same time, and each starts by
picking up their left chopstick. They then look for their right chopstick, but because it is
unavailable, they wait for it, forever, and eventually all the philosophers starve due to the
resulting deadlock.
Figure 5.14 - The structure of philosopher i.
Message-Passing Systems
• Message passing systems must support at a minimum system calls for "send
message" and "receive message".
• A communication link must be established between the cooperating processes before
messages can be sent.
• There are three key issues to be resolved in message passing systems as further
explored in the next three subsections:
o Direct or indirect communication ( naming )
o Synchronous or asynchronous communication
o Automatic or explicit buffering.
Memory Management
The term Memory can be defined as a collection of data in a specific format. It is used to store
instructions and process data.
The main memory is central to the operation of a modern computer. Main Memory is a large
array of words or bytes, ranging in size from hundreds of thousands to billions. Main memory is
a repository of rapidly available information shared by the CPU and I/O devices. Main memory
is the place where programs and information are kept when the processor is effectively utilizing
them. Main memory is associated with the processor, so moving instructions and information
into and out of the processor is extremely fast. Main memory is also known as RAM(Random
Access Memory). This memory is a volatile memory.RAM lost its data when a power
interruption occurs.
In a multiprogramming computer, the operating system resides in a part of memory and the rest
is used by multiple processes. The task of subdividing the memory among different processes is
called memory management. Memory management is a method in the operating system to
manage operations between main memory and disk during process execution. The main aim of
memory management is to achieve efficient utilization of memory.
3.2 Swapping :
When a process is executed it must have resided in memory. Swapping is a process of swapping
a process temporarily into a secondary memory from the main memory, which is fast as
compared to secondary memory. A swapping allows more processes to be run and can be fit into
memory at one time. The main part of swapping is transferred time and the total time is directly
proportional to the amount of memory swapped. Swapping is also known as roll-out, roll in,
because if a higher priority process arrives and wants service, the memory manager can swap
out the lower priority process and then load and execute the higher priority process. After
finishing higher priority work, the lower priority process swapped back in memory and continued
to the execution process.
This is the simplest memory management approach the memory is divided into two sections:
• one part for operating system
• second part for user program
3.3 Contiguous Memory Allocation :
The main memory should oblige both the operating system and the different client
processes. Therefore, the allocation of memory becomes an important task in the operating
system. The memory is usually divided into two partitions: one for the resident operating
system and one for the user processes. We normally need several user processes to reside in
memory simultaneously. Therefore, we need to consider how to allocate available memory
to the processes that are in the input queue waiting to be brought into memory. In adjacent
memory allotment, each process is contained in a single contiguous segment of memory.
To gain proper memory utilization, memory allocation must be allocated efficient manner.
One of the simplest methods for allocating memory is to divide memory into several fixed-
sized partitions and each partition contains exactly one process. Thus, the degree of
multiprogramming is obtained by the number of partitions.
Multiple partition allocation: In this method, a process is selected from the input queue and
loaded into the free partition. When the process terminates, the partition becomes available
for other processes.
Fixed partition allocation: In this method, the operating system maintains a table that
indicates which parts of memory are available and which are occupied by processes. Initially,
all memory is available for user processes and is considered one large block of available
memory. This available memory is known as a “Hole”. When the process arrives and needs
memory, we search for a hole that is large enough to store this process. If the requirement is
fulfilled then we allocate memory to process, otherwise keeping the rest available to satisfy
future requests. While allocating a memory sometimes dynamic storage allocation problems
occur, which concerns how to satisfy a request of size n from a list of free holes. There are
some solutions to this problem:
First fit:-
In the first fit, the first available free hole fulfills the requirement of the process allocated.
Here, in this diagram 40 KB memory block is the first available free hole that can store
process A (size of 25 KB), because the first two blocks did not have sufficient memory space.
Best fit:-
In the best fit, allocate the smallest hole that is big enough to process requirements. For this,
we search the entire list, unless the list is ordered by size
Here in this example, first, we traverse the complete list and find the last hole 25KB is the
best suitable hole for Process A(size 25KB).
In this method memory utilization is maximum as compared to other memory allocation
techniques.
Worst fit:-In the worst fit, allocate the largest available hole to process. This method
produces the largest leftover hole.
Here in this example, Process A (Size 25 KB) is allocated to the largest available memory
block which is 60KB. Inefficient memory utilization is a major issue in the worst fit.
3.3.2 Fragmentation
All the memory allocation strategies suffer from external fragmentation, though first and
best fits experience the problems more so than worst fit. External fragmentation means that
the available memory is broken up into lots of little pieces, none of which is big enough to
satisfy the next memory requirement, although the sum total could.
The amount of memory lost to fragmentation may vary with algorithm, usage patterns, and
some design decisions such as which end of a hole to allocate and which end to save on the
free list.
Statistical analysis of first fit, for example, shows that for N blocks of allocated memory,
another 0.5 N will be lost to fragmentation.
Internal fragmentation also occurs, with all memory allocation strategies. This is caused by
the fact that memory is allocated in blocks of a fixed size, whereas the actual memory needed
will rarely be that exact size. For a random distribution of memory requests, on the average
1/2 block will be wasted per memory request, because on the average the last allocated block
will be only half full.
Note that the same effect happens with hard drives, and that modern hardware gives
us increasingly larger drives and memory at the expense of ever larger block sizes,
which translates to more memory lost to internal fragmentation.
Some systems use variable size blocks to minimize losses due to internal
fragmentation.
If the programs in memory are relocatable, ( using execution-time address binding ), then the
external fragmentation problem can be reduced via compaction, i.e. moving all processes
down to one end of physical memory. This only involves updating the relocation register for
each process, as all internal work is done using logical addresses.
Another solution as we will see in upcoming sections is to allow processes to use non-
contiguous blocks of physical memory, with a separate relocation register for each block.
3.4 Paging
• The basic idea behind paging is to divide physical memory into a number of equal sized
blocks called frames, and to divide a programs logical memory space into blocks of the
same size called pages.
• Any page ( from any process ) can be placed into any available frame.
• The page table is used to look up what frame a particular page is stored in at the moment.
In the following example, for instance, page 2 of the program's logical memory is currently
stored in frame 3 of physical memory:
- Paging hardware
• ( DOS used to use an addressing scheme with 16 bit frame numbers and 16-bit offsets, on
hardware that only supported 24-bit hardware addresses. The result was a resolution of
starting frame addresses finer than the size of a single frame, and multiple frame-offset
combinations that mapped to the same physical hardware address. )
• Consider the following micro example, in which a process has 16 bytes of logical memory,
mapped in 4 byte pages into 32 bytes of physical memory. ( Presumably some other
processes would be consuming the remaining 16 bytes of physical memory. )
Paging example for a 32-byte memory with 4-byte pages
• Note that paging is like having a table of relocation registers, one for each page of the
logical memory.
• There is no external fragmentation with paging. All blocks of physical memory are used,
and there are no gaps in between and no problems with finding the right sized hole for a
particular chunk of memory.
• There is, however, internal fragmentation. Memory is allocated in chunks the size of a
page, and on the average, the last page will only be half full, wasting on the average half a
page of memory per process. ( Possibly more, if processes keep their code and data in
separate pages. )
• Larger page sizes waste more memory, but are more efficient in terms of overhead. Modern
trends have been to increase page sizes, and some systems even have multiple size pages
to try and make the best of both worlds.
• Page table entries ( frame numbers ) are typically 32 bit numbers, allowing access to 2^32
physical page frames. If those frames are 4 KB in size each, that translates to 16 TB of
addressable physical memory. ( 32 + 12 = 44 bits of physical address space. )
• When a process requests memory ( e.g. when its code is loaded in from disk ), free frames
are allocated from a free-frame list, and inserted into that process's page table.
• Processes are blocked from accessing anyone else's memory because all of their memory
requests are mapped through their page table. There is no way for them to generate an
address that maps into any other process's memory space.
• The operating system must keep track of each individual process's page table, updating it
whenever the process's pages get moved in and out of memory, and applying the correct
page table when processing system calls for a particular process. This all increases the
overhead involved when swapping processes in and out of the CPU. ( The currently active
page table must be updated to reflect the process that is currently running. )
• Most modern computer systems support logical address spaces of 2^32 to 2^64.
• With a 2^32 address space and 4K ( 2^12 ) page sizes, this leave 2^20 entries in the page
table. At 4 bytes per entry, this amounts to a 4 MB page table, which is too large to
reasonably keep in contiguous memory. ( And to swap in and out of memory with each
process switch. ) Note that with 4K pages, this would take 1024 pages just to hold the page
table!
• One option is to use a two-tier paging system, i.e. to page the page table.
• For example, the 20 bits described above could be broken down into two 10-bit page
numbers. The first identifies an entry in the outer page table, which identifies where in
memory to find one page of an inner page table. The second 10 bits finds a specific entry
in that inner page table, which in turn identifies a particular frame in physical memory.
( The remaining 12 bits of the 32 bit logical address are the offset within the 4K frame. )
• VAX Architecture divides 32-bit addresses into 4 equal sized sections, and each page is
512 bytes, yielding an address form of:
• With a 64-bit logical address space and 4K pages, there are 52 bits worth of page numbers,
which is still too many even for two-level paging. One could increase the paging level, but
with 10-bit page tables it would take 7 levels of indirection, which would be prohibitively
slow memory access. So some other approach must be used.
• One common data structure for accessing data that is sparsely distributed over a broad
range of possible values is with hash tables. Figure 8.16 below illustrates a hashed page
table using chain-and-bucket hashing:
Hashed page table
• Another approach is to use an inverted page table. Instead of a table listing all of the pages
for a particular process, an inverted page table lists all of the pages currently loaded in
memory, for all processes. ( I.e. there is one entry per frame instead of one entry per page. )
• Access to an inverted page table can be slow, as it may be necessary to search the entire
table in order to find the desired page ( or to discover that it is not there. ) Hashing the table
can help speedup the search process.
• Inverted page tables prohibit the normal method of implementing shared memory, which
is to map multiple logical pages to a common physical frame. ( Because each frame is now
mapped to one and only one process. )
Inverted page table
3.5 Segmentation
• Most users ( programmers ) do not think of their programs as existing in one continuous
linear address space.
• Rather they tend to think of their memory in multiple segments, each dedicated to a
particular use, such as code, data, the stack, the heap, etc.
• Memory segmentation supports this view by providing addresses with a segment number
( mapped to a segment base address ) and an offset from the beginning of that segment.
• For example, a C compiler might generate 5 segments for the user code, library code, global
( static ) variables, the stack, and the heap, as shown in Figure
Programmer's view of a program.
• The basic idea behind demand paging is that when a process is swapped in, its
pages are not swapped in all at once. Rather they are swapped in only when
the process needs them. ( on demand. ) This is termed a lazy swapper,
although a pager is a more accurate term.
• The basic idea behind paging is that when a process is swapped in, the pager
only loads into memory those pages that it expects the process to need ( right
away. )
• Pages that are not loaded into memory are marked as invalid in the page
table, using the invalid bit. ( The rest of the page table entry may either be
blank or contain information about where to find the swapped-out page on
the hard drive. )
• If the process only ever accesses pages that are loaded in memory ( memory
resident pages ), then the process runs exactly as if all the pages were loaded
in to memory.
Figure 3.1 - Page table when some pages are not in main memory.
• On the other hand, if a page is needed that was not originally loaded up, then
a page fault trap is generated, which must be handled in a series of steps:
1. The memory address requested is first checked, to make sure it was a
valid memory request.
2. If the reference was invalid, the process is terminated. Otherwise, the
page must be paged in.
3. A free frame is located, possibly from a free-frame list.
4. A disk operation is scheduled to bring in the necessary page from disk. (
This will usually block the process on an I/O wait, allowing some other
process to use the CPU in the meantime. )
5. When the I/O operation is complete, the process's page table is
updated with the new frame number, and the invalid bit is changed to
indicate that this is now a valid page reference.
6. The instruction that caused the page fault must now be restarted from
the beginning, ( as soon as this process gets another turn on the CPU. )
• In an extreme case, NO pages are swapped in for a process until they are
requested by page faults. This is known as pure demand paging.
• In theory each instruction could generate multiple page faults
• The hardware necessary to support virtual memory is the same as for paging
and swapping: A page table and secondary memory.
• A crucial part of the process is that the instruction must be restarted from
scratch once the desired page has been made available in memory. For most
simple instructions this is not a major difficulty. However there are some
architectures that allow a single instruction to modify a fairly large block of
data, ( which may span a page boundary ), and if some of the data gets
modified before the page fault occurs, this could cause problems. One
solution is to access both ends of the block before executing the instruction,
guaranteeing that the necessary pages get paged in before the instruction
begins.
• Obviously there is some slowdown and performance hit whenever a page fault
occurs and the system has to go get it from memory, but just how big a hit is it
exactly?
• A subtlety is that swap space is faster to access than the regular file system,
because it does not have to go through the whole directory structure. For this
reason some systems will transfer an entire process from the file system to
swap space before starting up the process, so that future paging all occurs
from the ( relatively ) faster swap space.
• Some systems use demand paging directly from the file system for binary code
( which never changes and hence does not have to be stored on a page
operation ), and to reserve the swap space for data segments that must be
stored. This approach is used by both Solaris and BSD Unix.
• In order to make the most use of virtual memory, we load several processes
into memory at the same time. Since we only load the pages that are actually
needed by each process at any given time, there is room to load many more
processes than if we had to load in the entire process.
• However memory is also needed for other purposes ( such as I/O buffering ),
and what happens if some process suddenly decides it needs more pages and
there aren't any free frames available? There are several possible solutions to
consider:
1. Adjust the memory used by I/O buffering, etc., to free up some frames
for user processes. The decision of how to allocate memory for I/O
versus user processes is a complex one, yielding different policies on
different systems. ( Some allocate a fixed amount for I/O, and others let
the I/O system contend for memory along with everything else. )
2. Put the process requesting more pages into a wait queue until some
free frames become available.
3. Swap some process out of memory completely, freeing up its page
frames.
4. Find some page in memory that isn't being used right now, and swap
that page only out to disk, freeing up a frame that can be allocated to
the process requesting it. This is known as page replacement, and is the
most common solution. There are many different algorithms for page
replacement, which is the subject of the remainder of this section.
• Note that step 3c adds an extra disk write to the page-fault handling,
effectively doubling the time required to process a page fault. This can be
alleviated somewhat by assigning a modify bit, or dirty bit to each page,
indicating whether or not it has been changed since it was last loaded in from
disk. If the dirty bit has not been set, then the page is unchanged, and does
not need to be written out to disk. Otherwise the page write is required. It
should come as no surprise that many page replacement strategies specifically
look for pages that do not have their dirty bit set, and preferentially select
clean pages as victim pages. It should also be obvious that unmodifiable code
pages never get their dirty bits set.
• There are two major requirements to implement a successful demand paging
system. We must develop a frame-allocation algorithm and a page-
replacement algorithm. The former centers around how many frames are
allocated to each process ( and to other needs ), and the latter deals with how
to select a page for replacement when there are no free frames available.
• The overall goal in selecting and tuning these algorithms is to generate the
fewest number of overall page faults. Because disk access is so slow relative to
memory access, even slight improvements to these algorithms can yield large
improvements in overall system performance.
• Algorithms are evaluated using a given string of memory accesses known as
a reference string, which can be generated in one of ( at least ) three common
ways:
1. Randomly generated, either evenly distributed or with some
distribution curve based on observed system behavior. This is the
fastest and easiest approach, but may not reflect real performance well,
as it ignores locality of reference.
2. Specifically designed sequences. These are useful for illustrating the
properties of comparative algorithms in published papers and
textbooks, ( and also for homework and exam problems. :-) )
3. Recorded memory references from a live system. This may be the best
approach, but the amount of data collected can be enormous, on the
order of a million addresses per second. The volume of collected data
can be reduced by making two important observations:
1. Only the page number that was accessed is relevant. The offset
within that page does not affect paging operations.
2. Successive accesses within the same page can be treated as a
single page request, because all requests after the first are
guaranteed to be page hits
• Although FIFO is simple and easy, it is not always optimal, or even efficient.
• The prediction behind LRU, the Least Recently Used, algorithm is that the
page that has not been used in the longest time is the one that will not be
used again in the near future. ( Note the distinction between FIFO and LRU:
The former looks at the oldest load time, and the latter looks at the
oldest use time. )
• Some view LRU as analogous to OPT, except looking backwards in time instead
of forwards. ( OPT has the interesting property that for any reference string S
and its reverse R, OPT will generate the same number of page faults for S and
for R. It turns out that LRU has this same property. )
• Figure 3.9 illustrates LRU for our sample string, yielding 12 page faults, ( as
compared to 15 for FIFO and 9 for OPT. )
Figure 3.9 - LRU page-replacement algorithm.
• LRU is considered a good replacement policy, and is often used. The problem
is how exactly to implement it. There are two simple approaches commonly
used:
1. Counters. Every memory access increments a counter, and the current
value of this counter is stored in the page table entry for that page.
Then finding the LRU page involves simple searching the table for the
page with the smallest counter value. Note that overflowing of the
counter must be considered.
2. Stack. Another approach is to use a stack, and whenever a page is
accessed, pull that page from the middle of the stack and place it on the
top. The LRU page will always be at the bottom of the stack. Because
this requires removing objects from the middle of the stack, a doubly
linked list is the recommended data structure.
• Note that both implementations of LRU require hardware support, either for
incrementing the counter or for managing the stack, as these operations must
be performed for every memory access.
3.3 Thrashing
• The working set model is based on the concept of locality, and defines
a working set window, of length delta. Whatever pages are included in the
most recent delta page references are said to be in the processes working set
window, and comprise its current working set, as illustrated in Figure 3.11:
Figure 3.11 - Working-set model.
• The selection of delta is critical to the success of the working set model - If it is
too small then it does not encompass all of the pages of the current locality,
and if it is too large, then it encompasses pages that are no longer being
frequently accessed.
• The total demand, D, is the sum of the sizes of the working sets for all
processes. If D exceeds the total number of available frames, then at least one
process is thrashing, because there are not enough frames available to satisfy
its minimum working set. If D is significantly less than the currently available
frames, then additional processes can be launched.
• The hard part of the working-set model is keeping track of what pages are in
the current working set, since every reference adds one to the set and
removes one older page. An approximation can be made using reference bits
and a timer that goes off after a set interval of memory references
• Note that there is a direct relationship between the page-fault rate and the
working-set, as a process moves from one locality to another.
Deadlocks
P0 P1
wait(A); wait(B)
wait(B); wait(A)
4.1Resources
• For the purposes of deadlock discussion, a system can be modeled as a collection of limited
resources, which can be partitioned into different categories, to be allocated to a number of
processes, each having different needs.
• Resource categories may include memory, printers, CPUs, open files, tape drives, CD-
ROMS, etc.
• By definition, all the resources within a category are equivalent, and a request of this
category can be equally satisfied by any one of the resources in that category. If this is not
the case ( i.e. if there is some difference between the resources within a category ), then
that category needs to be further divided into separate categories. For example, "printers"
may need to be separated into "laser printers" and "color inkjet printers".
• Some categories may have a single resource.
• In normal operation a process must request a resource before using it, and release it when
it is done, in the following sequence:
1. Request - If the request cannot be immediately granted, then the process must wait
until the resource(s) it needs become available. For example the system calls
open( ), malloc( ), new( ), and request( ).
2. Use - The process uses the resource, e.g. prints to the printer or reads from the file.
3. Release - The process relinquishes the resource. so that it becomes available for
other processes. For example, close( ), free( ), delete( ), and release( ).
• For all kernel-managed resources, the kernel keeps track of what resources are free and
which are allocated, to which process they are allocated, and a queue of processes waiting
for this resource to become available. Application-managed resources can be controlled
using mutexes or wait( ) and signal( ) calls, ( i.e. binary or counting semaphores. )
• A set of processes is deadlocked when every process in the set is waiting for a resource
that is currently allocated to another process in the set ( and which can only be released
when that other waiting process makes progress. )
4.2 Necessary Conditions
• There are four conditions that are necessary to achieve deadlock:
1. Mutual Exclusion - At least one resource must be held in a non-sharable
mode; If any other process requests this resource, then that process must
wait for the resource to be released.
2. Hold and Wait - A process must be simultaneously holding at least one
resource and waiting for at least one resource that is currently being held by
some other process.
3. No preemption - Once a process is holding a resource ( i.e. once its request
has been granted ), then that resource cannot be taken away from that
process until the process voluntarily releases it.
4. Circular Wait - A set of processes { P0, P1, P2, . . ., PN } must exist such
that every P[ i ] is waiting for P[ ( i + 1 ) % ( N + 1 ) ]. ( Note that this
condition implies the hold-and-wait condition, but it is easier to deal with
the conditions if the four are considered separately. )
4.3 Methods for Handling Deadlocks
• Generally speaking there are three ways of handling deadlocks:
1. Deadlock prevention or avoidance - Do not allow the system to get into a
deadlocked state.
2. Deadlock detection and recovery - Abort a process or preempt some resources when
deadlocks are detected.
3. Ignore the problem all together - If deadlocks only occur once a year or so, it may
be better to simply let them happen and reboot as necessary than to incur the
constant overhead and system performance penalties associated with deadlock
prevention or detection. This is the approach that both Windows and UNIX take.
• In order to avoid deadlocks, the system must have additional information about all
processes. In particular, the system must know what resources a process will or may request
in the future. ( Ranging from a simple worst-case maximum to a complete resource request
and release plan for each process, depending on the particular algorithm. )
• Deadlock detection is fairly straightforward, but deadlock recovery requires either aborting
processes or preempting resources, neither of which is an attractive alternative.
• If deadlocks are neither prevented nor detected, then when a deadlock occurs the system
will gradually slow down, as more and more processes become stuck waiting for resources
currently held by the deadlock and by other waiting processes. Unfortunately this
slowdown can be indistinguishable from a general system slowdown when a real-time
process has heavy computing needs.
4.4 Deadlock Prevention
• Deadlocks can be prevented by preventing at least one of the four required conditions:
4.4.1 Mutual Exclusion
• Shared resources such as read-only files do not lead to deadlocks.
• Unfortunately some resources, such as printers and tape drives, require exclusive
access by a single process.
4.4.2 Hold and Wait
• To prevent this condition processes must be prevented from holding one or more
resources while simultaneously waiting for one or more others. There are several
possibilities for this:
o Require that all processes request all resources at one time. This can be
wasteful of system resources if a process needs one resource early in its
execution and doesn't need some other resource until much later.
o Require that processes holding resources must release them before
requesting new resources, and then re-acquire the released resources along
with the new ones in a single new request. This can be a problem if a process
has partially completed an operation using a resource and then fails to get it
re-allocated after releasing it.
o Either of the methods described above can lead to starvation if a process
requires one or more popular resources.
4.4.3 No Preemption
• Preemption of process resource allocations can prevent this condition of deadlocks,
when it is possible.
o One approach is that if a process is forced to wait when requesting a new
resource, then all other resources previously held by this process are
implicitly released, ( preempted ), forcing this process to re-acquire the old
resources along with the new resources in a single request, similar to the
previous discussion.
o Another approach is that when a resource is requested and not available,
then the system looks to see what other processes currently have those
resources and are themselves blocked waiting for some other resource. If
such a process is found, then some of their resources may get preempted
and added to the list of resources for which the process is waiting.
o Either of these approaches may be applicable for resources whose states are
easily saved and restored, such as registers and memory, but are generally
not applicable to other devices such as printers and tape drives.
4.4.4 Circular Wait
• One way to avoid circular wait is to number all resources, and to require that
processes request resources only in strictly increasing ( or decreasing ) order.
• In other words, in order to request resource Rj, a process must first release all Ri
such that i >= j.
• One big challenge in this scheme is determining the relative ordering of the
different resources
4.5 Deadlock Avoidance
• The general idea behind deadlock avoidance is to prevent deadlocks from ever happening,
by preventing at least one of the aforementioned conditions.
• This requires more information about each process, AND tends to lead to low device
utilization. ( I.e. it is a conservative approach. )
• In some algorithms the scheduler only needs to know the maximum number of each
resource that a process might potentially use. In more complex algorithms the scheduler
can also take advantage of the schedule of exactly what resources may be needed in what
order.
• When a scheduler sees that starting a process or granting resource requests may lead to
future deadlocks, then that process is just not started or the request is not granted.
• A resource allocation state is defined by the number of available and allocated resources,
and the maximum requirements of all processes in the system.
4.5.1 Safe State
• A state is safe if the system can allocate all resources requested by all processes (
up to their stated maximums ) without entering a deadlock state.
• More formally, a state is safe if there exists a safe sequence of processes { P0, P1,
P2, ..., PN } such that all of the resource requests for Pi can be granted using the
resources currently allocated to Pi and all processes Pj where j < i. ( I.e. if all the
processes prior to Pi finish and free up their resources, then Pi will be able to finish
also, using the resources that they have freed up. )
• If a safe sequence does not exist, then the system is in an unsafe state,
which MAY lead to deadlock. ( All safe states are deadlock free, but not all unsafe
states lead to deadlocks. )
• For example, consider a system with 12 tape drives, allocated as follows. Is this a
safe state? What is the safe sequence?
P0 10 5
P1 4 2
P2 9 2
• What happens to the above table if process P2 requests and is granted one more
tape drive?
• Key to the safe state approach is that when a request is made for resources, the
request is granted only if the resulting allocation state is a safe one.
4.5.2 Resource-Allocation Graph Algorithm
• If resource categories have only single instances of their resources, then deadlock
states can be detected by cycles in the resource-allocation graphs.
• In this case, unsafe states can be recognized and avoided by augmenting the
resource-allocation graph with claim edges, noted by dashed lines, which point
from a process to a resource that it may request in the future.
• In order for this technique to work, all claim edges must be added to the graph for
any particular process before that process is allowed to request any resources.
( Alternatively, processes may only make requests for resources for which they
have already established claim edges, and claim edges cannot be added to any
process that is currently holding resources. )
• When a process makes a request, the claim edge Pi->Rj is converted to a request
edge. Similarly when a resource is released, the assignment reverts back to a claim
edge.
• This approach works by denying requests that would produce cycles in the
resource-allocation graph, taking claim edges into effect.
• Consider for example what happens when process P2 requests resource R2:
• Windows ( and some other systems ) use special file extensions to indicate
the type of each file:
• Macintosh stores a creator attribute for each file, according to the program that first
created it with the create( ) system call.
• UNIX stores magic numbers at the beginning of certain files. ( Experiment with the
"file" command, especially in directories such as /bin and /dev )
• Disk files are accessed in units of physical blocks, typically 55 bytes or some
power-of-two multiple thereof. ( Larger physical disks use larger block sizes, to
keep the range of block numbers within the range of a 32-bit integer. )
• Internally files are organized in units of logical units, which may be as small as a
single byte, or may be a larger size corresponding to some data record or structure
size.
• The number of logical units which fit into one physical block determines
its packing, and has an impact on the amount of internal fragmentation ( wasted
space ) that occurs.
• As a general rule, half a physical block is wasted for each file, and the larger the
block sizes the more space is lost to internal fragmentation.
• A sequential access file emulates magnetic tape operation, and generally supports
a few operations:
o read next - read a record and advance the tape to the next position.
o write next - write a record and advance the tape to the next position.
o rewind
o skip n records - May or may not be supported. N may be limited to positive
numbers, or may be limited to +/- 1.
• Jump to any record and read that record. Operations supported include:
o read n - read record number n. ( Note an argument is now required. )
o write n - write record number n. ( Note an argument is now required. )
o jump to record n - could be 0 or the end of file.
o Query current record - used to return back to this record later.
o Sequential access can be easily emulated using direct access. The inverse is
complicated and inefficient.
• An indexed access scheme can be easily built on top of a direct access system. Very
large files may require a multi-tiered indexing scheme, i.e. indexes of indexes.
• An obvious extension to the two-tiered directory structure, and the one with which
we are all most familiar.
• Each user / process has the concept of a current directory from which all ( relative )
searches take place.
• Files may be accessed using either absolute pathnames ( relative to the root of the
tree ) or relative pathnames ( relative to the current directory. )
• Directories are stored the same as any other file in the system, except there is a bit
that identifies them as directories, and they have some special structure that the OS
understands.
• One question for consideration is whether or not to allow the removal of directories
that are not empty - Windows requires that directories be emptied first, and UNIX
provides an option for deleting entire sub-trees.
Figure 5.8 - Tree-structured directory structure.
• When the same files need to be accessed in more than one place in the directory
structure ( e.g. because they are being shared by more than one user / process ), it
can be useful to provide an acyclic-graph structure. ( Note the directed arcs from
parent to child. )
• UNIX provides two types of links for implementing the acyclic-graph structure. (
See "man ln" for more details. )
o A hard link ( usually just called a link ) involves multiple directory entries
that both refer to the same file. Hard links are only valid for ordinary files
in the same filesystem.
o A symbolic link, that involves a special file, containing information about
where to find the linked file. Symbolic links may be used to link directories
and/or files in other filesystems, as well as ordinary files in the current
filesystem.
• Windows only supports symbolic links, termed shortcuts.
• Hard links require a reference count, or link count for each file, keeping track of
how many directory entries are currently referring to this file. Whenever one of the
references is removed the link count is reduced, and when it reaches zero, the disk
space can be reclaimed.
• For symbolic links there is some question as to what to do with the symbolic links
when the original file is moved or deleted:
o One option is to find all the symbolic links and adjust them also.
o Another is to leave the symbolic links dangling, and discover that they are
no longer valid the next time they are used.
o What if the original file is removed, and replaced with another file having
the same name before the symbolic link is next used?
Figure 5.9 - Acyclic-graph directory structure.
• If cycles are allowed in the graphs, then several problems can arise:
o Search algorithms can go into infinite loops. One solution is to not follow
links in search algorithms. ( Or not to follow symbolic links, and to only
allow symbolic links to refer to directories. )
o Sub-trees can become disconnected from the rest of the tree and still not
have their reference counts reduced to zero. Periodic garbage collection is
required to detect and resolve this problem. ( chkdsk in DOS and fsck in
UNIX search for these problems, among others, even though cycles are not
supposed to be allowed in either system. Disconnected disk blocks that are
not marked as free are added back to the file systems with made-up file
names, and can usually be safely deleted. )
Figure 5.10 - General graph directory.
5.4.1 Overview
• Physical disks are commonly divided into smaller units called partitions. They can
also be combined into larger units, but that is most commonly done for RAID
installations and is left for later chapters.
• Partitions can either be used as raw devices ( with no structure imposed upon them
), or they can be formatted to hold a filesystem ( i.e. populated with FCBs and initial
directory structures as appropriate. ) Raw partitions are generally used for swap
space, and may also be used for certain programs such as databases that choose to
manage their own disk storage system. Partitions containing filesystems can
generally only be accessed using the file system structure by ordinary users, but can
often be accessed as a raw device also by root.
• The boot block is accessed as part of a raw partition, by the boot program prior to
any operating system being loaded. Modern boot programs understand multiple
OSes and filesystem formats, and can give the user a choice of which of several
available systems to boot.
• The root partition contains the OS kernel and at least the key portions of the OS
needed to complete the boot process. At boot time the root partition is mounted,
and control is transferred from the boot program to the kernel found there. ( Older
systems required that the root partition lie completely within the first 524 cylinders
of the disk, because that was as far as the boot program could reach. Once the kernel
had control, then it could access partitions beyond the 524 cylinder boundary. )
• Continuing with the boot process, additional filesystems get mounted, adding their
information into the appropriate mount table structure. As a part of the mounting
process the file systems may be checked for errors or inconsistencies, either because
they are flagged as not having been closed properly the last time they were used, or
just for general principals. Filesystems may be mounted either automatically or
manually. In UNIX a mount point is indicated by setting a flag in the in-memory
copy of the inode, so all future references to that inode get re-directed to the root
directory of the mounted filesystem.
• In operation the disk rotates at high speed, such as 7200 rpm ( 50 revolutions per
second. ) The rate at which data can be transferred from the disk to the computer is
composed of several steps:
o The positioning time, a.k.a. the seek time or random access time is the time
required to move the heads from one cylinder to another, and for the heads
to settle down after the move. This is typically the slowest step in the
process and the predominant bottleneck to overall transfer rates.
o The rotational latency is the amount of time required for the desired sector
to rotate around and come under the read-write head.This can range
anywhere from zero to one full revolution, and on the average will equal
one-half revolution. This is another physical step and is usually the second
slowest step behind seek time. ( For a disk rotating at 7200 rpm, the average
rotational latency would be 1/2 revolution / 50 revolutions per second, or
just over 4 milliseconds, a long time by computer standards.
o The transfer rate, which is the time required to move the data electronically
from the disk to the computer. ( Some authors may also use the term transfer
rate to refer to the overall transfer rate, including seek time and rotational
latency as well as the electronic data transfer rate. )
• Disk heads "fly" over the surface on a very thin cushion of air. If they should
accidentally contact the disk, then a head crash occurs, which may or may not
permanently damage the disk or even destroy it completely. For this reason it is
normal to park the disk heads when turning a computer off, which means to move
the heads off the disk or to an area of the disk where there is no data stored.
• Floppy disks are normally removable. Hard drives can also be removable, and some
are even hot-swappable, meaning they can be removed while the computer is
running, and a new hard drive inserted in their place.
• Disk drives are connected to the computer via a cable known as the I/O Bus. Some
of the common interface formats include Enhanced Integrated Drive Electronics,
EIDE; Advanced Technology Attachment, ATA; Serial ATA, SATA, Universal
Serial Bus, USB; Fiber Channel, FC, and Small Computer Systems Interface, SCSI.
• The host controller is at the computer end of the I/O bus, and the disk controller is
built into the disk itself. The CPU issues commands to the host controller via I/O
ports. Data is transferred between the magnetic surface and onboard cache by the
disk controller, and then the data is transferred from that cache to the host controller
and the motherboard memory at electronic speeds.
• As technologies improve and economics change, old technologies are often used in
different ways. One example of this is the increasing used of solid state disks, or
SSDs.
• SSDs use memory technology as a small fast hard disk. Specific implementations
may use either flash memory or DRAM chips protected by a battery to sustain the
information through power cycles.
• Because SSDs have no moving parts they are much faster than traditional hard
drives, and certain problems such as the scheduling of disk accesses simply do not
apply.
• However SSDs also have their weaknesses: They are more expensive than hard
drives, generally not as large, and may have shorter life spans.
• SSDs are especially useful as a high-speed cache of hard-disk information that must
be accessed quickly. One example is to store filesystem meta-data, e.g. directory
and inode information, that must be accessed quickly and often. Another variation
is a boot disk containing the OS and some application executables, but no vital user
data. SSDs are also used in laptops to make them smaller, faster, and lighter.
• Because SSDs are so much faster than traditional hard disks, the throughput of the
bus can become a limiting factor, causing some SSDs to be connected directly to
the system PCI bus for example.
• Magnetic tapes were once used for common secondary storage before the days of
hard disk drives, but today are used primarily for backups.
• Accessing a particular spot on a magnetic tape can be slow, but once reading or
writing commences, access speeds are comparable to disk drives.
• Capacities of tape drives can range from 20 to 200 GB, and compression can double
that capacity.
• The traditional head-sector-cylinder, HSC numbers are mapped to linear block addresses
by numbering the first sector on the first head on the outermost track as sector 0.
Numbering proceeds with the rest of the sectors on that same track, and then the rest of the
tracks on the same cylinder before proceeding through the rest of the cylinders to the center
of the disk. In modern practice these linear block addresses are used in place of the HSC
numbers for a variety of reasons:
1. The linear length of tracks near the outer edge of the disk is much longer than for
those tracks located near the center, and therefore it is possible to squeeze many
more sectors onto outer tracks than onto inner ones.
2. All disks have some bad sectors, and therefore disks maintain a few spare sectors
that can be used in place of the bad ones. The mapping of spare sectors to bad
sectors in managed internally to the disk controller.
3. Modern hard drives can have thousands of cylinders, and hundreds of sectors per
track on their outermost tracks. These numbers exceed the range of HSC numbers
for many ( older ) operating systems, and therefore disks can be configured for any
convenient combination of HSC values that falls within the total number of sectors
physically on the drive.
• There is a limit to how closely packed individual bits can be placed on a physical media,
but that limit is growing increasingly more packed as technological advances are made.
• Modern disks pack many more sectors into outer cylinders than inner ones, using one of
two approaches:
o With Constant Linear Velocity, CLV, the density of bits is uniform from cylinder
to cylinder. Because there are more sectors in outer cylinders, the disk spins slower
when reading those cylinders, causing the rate of bits passing under the read-write
head to remain constant. This is the approach used by modern CDs and DVDs.
o With Constant Angular Velocity, CAV, the disk rotates at a constant angular speed,
with the bit density decreasing on outer cylinders. ( These disks would have a
constant number of sectors per track on all cylinders. )
Disk drives can be attached either directly to a particular host ( a local disk ) or to a network.
• First-Come First-Serve is simple and intrinsically fair, but not very efficient.
Consider in the following sequence the wild swing from cylinder 52 to 14 and then
back to 54:
Figure 5.17 - FCFS disk scheduling.
• Shortest Seek Time First scheduling is more efficient, but may lead to starvation
if a constant stream of requests arrives for the same general area of the disk.
• SSTF reduces the total head movement to 236 cylinders, down from 640 required
for the same set of requests under FCFS. Note, however that the distance could be
reduced still further to 208 by starting with 37 and then 14 first before processing
the rest of the requests.
Figure 5.18 - SSTF disk scheduling.
• The SCAN algorithm, a.k.a. the elevator algorithm moves back and forth from one
end of the disk to the other, similarly to an elevator processing requests in a tall
building.
Figure 5.19 - SCAN disk scheduling.
• Under the SCAN algorithm, If a request arrives just ahead of the moving head then
it will be processed right away, but if it arrives just after the head has passed, then
it will have to wait for the head to pass going the other way on the return trip. This
leads to a fairly wide variation in access times which can be improved upon.
• Consider, for example, when the head reaches the high end of the disk: Requests
with high cylinder numbers just missed the passing head, which means they are all
fairly recent requests, whereas requests with low numbers may have been waiting
for a much longer time. Making the return scan from high to low then ends up
accessing recent requests first and making older requests wait that much longer.
• LOOK scheduling improves upon SCAN by looking ahead at the queue of pending
requests, and not moving the heads any farther towards the end of the disk than is
necessary. The following diagram illustrates the circular form of LOOK:
Figure 5.21 - C-LOOK disk scheduling.
• With very low loads all algorithms are equal, since there will normally only be one
request to process at a time.
• For slightly larger loads, SSTF offers better performance than FCFS, but may lead
to starvation when loads become heavy enough.
• For busier systems, SCAN and LOOK algorithms eliminate starvation problems.
• The actual optimal algorithm may be something even more complex than those
discussed here, but the incremental improvements are generally not worth the
additional overhead.
• Some improvement to overall filesystem access times can be made by intelligent
placement of directory and/or inode information. If those structures are placed in
the middle of the disk instead of at the beginning of the disk, then the maximum
distance from those structures to data blocks is reduced to only one-half of the disk
size. If those structures can be further distributed and furthermore have their data
blocks stored as close as possible to the corresponding directory structures, then
that reduces still further the overall time to find the disk block numbers and then
access the corresponding data blocks.
• On modern disks the rotational latency can be almost as significant as the seek time,
however it is not within the OSes control to account for that, because modern disks
do not reveal their internal sector mapping schemes, ( particularly when bad blocks
have been remapped to spare sectors. )
o Some disk manufacturers provide for disk scheduling algorithms directly
on their disk controllers, ( which do know the actual geometry of the disk
as well as any remapping ), so that if a series of requests are sent from the
computer to the controller then those requests can be processed in an
optimal order.
o Unfortunately there are some considerations that the OS must take into
account that are beyond the abilities of the on-board disk-scheduling
algorithms, such as priorities of some requests over others, or the need to
process certain requests in a particular order. For this reason OSes may elect
to spoon-feed requests to the disk controller one at a time in certain
situations.
• Before a disk can be used, it has to be low-level formatted, which means laying
down all of the headers and trailers marking the beginning and ends of each sector.
Included in the header and trailer are the linear sector numbers, and error-
correcting codes, ECC, which allow damaged sectors to not only be detected, but
in many cases for the damaged data to be recovered ( depending on the extent of
the damage. ) Sector sizes are traditionally 55 bytes, but may be larger, particularly
in larger drives.
• ECC calculation is performed with every disk read or write, and if damage is
detected but the data is recoverable, then a soft error has occurred. Soft errors are
generally handled by the on-board disk controller, and never seen by the OS. ( See
below. )
• Once the disk is low-level formatted, the next step is to partition the drive into one
or more separate partitions. This step must be completed even if the disk is to be
used as a single large partition, so that the partition table can be written to the
beginning of the disk.
• After partitioning, then the filesystems must be logically formatted, which involves
laying down the master directory information ( FAT table or inode structure ),
initializing free lists, and creating at least the root directory of the filesystem. ( Disk
partitions which are to be used as raw devices are not logically formatted. This
saves the overhead and disk space of the filesystem structure, but requires that the
application program manage its own disk storage requirements. )
• No disk can be manufactured to 50% perfection, and all physical objects wear out
over time. For these reasons all disks are shipped with a few bad blocks, and
additional blocks can be expected to go bad slowly over time. If a large number of
blocks go bad then the entire disk will need to be replaced, but a few here and there
can be handled through other means.
• In the old days, bad blocks had to be checked for manually. Formatting of the disk
or running certain disk-analysis tools would identify bad blocks, and attempt to read
the data off of them one last time through repeated tries. Then the bad blocks would
be mapped out and taken out of future service. Sometimes the data could be
recovered, and sometimes it was lost forever. ( Disk analysis tools could be either
destructive or non-destructive. )
• Modern disk controllers make much better use of the error-correcting codes, so that
bad blocks can be detected earlier and the data usually recovered. ( Recall that
blocks are tested with every write as well as with every read, so often errors can be
detected before the write operation is complete, and the data simply written to a
different sector instead. )
• Note that re-mapping of sectors from their normal linear progression can throw off
the disk scheduling optimization of the OS, especially if the replacement sector is
physically far away from the sector it is replacing. For this reason most disks
normally keep a few spare sectors on each cylinder, as well as at least one spare
cylinder. Whenever possible a bad sector will be mapped to another sector on the
same cylinder, or at least a cylinder as close as possible. Sector slipping may also
be performed, in which all sectors between the bad sector and the replacement
sector are moved down by one, so that the linear progression of sector numbers can
be maintained.
• If the data on a bad block cannot be recovered, then a hard error has occurred.,
which requires replacing the file(s) from backups, or rebuilding them from scratch.
• Modern systems typically swap out pages as needed, rather than swapping out entire
processes. Hence the swapping system is part of the virtual memory management system.
• Managing swap space is obviously an important task for modern OSes.
• The general idea behind RAID is to employ a group of hard drives together with some form
of duplication, either to increase reliability or to speed up operations, ( or sometimes both. )
• RAID originally stood for Redundant Array of Inexpensive Disks, and was designed to
use a bunch of cheap small disks in place of one or two larger more expensive ones. Today
RAID systems employ large possibly expensive disks as their components, switching the
definition to Independent disks.
• The more disks a system has, the greater the likelihood that one of them will go bad
at any given time. Hence increasing disks on a system actually decreases the Mean
Time To Failure, MTTF of the system.
• If, however, the same data was copied onto multiple disks, then the data would not
be lost unless both ( or all ) copies of the data were damaged simultaneously, which
is a MUCH lower probability than for a single disk going bad. More specifically,
the second disk would have to go bad before the first disk was repaired, which
brings the Mean Time To Repair into play. For example if two disks were
involved, each with a MTTF of 50,000 hours and a MTTR of 5 hours, then
the Mean Time to Data Loss would be 500 * 5^6 hours, or 57,000 years!
• This is the basic idea behind disk mirroring, in which a system contains identical
data on two or more disks.
o Note that a power failure during a write operation could cause both disks to
contain corrupt data, if both disks were writing simultaneously at the time
of the power failure. One solution is to write to the two disks in series, so
that they will not both become corrupted ( at least not in the same way ) by
a power failure. And alternate solution involves non-volatile RAM as a
write cache, which is not lost in the event of a power failure and which is
protected by error-correcting codes.
• There are also two RAID levels which combine RAID levels 0 and 1 ( striping and
mirroring ) in different combinations, designed to provide both performance and
reliability at the expense of increased cost.
o RAID level 0 + 1 disks are first striped, and then the striped disks mirrored
to another set. This level generally provides better performance than RAID
level 5.
o RAID level 1 + 0 mirrors disks in pairs, and then stripes the mirrored pairs.
The storage capacity, performance, etc. are all the same, but there is an
advantage to this approach in the event of multiple disk failures, as
illustrated below:.
▪ In diagram (a) below, the 8 disks have been divided into two sets of
four, each of which is striped, and then one stripe set is used to
mirror the other set.
▪ If a single disk fails, it wipes out the entire stripe set, but the
system can keep on functioning using the remaining set.
▪ However if a second disk from the other stripe set now fails,
then the entire system is lost, as a result of two disk failures.
▪ In diagram (b), the same 8 disks are divided into four sets of two,
each of which is mirrored, and then the file system is striped across
the four sets of mirrored disks.
▪ If a single disk fails, then that mirror set is reduced to a single
disk, but the system rolls on, and the other three mirror sets
continue mirroring.
▪ Now if a second disk fails, ( that is not the mirror of the
already failed disk ), then another one of the mirror sets is
reduced to a single disk, but the system can continue without
data loss.
▪ In fact the second arrangement could handle as many as four
simultaneously failed disks, as long as no two of them were
from the same mirror pair.
• Trade-offs in selecting the optimal RAID level for a particular application include
cost, volume of data, need for reliability, need for performance, and rebuild time,
the latter of which can affect the likelihood that a second disk will fail while the
first failed disk is being rebuilt.
• Other decisions include how many disks are involved in a RAID set and how many
disks to protect with a single parity bit. More disks in the set increases performance
but increases cost. Protecting more disks per parity bit saves cost, but increases the
likelihood that a second disk will fail before the first bad disk is repaired.
5.11.5 Extensions
• RAID concepts have been extended to tape drives ( e.g. striping tapes for faster
backups or parity checking tapes for reliability ), and for broadcasting of data.
• RAID protects against physical errors, but not against any number of bugs or other
errors that could write erroneous data.
• ZFS adds an extra level of protection by including data block checksums in all
inodes along with the pointers to the data blocks. If data are mirrored and one copy
has the correct checksum and the other does not, then the data with the bad
checksum will be replaced with a copy of the data with the good checksum. This
increases reliability greatly over RAID alone, at a cost of a performance hit that is
acceptable because ZFS is so fast to begin with.
• Another problem with traditional filesystems is that the sizes are fixed, and
relatively difficult to change. Where RAID sets are involved it becomes even harder
to adjust filesystem sizes, because a filesystem cannot span across multiple
filesystems.
• ZFS solves these problems by pooling RAID sets, and by dynamically allocating
space to filesystems as needed. Filesystem sizes can be limited by quotas, and space
can also be reserved to guarantee that a filesystem will be able to grow later, but
these parameters can be changed at any time by the filesystem's owner. Otherwise
filesystems grow and shrink dynamically as needed.
Figure 5.26 - (a) Traditional volumes and file systems. (b) a ZFS pool and file systems.