Aalim Muhammed Salegh College of Engineering: Iaf-Avadi, Chennai - 55 Department of Computer Science and Engineering
Aalim Muhammed Salegh College of Engineering: Iaf-Avadi, Chennai - 55 Department of Computer Science and Engineering
COLLEGE OF ENGINEERING
IAF-AVADI, CHENNAI - 55
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
UNIT I
OS is a program that acts as an intermediary between a user of a computer and the computer
hardware
o Operating system
o Application programs – define the ways in which the system resources are used to
solve the computing problems of the users
o Users
• But shared computer such as mainframe or minicomputer must keep all users happy
• Users of dedicate systems such as workstations have dedicated resources but frequently
use shared resources from servers
• Handheld computers are resource poor, optimized for usability and battery life
• Some computers have little or no user interface, such as embedded computers in devices
and automobiles
• OS is a resource allocator
o Decides between conflicting requests for efficient and fair resource use
• OS is a control program
• ―The one program running at all times on the computer‖ is the kernel.
o an application program.
COMPUTER STARTUP
• One or more CPUs, device controllers connect through common bus providing access to
shared memory
• Device controller informs CPU that it has finished its operation by causing an
interrupt
• Interrupt transfers control to the interrupt service routine generally, through the
interrupt vector, which contains the addresses of all the service routines
INTERRUPT HANDLING
• The OS preserves the state of the CPU by storing registers and the program counter
o polling
▪ The interrupt controller polls (send a signal out to) each device to
determine which one made the request
• Separate segments of code determine what action should be taken for each type of
interrupt
INTERRUPT TIMELINE
I/O STRUCTURE
o After I/O starts, control returns to user program without waiting for I/O
completion
o System call – request to the OS to allow user to wait for I/O completion (polling
periodically to check busy/done)
o Device-status table contains entry for each I/O device indicating its type,
address, and state
STORAGE HIERARCHY
The basic unit of computer storage is the bit. A bit can contain one of two values, 0 and 1. All
other storage in a computer is based on collections of bits. Given enough bits, it is amazing how
many things a computer can represent: numbers, letters, images, movies, sounds, documents, and
programs, to name a few. A byte is 8 bits, and on most computers it is the smallest convenient
chunk of storage. For example, most computers don’t have an instruction to move a bit but do
have one to move a byte. A less common term is word, which is a given computer architecture’s
native unit of data. A word is made up of one or more bytes. For example, a computer that has
64-bit registers and 64-bit memory addressing typically has 64-bit (8-byte) words. A computer
executes many operations in its native word size rather than a byte at a time.
Computer storage, along with most computer throughput, is generally measured and manipulated
in bytes and collections of bytes.
STORAGE STRUCTURE
• Main memory – only large storage media that the CPU can access directly
o Random access
o Typically volatile
• Secondary storage – extension of main memory that provides large nonvolatile storage
capacity
• Hard disks – rigid metal or glass platters covered with magnetic recording material
o Disk surface is logically divided into tracks, which are subdivided into sectors
o The disk controller determines the logical interaction between the device and the
computer
o Various technologies
o Speed
o Volatility
• Important principle
o in hardware,
o operating system,
o software
o Efficiency
• Typically used for I/O devices that generate data in blocks, or generate data fast
• Device controller transfers blocks of data from buffer storage directly to main
memory without CPU intervention
• Only one interrupt is generated per block, rather than the one interrupt per byte
TYPES OF SYSTEMS
1. Increased throughput
2. Economy of scale
o Two types:
• Multicore
• Symmetric clustering has multiple nodes running applications, monitoring each other
• Provides ability to distinguish when system is running user code or kernel code
• System call changes mode to kernel, return from call resets it to user
• regain control, or
PROCESS MANAGEMENT
o Initialization data
• Typically system has many processes, some user, some operating system running
concurrently on one or more CPUs
ACTIVITIES
MEMORY MANAGEMENT
• All (or part) of the data that is needed by the program must be in memory.
o Keeping track of which parts of memory are currently being used and by whom
o Deciding which processes (or parts thereof) and data to move into and out of
memory
STORAGE MANAGEMENT
▪ File-System management
o Files usually organized into directories
o OS activities include
• Entire speed of computer operation hinges on disk subsystem and its algorithms
• OS activities
• Free-space management
• Storage allocation
• Disk scheduling
• Multitasking environments must be careful to use most recent value, no matter where it is
stored in the storage hierarchy
• Multiprocessor environment must provide cache coherency in hardware such that all
CPUs have the most recent value in their cache
I/O SUBSYSTEM
• Systems generally first distinguish among users, to determine who can do what
COMPUTING ENVIRONMENTS
TRADITIONAL
• But blurred as most systems interconnect with others (i.e., the Internet)
• Networking becoming ubiquitous – even home systems use firewalls to protect home
computers from Internet attacks
MOBILE
• Distributed computing
• Client-Server Computing
• File-server system provides interface for clients to store and retrieve files
▪ Broadcast request for service and respond to requests for service via
discovery protocol
o Examples include Napster and Gnutella, Voice over IP (VoIP) such as Skype
Virtualization
• Example: Parallels for OS X running Win and/or Linux and their apps
• Example. VMware ESX: installed on hardware, runs when hardware boots, provides
services to apps, runs guest OSes
• Use cases
• Operating systems made available in source-code format rather than just binary
closed-source
• Counter to the copy protection and Digital Rights Management (DRM) movement
• Started by Free Software Foundation (FSF), which has ―copyleft‖ GNU Public
License (GPL)
• Examples include GNU/Linux and BSD UNIX (including core of Mac OS X), and
many more
• Can use VMM like VMware Player (Free on Windows), Virtualbox (open source and
free on many platforms - http://www.virtualbox.com)
• System Calls
• System Programs
• Operating System Design and Implementation
• Operating System Structure
• System Boot
• User services:
o User interface
o Program execution - Loading a program into memory and running it, end
execution, either normally or abnormally (indicating error)
o I/O operations - A running program may require I/O, which may involve a file
or an I/O device
▪ May occur in the CPU and memory hardware, in I/O devices, in user
program
▪ For each type of error, OS should take the appropriate action to ensure
correct and consistent computing
System services:
o For ensuring the efficient operation of the system itself via resource sharing
o Accounting - To keep track of which users use how much and what kinds of
computer resources
•
• CLI or command interpreter allows direct command entry
•
• Typically, a number associated with each system call
• The system call interface invokes the intended system call in OS kernel and returns status
of the system call and any return values
• The caller need know nothing about how the system call is implemented
o Just needs to obey API and understand what OS will do as a result call
o Simplest: in registers
o Parameters placed, or pushed, onto the stack by the program and popped off the
stack by the operating system
o Block and stack methods do not limit the number or length of parameters being
passed
TYPES OF SYSTEM CALLS
• Process control
o end, abort
o load, execute
• File management
• Device management
o request device, release device
o read, write, reposition
• Information maintenance
• Communications
• Protection
•
•
SYSTEM PROGRAMS
• Most users’ view of the operation system is defined by system programs, not the actual
system calls
o File manipulation
o Communications
o Background services
o Application programs
o Some of them are simply user interfaces to system calls; others are considerably
more complex
o File management - Create, delete, copy, rename, print, dump, list, and generally
manipulate files and directories
• Status information
o Some ask the system for info - date, time, amount of available memory, disk
space, number of users
o Typically, these programs format and print the output to the terminal or other
output devices
• File modification
• Program loading and execution- Absolute loaders, relocatable loaders, linkage editors,
and overlay-loaders, debugging systems for higher-level and machine language
o Allow users to send messages to one another’s screens, browse web pages, send
electronic-mail messages, log in remotely, transfer files from one machine to
another
• Background Services
o Provide facilities like disk checking, process scheduling, error logging, printing
• Application programs
o Run by users
• Design and Implementation of OS not ―solvable‖, but some approaches have proven
successful
o User goals – operating system should be convenient to use, easy to learn, reliable,
safe, and fast
maintain, as well as flexible, reliable, error-free, and efficient
• Much variation
o Now C, C++
o Main body in C
o Systems programs in C, C++, scripting languages like PERL, Python, shell scripts
o But slower
o Each modules is responsible for one (or several) aspect of the desired
functionality
o Advantages:
o Disadvantages:
• Layered OSes
• Microkernel OSes
SIMPLE STRUCTURE
• MS-DOS was created to provide the most functionality in the least space
o But its interfaces and levels of functionality are not well separated
o Systems programs
o Kernel
• UNIX Kernel
o Kernel provides
▪ Rather monolithic
• Layers are selected such that each uses functions (operations) and services of only lower-
level layers
• Advantages:
• Disadvantages:
• Failure of an application can generate core dump file capturing memory of the process
• Operating system failure can generate crash dump file containing kernel memory
• Kernighan’s Law: ―Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are, by definition, not smart
enough to debug it.‖
PERFORMANCE TUNNING
OS SYSGEN
o Etc
• How OS is loaded?
• Bootstrap loader
37
UNIT II
• Process memory is divided into four sections as shown in Figure 3.1 below:
o The text section comprises the compiled program code, read in from
non-volatile storage when the program is launched.
o The data section stores global and static variables, allocated and
initialized prior to executing main.
o The heap is used for dynamic memory allocation, and is managed
via calls to new, delete, malloc, free, etc.
o The stack is used for local variables. Space on the stack is reserved
for local variables when they are declared ( at function entrance or
elsewhere, depending on the language ), and the space is freed up
when the variables go out of scope. Note that the stack is also used
for function return values, and the exact mechanisms of stack
management may be language specific.
o Note that the stack and the heap start at opposite ends of the
process's free space and grow towards each other. If they should
ever meet, then either a stack overflow error will occur, or else a call
to new or malloc will fail due to insufficient memory available.
• When processes are swapped out of memory and later restored, additional
information must also be stored and restored. Key among them are the
program counter and the value of all program registers.
38
Figure 3.1 - A process in memory
39
Figure 3.2 - Diagram of process state
For each process there is a Process Control Block, PCB, which stores the following (
types of ) process-specific information, as illustrated in Figure 3.1. ( Specific details may
vary from system to system. )
40
Figure 3.4 - Diagram showing CPU switch from process to process
• The two main objectives of the process scheduling system are to keep the CPU
busy at all times and to deliver "acceptable" response times for all programs,
particularly for interactive ones.
• The process scheduler must meet these objectives by implementing suitable
policies for swapping processes in and out of the CPU.
• ( Note that these objectives can be conflicting. In particular, every time the
system steps in to swap processes it takes up time on the CPU to do so, which is
thereby "lost" from doing any useful productive work. )
41
Figure 3.5 - The ready queue and various I/O device queues
3.2.2 Schedulers
42
Figure 3.6 - Queueing-diagram representation of process scheduling
44
3.3 Operations on Processes
• Processes may create other processes through appropriate system calls, such
as fork or spawn. The process which does the creating is termed the parent of
the other process, which is termed its child.
45
• Each process is given an integer identifier, termed its process identifier, or PID.
The parent PID ( PPID ) is also stored for each process.
• On typical UNIX systems the process scheduler is termed sched, and is given PID
0. The first thing it does at system startup time is to launch init, which gives that
process PID 1. Init then launches all system daemons and user logins, and
becomes the ultimate parent of all other processes. Figure 3.9 shows a typical
process tree for a Linux system, and other systems will have similar though not
identical trees:
Figure 3.9 Creating a separate process using the UNIX fork( ) system call.
47
Figure 3.10 - Process creation using the fork( ) system call
• Figure 3.12 shows the more complicated process for Windows, which must
provide all of the parameter information for the new process as part of the
forking process.
48
Figure 3.11
• Processes may request their own termination by making the exit( ) system call,
typically returning an int. This int is passed along to the parent if it is doing
49
a wait( ), and is typically zero on successful completion and some non-zero code
in the event of problems.
o child code:
o int exitCode;
exit( exitCode ); // return exitCode; has the same
effect when executed from main( )
o parent code:
o pid_t pid;
o int status
o pid = wait( &status );
o // pid indicates which child exited. exitCode in low-
order bits of status
// macros can test the high-order bits of status for why
it stopped
51
52
• Cooperating processes require some type of inter-process communication, which
is most commonly one of two types: Shared Memory systems or Message
Passing systems. Figure 3.13 illustrates the difference between the two systems:
Figure 3.12 - Communications models: (a) Message passing. (b) Shared memory.
• Shared Memory is faster once it is set up, because no system calls are required
and access occurs at normal memory speeds. However it is more complicated to
set up, and doesn't work as well across multiple computers. Shared memory is
generally preferable when large amounts of information must be shared quickly
on the same computer.
• Message Passing requires system calls for every message transfer, and is
therefore slower, but it is simpler to set up and works well across multiple
computers. Message passing is generally preferable when the amount and/or
frequency of data transfers is small, or when multiple computers are involved.
53
• This is a classic example, in which one process is producing data and another
process is consuming the data. ( In this example in the order in which it is
produced, although that could vary. )
• The data is passed via an intermediary buffer, which may be either unbounded
or bounded. With a bounded buffer the producer may have to wait until there is
space available in the buffer, but with an unbounded buffer the producer will
never need to wait. The consumer may need to wait in either case until there is
data available.
• This example uses shared memory and a circular queue. Note in the code below
that only the producer changes "in", and only the consumer changes "out", and
that they can never be accessing the same array location at the same time.
• First the following data is set up in the shared memory area:
#define BUFFER_SIZE 10
typedef struct {
. . .
} item;
item buffer[ BUFFER_SIZE ];
int in = 0;
int out = 0;
• Then the producer process. Note that the buffer is full when "in" is one less than
"out" in a circular sense:
item nextProduced;
while( true ) {
/* Produce an item and store it in nextProduced */
nextProduced = makeNewItem( . . . );
54
• Then the consumer process. Note that the buffer is empty when "in" is equal to
"out":
• Message passing systems must support at a minimum system calls for "send
message" and "receive message".
• A communication link must be established between the cooperating processes
before messages can be sent.
• There are three key issues to be resolved in message passing systems as further
explored in the next three subsections:
o Direct or indirect communication ( naming )
o Synchronous or asynchronous communication
o Automatic or explicit buffering.
3.4.2.1 Naming
• With direct communication the sender must know the name of the receiver to
which it wishes to send a message.
o There is a one-to-one link between every sender-receiver pair.
o For symmetric communication, the receiver must also know the specific
name of the sender from which it wishes to receive messages.
For asymmetric communications, this is not necessary.
• Indirect communication uses shared mailboxes, or ports.
o Multiple processes can share the same mailbox or boxes.
o Only one process can read any given message in a mailbox. Initially the
process that creates the mailbox is the owner, and is the only one allowed
to read mail in the mailbox, although this privilege may be transferred.
55
▪ ( Of course the process that reads the message can immediately
turn around and place an identical message back in the box for
someone else to read, but that may put it at the back end of a
queue of messages. )
o The OS must provide system calls to create and delete mailboxes, and to
send and receive messages to/from mailboxes.
3.4.2.2 Synchronization
3.4.2.3 Buffering
• Messages are passed via queues, which may have one of three capacity
configurations:
1. Zero capacity - Messages cannot be stored in the queue, so senders must
block until receivers accept the messages.
2. Bounded capacity- There is a certain pre-determined finite capacity in the
queue. Senders must block if the queue is full, until space becomes
available in the queue, but may be either blocking or non-blocking
otherwise.
3. Unbounded capacity - The queue has a theoretical infinite capacity, so
senders are never forced to block.
56
CPU Scheduling
• Almost all programs have some alternating cycle of CPU number crunching and
waiting for I/O of some kind. ( Even a simple fetch from memory takes a long
time relative to CPU speeds. )
• In a simple system running a single process, the time spent waiting for I/O is
wasted, and those CPU cycles are lost forever.
• A scheduling system allows one process to use the CPU while another is waiting
for I/O, thereby making full use of otherwise lost CPU cycles.
• The challenge is to make the overall system as "efficient" and "fair" as possible,
subject to varying and often dynamic conditions, and where "efficient" and "fair"
are somewhat subjective terms, often subject to shifting priority policies.
• CPU bursts vary from process to process, and from program to program,
but an extensive study shows frequency patterns similar to that shown in
Figure 6.2:
58
Figure 6.2 - Histogram of CPU-burst durations.
• Whenever the CPU becomes idle, it is the job of the CPU Scheduler ( a.k.a.
the short-term scheduler ) to select another process from the ready queue to
run next.
• The storage structure for the ready queue and the algorithm used to select
the next process are not necessarily a FIFO queue. There are several
alternatives to choose from, as well as numerous adjustable parameters for
each algorithm, which is the basic subject of this entire chapter.
6.1.4 Dispatcher
• The dispatcher is the module that gives control of the CPU to the process
selected by the scheduler. This function involves:
o Switching context.
o Switching to user mode.
o Jumping to the proper location in the newly loaded program.
• The dispatcher needs to be as fast as possible, as it is run on every context
switch. The time consumed by the dispatcher is known as dispatch
latency.
• There are several different criteria to consider when trying to select the "best"
scheduling algorithm for a particular situation and environment, including:
o CPU utilization - Ideally the CPU would be busy 100% of the time, so as
to waste 0 CPU cycles. On a real system CPU usage should range from
40% ( lightly loaded ) to 90% ( heavily loaded. )
60
o Throughput - Number of processes completed per unit time. May range
from 10 / second to 1 / hour depending on the specific processes.
o Turnaround time - Time required for a particular process to complete,
from submission time to completion. ( Wall clock time. )
o Waiting time - How much time processes spend in the ready queue
waiting their turn to get on the CPU.
▪ ( Load average - The average number of processes sitting in the
ready queue waiting their turn to get into the CPU. Reported in 1-
minute, 5-minute, and 15-minute averages by "uptime" and "who". )
o Response time - The time taken in an interactive program from the
issuance of a command to the commence of a response to that command.
• In general one wants to optimize the average value of a criteria ( Maximize CPU
utilization and throughput, and minimize all the others. ) However some times one
wants to do something different, such as to minimize the maximum response time.
• Sometimes it is most desirable to minimize the variance of a criteria than the
actual value. I.e. users are more accepting of a consistent predictable system than
an inconsistent one, even if it is a little bit slower.
The following subsections will explain several common scheduling strategies, looking at
only a single CPU burst each for a small number of processes. Obviously real systems
have to deal with a lot more simultaneous processes executing their CPU-I/O burst
cycles.
• FCFS is very simple - Just a FIFO queue, like customers waiting in line at
the bank or the post office or at a copying machine.
• Unfortunately, however, FCFS can yield some very long average wait
times, particularly if the first process to get there takes a long time. For
example, consider the following three processes:
• In the first Gantt chart below, process P1 arrives first. The average waiting
time for the three processes is ( 0 + 24 + 27 ) / 3 = 17.0 ms.
• In the second Gantt chart below, the same three processes have an average
wait time of ( 0 + 3 + 6 ) / 3 = 3.0 ms. The total run time for the three
bursts is the same, but in the second case two of the three finish much
quicker, and the other process is only delayed by a short amount.
61
• FCFS can also block the system in a busy dynamic system in another way,
known as the convoy effect. When one CPU intensive process blocks the
CPU, a number of I/O intensive processes can get backed up behind it,
leaving the I/O devices idle. When the CPU hog finally relinquishes the
CPU, then the I/O processes pass through the CPU quickly, leaving the
CPU idle while everyone queues up for I/O, and then the cycle repeats
itself when the CPU intensive process gets back to the ready queue.
• The idea behind the SJF algorithm is to pick the quickest fastest little job
that needs to be done, get it out of the way first, and then pick the next
smallest fastest job to do next.
• ( Technically this algorithm picks a process based on the next shortest CPU
burst, not the overall process time. )
• For example, the Gantt chart below is based upon the following CPU burst
times, ( and the assumption that all jobs arrive at the same time. )
64
• Priorities can be assigned either internally or externally. Internal priorities
are assigned by the OS using criteria such as average burst time, ratio of
CPU to I/O activity, system resource use, and other factors available to the
kernel. External priorities are assigned by users, based on the importance of
the job, fees paid, politics, etc.
• Priority scheduling can be either preemptive or non-preemptive.
• Priority scheduling can suffer from a major problem known as indefinite
blocking, or starvation, in which a low-priority task can wait forever
because there are always some other jobs around that have higher priority.
o If this problem is allowed to occur, then processes will either run
eventually when the system load lightens ( at say 2:00 a.m. ), or will
eventually get lost when the system is shut down or crashes. ( There
are rumors of jobs that have been stuck for years. )
o One common solution to this problem is aging, in which priorities of
jobs increase the longer they wait. Under this scheme a low-priority
job will eventually get its priority raised high enough that it gets run.
65
• The performance of RR is sensitive to the time quantum selected. If the
quantum is large enough, then RR reduces to the FCFS algorithm; If it is
very small, then each process gets 1/nth of the processor time and share the
CPU equally.
• BUT, a real system invokes overhead for every context switch, and the
smaller the time quantum the more context switches there are. ( See Figure
6.4 below. ) Most modern systems use time quantum between 10 and 100
milliseconds, and context switch times on the order of 10 microseconds, so
the overhead is small relative to the time quantum.
Figure 6.4 - The way in which a smaller time quantum increases context switches.
• Turn around time also varies with quantum time, in a non-apparent manner.
Consider, for example the processes shown in Figure 6.5:
66
Figure 6.5 - The way in which turnaround time varies with the time quantum.
67
Figure 6.6 - Multilevel queue scheduling
68
Figure 6.7 - Multilevel feedback queues.
• There are two types of threads to be managed in a modern system: User threads and
kernel threads.
• User threads are supported above the kernel, without kernel support. These are the
threads that application programmers would put into their programs.
• Kernel threads are supported within the kernel of the OS itself. All modern OSes
support kernel level threads, allowing the kernel to perform multiple simultaneous
tasks and/or to service multiple kernel system calls simultaneously.
• In a specific implementation, the user threads must be mapped to kernel threads,
using one of the following strategies.
• In the many-to-one model, many user-level threads are all mapped onto a
single kernel thread.
• Thread management is handled by the thread library in user space, which is
very efficient.
• However, if a blocking system call is made, then the entire process blocks,
even if the other user threads would otherwise be able to continue.
• Because a single kernel thread can operate only on a single CPU, the many-
to-one model does not allow individual processes to be split across multiple
CPUs.
• Green threads for Solaris and GNU Portable Threads implement the many-
to-one model in the past, but few systems continue to do so today.
69
Figure 4.5 - Many-to-one model
• The one-to-one model creates a separate kernel thread to handle each user
thread.
• One-to-one model overcomes the problems listed above involving blocking
system calls and the splitting of processes across multiple CPUs.
• However the overhead of managing the one-to-one model is more
significant, involving more overhead and slowing down the system.
• Most implementations of this model place a limit on how many threads can
be created.
• Linux and Windows from 95 to XP implement the one-to-one model for
threads.
70
• Individual processes may be allocated variable numbers of kernel threads,
depending on the number of CPUs present and other factors.
• Q: If one thread forks, is the entire process copied, or is the new process
single-threaded?
• A: System dependant.
• A: If the new process execs right away, there is no need to copy all the other
threads. If it doesn't, then the entire process should be copied.
71
• A: Many versions of UNIX provide multiple versions of the fork call for this
purpose.
• Threads that are no longer needed may be cancelled by another thread in one
of two ways:
1. Asynchronous Cancellation cancels the thread immediately.
2. Deferred Cancellation sets a flag indicating the thread should cancel
itself when it is convenient. It is then up to the cancelled thread to
check this flag periodically and exit nicely when it sees the flag set.
• ( Shared ) resource allocation and inter-thread data transfers can be
problematic with asynchronous cancellation.
• Most data is shared among threads, and this is one of the major benefits of
using threads in the first place.
• However sometimes threads need thread-specific data also.
• Most major thread libraries ( pThreads, Win32, Java ) provide support for
thread-specific data, known as thread-local storage or TLS. Note that this
is more like static data than local variables,because it does not cease to exist
when the function ends.
72
• Many implementations of threads provide a virtual processor as an interface
between the user thread and the kernel thread, particularly for the many-to-
many or two-tier models.
• This virtual processor is known as a "Lightweight Process", LWP.
o There is a one-to-one correspondence between LWPs and kernel
threads.
o The number of kernel threads available, ( and hence the number of
LWPs ) may change dynamically.
o The application ( user level thread library ) maps user threads onto
available LWPs.
o kernel threads are scheduled onto the real processor(s) by the OS.
o The kernel communicates to the user-level thread library when certain
events occur ( such as a thread about to block ) via an upcall, which
is handled in the thread library by an upcall handler. The upcall also
provides a new LWP for the upcall handler to run on, which it can
then use to reschedule the user thread that is about to become blocked.
The OS will also issue upcalls when a thread becomes unblocked, so
the thread library can make appropriate adjustments.
• If the kernel thread blocks, then the LWP blocks, which blocks the user
thread.
• Ideally there should be at least as many LWPs available as there could be
concurrently blocked kernel threads. Otherwise if all LWPs are blocked, then
user threads will have to wait for one to become available.
• A solution to the critical section problem must satisfy the following three
conditions:
1. Mutual Exclusion - Only one process at a time can be executing in their
critical section.
2. Progress - If no process is currently executing in their critical section, and
one or more processes want to execute their critical section, then only the
processes not in their remainder sections can participate in the decision, and
the decision cannot be postponed indefinitely. ( I.e. processes cannot be
blocked forever waiting to get into their critical sections. )
3. Bounded Waiting - There exists a limit as to how many other processes can
get into their critical sections after a process requests entry into their critical
section and before that request is granted. ( I.e. a process requesting entry
into their critical section will get a turn eventually, and there is a limit as to
how many other processes get to go first. )
• We assume that all processes proceed at a non-zero speed, but no assumptions can
be made regarding the relative speed of one process versus another.
• Kernel processes can also be subject to race conditions, which can be especially
problematic when updating commonly shared kernel data structures such as open
file tables or virtual memory management. Accordingly kernels can take on one of
two forms:
1. Non-preemptive kernels do not allow processes to be interrupted while in
kernel mode. This eliminates the possibility of kernel-mode race conditions,
74
but requires kernel mode operations to complete very quickly, and can be
problematic for real-time systems, because timing cannot be guaranteed.
2. Preemptive kernels allow for real-time operations, but must be carefully
written to avoid race conditions. This can be especially tricky on SMP
systems, in which multiple kernel processes may be running simultaneously
on different processors.
• Non-preemptive kernels include Windows XP, 2000, traditional UNIX, and Linux
prior to 2.6; Preemptive kernels include Linux 2.6 and later, and some commercial
UNIXes such as Solaris and IRIX.
• To generalize the solution(s) expressed above, each process when entering their
critical section must set some sort of lock, to prevent other processes from entering
their critical sections simultaneously, and must release the lock when exiting their
critical section, to allow other processes to proceed. Obviously it must be possible
to attain the lock only when no other process has already set a lock. Specific
implementations of this general procedure can get quite complicated, and may
include hardware solutions as outlined in this section.
• One simple solution to the critical section problem is to simply prevent a process
from being interrupted while in their critical section, which is the approach taken
by non preemptive kernels. Unfortunately this does not work well in multiprocessor
environments, due to the difficulties in disabling and the re-enabling interrupts on
all processors. There is also a question as to how this approach affects timing if the
clock interrupt is disabled.
• Another approach is for hardware to provide certain atomic operations. These
operations are guaranteed to operate as a single instruction, without interruption.
One such operation is the "Test and Set", which simultaneously sets a boolean lock
variable and returns its previous value, as shown in Figures 5.3 and 5.4:
75
Figures 5.3 and 5.4 illustrate "test_and_set( )" function
• The above examples satisfy the mutual exclusion requirement, but unfortunately do
not guarantee bounded waiting. If there are multiple processes trying to get into
their critical sections, there is no guarantee of what order they will enter, and any
76
one process could have the bad luck to wait forever until they got their turn in the
critical section. ( Since there is no guarantee as to the relative rates of the processes,
a very fast process could theoretically release the lock, whip through their
remainder section, and re-lock the lock before a slower process got a chance. As
more and more processes are involved vying for the same resource, the odds of a
slow process getting locked out completely increase. )
• Figure 5.7 illustrates a solution using test-and-set that does satisfy this requirement,
using two shared data structures, boolean lock and boolean waiting[ N ], where N
is the number of processes in contention for critical sections:
• The key feature of the above algorithm is that a process blocks on the AND of the
critical section being locked and that this process is in the waiting state. When
exiting a critical section, the exiting process does not just unlock the critical section
and let the other processes have a free-for-all trying to get in. Rather it first looks
in an orderly progression ( starting with the next process on the list ) for a process
that has been waiting, and if it finds one, then it releases that particular process from
its waiting state, without unlocking the critical section, thereby allowing a specific
process into the critical section while continuing to block all the others. Only if
there are no other processes currently waiting is the general lock removed, allowing
the next process to come along access to the critical section.
• Unfortunately, hardware level locks are especially difficult to implement in multi-
processor architectures. Discussion of such issues is left to books on advanced
computer architecture.
6.6 Semaphores
77
• A more robust alternative to simple mutexes is to use semaphores, which are
integer variables for which only two ( atomic ) operations are defined, the wait and
signal operations, as shown in the following figure.
• Note that not only must the variable-changing steps ( S-- and S++ ) be indivisible,
it is also necessary that for the wait operation when the test proves false that there
be no interruptions before S gets decremented. It IS okay, however, for the busy
loop to be interrupted when the test is true, which prevents the system from hanging
forever.
o Counting semaphores can take on any integer value, and are usually
used to count the number remaining of some limited resource. The
counter is initialized to the number of such resources available in the
system, and whenever the counting semaphore is greater than zero,
then a process can enter a critical section and use one of the resources.
78
When the counter gets to zero ( or negative in some implementations
), then the process blocks until another process frees up a resource and
increments the counting semaphore with a signal call. ( The binary
semaphore can be seen as just a special case where the number of
resources initially available is just one. )
o Semaphores can also be used to synchronize certain operations
between processes. For example, suppose it is important that process
P1 execute statement S1 before process P2 executes statement S2.
▪ First we create a semaphore named synch that is shared by the
two processes, and initialize it to zero.
▪ Then in process P1 we insert the code:
S1;
signal( synch );
wait( synch );
S2;
• The big problem with semaphores as described above is the busy loop in the
wait call, which consumes CPU cycles without doing any useful work. This
type of lock is known as a spinlock, because the lock just sits there and spins
while it waits. While this is generally a bad thing, it does have the advantage
of not invoking context switches, and so it is sometimes used in multi-
processing systems when the wait time is expected to be short - One thread
spins on one processor while another completes their critical section on
another processor.
• An alternative approach is to block a process when it is forced to wait for an
available semaphore, and swap it out of the CPU. In this implementation
each semaphore needs to maintain a list of processes that are blocked waiting
for it, so that one of the processes can be woken up and swapped back in
when the semaphore becomes available. ( Whether it gets swapped back into
the CPU immediately or whether it needs to hang out in the ready queue for
a while is a scheduling problem. )
• The new definition of a semaphore and the corresponding wait and signal
operations are shown as follows:
79
• Note that in this implementation the value of the semaphore can actually
become negative, in which case its magnitude is the number of processes
waiting for that semaphore. This is a result of decrementing the counter
before checking its value.
• Key to the success of semaphores is that the wait and signal operations be
atomic, that is no other process can execute a wait or signal on the same
semaphore at the same time. ( Other processes could be allowed to do other
things, including working with other semaphores, they just can't have access
to this semaphore. ) On single processors this can be implemented by
disabling interrupts during the execution of wait and signal; Multiprocessor
systems have to use more complex methods, including the use of
spinlocking.
• One important problem that can arise when using semaphores to block
processes waiting for a limited resource is the problem of deadlocks, which
occur when multiple processes are blocked, each waiting for a resource that
can only be freed by one of the other ( blocked ) processes, as illustrated in
the following example. ( Deadlocks are covered more completely in chapter
7. )
80
• Another problem to consider is that of starvation, in which one or more
processes gets blocked forever, and never get a chance to take their turn in
the critical section. For example, in the semaphores above, we did not specify
the algorithms for adding processes to the waiting queue in the semaphore
in the wait( ) call, or selecting one to be removed from the queue in the
signal( ) call. If the method chosen is a FIFO queue, then every process will
eventually get their turn, but if a LIFO queue is implemented instead, then
the first process to start waiting could starve.
81
The following classic problems are used to test virtually every new proposed
synchronization algorithm.
82
data simultaneously, but when a writer accesses the data, it needs exclusive
access.
• There are several variations to the readers-writers problem, most centered
around relative priorities of readers versus writers.
o The first readers-writers problem gives priority to readers. In this
problem, if a reader wants access to the data, and there is not already
a writer accessing it, then access is granted to the reader. A solution
to this problem can lead to starvation of the writers, as there could
always be more readers coming along to access the data. ( A steady
stream of readers will jump ahead of waiting writers as long as there
is currently already another reader accessing the data, because the
writer is forced to wait until the data is idle, which may never happen
if there are enough readers. )
o The second readers-writers problem gives priority to the writers. In
this problem, when a writer wants access to the data it jumps to the
head of the queue - All waiting readers are blocked, and the writer
gets access to the data as soon as it becomes available. In this solution
the readers may be starved by a steady stream of writers.
• The following code is an example of the first readers-writers problem, and
involves an important counter and two binary semaphores:
o readcount is used by the reader processes, to count the number of
readers currently accessing the data.
o mutex is a semaphore used only by the readers for controlled access
to readcount.
o rw_mutex is a semaphore used to block and release the writers. The
first reader to access the data will set this lock and the last reader to
exit will release it; The remaining readers do not touch rw_mutex. (
Eighth edition called this variable wrt. )
o Note that the first reader to come along will block on rw_mutex if
there is currently a writer accessing the data, and that all following
readers will only block on mutex for their turn to
increment readcount.
83
• Some hardware implementations provide specific reader-writer locks, which
are accessed using an argument specifying whether access is requested for
reading or writing. The use of reader-writer locks is beneficial for situation
in which: (1) processes can be easily identified as either readers or writers,
and (2) there are significantly more readers than writers, making the
additional overhead of the reader-writer lock pay off in terms of increased
concurrency of the readers.
84
o When a philosopher thinks, it puts down both chopsticks in their
original locations.
• One possible solution, as shown in the following code section, is to use a set
of five semaphores ( chopsticks[ 5 ] ), and to have each hungry philosopher
first wait on their left chopstick ( chopsticks[ i ] ), and then wait on their right
chopstick ( chopsticks[ ( i + 1 ) % 5 ] )
• But suppose that all five philosophers get hungry at the same time, and each
starts by picking up their left chopstick. They then look for their right
chopstick, but because it is unavailable, they wait for it, forever, and
eventually all the philosophers starve due to the resulting deadlock.
5.8 Monitors
• Semaphores can be very useful for solving concurrency problems, but only if
programmers use them properly. If even one process fails to abide by the proper
use of semaphores, either accidentally or deliberately, then the whole system breaks
down. ( And since concurrency problems are by definition rare events, the problem
code may easily go unnoticed and/or be heinous to debug. )
• For this reason a higher-level language construct has been developed,
called monitors.
• A monitor is essentially a class, in which all data is private, and with the
special restriction that only one method within any given monitor object may
be active at the same time. An additional restriction is that monitor methods
may only access the shared data within the monitor and any data passed to
them as parameters. I.e. they cannot access any data external to the monitor.
87
Figure 5.17 - Monitor with condition variables
• But now there is a potential problem - If process P within the monitor issues
a signal that would wake up process Q also within the monitor, then there
would be two processes running simultaneously within the monitor,
violating the exclusion requirement. Accordingly there are two possible
solutions to this dilemma:
Signal and wait - When process P issues the signal to wake up process Q, P then waits,
either for Q to leave the monitor or on some other condition.
Signal and continue - When P issues the signal, Q waits, either for P to exit the monitor
or for some other condition.
There are arguments for and against either choice. Concurrent Pascal offers a third
alternative - The signal call causes the signaling process to immediately exit the monitor,
so that the waiting process can then wake up and proceed.
• Java and C# ( C sharp ) offer monitors bulit-in to the language. Erlang offers
similar but different constructs.
• This solution to the dining philosophers uses monitors, and the restriction
that a philosopher may only pick up chopsticks when both are available.
There are also two key data structures in use in this solution:
1. enum { THINKING, HUNGRY,EATING } state[ 5 ]; A
philosopher may only set their state to eating when neither of their
adjacent neighbors is eating. ( state[ ( i + 1 ) % 5 ] != EATING &&
state[ ( i + 4 ) % 5 ] != EATING ).
88
2. condition self[ 5 ]; This condition is used to delay a hungry
philosopher who is unable to acquire chopsticks.
• In the following solution philosophers share a monitor, DiningPhilosophers,
and eat using the following sequence of operations:
1. DiningPhilosophers.pickup( ) - Acquires chopsticks, which may
block the process.
2. eat
3. DiningPhilosophers.putdown( ) - Releases the chopsticks.
89
the next queue. Externally accessible monitor processes are then
implemented as:
• When there are multiple processes waiting on the same condition within a
monitor, how does one decide which one to wake up in response to a signal
on that condition? One obvious approach is FCFS, and this may be suitable
in many cases.
• Another alternative is to assign ( integer ) priorities, and to wake up the
process with the smallest ( best ) priority.
• Figure 5.19 illustrates the use of such a condition within a monitor used for
resource allocation. Processes wishing to access this resource must specify
90
the time they expect to use it using the acquire( time ) method, and must call
the release( ) method when they are done with the resource.
91
with deadlock prevention or detection. This is the approach that both
Windows and UNIX take.
• In order to avoid deadlocks, the system must have additional information about all
processes. In particular, the system must know what resources a process will or may
request in the future. ( Ranging from a simple worst-case maximum to a complete
resource request and release plan for each process, depending on the particular
algorithm. )
• Deadlock detection is fairly straightforward, but deadlock recovery requires either
aborting processes or preempting resources, neither of which is an attractive
alternative.
• If deadlocks are neither prevented nor detected, then when a deadlock occurs the
system will gradually slow down, as more and more processes become stuck
waiting for resources currently held by the deadlock and by other waiting processes.
Unfortunately this slowdown can be indistinguishable from a general system
slowdown when a real-time process has heavy computing needs.
7.4.3 No Preemption
• One way to avoid circular wait is to number all resources, and to require that
processes request resources only in strictly increasing ( or decreasing ) order.
• In other words, in order to request resource Rj, a process must first release
all Ri such that i >= j.
• One big challenge in this scheme is determining the relative ordering of the
different resources
• The general idea behind deadlock avoidance is to prevent deadlocks from ever
happening, by preventing at least one of the aforementioned conditions.
• This requires more information about each process, AND tends to lead to low
device utilization. ( I.e. it is a conservative approach. )
• In some algorithms the scheduler only needs to know the maximum number of each
resource that a process might potentially use. In more complex algorithms the
scheduler can also take advantage of the schedule of exactly what resources may
be needed in what order.
• When a scheduler sees that starting a process or granting resource requests may
lead to future deadlocks, then that process is just not started or the request is not
granted.
• A resource allocation state is defined by the number of available and allocated
resources, and the maximum requirements of all processes in the system.
• A state is safe if the system can allocate all resources requested by all
processes ( up to their stated maximums ) without entering a deadlock state.
• More formally, a state is safe if there exists a safe sequence of processes
{ P0, P1, P2, ..., PN } such that all of the resource requests for Pi can be
93
granted using the resources currently allocated to Pi and all processes Pj
where j < i. ( I.e. if all the processes prior to Pi finish and free up their
resources, then Pi will be able to finish also, using the resources that they
have freed up. )
• If a safe sequence does not exist, then the system is in an unsafe state,
which MAY lead to deadlock. ( All safe states are deadlock free, but not all
unsafe states lead to deadlocks. )
• What happens to the above table if process P2 requests and is granted one
more tape drive?
• Key to the safe state approach is that when a request is made for resources,
the request is granted only if the resulting allocation state is a safe one.
94
for which they have already established claim edges, and claim edges cannot
be added to any process that is currently holding resources. )
• When a process makes a request, the claim edge Pi->Rj is converted to a
request edge. Similarly when a resource is released, the assignment reverts
back to a claim edge.
• This approach works by denying requests that would produce cycles in the
resource-allocation graph, taking claim edges into effect.
• Consider for example what happens when process P2 requests resource R2:
• The resulting resource-allocation graph would have a cycle in it, and so the
request cannot be granted.
• For resource categories that contain more than one instance the resource-
allocation graph method does not work, and more complex ( and less
efficient ) methods must be chosen.
• The Banker's Algorithm gets its name because it is a method that bankers
could use to assure that when they lend out resources they will still be able
to satisfy all their clients. ( A banker won't loan out a little money to start
building a house unless they are assured that they will later be able to loan
out the rest of the money to finish the house. )
95
• When a process starts up, it must state in advance the maximum allocation
of resources it may request, up to the amount available on the system.
• When a request is made, the scheduler determines whether granting the
request would leave the system in a safe state. If not, then the process must
wait until the request can be granted safely.
• The banker's algorithm relies on several key data structures: ( where n is the
number of processes and m is the number of resource categories. )
o Available[ m ] indicates how many resources are currently available
of each type.
o Max[ n ][ m ] indicates the maximum demand of each process of each
resource.
o Allocation[ n ][ m ] indicates the number of each resource category
allocated to each process.
o Need[ n ][ m ] indicates the remaining resources needed of each type
for each process. ( Note that Need[ i ][ j ] = Max[ i ][ j ] -
Allocation[ i ][ j ] for all i, j. )
• For simplification of discussions, we make the following notations /
observations:
o One row of the Need vector, Need[ i ], can be treated as a vector
corresponding to the needs of process i, and similarly for Allocation
and Max.
o A vector X is considered to be <= a vector Y if X[ i ] <= Y[ i ] for all
i.
97
7.5.3.2 Resource-Request Algorithm ( The Bankers Algorithm )
98
• What about requests of ( 3, 3,0 ) by P4? or ( 0, 2, 0 ) by P0? Can these
be safely granted? Why or why not?
• If deadlocks are not avoided, then another approach is to detect when they have
occurred and recover somehow.
• In addition to the performance hit of constantly checking for deadlocks, a policy /
algorithm must be in place for recovering from deadlocks, and there is potential for
lost work when processes must be aborted or have their resources preempted.
• If each resource category has a single instance, then we can use a variation
of the resource-allocation graph known as a wait-for graph.
• A wait-for graph can be constructed from a resource-allocation graph by
eliminating the resources and collapsing the associated edges, as shown in
the figure below.
• An arc from Pi to Pj in a wait-for graph indicates that process Pi is waiting
for a resource that process Pj is currently holding.
Figure 7.9 - (a) Resource allocation graph. (b) Corresponding wait-for graph
99
• This algorithm must maintain the wait-for graph, and periodically search it
for cycles.
• The detection algorithm outlined here is essentially the same as the Banker's
algorithm, with two subtle differences:
o In step 1, the Banker's Algorithm sets Finish[ i ] to false for all i. The
algorithm presented here sets Finish[ i ] to false only if Allocation[ i ]
is not zero. If the currently allocated resources for this process are
zero, the algorithm sets Finish[ i ] to true. This is essentially assuming
that IF all of the other processes can finish, then this process can finish
also. Furthermore, this algorithm is specifically looking for which
processes are involved in a deadlock situation, and a process that does
not have any resources allocated cannot be involved in a deadlock,
and so can be removed from any further consideration.
o Steps 2 and 3 are unchanged
o In step 4, the basic Banker's Algorithm says that if Finish[ i ] == true
for all i, that there is no deadlock. This algorithm is more specific, by
stating that if Finish[ i ] == false for any process Pi, then that process
is specifically involved in the deadlock which has been detected.
• ( Note: An alternative method was presented above, in which Finish held
integers instead of booleans. This vector would be initialized to all zeros,
and then filled with increasing integers as processes are detected which can
finish. If any processes are left at zero when the algorithm completes, then
there is a deadlock, and if not, then the integers in finish describe a safe
sequence. To modify this algorithm to match this section of the text,
processes with allocation = zero could be filled in with N, N - 1, N - 2, etc.
in step 1, and any processes left with Finish = 0 in step 4 are the deadlocked
processes. )
• Consider, for example, the following state, and determine if it is currently
deadlocked:
100
7.6.3 Detection-Algorithm Usage
102
• Now that we have a tool for determining if a particular state is safe or
not, we are now ready to look at the Banker's algorithm itself.
• This algorithm determines if a new request is safe, and grants it only
if it is safe to do so.
• When a request is made ( that does not exceed currently available
resources ), pretend it has been granted, and then see if the resulting
state is a safe one. If so, grant the request, and if not, deny the request,
as follows:
1. Let Request[ n ][ m ] indicate the number of resources of each
type currently requested by processes. If Request[ i ] > Need[ i
] for any process i, raise an error condition.
2. If Request[ i ] > Available for any process i, then that process
must wait for resources to become available. Otherwise the
process can continue to step 3.
3. Check to see if the request can be granted safely, by pretending
it has been granted and then seeing if the resulting state is safe.
If so, grant the request, and if not, then the process must wait
until its request can be granted safely.The procedure for
granting a request ( or pretending to for testing purposes ) is:
▪ Available = Available - Request
▪ Allocation = Allocation + Request
▪ Need = Need - Request
103
• What about requests of ( 3, 3,0 ) by P4? or ( 0, 2, 0 ) by P0? Can these
be safely granted? Why or why not?
• If deadlocks are not avoided, then another approach is to detect when they have
occurred and recover somehow.
• In addition to the performance hit of constantly checking for deadlocks, a policy /
algorithm must be in place for recovering from deadlocks, and there is potential for
lost work when processes must be aborted or have their resources preempted.
• If each resource category has a single instance, then we can use a variation
of the resource-allocation graph known as a wait-for graph.
• A wait-for graph can be constructed from a resource-allocation graph by
eliminating the resources and collapsing the associated edges, as shown in
the figure below.
• An arc from Pi to Pj in a wait-for graph indicates that process Pi is waiting
for a resource that process Pj is currently holding.
104
Figure 7.9 - (a) Resource allocation graph. (b) Corresponding wait-for graph
• The detection algorithm outlined here is essentially the same as the Banker's
algorithm, with two subtle differences:
o In step 1, the Banker's Algorithm sets Finish[ i ] to false for all i. The
algorithm presented here sets Finish[ i ] to false only if Allocation[ i ]
is not zero. If the currently allocated resources for this process are
zero, the algorithm sets Finish[ i ] to true. This is essentially assuming
that IF all of the other processes can finish, then this process can finish
also. Furthermore, this algorithm is specifically looking for which
processes are involved in a deadlock situation, and a process that does
not have any resources allocated cannot be involved in a deadlock,
and so can be removed from any further consideration.
o Steps 2 and 3 are unchanged
o In step 4, the basic Banker's Algorithm says that if Finish[ i ] == true
for all i, that there is no deadlock. This algorithm is more specific, by
stating that if Finish[ i ] == false for any process Pi, then that process
is specifically involved in the deadlock which has been detected.
• ( Note: An alternative method was presented above, in which Finish held
integers instead of booleans. This vector would be initialized to all zeros,
and then filled with increasing integers as processes are detected which can
105
finish. If any processes are left at zero when the algorithm completes, then
there is a deadlock, and if not, then the integers in finish describe a safe
sequence. To modify this algorithm to match this section of the text,
processes with allocation = zero could be filled in with N, N - 1, N - 2, etc.
in step 1, and any processes left with Finish = 0 in step 4 are the deadlocked
processes. )
• Consider, for example, the following state, and determine if it is currently
deadlocked:
106
1. Do deadlock detection after every resource allocation which cannot
be immediately granted. This has the advantage of detecting the
deadlock right away, while the minimum number of processes are
involved in the deadlock. ( One might consider that the process whose
request triggered the deadlock condition is the "cause" of the
deadlock, but realistically all of the processes in the cycle are equally
responsible for the resulting deadlock. ) The down side of this
approach is the extensive overhead and performance hit caused by
checking for deadlocks so frequently.
2. Do deadlock detection only when there is some clue that a deadlock
may have occurred, such as when CPU utilization reduces to 40% or
some other magic number. The advantage is that deadlock detection
is done much less frequently, but the down side is that it becomes
impossible to detect the processes involved in the original deadlock,
and so deadlock recovery can be more complicated and damaging to
more processes.
3. ( As I write this, a third alternative comes to mind: Keep a historical
log of resource allocations, since that last known time of no deadlocks.
Do deadlock checks periodically ( once an hour or when CPU usage
is low?), and then use the historical log to trace through and determine
when the deadlock occurred and what processes caused the initial
deadlock. Unfortunately I'm not certain that breaking the original
deadlock would then free up the resulting log jam. )
108
UNIT III
8.2 Swapping
109
Figure 8.5 - Swapping of two processes using a disk as a backing store
• The system shown in Figure 8.6 below allows protection against user
programs accessing areas that they should not, allows programs to be
relocated to different memory starting addresses as needed, and allows the
memory space devoted to the OS to grow or shrink dynamically as needs
change.
111
2. Best fit - Allocate the smallest hole that is big enough to satisfy the
request. This saves large holes for other process requests that may
need them later, but the resulting unused portions of holes may be too
small to be of any use, and will therefore be wasted. Keeping the free
list sorted can speed up the process of finding the right hole.
3. Worst fit - Allocate the largest hole available, thereby increasing the
likelihood that the remaining portion will be usable for satisfying
future requests.
• Simulations show that either first or best fit are better than worst fit in terms
of both time and storage utilization. First and best fits are about equal in
terms of storage utilization, but first fit is faster.
8.3.3. Fragmentation
8.4 Segmentation
112
8.4.1 Basic Method
114
Figure 8.9 - Example of segmentation
8.5 Paging
• The basic idea behind paging is to divide physical memory into a number of equal
sized blocks called frames, and to divide a programs logical memory space into
blocks of the same size called pages.
• Any page ( from any process ) can be placed into any available frame.
• The page table is used to look up what frame a particular page is stored in at the
moment. In the following example, for instance, page 2 of the program's logical
memory is currently stored in frame 3 of physical memory:
115
Figure 8.10 - Paging hardware
116
• A logical address consists of two parts: A page number in which the address resides,
and an offset from the beginning of that page. ( The number of bits in the page
number limits how many pages a single process can address. The number of bits in
the offset determines the maximum size of each page, and should correspond to the
system frame size. )
• The page table maps the page number to a frame number, to yield a physical address
which also has two parts: The frame number and the offset within that frame. The
number of bits in the frame number determines how many frames the system can
address, and the number of bits in the offset determines the size of each frame.
• Page numbers, frame numbers, and frame sizes are determined by the architecture,
but are typically powers of two, allowing addresses to be split at a certain number
of bits. For example, if the logical address size is 2^m and the page size is 2^n, then
the high-order m-n bits of a logical address designate the page number and the
remaining n bits represent the offset.
• Note also that the number of bits in the page number and the number of bits in the
frame number do not have to be identical. The former determines the address range
of the logical address space, and the latter relates to the physical address space.
• ( DOS used to use an addressing scheme with 16 bit frame numbers and 16-bit
offsets, on hardware that only supported 24-bit hardware addresses. The result was
a resolution of starting frame addresses finer than the size of a single frame, and
multiple frame-offset combinations that mapped to the same physical hardware
address. )
• Consider the following micro example, in which a process has 16 bytes of logical
memory, mapped in 4 byte pages into 32 bytes of physical memory. ( Presumably
some other processes would be consuming the remaining 16 bytes of physical
memory. )
117
Figure 8.12 - Paging example for a 32-byte memory with 4-byte pages
• Note that paging is like having a table of relocation registers, one for each page of
the logical memory.
• There is no external fragmentation with paging. All blocks of physical memory are
used, and there are no gaps in between and no problems with finding the right sized
hole for a particular chunk of memory.
• There is, however, internal fragmentation. Memory is allocated in chunks the size
of a page, and on the average, the last page will only be half full, wasting on the
average half a page of memory per process. ( Possibly more, if processes keep their
code and data in separate pages. )
118
• Larger page sizes waste more memory, but are more efficient in terms of overhead.
Modern trends have been to increase page sizes, and some systems even have
multiple size pages to try and make the best of both worlds.
• Page table entries ( frame numbers ) are typically 32 bit numbers, allowing access
to 2^32 physical page frames. If those frames are 4 KB in size each, that translates
to 16 TB of addressable physical memory. ( 32 + 12 = 44 bits of physical address
space. )
• When a process requests memory ( e.g. when its code is loaded in from disk ), free
frames are allocated from a free-frame list, and inserted into that process's page
table.
• Processes are blocked from accessing anyone else's memory because all of their
memory requests are mapped through their page table. There is no way for them to
generate an address that maps into any other process's memory space.
• The operating system must keep track of each individual process's page table,
updating it whenever the process's pages get moved in and out of memory, and
applying the correct page table when processing system calls for a particular
process. This all increases the overhead involved when swapping processes in and
out of the CPU. ( The currently active page table must be updated to reflect the
process that is currently running. )
Figure 8.13 - Free frames (a) before allocation and (b) after allocation
119
• Page lookups must be done for every memory reference, and whenever a
process gets swapped in or out of the CPU, its page table must be swapped
in and out too, along with the instruction registers, etc. It is therefore
appropriate to provide hardware support for this operation, in order to make
it as fast as possible and to make process switches as fast as possible also.
• One option is to use a set of registers for the page table. For example, the
DEC PDP-11 uses 16-bit addressing and 8 KB pages, resulting in only 8
pages per process. ( It takes 13 bits to address 8 KB of offset, leaving only 3
bits to define a page number. )
• An alternate option is to store the page table in main memory, and to use a
single register ( called the page-table base register, PTBR ) to record where
in memory the page table is located.
o Process switching is fast, because only the single register needs to be
changed.
o However memory access just got half as fast, because every memory
access now requires two memory accesses - One to fetch the frame
number from memory and then another one to access the desired
memory location.
o The solution to this problem is to use a very special high-speed
memory device called the translation look-aside buffer, TLB.
▪ The benefit of the TLB is that it can search an entire table for a
key value in parallel, and if it is found anywhere in the table,
then the corresponding lookup value is returned.
120
Figure 8.14 - Paging hardware with TLB
for a 40% slowdown to get the frame number. A 98% hit rate
would yield 122 nanoseconds average access time ( you should
verify this ), for a 22% slowdown.
for a 20% slowdown to get the frame number. A 99% hit rate
would yield 101 nanoseconds average access time ( you should
verify this ), for a 1% slowdown.
8.5.3 Protection
• The page table can also help to protect processes from accessing memory
that they shouldn't, or their own memory in ways that they shouldn't.
• A bit or bits can be added to the page table to classify a page as read-write,
read-only, read-write-execute, or some combination of these sorts of things.
Then each memory reference can be checked to ensure it is accessing the
memory in the appropriate mode.
• Valid / invalid bits can be added to "mask off" entries in the page table that
are not in use by the current process, as shown by example in Figure 8.12
below.
• Note that the valid / invalid bits described above cannot block all illegal
memory accesses, due to the internal fragmentation. ( Areas of memory in
the last page that are not entirely filled by the process, and may contain data
left over by whoever used that frame last. )
• Many processes do not use all of the page table available to them,
particularly in modern systems with very large potential page tables. Rather
than waste memory by creating a full-size page table for every process, some
systems use a page-table length register, PTLR, to specify the length of the
page table.
122
Figure 8.15 - Valid (v) or invalid (i) bit in page table
• Paging systems can make it very easy to share blocks of memory, by simply
duplicating page numbers in multiple page frames. This may be done with
either code or data.
• If code is reentrant, that means that it does not write to or change the code
in any way ( it is non self-modifying ), and it is therefore safe to re-enter it.
More importantly, it means the code can be shared by multiple processes, so
long as each has their own copy of the data and registers, including the
instruction register.
• In the example given below, three different users are running the editor
simultaneously, but the code is only loaded into memory ( in the page
frames ) one time.
• Some systems also implement shared memory in this fashion.
123
Figure 8.16 - Sharing of code in a paging environment
125
Figure 8.18 - Address translation for a two-level 32-bit paging architecture
• VAX Architecture divides 32-bit addresses into 4 equal sized sections, and
each page is 512 bytes, yielding an address form of:
• With a 64-bit logical address space and 4K pages, there are 52 bits worth of
page numbers, which is still too many even for two-level paging. One could
increase the paging level, but with 10-bit page tables it would take 7 levels
of indirection, which would be prohibitively slow memory access. So some
other approach must be used.
127
Figure 8.20 - Inverted page table
Consider as a final example a modern 64-bit CPU and operating system that are tightly
integrated to provide low-overhead virtual memory. Solaris running on the SPARC CPU
is a fully 64-bit operating system and as such has to solve the problem of virtual memory
without using up all of its physical memory by keeping multiple levels of page tables. Its
approach is a bit complex but solves the problem efficiently using hashed page tables.
There are two hash tables—one for the kernel and one for all user processes. Each maps
memory addresses from virtual to physical memory. Each hash-table entry represents a
contiguous area of mapped virtual memory, which is more efficient than having a separate
hash-table entry for each page. Each entry has a base address and a span indicating the
number of pages the entry represents. Virtual-to-physical translation would take too long
if each address required searching through a hash table, so the CPU implements a TLB
that holds translation table entries (TTEs) for fast hardware lookups. A cache of these
TTEs reside in a translation storage buffer (TSB), which includes an entry per recently
accessed page. When a virtual address reference occurs, the hardware searches the TLB
for a translation. If none is found, the hardware walks through the in-memory TSB looking
for the TTE that corresponds to the virtual address that caused the lookup. This TLB walk
functionality is found on many modern CPUs. If a match is found in the TSB, the CPU
copies the TSB entry into the TLB, and the memory translation completes. If no match is
found in the TSB, the kernel is interrupted to search the hash table. The kernel then creates
a TTE from the appropriate hash table and stores it in the TSB for automatic loading into
the TLB by the CPU memory-management unit. Finally, the interrupt handler returns
128
control to the MMU, which completes the address translation and retrieves the requested
byte or word from main memory.
• The basic idea behind demand paging is that when a process is swapped in, its
pages are not swapped in all at once. Rather they are swapped in only when the
process needs them. ( on demand. ) This is termed a lazy swapper, although
a pager is a more accurate term.
• The basic idea behind paging is that when a process is swapped in, the pager
only loads into memory those pages that it expects the process to need ( right
away. )
• Pages that are not loaded into memory are marked as invalid in the page
table, using the invalid bit. ( The rest of the page table entry may either be
blank or contain information about where to find the swapped-out page on
the hard drive. )
129
• If the process only ever accesses pages that are loaded in memory ( memory
resident pages ), then the process runs exactly as if all the pages were loaded
in to memory.
Figure 9.5 - Page table when some pages are not in main memory.
• On the other hand, if a page is needed that was not originally loaded up, then
a page fault trap is generated, which must be handled in a series of steps:
1. The memory address requested is first checked, to make sure it was a
valid memory request.
2. If the reference was invalid, the process is terminated. Otherwise, the
page must be paged in.
3. A free frame is located, possibly from a free-frame list.
4. A disk operation is scheduled to bring in the necessary page from disk.
( This will usually block the process on an I/O wait, allowing some
other process to use the CPU in the meantime. )
130
5. When the I/O operation is complete, the process's page table is
updated with the new frame number, and the invalid bit is changed to
indicate that this is now a valid page reference.
6. The instruction that caused the page fault must now be restarted from
the beginning, ( as soon as this process gets another turn on the CPU.
)
• In an extreme case, NO pages are swapped in for a process until they are
requested by page faults. This is known as pure demand paging.
• In theory each instruction could generate multiple page faults. In practice
this is very rare, due to locality of reference, covered in section 9.6.1.
• The hardware necessary to support virtual memory is the same as for paging
and swapping: A page table and secondary memory. ( Swap space, whose
allocation is discussed in chapter 12. )
• A crucial part of the process is that the instruction must be restarted from
scratch once the desired page has been made available in memory. For most
simple instructions this is not a major difficulty. However there are some
architectures that allow a single instruction to modify a fairly large block of
data, ( which may span a page boundary ), and if some of the data gets
modified before the page fault occurs, this could cause problems. One
131
solution is to access both ends of the block before executing the instruction,
guaranteeing that the necessary pages get paged in before the instruction
begins.
( 1 - p ) * ( 200 ) + p * 8000000
= 200 + 7,999,800 * p
which clearly depends heavily on p! Even if only one access in 1000 causes a page fault,
the effective access time drops from 200 nanoseconds to 8.2 microseconds, a slowdown
of a factor of 40 times. In order to keep the slowdown less than 10%, the page fault rate
must be less than 0.0000025, or one in 399,990 accesses.
• A subtlety is that swap space is faster to access than the regular file system,
because it does not have to go through the whole directory structure. For this
reason some systems will transfer an entire process from the file system to
swap space before starting up the process, so that future paging all occurs
from the ( relatively ) faster swap space.
• Some systems use demand paging directly from the file system for binary
code ( which never changes and hence does not have to be stored on a page
operation ), and to reserve the swap space for data segments that must be
stored. This approach is used by both Solaris and BSD Unix.
9.3 Copy-on-Write
• The idea behind a copy-on-write fork is that the pages for a parent process do not
have to be actually copied for the child until one or the other of the processes
changes the page. They can be simply shared between the two processes in the
meantime, with a bit set that the page needs to be copied if it ever gets written to.
This is a reasonable approach, since the child process usually issues an exec( )
system call immediately after the fork.
132
Figure 9.7 - Before process 1 modifies page C.
• Obviously only pages that can be modified even need to be labeled as copy-on-
write. Code segments can simply be shared.
• Pages used to satisfy copy-on-write duplications are typically allocated using zero-
fill-on-demand, meaning that their previous contents are zeroed out before the copy
proceeds.
• Some systems provide an alternative to the fork( ) system call called a virtual
memory fork, vfork( ). In this case the parent is suspended, and the child uses the
parent's memory pages. This is very fast for process creation, but requires that the
child not modify any of the shared memory pages before performing the exec( )
system call. ( In essence this addresses the question of which process executes first
after a call to fork, the parent or the child. With vfork, the parent is suspended,
allowing the child to execute first until it calls exec( ), sharing pages with the parent
in the meantime.
133
• In order to make the most use of virtual memory, we load several processes into
memory at the same time. Since we only load the pages that are actually needed by
each process at any given time, there is room to load many more processes than if
we had to load in the entire process.
• However memory is also needed for other purposes ( such as I/O buffering ), and
what happens if some process suddenly decides it needs more pages and there aren't
any free frames available? There are several possible solutions to consider:
1. Adjust the memory used by I/O buffering, etc., to free up some frames for
user processes. The decision of how to allocate memory for I/O versus user
processes is a complex one, yielding different policies on different systems.
( Some allocate a fixed amount for I/O, and others let the I/O system contend
for memory along with everything else. )
2. Put the process requesting more pages into a wait queue until some free
frames become available.
3. Swap some process out of memory completely, freeing up its page frames.
4. Find some page in memory that isn't being used right now, and swap that
page only out to disk, freeing up a frame that can be allocated to the process
requesting it. This is known as page replacement, and is the most common
solution. There are many different algorithms for page replacement, which
is the subject of the remainder of this section.
134
• The previously discussed page-fault processing assumed that there would be
free frames available on the free-frame list. Now the page-fault handling
must be modified to free up a frame if necessary, as follows:
1. Find the location of the desired page on the disk, either in swap space
or in the file system.
2. Find a free frame:
a. If there is a free frame, use it.
b. If there is no free frame, use a page-replacement algorithm to
select an existing frame to be replaced, known as the victim
frame.
c. Write the victim frame to disk. Change all related page tables
to indicate that this page is no longer in memory.
3. Read in the desired page and store it in the frame. Adjust all related
page and frame tables to indicate the change.
4. Restart the process that was waiting for this page.
• Note that step 3c adds an extra disk write to the page-fault handling,
effectively doubling the time required to process a page fault. This can be
alleviated somewhat by assigning a modify bit, or dirty bit to each page,
indicating whether or not it has been changed since it was last loaded in from
disk. If the dirty bit has not been set, then the page is unchanged, and does
not need to be written out to disk. Otherwise the page write is required. It
135
should come as no surprise that many page replacement strategies
specifically look for pages that do not have their dirty bit set, and
preferentially select clean pages as victim pages. It should also be obvious
that unmodifiable code pages never get their dirty bits set.
• There are two major requirements to implement a successful demand paging
system. We must develop a frame-allocation algorithm and a page-
replacement algorithm. The former centers around how many frames are
allocated to each process ( and to other needs ), and the latter deals with how
to select a page for replacement when there are no free frames available.
• The overall goal in selecting and tuning these algorithms is to generate the
fewest number of overall page faults. Because disk access is so slow relative
to memory access, even slight improvements to these algorithms can yield
large improvements in overall system performance.
• Algorithms are evaluated using a given string of memory accesses known as
a reference string, which can be generated in one of ( at least ) three
common ways:
1. Randomly generated, either evenly distributed or with some
distribution curve based on observed system behavior. This is the
fastest and easiest approach, but may not reflect real performance
well, as it ignores locality of reference.
2. Specifically designed sequences. These are useful for illustrating the
properties of comparative algorithms in published papers and
textbooks, ( and also for homework and exam problems. :-) )
3. Recorded memory references from a live system. This may be the best
approach, but the amount of data collected can be enormous, on the
order of a million addresses per second. The volume of collected data
can be reduced by making two important observations:
1. Only the page number that was accessed is relevant. The offset
within that page does not affect paging operations.
2. Successive accesses within the same page can be treated as a
single page request, because all requests after the first are
guaranteed to be page hits. ( Since there are no intervening
requests for other pages that could remove this page from the
page table. )
▪ So for example, if pages were of size 100 bytes, then the
sequence of address requests ( 0100, 0432, 0101, 0612, 0634,
0688, 0132, 0038, 0420 ) would reduce to page requests ( 1, 4,
1, 6, 1, 0, 4 )
• As the number of available frames increases, the number of page faults
should decrease, as shown in Figure 9.11:
136
Figure 9.11 - Graph of page faults versus number of frames.
• Although FIFO is simple and easy, it is not always optimal, or even efficient.
• An interesting effect that can occur with FIFO is Belady's anomaly, in which
increasing the number of frames available can actually increase the number
of page faults that occur! Consider, for example, the following chart based
on the page sequence ( 1, 2, 3, 4, 1, 2, 5, 1, 2, 3, 4, 5 ) and a varying number
of available frames. Obviously the maximum number of faults is 12 ( every
request generates a fault ), and the minimum number is 5 ( each page loaded
only once ), but in between there are some interesting results:
137
Figure 9.13 - Page-fault curve for FIFO replacement on a reference string.
• The discovery of Belady's anomaly lead to the search for an optimal page-
replacement algorithm, which is simply that which yields the lowest of all
possible page-faults, and which does not suffer from Belady's anomaly.
• Such an algorithm does exist, and is called OPT or MIN. This algorithm is
simply "Replace the page that will not be used for the longest time in the
future."
• For example, Figure 9.14 shows that by applying OPT to the same reference
string used for the FIFO example, the minimum number of possible page
faults is 9. Since 6 of the page-faults are unavoidable ( the first reference to
each new page ), FIFO can be shown to require 3 times as many ( extra )
page faults as the optimal algorithm. ( Note: The book claims that only the
first three page faults are required by all algorithms, indicating that FIFO is
only twice as bad as OPT. )
• Unfortunately OPT cannot be implemented in practice, because it requires
foretelling the future, but it makes a nice benchmark for the comparison and
evaluation of real proposed new algorithms.
• In practice most page-replacement algorithms try to approximate OPT by
predicting ( estimating ) in one fashion or another what page will not be used
for the longest period of time. The basis of FIFO is the prediction that the
page that was brought in the longest time ago is the one that will not be
needed again for the longest future time, but as we shall see, there are many
other prediction methods, all striving to match the performance of OPT.
138
Figure 9.14 - Optimal page-replacement algorithm
• The prediction behind LRU, the Least Recently Used, algorithm is that the
page that has not been used in the longest time is the one that will not be
used again in the near future. ( Note the distinction between FIFO and LRU:
The former looks at the oldest load time, and the latter looks at the
oldest use time. )
• Some view LRU as analogous to OPT, except looking backwards in time
instead of forwards. ( OPT has the interesting property that for any reference
string S and its reverse R, OPT will generate the same number of page faults
for S and for R. It turns out that LRU has this same property. )
• Figure 9.15 illustrates LRU for our sample string, yielding 12 page faults,
( as compared to 15 for FIFO and 9 for OPT. )
Figure 9.16 - Use of a stack to record the most recent page references.
140
9.4.5.1 Additional-Reference-Bits Algorithm
• Finer grain is possible by storing the most recent 8 reference bits for
each page in an 8-bit byte in the page table entry, which is interpreted
as an unsigned int.
o At periodic intervals ( clock interrupts ), the OS takes over, and
right-shifts each of the reference bytes by one bit.
o The high-order ( leftmost ) bit is then filled in with the current
value of the reference bit, and the reference bits are cleared.
o At any given time, the page with the smallest value for the
reference byte is the LRU page.
• Obviously the specific number of bits used and the frequency with
which the reference byte is updated are adjustable, and are tuned to
give the fastest performance on a given hardware platform.
141
Figure 9.17 - Second-chance ( clock ) page-replacement algorithm.
There are a number of page-buffering algorithms that can be used in conjunction with the
afore-mentioned algorithms, to improve overall performance and sometimes make up for
inherent weaknesses in the hardware and/or the underlying page-replacement algorithms:
143
We said earlier that there were two important tasks in virtual memory management: a
page-replacement strategy and a frame-allocation strategy. This section covers the second
part of that pair.
• The above arguments all assume that all memory is equivalent, or at least
has equivalent access times.
• This may not be the case in multiple-processor systems, especially where
each CPU is physically located on a separate circuit board which also holds
some portion of the overall system memory.
• In these latter systems, CPUs can access memory that is physically located
on the same board much faster than the memory on the other boards.
• The basic solution is akin to processor affinity - At the same time that we try
to schedule processes on the same CPU to minimize cache misses, we also
try to allocate memory for those processes on the same boards, to minimize
access times.
• The presence of threads complicates the picture, especially when the threads
get loaded onto different processors.
• Solaris uses an lgroup as a solution, in a hierarchical fashion based on
relative latency. For example, all processors and RAM on a single board
would probably be in the same lgroup. Memory assignments are made within
the same lgroup if possible, or to the next nearest lgroup otherwise. ( Where
"nearest" is defined as having the lowest access time. )
9.6 Thrashing
• If a process cannot maintain its minimum required number of frames, then it must
be swapped out, freeing up frames for other processes. This is an intermediate level
of CPU scheduling.
• But what about a process that can keep its minimum, but cannot keep all of the
frames that it is currently using on a regular basis? In this case it is forced to page
out pages that it will need again in the very near future, leading to large numbers of
page faults.
• A process that is spending more time paging than executing is said to be thrashing.
145
• Early process scheduling schemes would control the level of
multiprogramming allowed based on CPU utilization, adding in more
processes when CPU utilization was low.
• The problem is that when memory filled up and processes started spending
lots of time waiting for their pages to page in, then CPU utilization would
lower, causing the schedule to add in even more processes and exacerbating
the problem! Eventually the system would essentially grind to a halt.
• Local page replacement policies can prevent one thrashing process from
taking pages away from other processes, but it still tends to clog up the I/O
queue, thereby slowing down any other process that needs to do even a little
bit of paging ( or any other I/O for that matter. )
146
Figure 9.19 - Locality in a memory-reference pattern.
• The working set model is based on the concept of locality, and defines
a working set window, of length delta. Whatever pages are included in the
most recent delta page references are said to be in the processes working set
window, and comprise its current working set, as illustrated in Figure 9.20:
147
Figure 9.20 - Working-set model.
• The selection of delta is critical to the success of the working set model - If
it is too small then it does not encompass all of the pages of the current
locality, and if it is too large, then it encompasses pages that are no longer
being frequently accessed.
• The total demand, D, is the sum of the sizes of the working sets for all
processes. If D exceeds the total number of available frames, then at least
one process is thrashing, because there are not enough frames available to
satisfy its minimum working set. If D is significantly less than the currently
available frames, then additional processes can be launched.
• The hard part of the working-set model is keeping track of what pages are in
the current working set, since every reference adds one to the set and
removes one older page. An approximation can be made using reference bits
and a timer that goes off after a set interval of memory references:
o For example, suppose that we set the timer to go off after every 5000
references ( by any process ), and we can store two additional
historical reference bits in addition to the current reference bit.
o Every time the timer goes off, the current reference bit is copied to
one of the two historical bits, and then cleared.
o If any of the three bits is set, then that page was referenced within the
last 15,000 references, and is considered to be in that processes
reference set.
o Finer resolution can be achieved with more historical bits and a more
frequent timer, at the expense of greater overhead.
148
Figure 9.21 - Page-fault frequency.
• Note that there is a direct relationship between the page-fault rate and the
working-set, as a process moves from one locality to another:
149
UNIT IV
Mass-Storage Structure
150
Figure 10.1 - Moving-head disk mechanism.
• In operation the disk rotates at high speed, such as 7200 rpm ( 120 revolutions
per second. ) The rate at which data can be transferred from the disk to the
computer is composed of several steps:
o The positioning time, a.k.a. the seek time or random access time is the
time required to move the heads from one cylinder to another, and for
the heads to settle down after the move. This is typically the slowest step
in the process and the predominant bottleneck to overall transfer rates.
o The rotational latency is the amount of time required for the desired
sector to rotate around and come under the read-write head.This can
range anywhere from zero to one full revolution, and on the average will
equal one-half revolution. This is another physical step and is usually the
second slowest step behind seek time. ( For a disk rotating at 7200 rpm,
the average rotational latency would be 1/2 revolution / 120 revolutions
per second, or just over 4 milliseconds, a long time by computer
standards.
o The transfer rate, which is the time required to move the data
electronically from the disk to the computer. ( Some authors may also use
the term transfer rate to refer to the overall transfer rate, including seek
time and rotational latency as well as the electronic data transfer rate. )
• Disk heads "fly" over the surface on a very thin cushion of air. If they should
accidentally contact the disk, then a head crash occurs, which may or may not
151
permanently damage the disk or even destroy it completely. For this reason it is
normal to park the disk heads when turning a computer off, which means to
move the heads off the disk or to an area of the disk where there is no data
stored.
• Floppy disks are normally removable. Hard drives can also be removable, and
some are even hot-swappable, meaning they can be removed while the
computer is running, and a new hard drive inserted in their place.
• Disk drives are connected to the computer via a cable known as the I/O
Bus. Some of the common interface formats include Enhanced Integrated Drive
Electronics, EIDE; Advanced Technology Attachment, ATA; Serial ATA, SATA,
Universal Serial Bus, USB; Fiber Channel, FC, and Small Computer Systems
Interface, SCSI.
• The host controller is at the computer end of the I/O bus, and the disk
controller is built into the disk itself. The CPU issues commands to the host
controller via I/O ports. Data is transferred between the magnetic surface and
onboard cache by the disk controller, and then the data is transferred from that
cache to the host controller and the motherboard memory at electronic speeds.
• As technologies improve and economics change, old technologies are often used
in different ways. One example of this is the increasing used of solid state disks,
or SSDs.
• SSDs use memory technology as a small fast hard disk. Specific implementations
may use either flash memory or DRAM chips protected by a battery to sustain
the information through power cycles.
• Because SSDs have no moving parts they are much faster than traditional hard
drives, and certain problems such as the scheduling of disk accesses simply do
not apply.
• However SSDs also have their weaknesses: They are more expensive than hard
drives, generally not as large, and may have shorter life spans.
• SSDs are especially useful as a high-speed cache of hard-disk information that
must be accessed quickly. One example is to store filesystem meta-data, e.g.
directory and inode information, that must be accessed quickly and often.
Another variation is a boot disk containing the OS and some application
executables, but no vital user data. SSDs are also used in laptops to make them
smaller, faster, and lighter.
• Because SSDs are so much faster than traditional hard disks, the throughput of
the bus can become a limiting factor, causing some SSDs to be connected
directly to the system PCI bus for example.
152
• Magnetic tapes were once used for common secondary storage before the days
of hard disk drives, but today are used primarily for backups.
• Accessing a particular spot on a magnetic tape can be slow, but once reading or
writing commences, access speeds are comparable to disk drives.
• Capacities of tape drives can range from 20 to 200 GB, and compression can
double that capacity.
Disk drives can be attached either directly to a particular host ( a local disk ) or to a
network.
153
10.3.1 Host-Attached Storage
154
Figure 10.2 - Network-attached storage.
155
• Bandwidth is measured by the amount of data transferred divided by the total
amount of time from the first request being made to the last transfer being
completed, ( for a series of disk requests. )
• Both bandwidth and access time can be improved by processing requests in a
good order.
• Disk requests include the disk address, memory address, number of sectors to
transfer, and whether the request is for reading or writing.
• First-Come First-Serve is simple and intrinsically fair, but not very efficient.
Consider in the following sequence the wild swing from cylinder 122 to 14 and
then back to 124:
• Shortest Seek Time First scheduling is more efficient, but may lead to starvation
if a constant stream of requests arrives for the same general area of the disk.
• SSTF reduces the total head movement to 236 cylinders, down from 640
required for the same set of requests under FCFS. Note, however that the
distance could be reduced still further to 208 by starting with 37 and then 14
first before processing the rest of the requests.
156
Figure 10.5 - SSTF disk scheduling.
• The SCAN algorithm, a.k.a. the elevator algorithm moves back and forth from
one end of the disk to the other, similarly to an elevator processing requests in a
tall building.
157
• Under the SCAN algorithm, If a request arrives just ahead of the moving head
then it will be processed right away, but if it arrives just after the head has
passed, then it will have to wait for the head to pass going the other way on the
return trip. This leads to a fairly wide variation in access times which can be
improved upon.
• Consider, for example, when the head reaches the high end of the disk: Requests
with high cylinder numbers just missed the passing head, which means they are
all fairly recent requests, whereas requests with low numbers may have been
waiting for a much longer time. Making the return scan from high to low then
ends up accessing recent requests first and making older requests wait that
much longer.
• LOOK scheduling improves upon SCAN by looking ahead at the queue of pending
requests, and not moving the heads any farther towards the end of the disk than
is necessary. The following diagram illustrates the circular form of LOOK:
158
Figure 10.8 - C-LOOK disk scheduling.
• With very low loads all algorithms are equal, since there will normally only be
one request to process at a time.
• For slightly larger loads, SSTF offers better performance than FCFS, but may lead
to starvation when loads become heavy enough.
• For busier systems, SCAN and LOOK algorithms eliminate starvation problems.
• The actual optimal algorithm may be something even more complex than those
discussed here, but the incremental improvements are generally not worth the
additional overhead.
• Some improvement to overall filesystem access times can be made by intelligent
placement of directory and/or inode information. If those structures are placed
in the middle of the disk instead of at the beginning of the disk, then the
maximum distance from those structures to data blocks is reduced to only one-
half of the disk size. If those structures can be further distributed and
furthermore have their data blocks stored as close as possible to the
corresponding directory structures, then that reduces still further the overall
time to find the disk block numbers and then access the corresponding data
blocks.
• On modern disks the rotational latency can be almost as significant as the seek
time, however it is not within the OSes control to account for that, because
modern disks do not reveal their internal sector mapping schemes, ( particularly
when bad blocks have been remapped to spare sectors. )
o Some disk manufacturers provide for disk scheduling algorithms directly
on their disk controllers, ( which do know the actual geometry of the disk
159
as well as any remapping ), so that if a series of requests are sent from the
computer to the controller then those requests can be processed in an
optimal order.
o Unfortunately there are some considerations that the OS must take into
account that are beyond the abilities of the on-board disk-scheduling
algorithms, such as priorities of some requests over others, or the need to
process certain requests in a particular order. For this reason OSes may
elect to spoon-feed requests to the disk controller one at a time in certain
situations.
160
• The first sector on the hard drive is known as the Master Boot Record,
MBR, and contains a very small amount of code in addition to the partition
table. The partition table documents how the disk is partitioned into logical
disks, and indicates specifically which partition is
the active or boot partition.
• The boot program then looks to the active partition to find an operating
system, possibly loading up a slightly larger / more advanced boot program
along the way.
• In a dual-boot ( or larger multi-boot ) system, the user may be given a
choice of which operating system to boot, with a default action to be taken
in the event of no response within some time frame.
• Once the kernel is found by the boot program, it is loaded into memory and
then control is transferred over to the OS. The kernel will normally
continue the boot process by initializing all important kernel data
structures, launching important system services ( e.g. network daemons,
sched, init, etc. ), and finally providing one or more login prompts. Boot
options at this stage may include single-
user a.k.a. maintenance or safe modes, in which very few system services
are started - These modes are designed for system administrators to repair
problems or otherwise maintain the system.
• Modern systems typically swap out pages as needed, rather than swapping out
entire processes. Hence the swapping system is part of the virtual memory
management system.
• Managing swap space is obviously an important task for modern OSes.
• The general idea behind RAID is to employ a group of hard drives together with
some form of duplication, either to increase reliability or to speed up operations,
( or sometimes both. )
• RAID originally stood for Redundant Array of Inexpensive Disks, and was
designed to use a bunch of cheap small disks in place of one or two larger more
expensive ones. Today RAID systems employ large possibly expensive disks as
their components, switching the definition to Independent disks.
163
10.7.1 Improvement of Reliability via Redundancy
• The more disks a system has, the greater the likelihood that one of them
will go bad at any given time. Hence increasing disks on a system
actually decreases the Mean Time To Failure, MTTF of the system.
• If, however, the same data was copied onto multiple disks, then the data
would not be lost unless both ( or all ) copies of the data were damaged
simultaneously, which is a MUCH lower probability than for a single disk
going bad. More specifically, the second disk would have to go bad before
the first disk was repaired, which brings the Mean Time To Repair into
play. For example if two disks were involved, each with a MTTF of
100,000 hours and a MTTR of 10 hours, then the Mean Time to Data
Loss would be 500 * 10^6 hours, or 57,000 years!
• This is the basic idea behind disk mirroring, in which a system contains
identical data on two or more disks.
o Note that a power failure during a write operation could cause both
disks to contain corrupt data, if both disks were writing
simultaneously at the time of the power failure. One solution is to
write to the two disks in series, so that they will not both become
corrupted ( at least not in the same way ) by a power failure. And
alternate solution involves non-volatile RAM as a write cache,
which is not lost in the event of a power failure and which is
protected by error-correcting codes.
File Concept
165
166
11.1.2 File Operations
167
Figure 11.2 - File-locking example in Java.
168
11.1.3 File Types
• Macintosh stores a creator attribute for each file, according to the program
that first created it with the create( ) system call.
• UNIX stores magic numbers at the beginning of certain files. ( Experiment
with the "file" command, especially in directories such as /bin and /dev )
• Some files contain an internal structure, which may or may not be known
to the OS.
• For the OS to support particular file formats increases the size and
complexity of the OS.
169
• UNIX treats all files as sequences of bytes, with no further consideration of
the internal structure. ( With the exception of executable binary programs,
which it must know how to load and find the first executable statement, etc.
)
• Macintosh files have two forks - a resource fork, and a data fork. The
resource fork contains information relating to the UI, such as icons and
button images, and can be modified independently of the data fork, which
contains the code or data as appropriate.
• Disk files are accessed in units of physical blocks, typically 512 bytes or
some power-of-two multiple thereof. ( Larger physical disks use larger
block sizes, to keep the range of block numbers within the range of a 32-bit
integer. )
• Internally files are organized in units of logical units, which may be as
small as a single byte, or may be a larger size corresponding to some data
record or structure size.
• The number of logical units which fit into one physical block determines
its packing, and has an impact on the amount of internal fragmentation
( wasted space ) that occurs.
• As a general rule, half a physical block is wasted for each file, and the
larger the block sizes the more space is lost to internal fragmentation.
172
11.3.2 Directory Overview
173
Figure 11.9 - Single-level directory.
• An obvious extension to the two-tiered directory structure, and the one with
which we are all most familiar.
• Each user / process has the concept of a current directory from which all
( relative ) searches take place.
• Files may be accessed using either absolute pathnames ( relative to the root
of the tree ) or relative pathnames ( relative to the current directory. )
• Directories are stored the same as any other file in the system, except there
is a bit that identifies them as directories, and they have some special
structure that the OS understands.
174
• One question for consideration is whether or not to allow the removal of
directories that are not empty - Windows requires that directories be
emptied first, and UNIX provides an option for deleting entire sub-trees.
• When the same files need to be accessed in more than one place in the
directory structure ( e.g. because they are being shared by more than one
user / process ), it can be useful to provide an acyclic-graph structure.
( Note the directed arcs from parent to child. )
• UNIX provides two types of links for implementing the acyclic-graph
structure. ( See "man ln" for more details. )
o A hard link ( usually just called a link ) involves multiple directory
entries that both refer to the same file. Hard links are only valid for
ordinary files in the same filesystem.
o A symbolic link, that involves a special file, containing information
about where to find the linked file. Symbolic links may be used to
link directories and/or files in other filesystems, as well as ordinary
files in the current filesystem.
• Windows only supports symbolic links, termed shortcuts.
• Hard links require a reference count, or link count for each file, keeping
track of how many directory entries are currently referring to this file.
Whenever one of the references is removed the link count is reduced, and
when it reaches zero, the disk space can be reclaimed.
175
• For symbolic links there is some question as to what to do with the
symbolic links when the original file is moved or deleted:
o One option is to find all the symbolic links and adjust them also.
o Another is to leave the symbolic links dangling, and discover that
they are no longer valid the next time they are used.
o What if the original file is removed, and replaced with another file
having the same name before the symbolic link is next used?
• If cycles are allowed in the graphs, then several problems can arise:
o Search algorithms can go into infinite loops. One solution is to not
follow links in search algorithms. ( Or not to follow symbolic links,
and to only allow symbolic links to refer to directories. )
o Sub-trees can become disconnected from the rest of the tree and still
not have their reference counts reduced to zero. Periodic garbage
collection is required to detect and resolve this problem. ( chkdsk in
DOS and fsck in UNIX search for these problems, among others,
even though cycles are not supposed to be allowed in either system.
Disconnected disk blocks that are not marked as free are added back
to the file systems with made-up file names, and can usually be
safely deleted. )
176
Figure 11.13 - General graph directory.
• The basic idea behind mounting file systems is to combine multiple file systems
into one large tree structure.
• The mount command is given a filesystem to mount and a mount point ( directory
) on which to attach it.
• Once a file system is mounted onto a mount point, any further references to that
directory actually refer to the root of the mounted file system.
• Any files ( or sub-directories ) that had been stored in the mount point directory
prior to mounting the new filesystem are now hidden by the mounted filesystem,
and are no longer available. For this reason some systems only allow mounting
onto empty directories.
• Filesystems can only be mounted by root, unless root has previously configured
certain filesystems to be mountable onto certain pre-determined mount points. (
E.g. root may allow users to mount floppy filesystems to /mnt or something like
it. ) Anyone can run the mount command to see what filesystems are currently
mounted.
• Filesystems may be mounted read-only, or have other restrictions imposed.
177
Figure 11.14 - File system. (a) Existing system. (b) Unmounted volume.
178
• More recent Windows systems allow filesystems to be mounted to any directory
in the filesystem, much like UNIX.
• The advent of the Internet introduces issues for accessing files stored on
remote computers
o The original method was ftp, allowing individual files to be
transported across systems as needed. Ftp can be either account and
password controlled, or anonymous, not requiring any user name or
password.
o Various forms of distributed file systems allow remote file systems
to be mounted onto a local directory structure, and accessed using
normal file access commands. ( The actual files are still transported
across the network as needed, possibly using ftp as the underlying
transport mechanism. )
o The WWW has made it easy once again to access files on remote
systems without mounting their filesystems, generally using
( anonymous ) ftp as the underlying file transport mechanism.
179
o Servers commonly restrict mount permission to certain trusted
systems only. Spoofing ( a computer pretending to be a
different computer ) is a potential security risk.
o Servers may restrict remote access to read-only.
o Servers restrict which filesystems may be remotely mounted.
Generally the information within those subsystems is limited,
relatively public, and protected by frequent backups.
• The NFS ( Network File System ) is a classic example of such a
system.
11.6 Protection
• Files must be kept safe for reliability ( against accidental damage ), and protection
( against deliberate malicious access. ) The former is usually managed with
backup copies. This section discusses the latter.
• One simple protection scheme is to remove all access to a file. However this
makes the file unusable, so some sort of controlled access must be arranged.
• In addition there are some special bits that can also be applied:
o The set user ID ( SUID ) bit and/or the set group ID ( SGID ) bits
applied to executable files temporarily change the identity of
whoever runs the program to match that of the owner / group of the
executable program. This allows users running specific programs to
have access to files ( while running that program ) to which they
would normally be unable to access. Setting of these two bits is
182
usually restricted to root, and must be done with caution, as it
introduces a potential security leak.
o The sticky bit on a directory modifies write permission, allowing
users to only delete files for which they are the owner. This allows
everyone to create files in /tmp, for example, but to only delete files
which they have created, and not anyone else's.
o The SUID, SGID, and sticky bits are indicated with an S, S, and T in
the positions for execute permission for the user, group, and others,
respectively. If the letter is lower case, ( s, s, t ), then the
corresponding execute permission is not also given. If it is upper
case, ( S, S, T ), then the corresponding execute permission IS given.
o The numeric form of chmod is needed to set these advanced bits.
183
Figure 11.16 - Windows 7 access-control list management.
184
12.1 File-System Structure
• Hard disks have two important properties that make them suitable for secondary
storage of files in file systems: (1) Blocks of data can be rewritten in place, and
(2) they are direct access, allowing any block of data to be accessed with only (
relatively ) minor movements of the disk heads and rotational latency. ( See
Chapter 12 )
• Disks are usually accessed in physical blocks, rather than a byte at a time. Block
sizes may range from 512 bytes to 4K or larger.
• File systems organize storage on disk drives, and can be viewed as a layered
design:
o At the lowest layer are the physical devices, consisting of the magnetic
media, motors & controls, and the electronics connected to them and
controlling them. Modern disk put more and more of the electronic
controls directly on the disk drive itself, leaving relatively little work for
the disk controller card to perform.
o I/O Control consists of device drivers, special software programs ( often
written in assembly ) which communicate with the devices by reading and
writing special codes directly to and from memory addresses
corresponding to the controller card's registers. Each controller card
( device ) on a system has a different set of addresses ( registers,
a.k.a. ports ) that it listens to, and a unique set of command codes and
results codes that it understands.
o The basic file system level works directly with the device drivers in terms
of retrieving and storing raw blocks of data, without any consideration for
what is in each block. Depending on the system, blocks may be referred to
with a single block number, ( e.g. block # 234234 ), or with head-sector-
cylinder combinations.
o The file organization module knows about files and their logical blocks,
and how they map to physical blocks on the disk. In addition to translating
from logical to physical blocks, the file organization module also maintains
the list of free blocks, and allocates free blocks to files as needed.
o The logical file system deals with all of the meta data associated with a
file ( UID, GID, mode, dates, etc ), i.e. everything about the file except the
data itself. This level manages the directory structure and the mapping of
file names to file control blocks, FCBs, which contain all of the meta data
as well as block number information for finding the data on the disk.
• The layered approach to file systems means that much of the code can be used
uniformly for a wide variety of different file systems, and only certain layers
need to be filesystem specific. Common file systems in use include the UNIX file
system, UFS, the Berkeley Fast File System, FFS, Windows systems FAT, FAT32,
NTFS, CD-ROM systems ISO 9660, and for Linux the extended file systems ext2
and ext3 ( among 40 others supported. )
185
Figure 12.1 - Layered file system.
12.2.1 Overview
186
Figure 12.2 - A typical file-control block.
187
Figure 12.3 - In-memory file-system structures. (a) File open. (b) File read.
• Physical disks are commonly divided into smaller units called partitions. They can
also be combined into larger units, but that is most commonly done for RAID
installations and is left for later chapters.
• Partitions can either be used as raw devices ( with no structure imposed upon
them ), or they can be formatted to hold a filesystem ( i.e. populated with FCBs
and initial directory structures as appropriate. ) Raw partitions are generally used
for swap space, and may also be used for certain programs such as databases
that choose to manage their own disk storage system. Partitions containing
filesystems can generally only be accessed using the file system structure by
ordinary users, but can often be accessed as a raw device also by root.
• The boot block is accessed as part of a raw partition, by the boot program prior
to any operating system being loaded. Modern boot programs understand
multiple OSes and filesystem formats, and can give the user a choice of which of
several available systems to boot.
• The root partition contains the OS kernel and at least the key portions of the OS
needed to complete the boot process. At boot time the root partition is
mounted, and control is transferred from the boot program to the kernel found
there. ( Older systems required that the root partition lie completely within the
188
first 1024 cylinders of the disk, because that was as far as the boot program
could reach. Once the kernel had control, then it could access partitions beyond
the 1024 cylinder boundary. )
• Continuing with the boot process, additional filesystems get mounted, adding
their information into the appropriate mount table structure. As a part of the
mounting process the file systems may be checked for errors or inconsistencies,
either because they are flagged as not having been closed properly the last time
they were used, or just for general principals. Filesystems may be mounted
either automatically or manually. In UNIX a mount point is indicated by setting a
flag in the in-memory copy of the inode, so all future references to that inode
get re-directed to the root directory of the mounted filesystem.
189
Figure 12.4 - Schematic view of a virtual file system.
• A linear list is the simplest and easiest directory structure to set up, but it does
have some drawbacks.
• Finding a file ( or verifying one does not already exist upon creation ) requires a
linear search.
• Deletions can be done by moving all entries, flagging an entry as deleted, or by
moving the last entry into the newly vacant position.
• Sorting the list makes searches faster, at the expense of more complex insertions
and deletions.
• A linked list makes insertions and deletions into a sorted list easier, with
overhead for the links.
• More complex data structures, such as B-trees, could also be considered.
190
• A hash table can also be used to speed up searches.
• Hash tables are generally implemented in addition to a linear or other structure
• There are three major methods of storing files on disks: contiguous, linked, and
indexed.
• Contiguous Allocation requires that all blocks of a file be kept together contiguously.
• Performance is very fast, because reading successive blocks of the same file generally requires no
movement of the disk heads, or at most one small step to the next adjacent cylinder.
• Storage allocation involves the same issues discussed earlier for the allocation of contiguous blocks
of memory ( first fit, best fit, fragmentation problems, etc. ) The distinction is that the high time penalty
required for moving the disk heads from spot to spot may now justify the benefits of keeping files
contiguously when possible.
• ( Even file systems that do not by default store files contiguously can benefit from certain utilities that
compact the disk and make all files contiguous in the process. )
• Problems can arise when files grow, or if the exact size of a file is unknown at creation time:
o Over-estimation of the file's final size increases external fragmentation and wastes disk space.
o Under-estimation may require that a file be moved or a process aborted if the file grows
beyond its originally allocated space.
o If a file grows slowly over a long time period and the total final space must be allocated
initially, then a lot of space becomes unusable before the file fills the space.
• A variation is to allocate file space in large contiguous chunks, called extents. When a file outgrows
its original extent, then an additional one is allocated. ( For example an extent may be the size of a
complete track or even cylinder, aligned on an appropriate track or cylinder boundary. ) The high-
performance files system Veritas uses extents to optimize performance.
191
Figure 12.5 - Contiguous allocation of disk space.
12.4.2 Linked Allocation
• Disk files can be stored as linked lists, with the expense of the storage space consumed by each link.
( E.g. a block may be 508 bytes instead of 512. )
• Linked allocation involves no external fragmentation, does not require pre-known file sizes, and
allows files to grow dynamically at any time.
• Unfortunately linked allocation is only efficient for sequential access files, as random access requires
starting at the beginning of the list for each new location access.
• Allocating clusters of blocks reduces the space wasted by pointers, at the cost of internal
fragmentation.
• Another big problem with linked allocation is reliability if a pointer is lost or damaged. Doubly linked
lists provide some protection, at the cost of additional overhead and wasted space.
192
Figure 12.6 - Linked allocation of disk space.
• The File Allocation Table, FAT, used by DOS is a variation of linked allocation, where all the links
are stored in a separate table at the beginning of the disk. The benefit of this approach is that the FAT
table can be cached in memory, greatly improving random access speeds.
193
Figure 12.7 File-allocation table.
12.4.3 Indexed Allocation
• Indexed Allocation combines all of the indexes for accessing each file into a common block ( for that
file ), as opposed to spreading them all over the disk or storing them in a FAT table.
194
Figure 12.8 - Indexed allocation of disk space.
• Some disk space is wasted ( relative to linked lists or FAT tables ) because an entire index block must
be allocated for each file, regardless of how many data blocks the file contains. This leads to questions
of how big the index block should be, and how it should be implemented. There are several
approaches:
o Linked Scheme - An index block is one disk block, which can be read and written in a single
disk operation. The first index block contains some header information, the first N block
addresses, and if necessary a pointer to additional linked index blocks.
o Multi-Level Index - The first index block contains a set of pointers to secondary index blocks,
which in turn contain pointers to the actual data blocks.
o Combined Scheme - This is the scheme used in UNIX inodes, in which the first 12 or so data
block pointers are stored directly in the inode, and then singly, doubly, and triply indirect
pointers provide access to more data blocks as needed. ( See below. ) The advantage of this
scheme is that for small files ( which many are ), the data blocks are readily accessible ( up to
48K with 4K block sizes ); files up to about 4144K ( using 4K blocks ) are accessible with
only a single indirect block ( which can be cached ), and huge files are still accessible using a
relatively small number of disk accesses ( larger in theory than can be addressed by a 32-bit
address, which is why some systems have moved to 64-bit file pointers. )
195
Figure 12.9 - The UNIX inode.
12.4.4 Performance
• The optimal allocation method is different for sequential access files than for random access files, and
is also different for small files than for large files.
• Some systems support more than one allocation method, which may require specifying how the file is
to be used ( sequential or random access ) at the time it is allocated. Such systems also provide
conversion utilities.
• Some systems have been known to use contiguous access for small files, and automatically switch to
an indexed scheme when file sizes surpass a certain threshold.
• And of course some systems adjust their allocation schemes ( e.g. block sizes ) to best match the
characteristics of the hardware for optimum performance.
• One simple approach is to use a bit vector, in which each bit represents a
disk block, set to 1 if free or 0 if allocated.
• Fast algorithms exist for quickly finding contiguous blocks of a given size
196
• The down side is that a 40GB disk requires over 5MB just to store the
bitmap. ( For example. )
• A linked list can also be used to keep track of all free blocks.
• Traversing the list and/or finding a contiguous block of a given size are not
easy, but fortunately are not frequently needed operations. Generally the
system just adds and removes single blocks from the beginning of the list.
• The FAT table keeps track of the free list as just one more linked list on the
table.
12.5.3 Grouping
12.5.4 Counting
197
• When there are multiple contiguous blocks of free space then the system
can keep track of the starting address of the group and the number of
contiguous free blocks. As long as the average length of a contiguous
group of free blocks is greater than two this offers a savings in space
needed for the free list. ( Similar to compression techniques used for
graphics images when a group of pixels all the same color is encountered. )
• Sun's ZFS file system was designed for HUGE numbers and sizes of files,
directories, and even file systems.
• The resulting data structures could be VERY inefficient if not implemented
carefully. For example, freeing up a 1 GB file on a 1 TB file system could
involve updating thousands of blocks of free list bit maps if the file was
spread across the disk.
• ZFS uses a combination of techniques, starting with dividing the disk up
into ( hundreds of ) metaslabs of a manageable size, each having their own
space map.
• Free blocks are managed using the counting technique, but rather than
write the information to a table, it is recorded in a log-structured
transaction record. Adjacent free blocks are also coalesced into a larger
single free block.
• An in-memory space map is constructed using a balanced tree data
structure, constructed from the log data.
• The combination of the in-memory tree and the on-disk log provide for
very fast and efficient management of these very large files and free
blocks.
I/O Hardware
• One way of communicating with devices is through registers associated with each
port. Registers may be one to four bytes in size, and may typically include ( a
subset of ) the following four:
1. The data-in register is read by the host to get input from the device.
2. The data-out register is written by the host to send output.
3. The status register has bits read by the host to ascertain the status of the
device, such as idle, ready for input, busy, error, transaction complete, etc.
4. The control register has bits written by the host to issue commands or to
change settings of the device such as parity checking, word length, or full-
versus half-duplex operation.
• Figure 13.2 shows some of the most common I/O port address ranges.
199
Figure 13.2 - Device I/O port locations on PCs ( partial ).
13.2.1 Polling
200
5. Then the device controller reads the command register, sees the
write bit set, reads the byte of data from the data-out register, and
outputs the byte of data.
6. The device controller then clears the error bit in the status register,
the command-ready bit, and finally clears the busy bit, signaling the
completion of the operation.
• Polling can be very fast and efficient, if both the device and the controller
are fast and if there is significant data to transfer. It becomes inefficient,
however, if the host must wait a long time in the busy loop waiting for the
device, or if frequent checks need to be made for data that is infrequently
there.
13.2.2 Interrupts
• Interrupts allow devices to notify the CPU when they have data to transfer
or when an operation is complete, allowing the CPU to perform other
duties when no I/O transfers need its immediate attention.
• The CPU has an interrupt-request line that is sensed after every
instruction.
o A device's controller raises an interrupt by asserting a signal on the
interrupt request line.
o The CPU then performs a state save, and transfers control to
the interrupt handler routine at a fixed address in memory. ( The
CPU catches the interrupt and dispatches the interrupt handler. )
o The interrupt handler determines the cause of the interrupt, performs
the necessary processing, performs a state restore, and executes
a return from interrupt instruction to return control to the CPU.
( The interrupt handler clears the interrupt by servicing the device. )
▪ ( Note that the state restored does not need to be the same
state as the one that was saved when the interrupt went off.
See below for an example involving time-slicing. )
• Figure 13.3 illustrates the interrupt-driven I/O procedure:
201
Figure 13.3 - Interrupt-driven I/O cycle.
• The above description is adequate for simple interrupt-driven I/O, but there
are three needs in modern computing which complicate the picture:
1. The need to defer interrupt handling during critical processing,
2. The need to determine which interrupt handler to invoke, without
having to poll all devices to see which one needs attention, and
3. The need for multi-level interrupts, so the system can differentiate
between high- and low-priority interrupts for proper response.
• These issues are handled in modern computer architectures with interrupt-
controller hardware.
o Most CPUs now have two interrupt-request lines: One that is non-
maskable for critical error conditions and one that is maskable, that
the CPU can temporarily ignore during critical processing.
o The interrupt mechanism accepts an address, which is usually one of
a small set of numbers for an offset into a table called the interrupt
202
vector. This table ( usually located at physical address zero ? ) holds
the addresses of routines prepared to process specific interrupts.
o The number of possible interrupt handlers still exceeds the range of
defined interrupt numbers, so multiple handlers can be interrupt
chained. Effectively the addresses held in the interrupt vectors are
the head pointers for linked-lists of interrupt handlers.
o Figure 13.4 shows the Intel Pentium interrupt vector. Interrupts 0 to
31 are non-maskable and reserved for serious hardware and other
errors. Maskable interrupts, including normal device I/O interrupts
begin at interrupt 32.
o Modern interrupt hardware also supports interrupt priority levels,
allowing systems to mask off only lower-priority interrupts while
servicing a high-priority interrupt, or conversely to allow a high-
priority signal to interrupt the processing of a low-priority one.
• At boot time the system determines which devices are present, and loads
the appropriate handler addresses into the interrupt table.
• During operation, devices signal errors or the completion of commands via
interrupts.
203
• Exceptions, such as dividing by zero, invalid memory accesses, or attempts
to access kernel mode instructions can be signaled via interrupts.
• Time slicing and context switches can also be implemented using the
interrupt mechanism.
o The scheduler sets a hardware timer before transferring control over
to a user process.
o When the timer raises the interrupt request line, the CPU performs a
state-save, and transfers control over to the proper interrupt handler,
which in turn runs the scheduler.
o The scheduler does a state-restore of a different process before
resetting the timer and issuing the return-from-interrupt instruction.
• A similar example involves the paging system for virtual memory - A page
fault causes an interrupt, which in turn issues an I/O request and a context
switch as described above, moving the interrupted process into the wait
queue and selecting a different process to run. When the I/O request has
completed ( i.e. when the requested page has been loaded up into physical
memory ), then the device interrupts, and the interrupt handler moves the
process from the wait queue into the ready queue, ( or depending on
scheduling algorithms and policies, may go ahead and context switch it
back onto the CPU. )
• System calls are implemented via software interrupts, a.k.a. traps. When a
( library ) program needs work performed in kernel mode, it sets command
information and possibly data addresses in certain registers, and then raises
a software interrupt. ( E.g. 21 hex in DOS. ) The system does a state save
and then calls on the proper interrupt handler to process the request in
kernel mode. Software interrupts generally have low priority, as they are
not as urgent as devices with limited buffering space.
• Interrupts are also used to control kernel operations, and to schedule
activities for optimal performance. For example, the completion of a disk
read operation involves two interrupts:
o A high-priority interrupt acknowledges the device completion, and
issues the next disk request so that the hardware does not sit idle.
o A lower-priority interrupt transfers the data from the kernel memory
space to the user space, and then transfers the process from the
waiting queue to the ready queue.
• The Solaris OS uses a multi-threaded kernel and priority threads to assign
different threads to different interrupt handlers. This allows for the
"simultaneous" handling of multiple interrupts, and the assurance that high-
priority interrupts will take precedence over low-priority ones and over
user processes.
• For devices that transfer large quantities of data ( such as disk controllers ),
it is wasteful to tie up the CPU transferring data in and out of registers one
byte at a time.
204
• Instead this work can be off-loaded to a special processor, known as
the Direct Memory Access, DMA, Controller.
• The host issues a command to the DMA controller, indicating the location
where the data is located, the location where the data is to be transferred to,
and the number of bytes of data to transfer. The DMA controller handles
the data transfer, and then interrupts the CPU when the transfer is
complete.
• A simple DMA controller is a standard component in modern PCs, and
many bus-mastering I/O cards contain their own DMA hardware.
• Handshaking between DMA controllers and their devices is accomplished
through two wires called the DMA-request and DMA-acknowledge wires.
• While the DMA transfer is going on the CPU does not have access to the
PCI bus ( including main memory ), but it does have access to its internal
registers and primary and secondary caches.
• DMA can be done in terms of either physical addresses or virtual addresses
that are mapped to physical addresses. The latter approach is known
as Direct Virtual Memory Access, DVMA, and allows direct data transfer
from one memory-mapped device to another without using the main
memory chips.
• Direct DMA access by user processes can speed up operations, but is
generally forbidden by modern systems for security and protection reasons.
( I.e. DMA is a kernel-mode operation. )
• Figure 13.5 below illustrates the DMA process.
205
13.2.4 I/O Hardware Summary
206
Figure 13.7 - Characteristics of I/O devices.
• Most devices can be characterized as either block I/O, character I/O, memory
mapped file access, or network sockets. A few devices are special, such as time-
of-day clock and the system timer.
• Most OSes also have an escape, or back door, which allows applications to send
commands directly to device drivers if needed. In UNIX this is the ioctl( ) system
call ( I/O Control ). Ioctl( ) takes three arguments - The file descriptor for the
device driver being accessed, an integer indicating the desired function to be
performed, and an address used for communicating or transferring additional
information.
• Block devices are accessed a block at a time, and are indicated by a "b" as
the first character in a long listing on UNIX systems. Operations supported
include read( ), write( ), and seek( ).
o Accessing blocks on a hard drive directly ( without going through
the filesystem structure ) is called raw I/O, and can speed up certain
operations by bypassing the buffering and locking normally
conducted by the OS. ( It then becomes the application's
responsibility to manage those issues. )
o A new alternative is direct I/O, which uses the normal filesystem
access, but which disables buffering and locking operations.
• Memory-mapped file I/O can be layered on top of block-device drivers.
o Rather than reading in the entire file, it is mapped to a range of
memory addresses, and then paged into memory as needed using the
virtual memory system.
207
o Access to the file is then accomplished through normal memory
accesses, rather than through read( ) and write( ) system calls. This
approach is commonly used for executable program code.
• Character devices are accessed one byte at a time, and are indicated by a
"c" in UNIX long listings. Supported operations include get( ) and put( ),
with more advanced functionality such as reading an entire line supported
by higher-level library routines.
• Because network access is inherently different from local disk access, most
systems provide a separate interface for network devices.
• One common and popular interface is the socket interface, which acts like a
cable or pipeline connecting two networked entities. Data can be put into
the socket at one end, and read out sequentially at the other end. Sockets
are normally full-duplex, allowing for bi-directional data transfer.
• The select( ) system call allows servers ( or other applications ) to identify
sockets which have data waiting, without having to poll all available
sockets.
Figure 13.8 - Two I/O methods: (a) synchronous and (b) asynchronous.
209
• On systems with many devices, separate request queues are often kept for
each device:
13.4.2 Buffering
210
Figure 13.10 - Sun Enterprise 6000 device-transfer rates ( logarithmic ).
13.4.3 Caching
211
then the application sees that print job as complete, and the print scheduler
sends each file to the appropriate printer one at a time.
• Support is provided for viewing the spool queues, removing jobs from the
queues, moving jobs from one queue to another queue, and in some cases
changing the priorities of jobs in the queues.
• Spool queues can be general ( any laser printer ) or specific ( printer
number 42. )
• OSes can also provide support for processes to request / get exclusive
access to a particular device, and/or to wait until a device becomes
available.
• I/O requests can fail for many reasons, either transient ( buffers overflow )
or permanent ( disk crash ).
• I/O requests usually return an error bit ( or more ) indicating the problem.
UNIX systems also set the global variable errno to one of a hundred or so
well-defined values to indicate the specific error that has occurred. ( See
errno.h for a complete listing, or man errno. )
• Some devices, such as SCSI devices, are capable of providing much more
detailed information about errors, and even keep an on-board error log that
can be requested by the host.
212
Figure 13.11 - Use of a system call to perform I/O.
213
Figure 13.12 - UNIX I/O kernel structure.
214
UNIT V
Overview The fundamental idea behind a virtual machine is to abstract the hardware of a single
computer (the CPU, memory, disk drives, network interface cards, and so forth) into several different
execution environments, thereby creating the illusion that each separate environment is running on
its own private computer. This concept may seem similar to the layered approach of operating system
implementation (see Section 2.8.2), and in some ways it is. In the case of virtualization, there is a
layer that creates a virtual system on which operating systems or applications can run. Virtual
machine implementations involve several components. At the base is the host, the underlying
hardware system that runs the virtual machines. The virtual machine manager (VMM) (also known
as a hypervisor) creates and runs virtual machines by providing an interface that is identical to the
host (except in the case of paravirtualization, discussed later). Each guest process is provided with a
virtual copy of the host (Figure 18.1). Usually, the guest process is in fact an operating system. A
single physical machine can thus run multiple operating systems concurrently, each in its own virtual
machine. Take a moment to note that with virtualization, the definition of “operating system” once
again blurs. For example, consider VMM software such as VMware ESX. This virtualization software
is installed on the hardware, runs when the hardware boots, and provides services to applications.
The services include traditional ones, such as scheduling and memory management, along with new
types, such as migration of applications between systems. Furthermore, the applications are, in fact,
guest operating systems. Is the VMware ESX VMM an operating system that, in turn, runs other
operating systems? Certainly it acts like an operating system. For clarity, however, we call the
component that provides virtual environments a VMM
The implementation of VMMs varies greatly. Options include the following: • Hardware-based
solutions that provide support for virtual machine creation and management via firmware. These
VMMs, which are commonly found in mainframe and large to midsized servers, are generally known
as type 0 hypervisors. IBM LPARs and Oracle LDOMs are examples
Type 0 Hypervisor Type 0 hypervisors have existed for many years under many names, including
“partitions” and “domains.” They are a hardware feature, and that brings its own positives and
negatives. Operating systems need do nothing special to take advantage of their features. The VMM
itself is encoded in the firmware and loaded at boot time. In turn, it loads the guest images to run in
each partition. The feature set of a type 0 hypervisor tends to be smaller than those of the other types
because it is implemented in hardware. For example, a system might be split into four virtual systems,
each with dedicated CPUs, memory, and I/O devices. Each guest believes that it has dedicated
hardware because it does, simplifying many implementation details.
I/O presents some difficulty, because it is not easy to dedicate I/O devices to guests if there are not
enough. What if a system has two Ethernet ports and more than two guests, for example? Either all
guests must get their own I/O devices, or the system must provided I/O device sharing. In these cases,
217
the hypervisor manages shared access or grants all devices to a control partition.
Type 1 Hypervisor
Type 1 hypervisors are commonly found in company data centers and are, in a sense, becoming “the
data-center operating system.” They are special-purpose operating systems that run natively on the
hardware, but rather than providing system calls and other interfaces for running programs, they
create, run, and manage guest operating systems. In addition to running on standard hardware, they
can run on type 0 hypervisors, but not on other type 1 hypervisors. Whatever the platform, guests
generally do not know they are running on anything but the native hardware. Type 1 hypervisors run
in kernel mode, taking advantage of hardware protection. Where the host CPU allows, they use
multiple modes to give guest operating systems their own control and improved performance. They
implement device drivers for the hardware they run on, since no other component could do so.
Because they are operating systems, they must also provide CPU scheduling, memory management,
I/O management, protection, and even security. Frequently, they provide APIs, but those APIs
support applications in guests or external applications that supply features like backups, monitoring,
and security. Many type 1 hypervisors are closed-source commercial offerings, such as VMware
ESX, while some are open source or hybrids of open and closed source, such as Citrix XenServer and
its open Xen counterpart.
Type 2 Hypervisor
This type of VMM is simply another process run and managed by the host, and even the host does
not know that virtualization is happening within the VMM. Type 2 hypervisors have limits not
associated with some of the other types. For example, a user needs administrative privileges to access
many of the hardware assistance features of modern CPUs. If the VMM is being run by a standard
user without additional privileges, the VMM cannot take advantage of these features. Due to this
limitation, as well as the extra overhead of running a general-purpose operating system as well as
guest operating systems, type 2 hypervisors tend to have poorer overall performance than type 0 or
218
type 1. As is often the case, the limitations of type 2 hypervisors also provide some benefits. They
run on a variety of general-purpose operating systems, and running them requires no changes to the
host operating system. A student can use a type 2 hypervisor, for example, to test a non-native
operating system without replacing the native operating system. In fact, on an Apple laptop, a student
could have versions of Windows, Linux, Unix, and less common operating systems all available for
learning and experimentation.
Paravirtualization
The Xen VMM became the leader in paravirtulization by implementing several
techniques to optimize the performance of guests as well as of the host system. For example,
as mentioned earlier, some VMMs present virtual devices to guests that appear to be real
devices. Instead of taking that approach, the Xen VMM presented clean and simple device
abstractions that allow efficient I/O as well as good I/O-related communication between the
guest and the VMM. For memory management, Xen did not implement nested page tables.
Rather, each guest had its own set of page tables, set to read-only. Xen required the guest to
use a specific mechanism, a hypercall from the guest to the hypervisor VMM, when a page-
table change was needed. This meant that the guest operating system’s kernel code must have
been changed from the default code to these Xen-specific methods. To optimize performance,
Xen allowed the guest to queue up multiple page-table changes asynchronously via hypercalls
and then checked to ensure that the changes were complete before continuing operation
Programming-Environment Virtualization
Another kind of virtualization, based on a different execution model, is the virtualization of
programming environments. Here, a programming language is designed to run within a custom-
built virtualized environment. For example, Oracle’s Java has many features that depend on its
running in the Java virtual machine (JVM), including specific methods for security and memory
management. If we define virtualization as including only duplication of hardware, this is not really
virtualization at all. But we need not limit ourselves to that definition. Instead, we can define a
virtual environment, based on APIs, that provides a set of features we want to have available for a
particular language and programs written in that language. Java programs run within the JVM
environment, and the JVM is compiled to be a native program on systems on which it runs. This
arrangement means that Java programs are written once and then can run on any system (including
all of the major operating systems) on which a JVM is available. The same can be said of interpreted
languages, which run inside programs that read each instruction and interpret it into native
operations.
Emulation
Virtualization is probably the most common method for running applications designed for one
operating system on a different operating system, but on the same CPU. This method works
relatively efficiently because the applications were compiled for the instruction set that the target
system uses. But what if an application or operating system needs to run on a different CPU? Here,
it is necessary to translate all of the source CPU’s instructions so that they are turned into the
equivalent instructions of the target CPU. Such an environment is no longer virtualized but rather
is fully emulated. Emulation is useful when the host system has one system architecture and the
guest system was compiled for a different architecture. For example, suppose a company has
replaced its outdated computer system with a new system but would like to continue to run certain
important programs that were compiled for the old system. The programs could be run in an
emulator that translates each of the outdated system’s instructions into the native instruction set of
the new system. Emulation can increase the life of programs and allow us to explore old
architectures without having an actual old machine
219
Application Containment
The goal of virtualization in some instances is to provide a method to segregate
applications, manage their performance and resource use, and create an easy way to start, stop,
move, and manage them. In such cases, perhaps full-fledged virtualization is not needed. If the
applications are all compiled for the same operating system, then we do not need complete
virtualization to provide these features. We can instead use application containment. Consider
one example of application containment. Starting with version 10, Oracle Solaris has included
containers, or zones, that create a virtual layer between the operating system and the
applications. In this system, only one kernel is installed, and the hardware is not virtualized.
Rather, the operating system and its devices are virtualized, providing processes within a zone
with the impression that they are the only processes on the system. One or more containers can
be created, and each can have its own applications, network stacks, network address and ports,
user accounts, and so on. CPU and memory resources can be divided among the zones and the
system-wide processes. Each zone, in fact, can run its own scheduler to optimize the
performance of its applications on the allotted resources. Figure 18.7 shows a Solaris 10 system
with two containers and the standard “global” user space.
1. Recall that a guest believes it controls memory allocation via its pagetable management, whereas
in reality the VMM maintains a nested page table that translates the guest page table to the real page
table. The VMM can use this extra level of indirection to optimize the guest’s use of memory
without the guest’s knowledge or help. One approach is to provide double paging. Here, the VMM
has its own page-replacement algorithms and loads pages into a backing store that the guest believes
is physical memory. Of course, the VMM knows less about the guest’s memory access patterns than
the guest does, so its paging is less efficient, creating performance problems. VMMs do use this
method when other methods are not available or are not providing enough free memory. However,
it is not the preferred approach.
2. A common solution is for the VMM to install in each guest a pseudo– device driver or kernel
module that the VMM controls. (Apseudo–device driver uses device-driver interfaces, appearing to
the kernel to be a device driver, but does not actually control a device. Rather, it is an easy way to
add kernel-mode code without directly modifying the kernel.) This balloon memory manager
communicates with the VMM and is told to allocate or deallocate memory. If told to allocate, it
allocates memory and tells the operating system to pin the allocated pages into physical memory
3. Another common method for reducing memory pressure is for the VMM to determine if the same
page has been loaded more than once. If this is the case, the VMM reduces the number of copies of
the page to one and maps the other users of the page to that one copy. VMware, for example,
randomly samples guest memory and creates a hash for each page sampled. That hash value is a
“thumbprint” of the page. The hash of every page examined is compared with other hashes stored
in a hash table. If there is a match, the pages are compared byte by byte to see if they really are
identical. If they are, one page is freed, and its logical address is mapped to the other’s physical
address.
I/O
In the area of I/O, hypervisors have some leeway and can be less concerned with how they represent
the underlying hardware to their guests. Because of the wide variation in I/O devices, operating
systems are used to dealing with varying and flexible I/O mechanisms. For example, an operating
system’s device-driver mechanism provides a uniform interface to the operating system whatever
221
the I/O device. Device-driver interfaces are designed to allow third-party hardware manufacturers
to provide device drivers connecting their devices to the operating system. Usually, device drivers
can be dynamically loaded and unloaded. Virtualization takes advantage of this built-in flexibility
by providing specific virtualized devices to guest operating systems.
Storage Management
Type 1 hypervisors store the guest root disk (and configuration information) in one or more files in
the file systems provided by the VMM. Type 2 hypervisors store the same information in the host
operating system’s file systems. In essence, a disk image, containing all of the contents of the root
di of the guest, is contained in one file in the VMM. Aside from the potential performance problems
that causes, this is a clever solution, because it simplifies copying and moving guests. If the
administrator wants a duplicate of the guest (for testing, for example), she simply copies the
associated disk image of the guest and tells the VMM about the new copy. Booting the new virtual
machine brings up an identical guest. Moving a virtual machine from one system to another that
runs the same VMM is as simple as halting the guest, copying the image to the other system, and
starting the guest there
Live Migration
1. The source VMM establishes a connection with the target VMM and confirms that it is allowed
to send a guest.
2. The target creates a new guest by creating a new VCPU, new nested page table, and other state
storage.
3. The source sends all read-only memory pages to the target.
4. The source sends all read–write pages to the target, marking them as clean.
5. The source repeats step 4, because during that step some pages were probably modified by the
guest and are now dirty. These pages need to be sent again and marked again as clean.
6. When the cycle of steps 4 and 5 becomes very short, the source VMM freezes the guest, sends
the VCPU’s final state, other state details, and the final dirty pages, and tells the target to start
running the guest. Once the target acknowledges that the guest is running, the source terminates the
guest.
iOS
iOS is a mobile operating system designed by Apple to run its smartphone, the iPhone, as well as
its tablet computer, the iPad. iOS is structured on the Mac OS X operating system, with added
functionality pertinent to mobile devices, but does not directly run Mac OS X applications. The
222
structure of iOS appears in Figure 2.17. Cocoa Touch is an API for Objective-C that provides
several frameworks for developing applications that run on iOS devices. The fundamental
difference between Cocoa, mentioned earlier, and Cocoa Touch is that the latter provides support
for hardware features unique to mobile devices, such as touch screens. The media services layer
provides services for graphics, audio, and video.
Android
The Android operating system was designed by the Open Handset Alliance (led primarily by
Google) and was developed for Android smartphones and tablet computers. Whereas iOS is
designed to run on Apple mobile devices and is close-sourced, Android runs on a variety of mobile
platforms and is open-sourced, partly explaining its rapid rise in popularity. The structure of
Android appears in Figure 2.18. Android is similar to iOS in that it is a layered stack of software
that provides a rich set of frameworks for developing mobile applications. At the bottom of this
software stack is the Linux kernel, although it has been modified by Google and is currently outside
the normal distribution of Linux releases.
223
Linux is used primarily for process, memory, and device-driver support for hardware and has been
expanded to include power management. The Android runtime environment includes a core set of
libraries as well as the Dalvik virtual machine. Software designers for Android devices develop
applications in the Java language. However, rather than using the standard Java API, Google has
designed a separate Android API for Java development. The Java class files are first compiled to
Java bytecode and then translated into an executable file that runs on the Dalvik virtual machine.
The Dalvik virtual machine was designed for Android and is optimized for mobile devices with
limited memory and CPU processing capabilities
224