0% found this document useful (0 votes)
8 views163 pages

Os Entire Notes

The document provides an overview of operating systems, detailing their functions, structures, and operations, including process management, memory management, file systems, and device management. It discusses various computing environments such as distributed systems and client-server computing, as well as the significance of open-source operating systems like Linux and BSD UNIX. Additionally, it outlines the services provided by operating systems, including user interfaces, program execution, I/O operations, and error detection.

Uploaded by

cse4671
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views163 pages

Os Entire Notes

The document provides an overview of operating systems, detailing their functions, structures, and operations, including process management, memory management, file systems, and device management. It discusses various computing environments such as distributed systems and client-server computing, as well as the significance of open-source operating systems like Linux and BSD UNIX. Additionally, it outlines the services provided by operating systems, including user interfaces, program execution, I/O operations, and error detection.

Uploaded by

cse4671
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 163

UNIT-1

Operating Systems Overview: Operating system functions, Operating


system structure, Operating systems operations, Computing environments,
Open-Source Operating Systems. System Structures: Operating System
Services, User and Operating-System Interface, systems calls, Types of
System Calls, system programs, operating system structure

UNIT-2

Process Concept: Process scheduling, Operations on processes, Inter-


process communication,

Multithreaded Programming: Multithreading models, Thread libraries, Threading issues.

Process Scheduling: Basic concepts, Scheduling criteria, Scheduling algorithms,

Inter-process Communication: Critical Regions, Mutual exclusion with busy


waiting, Semaphores, Monitors, Message passing, Classical IPC Problems –
Dining philosophers problem, Readers and writers problem.

UNIT-3

Memory-Management Strategies: Introduction, Swapping, Contiguous


memory allocation, Paging, Segmentation.

Virtual Memory Management: Introduction, Demand paging, Page


replacement, Frame allocation, Thrashing

UNIT-4

Deadlocks: Resources, Conditions for deadlocks. Deadlock detection and


recovery, Deadlock avoidance, Deadlock prevention.

UNIT-5
File Systems: Files, Directories, File system implementation . Secondary-
Storage Structure: Overview of disk structure, and attachment, Disk
scheduling, RAID structure,

Introduction
An Operating System (OS) is an interface between a computer user and computer hardware. An
operating system is a software which performs all the basic tasks like file management, memory
management, process management, handling input and output, and controlling peripheral devices
such as disk drives and printers.
An operating system is software that enables applications to interact with a computer's hardware.
The software that contains the core components of the operating system is called the kernel.
The primary purposes of an Operating System are to enable applications (spftwares) to interact
with a computer's hardware and to manage a system's hardware and software resources.
Some popular Operating Systems include Linux Operating System, Windows Operating System,
VMS, OS/400, AIX, z/OS, etc. Today, Operating systems is found almost in every device like
mobile phones, personal computers, mainframe computers, automobiles, TV, Toys etc.

Definitions
We can have a number of definitions of an Operating System. Let's go through few of them:
An Operting System is the low-level software that supports a computer's basic functions, such as
scheduling tasks and controlling peripherals.

An operating system is a program that acts as an interface between the user and the computer
hardware and controls the execution of all kinds of programs.

An operating system (OS) is system software that manages computer hardware, software resources,
and provides common services for computer programs.

1.1 What Operating Systems Do - For Users, For Applications, etc.


Figure 1.1 - Abstract view of the components of a computer system

• Computer = HW + OS + Apps + Users


• OS serves as interface between HW and ( Apps & Users )
• OS provides services for Apps & Users
• OS manages resources ( Government model, it doesn't produce anything. )
• Debates about what is included in the OS - Just the kernel, or everything the
vendor ships?
( Consider the distinction between system applications and 3rd party or user apps. )
1.2 Computer-System Organization - What are all the parts, and how do they fit
together?
Figure 1.2 - A modern computer system

1.2 Functions of Operating System

a) Process management
• Allocating and deallocating the resources.
• Allocates resources such that the system doesn’t run out of resources.
• Offering mechanisms for process synchronization.
• Helps in process communication
b) Memory management
• Allocating/deallocating memory to store programs.
• Deciding the amount of memory that should be allocated to the program.
• Memory distribution while multiprocessing.
• Update the status in case memory is freed
• Keeps record of how much memory is used and how much is unused.
c) File Management
• Keeps track of location and status of files.
• Allocating and deallocating resources.
• Decides which resource to be assigned to which file.
• Creating file
• Editing a file
• Updating a file
• Deleting a files
d) Device management
• Allocating and deallocating devices to different processes.
• Keeps records of all the devices attached to the computer.
• Decides which device to be allocated to which process and for how much time.
e) Security & Privacy − By means of password and similar other techniques, it prevents
unauthorized access to programs and data.
f) Control over system performance − Recording delays between request for a service and
response from the system.
g) Job accounting − Keeping track of time and resources used by various jobs and users.
h) Error detecting − Production of dumps, traces, error messages, and other debugging and
error detecting aids.

1.3 Operating-System Structure

A time-sharing ( multi-user multi-tasking ) OS requires:

• Memory management
• Process management
• Job scheduling
• Resource allocation strategies
• Swap space / virtual memory in physical memory
• Interrupt handling
• File system management
• Protection and security
• Inter-process communications
Figure 1.3 - Memory layout for a multiprogramming system

1.4 Operating-System Operations

Interrupt-driven nature of modern OSes requires that erroneous processes not be able
to disturb anything else.
1.4.1 Dual-Mode and Multimode Operation

• User mode when executing harmless code in user applications


• Kernel mode ( a.k.a. system mode, supervisor mode, privileged mode ) when
executing potentially dangerous code in the system kernel.
• Certain machine instructions ( privileged instructions ) can only be executed in
kernel mode.
• Kernel mode can only be entered by making system calls. User code cannot flip
the mode switch.
• Modern computers support dual-mode operation in hardware, and therefore most
modern OSes support dual-mode operation.
Figure 1.4 - Transition from user to kernel mode

• The concept of modes can be extended beyond two, requiring more than a single
mode bit
• CPUs that support virtualization use one of these extra bits to indicate when the
virtual machine manager, VMM, is in control of the system. The VMM has more
privileges than ordinary user programs, but not so many as the full kernel.
• System calls are typically implemented in the form of software interrupts, which
causes the hardware's interrupt handler to transfer control over to an appropriate
interrupt handler, which is part of the operating system, switching the mode bit to kernel
mode in the process. The interrupt handler checks exactly which interrupt was
generated, checks additional parameters ( generally passed through registers ) if
appropriate, and then calls the appropriate kernel service routine to handle the service
requested by the system call.
• User programs' attempts to execute illegal instructions ( privileged or non-
existent instructions ), or to access forbidden memory areas, also generate software
interrupts, which are trapped by the interrupt handler and control is transferred to the
OS, which issues an appropriate error message, possibly dumps data to a log ( core )
file for later analysis, and then terminates the offending program.
1.4.2 Timer

• Before the kernel begins executing user code, a timer is set to generate an
interrupt.
• The timer interrupt handler reverts control back to the kernel.
• This assures that no user process can take over the system.
• Timer control is a privileged instruction, ( requiring kernel mode. )
• or external attackers attempting to access or damage the system.
1.5 Computing Environments
1.5.1 Traditional Computing

1.5.2 Distributed Systems

• Distributed Systems consist of multiple, possibly heterogeneous, computers


connected together via a network and cooperating in some way, form, or fashion.
• Networks may range from small tight LANs to broad reaching WANs.
o WAN = Wide Area Network, such as an international corporation
o MAN =Metropolitan Area Network, covering a region the size of a city for
example.
o LAN =Local Area Network, typical of a home, business, single-site corporation,
or university campus.
o PAN = Personal Area Network, such as the bluetooth connection between your
PC, phone, headset, car, etc.
• Network access speeds, throughputs, reliabilities, are all important issues.
• OS view of the network may range from just a special form of file access to
complex well-coordinated network operating systems.
• Shared resources may include files, CPU cycles, RAM, printers, and other
resources.
1.5.3 Client-Server Computing

• A defined server provides services ( HW or SW ) to other systems which serve


as clients. ( Technically clients and servers are processes, not HW, and may co-exist on
the same physical computer. )
• A process may act as both client and server of either the same or different
resources.
• Served resources may include disk space, CPU cycles, time of day, IP name
information, graphical displays ( X Servers ), or other resources.

Figure 1.5 - General structure of a client-server system


1.5.4 Peer-to-Peer Computing

• Any computer or process on the network may provide services to any other which
requests it. There is no clear "leader" or overall organization.
• May employ a central "directory" server for looking up the location of resources,
or may use peer-to-peer searching to find resources.
• E.g. Skype uses a central server to locate a desired peer, and then further
communication is peer to peer.

Figure 1.6 - Peer-to-peer system with no centralized service


1.5.5 Real-Time Embedded Systems

• Embedded into devices such as automobiles, climate control systems, process


control, and even toasters and refrigerators.
• May involve specialized chips, or generic CPUs applied to a particular task. (
Consider the current price of 80286 or even 8086 or 8088 chips, which are still plenty
powerful enough for simple electronic devices such as kids toys. )
• Process control devices require real-time ( interrupt driven ) OSes. Response
time can be critical for many such devices.
1.12 Open-Source Operating Systems

• ( For more information on the Flourish conference held at UIC on the subject of
Free Libre and Open Source Software , visit http://www.flourishconf.com )
• Open-Source software is published ( sometimes sold ) with the source code, so
that anyone can see and optionally modify the code.
• Open-source SW is often developed and maintained by a small army of loosely
connected often unpaid programmers, each working towards the common good.
• Critics argue that open-source SW can be buggy, but proponents counter that
bugs are found and fixed quickly, since there are so many pairs of eyes inspecting all
the code.
• Open-source operating systems are a good resource for studying OS
development, since students can examine the source code and even change it and re-
compile the changes.
1.12.1 History

• At one time ( 1950s ) a lot of code was open-source.


• Later, companies tried to protect the privacy of their code, particularly sensitive
issues such as copyright protection algorithms.
• In 1983 Richard Stallman started the GNU project to produce an open-source
UNIX.
• He later published the GNU Manifesto, arguing that ALL software should be
open-source, and founded the Free Software Foundation to promote open-source
development.
• FSF and GNU use the GNU General Public License which essentially states that
all users of the software have full rights to copy and change the SW however they wish,
so long as anything they distribute further contain the same license agreement. (
Copylefting )
1.12.2 Linux

• Developed by Linus Torvalds in Finland in 1991 as the first full operating system
developed by GNU.
• Many different distributions of Linux have evolved from Linus's original,
including RedHat, SUSE, Fedora, Debian, Slackware, and Ubuntu, each geared toward
a different group of end-users and operating environments.
• To run Linux on a Windows system using VMware, follow these steps:
1. Download the free "VMware Player" tool
from http://www.vmware.com/download/player and install it on your system
2. Choose a Linux version from among hundreds of virtual machine images
at http://www.vmware.com/appliances
3. Boot the virtual machine within VMware Player.
1.12.3 BSD UNIX

• UNIX was originally developed at ATT Bell labs, and the source code made
available to computer science students at many universities, including the University of
California at Berkeley, UCB.
• UCB students developed UNIX further, and released their product as BSD UNIX
in both binary and source-code format.
• BSD UNIX is not open-source, however, because a license is still needed from
ATT.
• In spite of various lawsuits, there are now several versions of BSD UNIX,
including Free BSD, NetBSD, OpenBSD, and DragonflyBSD
• The source code is located in /usr/src.
• The core of the Mac operating system is Darwin, derived from BSD UNIX, and
is available at http://developer.apple.com/opensource/index.html
1.13.4 Solaris

• Solaris is the UNIX operating system for computers from Sun Microsystems.
• Solaris was originally based on BSD UNIX, and has since migrated to ATT
SystemV as its basis.
• Parts of Solaris are now open-source, and some are not because they are still
covered by ATT copyrights.
• It is possible to change the open-source components of Solaris, re-compile them,
and then link them in with binary libraries of the copyrighted portions of Solaris.
• Open Solaris is available from http://www.opensolaris.org/os/
• Solaris also allows viewing of the source code online, without having to
download and unpack the entire package.

2.1 Operating-System Services

Figure 2.1 - A view of operating system services


OSes provide environments in which programs run, and services for the users of the
system, including:

• User Interfaces - Means by which users can issue commands to the system.
Depending on the system these may be a command-line interface ( e.g. sh, csh,
ksh, tcsh, etc. ), a GUI interface ( e.g. Windows, X-Windows, KDE, Gnome, etc.
), or a batch command systems. The latter are generally older systems using
punch cards of job-control language, JCL, but may still be used today for
specialty systems designed for a single purpose.
• Program Execution - The OS must be able to load a program into RAM, run
the program, and terminate the program, either normally or abnormally.
• I/O Operations - The OS is responsible for transferring data to and from I/O
devices, including keyboards, terminals, printers, and storage devices.
• File-System Manipulation - In addition to raw data storage, the OS is also
responsible for maintaining directory and subdirectory structures, mapping file
names to specific blocks of data storage, and providing tools for navigating and
utilizing the file system.
• Communications - Inter-process communications, IPC, either between
processes running on the same processor, or between processes running on
separate processors or separate machines. May be implemented as either shared
memory or message passing, ( or some systems may offer both. )
• Error Detection - Both hardware and software errors must be detected and
handled appropriately, with a minimum of harmful repercussions. Some systems
may include complex error avoidance or recovery systems, including backups,
RAID drives, and other redundant systems. Debugging and diagnostic tools aid
users and administrators in tracing down the cause of problems.

Other systems aid in the efficient operation of the OS itself:

• Resource Allocation - E.g. CPU cycles, main memory, storage space, and
peripheral devices. Some resources are managed with generic systems and others
with very carefully designed and specially tuned systems, customized for a
particular resource and operating environment.
• Accounting - Keeping track of system activity and resource usage, either for
billing purposes or for statistical record keeping that can be used to optimize
future performance.
• Protection and Security - Preventing harm to the system and to resources, either
through wayward internal processes or malicious outsiders. Authentication,
ownership, and restricted access are obvious parts of this system. Highly secure
systems may log all process activity down to excruciating detail, and security
regulation dictate the storage of those records on permanent non-erasable
medium for extended times in secure ( off-site ) facilities.

2.2 User Operating-System Interface

2.2.1 Command Interpreter

• Gets and processes the next user request, and launches the requested
programs.
• In some systems the CI may be incorporated directly into the kernel.
• More commonly the CI is a separate program that launches once the user
logs in or otherwise accesses the system.
• UNIX, for example, provides the user with a choice of different shells,
which may either be configured to launch automatically at login, or
which may be changed on the fly. ( Each of these shells uses a different
configuration file of initial settings and commands that are executed
upon startup. )
• Different shells provide different functionality, in terms of certain
commands that are implemented directly by the shell without launching
any external programs. Most provide at least a rudimentary command
interpretation structure for use in shell script programming ( loops,
decision constructs, variables, etc. )
• An interesting distinction is the processing of wild card file naming and
I/O re-direction. On UNIX systems those details are handled by the
shell, and the program which is launched sees only a list of filenames
generated by the shell from the wild cards. On a DOS system, the wild
cards are passed along to the programs, which can interpret the wild
cards as the program sees fit.
Figure 2.2 - The Bourne shell command interpreter in Solaris 10

2.2.2 Graphical User Interface, GUI

• Generally implemented as a desktop metaphor, with file folders, trash


cans, and resource icons.
• Icons represent some item on the system, and respond accordingly when
the icon is activated.
• First developed in the early 1970's at Xerox PARC research facility.
• In some systems the GUI is just a front end for activating a traditional
command line interpreter running in the background. In others the GUI
is a true graphical shell in its own right.
• Mac has traditionally provided ONLY the GUI interface. With the
advent of OSX ( based partially on UNIX ), a command line interface
has also become available.
• Because mice and keyboards are impractical for small mobile devices,
these normally use a touch-screen interface today, that responds to
various patterns of swipes or "gestures". When these first came out they
often had a physical keyboard and/or a trackball of some kind built in,
but today a virtual keyboard is more commonly implemented on the
touch screen.

Figure 2.3 - The iPad touchscreen

2.3 System Calls

• System calls provide a means for user or application programs to call upon the
services of the operating system.
• Generally written in C or C++, although some are written in assembly for optimal
performance.
• Figure 2.5 illustrates the sequence of system calls required to copy a file:
Figure 2.5 - Example of how system calls are used.

• You can use "strace" to see more examples of the large number of system calls
invoked by a single simple command. Read the man page for strace, and try some
simple examples. ( strace mkdir temp, strace cd temp, strace date > t.t, strace cp
t.t t.2, etc. )
• Most programmers do not use the low-level system calls directly, but instead use
an "Application Programming Interface", API. The following sidebar shows the
read( ) call available in the API on UNIX based systems::
The use of APIs instead of direct system calls provides for greater program portability
between different systems. The API then makes the appropriate system calls through
the system call interface, using a table lookup to access specific numbered system
calls, as shown in Figure 2.6:
Figure 2.6 - The handling of a user application invoking the open( ) system call

• Parameters are generally passed to system calls via registers, or less commonly,
by values pushed onto the stack. Large blocks of data are generally accessed
indirectly, through a memory address passed in a register or on the stack, as
shown in Figure 2.7:

Figure 2.7 - Passing of parameters as a table

2.4 Types of System Calls

Six major categories, as outlined in Figure 2.8 and the following six subsections:
2.4.1 Process Control

• Process control system calls include end, abort, load, execute, create process,
terminate process, get/set process attributes, wait for time or event, signal event,
and allocate and free memory.
• Processes must be created, launched, monitored, paused, resumed, and
eventually stopped.
• When one process pauses or stops, then another must be launched or resumed
• When processes stop abnormally it may be necessary to provide core dumps
and/or other diagnostic or recovery tools.
• Compare DOS ( a single-tasking system ) with UNIX ( a multi-tasking system
).
o When a process is launched in DOS, the command interpreter first unloads
as much of itself as it can to free up memory, then loads the process and
transfers control to it. The interpreter does not resume until the process has
completed, as shown in Figure 2.9:

Figure 2.9 - MS-DOS execution. (a) At system startup. (b) Running a program.

o Because UNIX is a multi-tasking system, the command interpreter remains


completely resident when executing a process, as shown in Figure 2.11 below.
▪ The user can switch back to the command interpreter at any time, and can place the
running process in the background even if it was not originally launched as a
background process.
▪ In order to do this, the command interpreter first executes a "fork" system call,
which creates a second process which is an exact duplicate ( clone ) of the original
command interpreter. The original process is known as the parent, and the cloned
process is known as the child, with its own unique process ID and parent ID.
▪ The child process then executes an "exec" system call, which replaces its code with
that of the desired process.
▪ The parent ( command interpreter ) normally waits for the child to complete before
issuing a new command prompt, but in some cases it can also issue a new prompt
right away, without waiting for the child process to complete. ( The child is then
said to be running "in the background", or "as a background process". )
Figure 2.10 - FreeBSD running multiple programs

2.4.2 File Management

• File management system calls include create file, delete file, open, close, read,
write, reposition, get file attributes, and set file attributes.
• These operations may also be supported for directories as well as ordinary files.
• ( The actual directory structure may be implemented using ordinary files on the
file system, or through other means.

2.4.3 Device Management

• Device management system calls include request device, release device, read,
write, reposition, get/set device attributes, and logically attach or detach
devices.
• Devices may be physical ( e.g. disk drives ), or virtual / abstract ( e.g. files,
partitions, and RAM disks ).
• Some systems represent devices as special files in the file system, so that
accessing the "file" calls upon the appropriate device drivers in the OS. See
for example the /dev directory on any UNIX system.

2.4.4 Information Maintenance

• Information maintenance system calls include calls to get/set the time, date,
system data, and process, file, or device attributes.
• Systems may also provide the ability to dump memory at any time, single step
programs pausing execution after each instruction, and tracing the operation of
programs, all of which can help to debug programs.

2.4.5 Communication

• Communication system calls create/delete communication connection,


send/receive messages, transfer status information, and attach/detach remote
devices.
• The message passing model must support calls to:
o Identify a remote process and/or host with which to communicate.
o Establish a connection between the two processes.
o Open and close the connection as needed.
o Transmit messages along the connection.
o Wait for incoming messages, in either a blocking or non-blocking state.
o Delete the connection when no longer needed.
• The shared memory model must support calls to:
o Create and access memory that is shared amongst processes ( and threads. )
o Provide locking mechanisms restricting simultaneous access.
o Free up shared memory and/or dynamically allocate it as needed.
• Message passing is simpler and easier, ( particularly for inter-computer
communications ), and is generally appropriate for small amounts of data.
• Shared memory is faster, and is generally the better approach where large
amounts of data are to be shared, ( particularly when most processes are reading
the data rather than writing it, or at least when only one or a small number of
processes need to change any given data item. )

2.4.6 Protection

• Protection provides mechanisms for controlling which users / processes have


access to which system resources.
• System calls allow the access mechanisms to be adjusted as needed, and for non-
priveleged users to be granted elevated access permissions under carefully
controlled temporary circumstances.
• Once only of concern on multi-user systems, protection is now important on all
systems, in the age of ubiquitous network connectivity.
2.5 System Programs

• System programs provide OS functionality through separate applications,


which are not part of the kernel or command interpreters. They are also known
as system utilities or system applications.
• Most systems also ship with useful applications such as calculators and simple
editors, ( e.g. Notepad ). Some debate arises as to the border between system
and non-system applications.
• System programs may be divided into these categories:
o File management - programs to create, delete, copy, rename, print, list,
and generally manipulate files and directories.
o Status information - Utilities to check on the date, time, number of
users, processes running, data logging, etc. System registries are used to
store and recall configuration information for particular applications.
o File modification - e.g. text editors and other tools which can change
file contents.
o Programming-language support - E.g. Compilers, linkers, debuggers,
profilers, assemblers, library archive management, interpreters for
common languages, and support for make.
o Program loading and execution - loaders, dynamic loaders, overlay
loaders, etc., as well as interactive debuggers.
o Communications - Programs for providing connectivity between
processes and users, including mail, web browsers, remote logins, file
transfers, and remote command execution.
o Background services - System daemons are commonly started when the
system is booted, and run for as long as the system is running, handling
necessary services. Examples include network daemons, print servers,
process schedulers, and system error monitoring services.
• Most operating systems today also come complete with a set of application
programs to provide additional services, such as copying files or checking the
time and date.
• Most users' views of the system is determined by their command interpreter
and the application programs. Most never make system calls, even through the
API, ( with the exception of simple ( file ) I/O in user-written programs. )

2.7 Operating-System Structure


For efficient performance and implementation an OS should be partitioned into separate
subsystems, each with carefully defined tasks, inputs, outputs, and performance
characteristics. These subsystems can then be arranged in various architectural
configurations:

2.7.1 Simple Structure

When DOS was originally written its developers had no idea how big and important it
would eventually become. It was written by a few programmers in a relatively short
amount of time, without the benefit of modern software engineering techniques, and
then gradually grew over time to exceed its original expectations. It does not break the
system into subsystems, and has no distinction between user and kernel modes,
allowing all programs direct access to the underlying hardware. ( Note that user versus
kernel mode was not supported by the 8088 chip set anyway, so that really wasn't an
option back then. )

Figure 2.11 - MS-DOS layer structure

The original UNIX OS used a simple layered approach, but almost all the OS was in
one big layer, not really breaking the OS down into layered subsystems:
Figure 2.12 - Traditional UNIX system structure

2.7.2 Layered Approach

• Another approach is to break the OS into a number of smaller layers, each of which
rests on the layer below it, and relies solely on the services provided by the next
lower layer.
• This approach allows each layer to be developed and debugged independently, with
the assumption that all lower layers have already been debugged and are trusted to
deliver proper services.
• The problem is deciding what order in which to place the layers, as no layer can
call upon the services of any higher layer, and so many chicken-and-egg situations
may arise.
• Layered approaches can also be less efficient, as a request for service from a higher
layer has to filter through all lower layers before it reaches the HW, possibly with
significant processing at each step.
Figure 2.13 - A layered operating system

2.7.3 Microkernels

• The basic idea behind micro kernels is to remove all non-essential services from the
kernel, and implement them as system applications instead, thereby making the
kernel as small and efficient as possible.
• Most microkernels provide basic process and memory management, and message
passing between other services, and not much more.
• Security and protection can be enhanced, as most services are performed in user
mode, not kernel mode.
• System expansion can also be easier, because it only involves adding more system
applications, not rebuilding a new kernel.
• Mach was the first and most widely known microkernel, and now forms a major
component of Mac OSX.
• Windows NT was originally microkernel, but suffered from performance problems
relative to Windows 95. NT 4.0 improved performance by moving more services
into the kernel, and now XP is back to being more monolithic.
• Another microkernel example is QNX, a real-time OS for embedded systems.
Figure 2.14 - Architecture of a typical microkernel

2.7.4 Modules

• Modern OS development is object-oriented, with a relatively small core kernel and


a set of modules which can be linked in dynamically. See for example the Solaris
structure, as shown in Figure 2.13 below.
• Modules are similar to layers in that each subsystem has clearly defined tasks and
interfaces, but any module is free to contact any other module, eliminating the
problems of going through multiple intermediary layers, as well as the chicken-and-
egg problems.
• The kernel is relatively small in this architecture, similar to microkernels, but the
kernel does not have to implement message passing since modules are free to
contact each other directly.

Figure 2.15 - Solaris loadable modules

2.7.5 Hybrid Systems


• Most OSes today do not strictly adhere to one architecture, but are hybrids of
several.

2.7.5.1 Mac OS X

• The Max OSX architecture relies on the Mach microkernel for basic system
management services, and the BSD kernel for additional services. Application
services and dynamically loadable modules ( kernel extensions ) provide the rest of
the OS functionality:

Figure 2.16 - The Mac OS X structure

2.7.5.2 iOS

• The iOS operating system was developed by Apple for iPhones and iPads. It runs
with less memory and computing power needs than Max OS X, and supports
touchscreen interface and graphics for small screens:

Figure 2.17 - Architecture of Apple's iOS.

2.7.5.3 Android
• The Android OS was developed for Android smartphones and tablets by the Open
Handset Alliance, primarily Google.
• Android is an open-source OS, as opposed to iOS, which has lead to its popularity.
• Android includes versions of Linux and a Java virtual machine both optimized for
small platforms.
• Android apps are developed using a special Java-for-Android development
environment.

Figure 2.18 - Architecture of Google's Android

Processes
References:

1. Abraham Silberschatz, Greg Gagne, and Peter Baer Galvin, "Operating System
Concepts, Ninth Edition ", Chapter 3

3.1 Process Concept

• A process is an instance of a program in execution.


• Batch systems work in terms of "jobs". Many modern process concepts are still
expressed in terms of jobs, ( e.g. job scheduling ), and the two terms are often used
interchangeably.

3.1.1 The Process

• Process memory is divided into four sections as shown in Figure 3.1 below:
o The text section comprises the compiled program code, read in from non-volatile
storage when the program is launched.
o The data section stores global and static variables, allocated and initialized prior to
executing main.
o The heap is used for dynamic memory allocation, and is managed via calls to new,
delete, malloc, free, etc.
o The stack is used for local variables. Space on the stack is reserved for local variables
when they are declared ( at function entrance or elsewhere, depending on the
language ), and the space is freed up when the variables go out of scope. Note that
the stack is also used for function return values, and the exact mechanisms of stack
management may be language specific.
o Note that the stack and the heap start at opposite ends of the process's free space and
grow towards each other. If they should ever meet, then either a stack overflow error
will occur, or else a call to new or malloc will fail due to insufficient memory
available.
• When processes are swapped out of memory and later restored, additional
information must also be stored and restored. Key among them are the program
counter and the value of all program registers.

Figure 3.1 - A process in memory

3.1.2 Process State

• Processes may be in one of 5 states, as shown in Figure 3.2 below.


o New - The process is in the stage of being created.
o Ready - The process has all the resources available that it needs to run, but the CPU
is not currently working on this process's instructions.
o Running - The CPU is working on this process's instructions.
o Waiting - The process cannot run at the moment, because it is waiting for some
resource to become available or for some event to occur. For example the process
may be waiting for keyboard input, disk access request, inter-process messages, a
timer to go off, or a child process to finish.
o Terminated - The process has completed.
• The load average reported by the "w" command indicate the average number of
processes in the "Ready" state over the last 1, 5, and 15 minutes, i.e. processes who
have everything they need to run but cannot because the CPU is busy doing
something else.
• Some systems may have other states besides the ones listed here.

Figure 3.2 - Diagram of process state

3.1.3 Process Control Block

For each process there is a Process Control Block, PCB, which stores the following
( types of ) process-specific information, as illustrated in Figure 3.1. ( Specific
details may vary from system to system. )

• Process State - Running, waiting, etc., as discussed above.


• Process ID, and parent process ID.
• CPU registers and Program Counter - These need to be saved and restored when
swapping processes in and out of the CPU.
• CPU-Scheduling information - Such as priority information and pointers to
scheduling queues.
• Memory-Management information - E.g. page tables or segment tables.
• Accounting information - user and kernel CPU time consumed, account numbers,
limits, etc.
• I/O Status information - Devices allocated, open file tables, etc.

Figure 3.3 - Process control block ( PCB )

Figure 3.4 - Diagram showing CPU switch from process to process


3.1.4 Threads

• Modern systems allow a single process to have multiple threads of execution, which
execute concurrently. Threads are covered extensively in the next chapter.

3.2 Process Scheduling

• The two main objectives of the process scheduling system are to keep the CPU busy
at all times and to deliver "acceptable" response times for all programs, particularly
for interactive ones.
• The process scheduler must meet these objectives by implementing suitable policies
for swapping processes in and out of the CPU.
• ( Note that these objectives can be conflicting. In particular, every time the system
steps in to swap processes it takes up time on the CPU to do so, which is thereby
"lost" from doing any useful productive work. )

3.2.1 Scheduling Queues

• All processes are stored in the job queue.


• Processes in the Ready state are placed in the ready queue.
• Processes waiting for a device to become available or to deliver data are placed
in device queues. There is generally a separate device queue for each device.
• Other queues may also be created and used as needed.
Figure 3.5 - The ready queue and various I/O device queues

3.2.2 Schedulers

• A long-term scheduler is typical of a batch system or a very heavily loaded system.


It runs infrequently, ( such as when one process ends selecting one more to be loaded
in from disk in its place ), and can afford to take the time to implement intelligent
and advanced scheduling algorithms.
• The short-term scheduler, or CPU Scheduler, runs very frequently, on the order of
100 milliseconds, and must very quickly swap one process out of the CPU and swap
in another one.
• Some systems also employ a medium-term scheduler. When system loads get high,
this scheduler will swap one or more processes out of the ready queue system for a
few seconds, in order to allow smaller faster jobs to finish up quickly and clear the
system. See the differences in Figures 3.7 and 3.8 below.
• An efficient scheduling system will select a good process mix of CPU-
bound processes and I/O bound processes.

Figure 3.6 - Queueing-diagram representation of process scheduling

Figure 3.7 - Addition of a medium-term scheduling to the queueing diagram


3.2.3 Context Switch

• Whenever an interrupt arrives, the CPU must do a state-save of the currently


running process, then switch into kernel mode to handle the interrupt, and then do
a state-restore of the interrupted process.
• Similarly, a context switch occurs when the time slice for one process has expired
and a new process is to be loaded from the ready queue. This will be instigated by a
timer interrupt, which will then cause the current process's state to be saved and the
new process's state to be restored.
• Saving and restoring states involves saving and restoring all of the registers and
program counter(s), as well as the process control blocks described above.
• Context switching happens VERY VERY frequently, and the overhead of doing the
switching is just lost CPU time, so context switches ( state saves & restores ) need
to be as fast as possible. Some hardware has special provisions for speeding this up,
such as a single machine instruction for saving or restoring all registers at once.

3.3 Operations on Processes

3.3.1 Process Creation


• Processes may create other processes through appropriate system calls, such
as fork or spawn. The process which does the creating is termed the parent of the
other process, which is termed its child.
• Each process is given an integer identifier, termed its process identifier, or PID.
The parent PID ( PPID ) is also stored for each process.
• On typical UNIX systems the process scheduler is termed sched, and is given PID
0. The first thing it does at system startup time is to launch init, which gives that
process PID 1. Init then launches all system daemons and user logins, and becomes
the ultimate parent of all other processes. Figure 3.9 shows a typical process tree for
a Linux system, and other systems will have similar though not identical trees:

Figure 3.8 - A tree of processes on a typical Linux system

• Depending on system implementation, a child process may receive some amount of


shared resources with its parent. Child processes may or may not be limited to a
subset of the resources originally allocated to the parent, preventing runaway
children from consuming all of a certain system resource.
• There are two options for the parent process after creating the child:
1. Wait for the child process to terminate before proceeding. The parent makes a wait(
) system call, for either a specific child or for any child, which causes the parent
process to block until the wait( ) returns. UNIX shells normally wait for their
children to complete before issuing a new prompt.
2. Run concurrently with the child, continuing to process without waiting. This is the
operation seen when a UNIX shell runs a process as a background task. It is also
possible for the parent to run for a while, and then wait for the child later, which
might occur in a sort of a parallel processing operation. ( E.g. the parent may fork
off a number of children without waiting for any of them, then do a little work of its
own, and then wait for the children. )
• Two possibilities for the address space of the child relative to the parent:
1. The child may be an exact duplicate of the parent, sharing the same program and
data segments in memory. Each will have their own PCB, including program
counter, registers, and PID. This is the behavior of the fork system call in UNIX.
2. The child process may have a new program loaded into its address space, with all
new code and data segments. This is the behavior of the spawn system calls in
Windows. UNIX systems implement this as a second step, using the exec system
call.
• Figures 3.10 and 3.11 below shows the fork and exec process on a UNIX system.
Note that the fork system call returns the PID of the processes child to each process
- It returns a zero to the child process and a non-zero child PID to the parent, so the
return value indicates which process is which. Process IDs can be looked up any
time for the current process or its direct parent using the getpid( ) and getppid( )
system calls respectively.
Figure 3.9 Creating a separate process using the UNIX fork( ) system call.

Figure 3.10 - Process creation using the fork( ) system call

• Related man pages:


o fork( 2 )
o exec( 3 )
o wait( 2 )
Figure 3.12 shows the more complicated process for Windows, which must provide
all of the parameter information for the new process as part of the forking process.

Figure 3.11
3.3.2 Process Termination

• Processes may request their own termination by making the exit( ) system call,
typically returning an int. This int is passed along to the parent if it is doing a wait(
), and is typically zero on successful completion and some non-zero code in the event
of problems.
o child code:
o int exitCode;

exit( exitCode ); // return exitCode; has the same effect when executed
from main( )

o parent code:
o pid_t pid;
o int status
o pid = wait( &status );
o // pid indicates which child exited. exitCode in low-order bits of status

// macros can test the high-order bits of status for why it stopped

• Processes may also be terminated by the system for a variety of reasons, including:
o The inability of the system to deliver necessary system resources.
o In response to a KILL command, or other un handled process interrupt.
o A parent may kill its children if the task assigned to them is no longer needed.
o If the parent exits, the system may or may not allow the child to continue without a
parent. ( On UNIX systems, orphaned processes are generally inherited by init,
which then proceeds to kill them. The UNIX nohup command allows a child to
continue executing after its parent has exited. )
• When a process terminates, all of its system resources are freed up, open files flushed
and closed, etc. The process termination status and execution times are returned to
the parent if the parent is waiting for the child to terminate, or eventually returned to
init if the process becomes an orphan. ( Processes which are trying to terminate but
which cannot because their parent is not waiting for them are termed zombies. These
are eventually inherited by init as orphans and killed off. Note that modern UNIX
shells do not produce as many orphans and zombies as older systems used to. )

3.4 Interprocess Communication

• Independent Processes operating concurrently on a systems are those that can


neither affect other processes or be affected by other processes.
• Cooperating Processes are those that can affect or be affected by other processes.
There are several reasons why cooperating processes are allowed:
o Information Sharing - There may be several processes which need access to the same
file for example. ( e.g. pipelines. )
o Computation speedup - Often a solution to a problem can be solved faster if the
problem can be broken down into sub-tasks to be solved simultaneously (
particularly when multiple processors are involved. )
o Modularity - The most efficient architecture may be to break a system down into
cooperating modules. ( E.g. databases with a client-server architecture. )
o Convenience - Even a single user may be multi-tasking, such as editing, compiling,
printing, and running the same code in different windows.

• Cooperating processes require some type of inter-process communication, which is


most commonly one of two types: Shared Memory systems or Message Passing
systems. Figure 3.13 illustrates the difference between the two systems:

Figure 3.12 - Communications models: (a) Message passing. (b) Shared


memory.

• Shared Memory is faster once it is set up, because no system calls are required and
access occurs at normal memory speeds. However it is more complicated to set up,
and doesn't work as well across multiple computers. Shared memory is generally
preferable when large amounts of information must be shared quickly on the same
computer.
• Message Passing requires system calls for every message transfer, and is therefore
slower, but it is simpler to set up and works well across multiple computers. Message
passing is generally preferable when the amount and/or frequency of data transfers
is small, or when multiple computers are involved.

3.4.1 Shared-Memory Systems

• In general the memory to be shared in a shared-memory system is initially within


the address space of a particular process, which needs to make system calls in order
to make that memory publicly available to one or more other processes.
• Other processes which wish to use the shared memory must then make their own
system calls to attach the shared memory area onto their address space.
• Generally a few messages must be passed back and forth between the cooperating
processes first in order to set up and coordinate the shared memory access.

Producer-Consumer Example Using Shared Memory

• This is a classic example, in which one process is producing data and another process
is consuming the data. ( In this example in the order in which it is produced, although
that could vary. )
• The data is passed via an intermediary buffer, which may be either unbounded or
bounded. With a bounded buffer the producer may have to wait until there is space
available in the buffer, but with an unbounded buffer the producer will never need
to wait. The consumer may need to wait in either case until there is data available.
• This example uses shared memory and a circular queue. Note in the code below that
only the producer changes "in", and only the consumer changes "out", and that they
can never be accessing the same array location at the same time.
• First the following data is set up in the shared memory area:

#define BUFFER_SIZE 10

typedef struct {
...
} item;

item buffer[ BUFFER_SIZE ];


int in = 0;
int out = 0;
• Then the producer process. Note that the buffer is full when "in" is one less than
"out" in a circular sense:

// Code from Figure 3.13

item nextProduced;
while( true ) {

/* Produce an item and store it in nextProduced */


nextProduced = makeNewItem( . . . );

/* Wait for space to become available */


while( ( ( in + 1 ) % BUFFER_SIZE ) == out )
; /* Do nothing */

/* And then store the item and repeat the loop. */


buffer[ in ] = nextProduced;
in = ( in + 1 ) % BUFFER_SIZE;

• Then the consumer process. Note that the buffer is empty when "in" is equal to "out":

// Code from Figure 3.14

item nextConsumed;

while( true ) {

/* Wait for an item to become available */


while( in == out )
; /* Do nothing */

/* Get the next available item */


nextConsumed = buffer[ out ];
out = ( out + 1 ) % BUFFER_SIZE;

/* Consume the item in nextConsumed


( Do something with it ) */
}

3.4.2 Message-Passing Systems

• Message passing systems must support at a minimum system calls for "send
message" and "receive message".
• A communication link must be established between the cooperating processes before
messages can be sent.
• There are three key issues to be resolved in message passing systems as further
explored in the next three subsections:
o Direct or indirect communication ( naming )
o Synchronous or asynchronous communication
o Automatic or explicit buffering.

3.4.2.1 Naming

• With direct communication the sender must know the name of the receiver to
which it wishes to send a message.
o There is a one-to-one link between every sender-receiver pair.
o For symmetric communication, the receiver must also know the specific name of
the sender from which it wishes to receive messages.
For asymmetric communications, this is not necessary.
• Indirect communication uses shared mailboxes, or ports.
o Multiple processes can share the same mailbox or boxes.
o Only one process can read any given message in a mailbox. Initially the process that
creates the mailbox is the owner, and is the only one allowed to read mail in the
mailbox, although this privilege may be transferred.
▪ ( Of course the process that reads the message can immediately turn around and
place an identical message back in the box for someone else to read, but that may
put it at the back end of a queue of messages. )
o The OS must provide system calls to create and delete mailboxes, and to send and
receive messages to/from mailboxes.

3.4.2.2 Synchronization

• Either the sending or receiving of messages ( or neither or both ) may be


either blocking or non-blocking.

3.4.2.3 Buffering
• Messages are passed via queues, which may have one of three capacity
configurations:
1. Zero capacity - Messages cannot be stored in the queue, so senders must block until
receivers accept the messages.
2. Bounded capacity- There is a certain pre-determined finite capacity in the queue.
Senders must block if the queue is full, until space becomes available in the queue,
but may be either blocking or non-blocking otherwise.
3. Unbounded capacity - The queue has a theoretical infinite capacity, so senders are
never forced to block.
Threads
References:

1. Abraham Silberschatz, Greg Gagne, and Peter Baer Galvin, "Operating


System Concepts, Ninth Edition ", Chapter 4

4.1 Overview

• A thread is a basic unit of CPU utilization, consisting of a program counter,


a stack, and a set of registers, ( and a thread ID. )
• Traditional ( heavyweight ) processes have a single thread of control - There
is one program counter, and one sequence of instructions that can be carried
out at any given time.
• As shown in Figure 4.1, multi-threaded applications have multiple threads
within a single process, each having their own program counter, stack and set
of registers, but sharing common code, data, and certain structures such as
open files.
Figure 4.1 - Single-threaded and multithreaded processes

4.1.2 Benefits

• There are four major categories of benefits to multi-threading:


1. Responsiveness - One thread may provide rapid response while other threads
are blocked or slowed down doing intensive calculations.
2. Resource sharing - By default threads share common code, data, and other
resources, which allows multiple tasks to be performed simultaneously in a
single address space.
3. Economy - Creating and managing threads ( and context switches between
them ) is much faster than performing the same tasks for processes.
4. Scalability, i.e. Utilization of multiprocessor architectures - A single threaded
process can only run on one CPU, no matter how many may be available,
whereas the execution of a multi-threaded application may be split amongst
available processors. ( Note that single threaded processes can still benefit from
multi-processor architectures when there are multiple processes contending for
the CPU, i.e. when the load average is above some certain threshold. )

4.2 Multithreading Models

• There are two types of threads to be managed in a modern system: User


threads and kernel threads.
• User threads are supported above the kernel, without kernel support. These
are the threads that application programmers would put into their programs.
• Kernel threads are supported within the kernel of the OS itself. All modern
OSes support kernel level threads, allowing the kernel to perform multiple
simultaneous tasks and/or to service multiple kernel system calls
simultaneously.
• In a specific implementation, the user threads must be mapped to kernel
threads, using one of the following strategies.

4.2.1 Many-To-One Model

• In the many-to-one model, many user-level threads are all mapped onto a
single kernel thread.
• Thread management is handled by the thread library in user space, which is
very efficient.
• However, if a blocking system call is made, then the entire process blocks,
even if the other user threads would otherwise be able to continue.
• Because a single kernel thread can operate only on a single CPU, the many-
to-one model does not allow individual processes to be split across multiple
CPUs.
• Green threads for Solaris and GNU Portable Threads implement the many-to-
one model in the past, but few systems continue to do so today.

Figure 4.2 - Many-to-one model

4.2.2 One-To-One Model

• The one-to-one model creates a separate kernel thread to handle each user
thread.
• One-to-one model overcomes the problems listed above involving blocking
system calls and the splitting of processes across multiple CPUs.
• However the overhead of managing the one-to-one model is more significant,
involving more overhead and slowing down the system.
• Most implementations of this model place a limit on how many threads can
be created.
• Linux and Windows from 95 to XP implement the one-to-one model for
threads.
Figure 4.3 - One-to-one model

4.2.3 Many-To-Many Model

• The many-to-many model multiplexes any number of user threads onto an


equal or smaller number of kernel threads, combining the best features of the
one-to-one and many-to-one models.
• Users have no restrictions on the number of threads created.
• Blocking kernel system calls do not block the entire process.
• Processes can be split across multiple processors.
• Individual processes may be allocated variable numbers of kernel threads,
depending on the number of CPUs present and other factors.

Figure 4.4 - Many-to-many model


• One popular variation of the many-to-many model is the two-tier model,
which allows either many-to-many or one-to-one operation.
• IRIX, HP-UX, and Tru64 UNIX use the two-tier model, as did Solaris prior
to Solaris 9.

Figure 4.5 - Two-level model

4.3 Thread Libraries

• Thread libraries provide programmers with an API for creating and managing
threads.
• Thread libraries may be implemented either in user space or in kernel space.
The former involves API functions implemented solely within user space,
with no kernel support. The latter involves system calls, and requires a kernel
with thread library support.
• There are three main thread libraries in use today:
1. POSIX Pthreads - may be provided as either a user or kernel library, as
an extension to the POSIX standard.
2. Win32 threads - provided as a kernel-level library on Windows
systems.
3. Java threads - Since Java generally runs on a Java Virtual Machine, the
implementation of threads is based upon whatever OS and hardware the
JVM is running on, i.e. either Pthreads or Win32 threads depending on
the system.
• The following sections will demonstrate the use of threads in all three systems
for calculating the sum of integers from 0 to N in a separate thread, and storing
the result in a variable "sum".

4.3.1 Pthreads

• The POSIX standard ( IEEE 1003.1c ) defines the specification for pThreads,
not the implementation.
• pThreads are available on Solaris, Linux, Mac OSX, Tru64, and via public
domain shareware for Windows.
• Global variables are shared amongst all threads.
• One thread can wait for the others to rejoin before continuing.
• pThreads begin execution in a specified function, in this example the runner(
) function:
4.4.2 Windows Threads

• Similar to pThreads. Examine the code example to see the differences, which
are mostly syntactic & nomenclature:
4.4.3 Java Threads

• ALL Java programs use Threads - even "common" single-threaded ones.


• The creation of new Threads requires Objects that implement the Runnable
Interface, which means they contain a method "public void run( )" . Any
descendant of the Thread class will naturally contain such a method. ( In
practice the run( ) method must be overridden / provided for the thread to have
any practical functionality. )
• Creating a Thread Object does not start the thread running - To do that the
program must call the Thread's "start( )" method. Start( ) allocates and
initializes memory for the Thread, and then calls the run( ) method. (
Programmers do not call run( ) directly. )
• Because Java does not support global variables, Threads must be passed a
reference to a shared Object in order to share data, in this example the "Sum"
Object.
• Note that the JVM runs on top of a native OS, and that the JVM specification
does not specify what model to use for mapping Java threads to kernel threads.
This decision is JVM implementation dependant, and may be one-to-one,
many-to-many, or many to one.. ( On a UNIX system the JVM normally uses
PThreads and on a Windows system it normally uses windows threads. )
4.4 Threading Issues

4.4.1 The fork( ) and exec( ) System Calls

• Q: If one thread forks, is the entire process copied, or is the new process
single-threaded?
• A: System dependant.
• A: If the new process execs right away, there is no need to copy all the other
threads. If it doesn't, then the entire process should be copied.
• A: Many versions of UNIX provide multiple versions of the fork call for this
purpose.

4.4.2 Signal Handling

• Q: When a multi-threaded process receives a signal, to what thread should that


signal be delivered?
• A: There are four major options:
1. Deliver the signal to the thread to which the signal applies.
2. Deliver the signal to every thread in the process.
3. Deliver the signal to certain threads in the process.
4. Assign a specific thread to receive all signals in a process.
• The best choice may depend on which specific signal is involved.
• UNIX allows individual threads to indicate which signals they are accepting
and which they are ignoring. However the signal can only be delivered to one
thread, which is generally the first thread that is accepting that particular
signal.
• UNIX provides two separate system calls, kill( pid, signal
) and pthread_kill( tid, signal ), for delivering signals to processes or specific
threads respectively.
• Windows does not support signals, but they can be emulated using
Asynchronous Procedure Calls ( APCs ). APCs are delivered to specific
threads, not processes.

4.4.3 Thread Cancellation

• Threads that are no longer needed may be cancelled by another thread in one
of two ways:
1. Asynchronous Cancellation cancels the thread immediately.
2. Deferred Cancellation sets a flag indicating the thread should cancel itself
when it is convenient. It is then up to the cancelled thread to check this flag
periodically and exit nicely when it sees the flag set.
• ( Shared ) resource allocation and inter-thread data transfers can be
problematic with asynchronous cancellation.

4.4.4 Thread-Local Storage ( was 4.4.5 Thread-Specific Data )

• Most data is shared among threads, and this is one of the major benefits of
using threads in the first place.
• However sometimes threads need thread-specific data also.
• Most major thread libraries ( pThreads, Win32, Java ) provide support for
thread-specific data, known as thread-local storage or TLS. Note that this is
more like static data than local variables,because it does not cease to exist
when the function ends.

4.5.5 Scheduler Activations


• Many implementations of threads provide a virtual processor as an interface
between the user thread and the kernel thread, particularly for the many-to-
many or two-tier models.
• This virtual processor is known as a "Lightweight Process", LWP.
o There is a one-to-one correspondence between LWPs and kernel
threads.
o The number of kernel threads available, ( and hence the number of
LWPs ) may change dynamically.
o The application ( user level thread library ) maps user threads onto
available LWPs.
o kernel threads are scheduled onto the real processor(s) by the OS.
o The kernel communicates to the user-level thread library when certain
events occur ( such as a thread about to block ) via an upcall, which is
handled in the thread library by an upcall handler. The upcall also
provides a new LWP for the upcall handler to run on, which it can then
use to reschedule the user thread that is about to become blocked. The
OS will also issue upcalls when a thread becomes unblocked, so the
thread library can make appropriate adjustments.
• If the kernel thread blocks, then the LWP blocks, which blocks the user thread.
• Ideally there should be at least as many LWPs available as there could be
concurrently blocked kernel threads. Otherwise if all LWPs are blocked, then
user threads will have to wait for one to become available.
CPU Scheduling

References:

1. Abraham Silberschatz, Greg Gagne, and Peter Baer Galvin, "Operating System Concepts,
Ninth Edition ", Chapter 6

2.1 Basic Concepts

• Almost all programs have some alternating cycle of CPU number crunching and waiting
for I/O of some kind. ( Even a simple fetch from memory takes a long time relative to CPU
speeds. )
• In a simple system running a single process, the time spent waiting for I/O is wasted, and
those CPU cycles are lost forever.
• A scheduling system allows one process to use the CPU while another is waiting for I/O,
thereby making full use of otherwise lost CPU cycles.
• The challenge is to make the overall system as "efficient" and "fair" as possible, subject to
varying and often dynamic conditions, and where "efficient" and "fair" are somewhat
subjective terms, often subject to shifting priority policies.

2.1.1 CPU-I/O Burst Cycle

• Almost all processes alternate between two states in a continuing cycle, as shown in Figure
2.1 below :
o A CPU burst of performing calculations, and
o An I/O burst, waiting for data transfer in or out of the system.

Figure 2.1 - Alternating sequence of CPU and I/O bursts.

2.1.2 CPU Scheduler

• Whenever the CPU becomes idle, it is the job of the CPU Scheduler (the short-term
scheduler ) to select another process from the ready queue to run next.
• The storage structure for the ready queue and the algorithm used to select the next process
are not necessarily a FIFO queue. There are several alternatives to choose from, as well as
numerous adjustable parameters for each algorithm.

2.1.3. Preemptive Scheduling

• CPU scheduling decisions take place under one of four conditions:


1. When a process switches from the running state to the waiting state, such as for an
I/O request or invocation of the wait( ) system call.
2. When a process switches from the running state to the ready state, for example in
response to an interrupt.
3. When a process switches from the waiting state to the ready state, say at completion
of I/O or a return from wait( ).
4. When a process terminates.
• For conditions 1 and 4 there is no choice - A new process must be selected.
• For conditions 2 and 3 there is a choice - To either continue running the current process,
or select a different one.
• If scheduling takes place only under conditions 1 and 4, the system is said to be non-
preemptive, or cooperative. Under these conditions, once a process starts running it keeps
running, until it either voluntarily blocks or until it finishes. Otherwise the system is said
to be preemptive.

2.1.4 Dispatcher

• The dispatcher is the module that gives control of the CPU to the process selected by the
scheduler. This function involves:
o Switching context.
o Switching to user mode.
o Jumping to the proper location in the newly loaded program.
• The dispatcher needs to be as fast as possible, as it is run on every context switch. The time
consumed by the dispatcher is known as dispatch latency.

2.2 Scheduling Criteria

• There are several different criteria to consider when trying to select the "best" scheduling
algorithm for a particular situation and environment, including:
o CPU utilization - Ideally the CPU would be busy 100% of the time, so as to waste 0
CPU cycles. On a real system CPU usage should range from 40% ( lightly loaded ) to
80% ( heavily loaded. )
o Throughput - Number of processes completed per unit time. May range from 10 /
second to 1 / hour depending on the specific processes.
o Turnaround time - Time required for a particular process to complete, from
submission time to completion. ( Wall clock time. )
o Waiting time - How much time processes spend in the ready queue waiting their turn
to get on the CPU.
▪ ( Load average - The average number of processes sitting in the ready queue waiting
their turn to get into the CPU. Reported in 1-minute, 5-minute, and 15-minute averages
by "uptime" and "who". )
o Response time - The time taken in an interactive program from the issuance of a
command to the commence of a response to that command.
• In general one wants to optimize the average value of a criteria ( Maximize CPU utilization
and throughput, and minimize all the others. ) However sometimes one wants to do
something different, such as to minimize the maximum response time.
• Sometimes it is most desirable to minimize the variance of criteria than the actual value.
I.e. users are more accepting of a consistent predictable system than an inconsistent one,
even if it is a little bit slower.

2.3 Scheduling Algorithms


The following subsections will explain several common scheduling strategies, looking at only a
single CPU burst each for a small number of processes. Obviously real systems have to deal with
a lot more simultaneous processes executing their CPU-I/O burst cycles.

2.3.1 First-Come First-Serve Scheduling, FCFS

• FCFS is very simple - Just a FIFO queue, like customers waiting in line at the bank or the
post office or at a copying machine.
• Unfortunately, however, FCFS can yield some very long average wait times, particularly
if the first process to get there takes a long time. For example, consider the following three
processes:

Process Burst Time

P1 24

P2 3

P3 3

• In the first Gantt chart below, process P1 arrives first. The average waiting time for the
three processes is ( 0 + 24 + 27 ) / 3 = 17.0 ms.
• In the second Gantt chart below, the same three processes have an average wait time of ( 0
+ 3 + 6 ) / 3 = 3.0 ms. The total run time for the three bursts is the same, but in the second
case two of the three finish much quicker, and the other process is only delayed by a short
amount.

2.3.2 Shortest-Job-First Scheduling, SJF

• The idea behind the SJF algorithm is to pick the quickest fastest little job that needs to be
done, get it out of the way first, and then pick the next smallest fastest job to do next.
• ( Technically this algorithm picks a process based on the next shortest CPU burst, not the
overall process time. )
• For example, the Gantt chart below is based upon the following CPU burst times, ( and the
assumption that all jobs arrive at the same time. )
Process Burst Time

P1 6

P2 8

P3 7

P4 3

• In the case above the average wait time is ( 0 + 3 + 9 + 16 ) / 4 = 7.0 ms, ( as opposed to
10.25 ms for FCFS for the same processes. )

• SJF can be either preemptive or non-preemptive. Preemption occurs when a new process
arrives in the ready queue that has a predicted burst time shorter than the time remaining
in the process whose burst is currently on the CPU. Preemptive SJF is sometimes referred
to as shortest remaining time first scheduling.
• For example, the following Gantt chart is based upon the following data:

Process Arrival Time Burst Time

P1 0 8

P2 1 4

P3 2 9

p4 3 5

• The average wait time in this case is ( ( 5 - 3 ) + ( 10 - 1 ) + ( 17 - 2 ) ) / 4 = 26 / 4 = 6.5


ms. ( As opposed to 7.75 ms for non-preemptive SJF or 8.75 for FCFS. )

2.3.3 Priority Scheduling


• Priority scheduling is a more general case of SJF, in which each job is assigned a priority
and the job with the highest priority gets scheduled first. ( SJF uses the inverse of the next
expected burst time as its priority - The smaller the expected burst, the higher the priority.
)
• Note that in practice, priorities are implemented using integers within a fixed range, but
there is no agreed-upon convention as to whether "high" priorities use large numbers or
small numbers. This book uses low number for high priorities, with 0 being the highest
possible priority.
• For example, the following Gantt chart is based upon these process burst times and
priorities, and yields an average waiting time of 8.2 ms:

Process Burst Time Priority

P1 10 3

P2 1 1

P3 2 4

P4 1 5

P5 5 2

• Priorities can be assigned either internally or externally. Internal priorities are assigned by
the OS using criteria such as average burst time, ratio of CPU to I/O activity, system
resource use, and other factors available to the kernel. External priorities are assigned by
users, based on the importance of the job, fees paid, politics, etc.
• Priority scheduling can be either preemptive or non-preemptive.
• Priority scheduling can suffer from a major problem known as indefinite blocking,
or starvation, in which a low-priority task can wait forever because there are always some
other jobs around that have higher priority.
o If this problem is allowed to occur, then processes will either run eventually when
the system load lightens ( at say 2:00 a.m. ), or will eventually get lost when the
system is shut down or crashes. ( There are rumors of jobs that have been stuck for
years. )
o One common solution to this problem is aging, in which priorities of jobs increase
the longer they wait. Under this scheme a low-priority job will eventually get its
priority raised high enough that it gets run.
2.3.4 Round Robin Scheduling

• Round robin scheduling is similar to FCFS scheduling, except that CPU bursts are assigned
with limits called time quantum.
• When a process is given the CPU, a timer is set for whatever value has been set for a time
quantum.
o If the process finishes its burst before the time quantum timer expires, then it is
swapped out of the CPU just like the normal FCFS algorithm.
o If the timer goes off first, then the process is swapped out of the CPU and moved
to the back end of the ready queue.
• The ready queue is maintained as a circular queue, so when all processes have had a turn,
then the scheduler gives the first process another turn, and so on.
• RR scheduling can give the effect of all processors sharing the CPU equally, although the
average wait time can be longer than with other scheduling algorithms. In the following
example the average wait time is 5.66 ms.

Process Burst Time

P1 24

P2 3

P3 3

• The performance of RR is sensitive to the time quantum selected. If the quantum is large
enough, then RR reduces to the FCFS algorithm; If it is very small, then each process gets
1/nth of the processor time and share the CPU equally.
• BUT, a real system invokes overhead for every context switch, and the smaller the time
quantum the more context switches there are. ( See Figure 6.4 below. ) Most modern
systems use time quantum between 10 and 100 milliseconds, and context switch times on
the order of 10 microseconds, so the overhead is small relative to the time quantum.
Figure 6.4 - The way in which a smaller time quantum increases context switches.

• Turn around time also varies with quantum time, in a non-apparent manner.

• In general, turnaround time is minimized if most processes finish their next cpu burst within
one time quantum. For example, with three processes of 10 ms bursts each, the average
turnaround time for 1 ms quantum is 29, and for 10 ms quantum it reduces to 20. However,
if it is made too large, then RR just degenerates to FCFS. A rule of thumb is that 80% of
CPU bursts should be smaller than the time quantum.

2.3.5 Multilevel Queue Scheduling

• When processes can be readily categorized, then multiple separate queues can be
established, each implementing whatever scheduling algorithm is most appropriate for that
type of job, and/or with different parametric adjustments.
• Scheduling must also be done between queues, that is scheduling one queue to get time
relative to other queues. Two common options are strict priority ( no job in a lower priority
queue runs until all higher priority queues are empty ) and round-robin ( each queue gets a
time slice in turn, possibly of different sizes. )
• Note that under this algorithm jobs cannot switch from queue to queue - Once they are
assigned a queue, that is their queue until they finish.
Figure 6.5 - Multilevel queue scheduling

2.3.6 Multilevel Feedback-Queue Scheduling

• Multilevel feedback queue scheduling is similar to the ordinary multilevel queue


scheduling described above, except jobs may be moved from one queue to another for a
variety of reasons:
o If the characteristics of a job change between CPU-intensive and I/O intensive, then
it may be appropriate to switch a job from one queue to another.
o Aging can also be incorporated, so that a job that has waited for a long time can get
bumped up into a higher priority queue for a while.
• Multilevel feedback queue scheduling is the most flexible, because it can be tuned for any
situation. But it is also the most complex to implement because of all the adjustable
parameters. Some of the parameters which define one of these systems include:
o The number of queues.
o The scheduling algorithm for each queue.
o The methods used to upgrade or demote processes from one queue to another. (
Which may be different. )
o The method used to determine which queue a process enters initially.
Figure 6.6 - Multilevel feedback queues.
2.2 The Critical-Section Problem

• The producer-consumer problem described above is a specific example of a more general


situation known as the critical section problem. The general idea is that in a number of
cooperating processes, each has a critical section of code, with the following conditions
and terminologies:
o Only one process in the group can be allowed to execute in their critical section at
any one time. If one process is already executing their critical section and another
process wishes to do so, then the second process must be made to wait until the first
process has completed their critical section work.
o The code preceding the critical section, and which controls access to the critical
section, is termed the entry section. It acts like a carefully controlled locking door.
o The code following the critical section is termed the exit section. It generally
releases the lock on someone else's door, or at least lets the world know that they
are no longer in their critical section.
o The rest of the code not included in either the critical section or the entry or exit
sections is termed the remainder section.

Figure 5.1 - General structure of a typical process Pi

• A solution to the critical section problem must satisfy the following three conditions:
1. Mutual Exclusion - Only one process at a time can be executing in their critical
section.
2. Progress - If no process is currently executing in their critical section, and one or
more processes want to execute their critical section, then only the processes not in
their remainder sections can participate in the decision, and the decision cannot be
postponed indefinitely. ( I.e. processes cannot be blocked forever waiting to get into
their critical sections. )
3. Bounded Waiting - There exists a limit as to how many other processes can get
into their critical sections after a process requests entry into their critical section
and before that request is granted. ( I.e. a process requesting entry into their critical
section will get a turn eventually, and there is a limit as to how many other processes
get to go first. )
• We assume that all processes proceed at a non-zero speed, but no assumptions can be made
regarding the relative speed of one process versus another.
• Kernel processes can also be subject to race conditions, which can be especially
problematic when updating commonly shared kernel data structures such as open file tables
or virtual memory management. Accordingly kernels can take on one of two forms:
o Non-preemptive kernels do not allow processes to be interrupted while in kernel
mode. This eliminates the possibility of kernel-mode race conditions, but requires
kernel mode operations to complete very quickly, and can be problematic for real-
time systems, because timing cannot be guaranteed.
o Preemptive kernels allow for real-time operations, but must be carefully written to
avoid race conditions. This can be especially tricky on SMP systems, in which
multiple kernel processes may be running simultaneously on different processors.

2.6 Semaphores

• A more robust alternative to simple mutexes is to use semaphores, which are integer
variables for which only two ( atomic ) operations are defined, the wait and signal
operations, as shown in the following figure.
• Note that not only must the variable-changing steps ( S-- and S++ ) be indivisible, it is also
necessary that for the wait operation when the test proves false that there be no interruptions
before S gets decremented. It IS okay, however, for the busy loop to be interrupted when
the test is true, which prevents the system from hanging forever.
2.6.1 Semaphore Usage

• In practice, semaphores can take on one of two forms:


o Binary semaphores can take on one of two values, 0 or 1. They can be used to solve
the critical section problem as described above, and can be used as mutexes on
systems that do not provide a separate mutex mechanism.. The use of mutexes for
this purpose is shown in Figure 6.9 ( from the 8th edition ) below.

Mutual-exclusion implementation with semaphores. ( From 8th edition. )

o Counting semaphores can take on any integer value, and are usually used to count
the number remaining of some limited resource. The counter is initialized to the
number of such resources available in the system, and whenever the counting
semaphore is greater than zero, then a process can enter a critical section and use
one of the resources. When the counter gets to zero ( or negative in some
implementations ), then the process blocks until another process frees up a resource
and increments the counting semaphore with a signal call. ( The binary semaphore
can be seen as just a special case where the number of resources initially available
is just one. )
o Semaphores can also be used to synchronize certain operations between processes.
For example, suppose it is important that process P1 execute statement S1 before
process P2 executes statement S2.
▪ First we create a semaphore named synch that is shared by the two
processes, and initialize it to zero.
▪ Then in process P1 we insert the code:

S1;
signal( synch );

▪ and in process P2 we insert the code:


wait( synch );
S2;

▪ Because synch was initialized to 0, process P2 will block on the wait until
after P1 executes the call to signal.

2.6.2 Semaphore Implementation

• The big problem with semaphores as described above is the busy loop in the wait call,
which consumes CPU cycles without doing any useful work. This type of lock is known
as a spinlock, because the lock just sits there and spins while it waits. While this is
generally a bad thing, it does have the advantage of not invoking context switches, and so
it is sometimes used in multi-processing systems when the wait time is expected to be short
- One thread spins on one processor while another completes their critical section on
another processor.
• An alternative approach is to block a process when it is forced to wait for an available
semaphore, and swap it out of the CPU. In this implementation each semaphore needs to
maintain a list of processes that are blocked waiting for it, so that one of the processes can
be woken up and swapped back in when the semaphore becomes available. ( Whether it
gets swapped back into the CPU immediately or whether it needs to hang out in the ready
queue for a while is a scheduling problem. )

2.6.3 Deadlocks and Starvation

• One important problem that can arise when using semaphores to block processes waiting
for a limited resource is the problem of deadlocks, which occur when multiple processes
are blocked, each waiting for a resource that can only be freed by one of the other ( blocked
) processes, as illustrated in the following example. ( Deadlocks are covered more
completely in chapter 7. )
• Another problem to consider is that of starvation, in which one or more processes gets
blocked forever, and never get a chance to take their turn in the critical section. For
example, in the semaphores above, we did not specify the algorithms for adding processes
to the waiting queue in the semaphore in the wait( ) call, or selecting one to be removed
from the queue in the signal( ) call. If the method chosen is a FIFO queue, then every
process will eventually get their turn, but if a LIFO queue is implemented instead, then the
first process to start waiting could starve.

2.7 Monitors

• Semaphores can be very useful for solving concurrency problems, but only if
programmers use them properly. If even one process fails to abide by the proper use of
semaphores, either accidentally or deliberately, then the whole system breaks down. ( And
since concurrency problems are by definition rare events, the problem code may easily go
unnoticed and/or be heinous to debug. )
• For this reason a higher-level language construct has been developed, called monitors.

2.7.1 Monitor Usage

• A monitor is essentially a class, in which all data is private, and with the special restriction
that only one method within any given monitor object may be active at the same time. An
additional restriction is that monitor methods may only access the shared data within the
monitor and any data passed to them as parameters. I.e. they cannot access any data external
to the monitor.
Figure 5.15 - Syntax of a monitor.

• Figure 5.16 shows a schematic of a monitor, with an entry queue of processes waiting their
turn to execute monitor operations ( methods. )
Figure 5.16 - Schematic view of a monitor

• In order to fully realize the potential of monitors, we need to introduce one additional new
data type, known as a condition.
o A variable of type condition has only two legal operations, wait and signal. I.e. if
X was defined as type condition, then legal operations would be X.wait( ) and
X.signal( )
o The wait operation blocks a process until some other process calls signal, and adds
the blocked process onto a list associated with that condition.
o The signal process does nothing if there are no processes waiting on that condition.
Otherwise it wakes up exactly one process from the condition's list of waiting
processes. ( Contrast this with counting semaphores, which always affect the
semaphore on a signal call. )
• Figure 6.18 below illustrates a monitor that includes condition variables within its data
space. Note that the condition variables, along with the list of processes currently waiting
for the conditions, are in the data space of the monitor - The processes on these lists are not
"in" the monitor, in the sense that they are not executing any code in the monitor.
Figure 5.17 - Monitor with condition variables

• But now there is a potential problem - If process P within the monitor issues a signal that
would wake up process Q also within the monitor, then there would be two processes
running simultaneously within the monitor, violating the exclusion requirement.
Accordingly there are two possible solutions to this dilemma:

Signal and wait - When process P issues the signal to wake up process Q, P then waits, either for
Q to leave the monitor or on some other condition.

Signal and continue - When P issues the signal, Q waits, either for P to exit the monitor or for
some other condition.

There are arguments for and against either choice. Concurrent Pascal offers a third alternative -
The signal call causes the signaling process to immediately exit the monitor, so that the waiting
process can then wake up and proceed.
2.8 Classic Problems of Synchronization

The following classic problems are used to test virtually every new proposed synchronization
algorithm.

2.8.1 The Bounded-Buffer Problem

• This is a generalization of the producer-consumer problem wherein access is controlled to


a shared group of buffers of a limited size.
• In this solution, the two counting semaphores "full" and "empty" keep track of the current
number of full and empty buffers respectively ( and initialized to 0 and N respectively. )
The binary semaphore mutex controls access to the critical section. The producer and
consumer processes are nearly identical - One can think of the producer as producing full
buffers, and the consumer producing empty buffers.
Figures 5.9 and 5.10 use variables next_produced and next_consumed

2.8.2 The Readers-Writers Problem


• In the readers-writers problem there are some processes ( termed readers ) who only read
the shared data, and never change it, and there are other processes ( termed writers ) who
may change the data in addition to or instead of reading it. There is no limit to how many
readers can access the data simultaneously, but when a writer accesses the data, it needs
exclusive access.
• There are several variations to the readers-writers problem, most centered around relative
priorities of readers versus writers.
o The first readers-writers problem gives priority to readers. In this problem, if a
reader wants access to the data, and there is not already a writer accessing it, then
access is granted to the reader. A solution to this problem can lead to starvation of
the writers, as there could always be more readers coming along to access the data.
( A steady stream of readers will jump ahead of waiting writers as long as there is
currently already another reader accessing the data, because the writer is forced to
wait until the data is idle, which may never happen if there are enough readers. )
o The second readers-writers problem gives priority to the writers. In this problem,
when a writer wants access to the data it jumps to the head of the queue - All waiting
readers are blocked, and the writer gets access to the data as soon as it becomes
available. In this solution the readers may be starved by a steady stream of writers.
• The following code is an example of the first readers-writers problem, and involves an
important counter and two binary semaphores:
o readcount is used by the reader processes, to count the number of readers currently
accessing the data.
o mutex is a semaphore used only by the readers for controlled access to readcount.
o rw_mutex is a semaphore used to block and release the writers. The first reader to
access the data will set this lock and the last reader to exit will release it; The
remaining readers do not touch rw_mutex. ( Eighth edition called this variable wrt.
)
o Note that the first reader to come along will block on rw_mutex if there is currently
a writer accessing the data, and that all following readers will only block
on mutex for their turn to increment readcount.
• Some hardware implementations provide specific reader-writer locks, which are accessed
using an argument specifying whether access is requested for reading or writing. The use
of reader-writer locks is beneficial for situation in which: (1) processes can be easily
identified as either readers or writers, and (2) there are significantly more readers than
writers, making the additional overhead of the reader-writer lock pay off in terms of
increased concurrency of the readers.

2.8.3 The Dining-Philosophers Problem

• The dining philosophers problem is a classic synchronization problem involving the


allocation of limited resources amongst a group of processes in a deadlock-free and
starvation-free manner:
o Consider five philosophers sitting around a table, in which there are five chopsticks
evenly distributed and an endless bowl of rice in the center, as shown in the diagram
below. ( There is exactly one chopstick between each pair of dining philosophers.
)
o These philosophers spend their lives alternating between two activities: eating and
thinking.
o When it is time for a philosopher to eat, it must first acquire two chopsticks - one
from their left and one from their right.
o When a philosopher thinks, it puts down both chopsticks in their original locations.

Figure 5.13 - The situation of the dining philosophers

• One possible solution, as shown in the following code section, is to use a set of five
semaphores ( chopsticks[ 5 ] ), and to have each hungry philosopher first wait on their left
chopstick ( chopsticks[ i ] ), and then wait on their right chopstick ( chopsticks[ ( i + 1 ) %
5])
• But suppose that all five philosophers get hungry at the same time, and each starts by
picking up their left chopstick. They then look for their right chopstick, but because it is
unavailable, they wait for it, forever, and eventually all the philosophers starve due to the
resulting deadlock.
Figure 5.14 - The structure of philosopher i.

• Some potential solutions to the problem include:


o Only allow four philosophers to dine at the same time. ( Limited simultaneous
processes. )
o Allow philosophers to pick up chopsticks only when both are available, in a critical
section. ( All or nothing allocation of critical resources. )
o Use an asymmetric solution, in which odd philosophers pick up their left chopstick
first and even philosophers pick up their right chopstick first. ( Will this solution
always work? What if there are an even number of philosophers? )
• Note carefully that a deadlock-free solution to the dining philosophers problem does not
necessarily guarantee a starvation-free one. ( While some or even most of the philosophers
may be able to get on with their normal lives of eating and thinking, there may be one
unlucky soul who never seems to be able to get both chopsticks at the same time.

Message-Passing Systems

• Message passing systems must support at a minimum system calls for "send
message" and "receive message".
• A communication link must be established between the cooperating processes before
messages can be sent.
• There are three key issues to be resolved in message passing systems as further
explored in the next three subsections:
o Direct or indirect communication ( naming )
o Synchronous or asynchronous communication
o Automatic or explicit buffering.
Memory Management
The term Memory can be defined as a collection of data in a specific format. It is used to store
instructions and process data.

3.1 What is Main Memory:

The main memory is central to the operation of a modern computer. Main Memory is a large
array of words or bytes, ranging in size from hundreds of thousands to billions. Main memory is
a repository of rapidly available information shared by the CPU and I/O devices. Main memory
is the place where programs and information are kept when the processor is effectively utilizing
them. Main memory is associated with the processor, so moving instructions and information
into and out of the processor is extremely fast. Main memory is also known as RAM(Random
Access Memory). This memory is a volatile memory.RAM lost its data when a power
interruption occurs.

3.1.1 What is Memory Management :

In a multiprogramming computer, the operating system resides in a part of memory and the rest
is used by multiple processes. The task of subdividing the memory among different processes is
called memory management. Memory management is a method in the operating system to
manage operations between main memory and disk during process execution. The main aim of
memory management is to achieve efficient utilization of memory.

3.1.2 Why Memory Management is required:

• Allocate and de-allocate memory before and after process execution.


• To keep track of used memory space by processes.
• To minimize fragmentation issues.
• To proper utilization of main memory.
• To maintain data integrity while executing of process.

3.2 Swapping :

When a process is executed it must have resided in memory. Swapping is a process of swapping
a process temporarily into a secondary memory from the main memory, which is fast as
compared to secondary memory. A swapping allows more processes to be run and can be fit into
memory at one time. The main part of swapping is transferred time and the total time is directly
proportional to the amount of memory swapped. Swapping is also known as roll-out, roll in,
because if a higher priority process arrives and wants service, the memory manager can swap
out the lower priority process and then load and execute the higher priority process. After
finishing higher priority work, the lower priority process swapped back in memory and continued
to the execution process.

3.2.1 Memory management with uniprogramming(without swapping):

This is the simplest memory management approach the memory is divided into two sections:
• one part for operating system
• second part for user program
3.3 Contiguous Memory Allocation :
The main memory should oblige both the operating system and the different client
processes. Therefore, the allocation of memory becomes an important task in the operating
system. The memory is usually divided into two partitions: one for the resident operating
system and one for the user processes. We normally need several user processes to reside in
memory simultaneously. Therefore, we need to consider how to allocate available memory
to the processes that are in the input queue waiting to be brought into memory. In adjacent
memory allotment, each process is contained in a single contiguous segment of memory.

Contiguous memory allocation

3.3.1 Memory allocation:

To gain proper memory utilization, memory allocation must be allocated efficient manner.
One of the simplest methods for allocating memory is to divide memory into several fixed-
sized partitions and each partition contains exactly one process. Thus, the degree of
multiprogramming is obtained by the number of partitions.
Multiple partition allocation: In this method, a process is selected from the input queue and
loaded into the free partition. When the process terminates, the partition becomes available
for other processes.
Fixed partition allocation: In this method, the operating system maintains a table that
indicates which parts of memory are available and which are occupied by processes. Initially,
all memory is available for user processes and is considered one large block of available
memory. This available memory is known as a “Hole”. When the process arrives and needs
memory, we search for a hole that is large enough to store this process. If the requirement is
fulfilled then we allocate memory to process, otherwise keeping the rest available to satisfy
future requests. While allocating a memory sometimes dynamic storage allocation problems
occur, which concerns how to satisfy a request of size n from a list of free holes. There are
some solutions to this problem:
First fit:-
In the first fit, the first available free hole fulfills the requirement of the process allocated.
Here, in this diagram 40 KB memory block is the first available free hole that can store
process A (size of 25 KB), because the first two blocks did not have sufficient memory space.
Best fit:-
In the best fit, allocate the smallest hole that is big enough to process requirements. For this,
we search the entire list, unless the list is ordered by size

Here in this example, first, we traverse the complete list and find the last hole 25KB is the
best suitable hole for Process A(size 25KB).
In this method memory utilization is maximum as compared to other memory allocation
techniques.
Worst fit:-In the worst fit, allocate the largest available hole to process. This method
produces the largest leftover hole.

Here in this example, Process A (Size 25 KB) is allocated to the largest available memory
block which is 60KB. Inefficient memory utilization is a major issue in the worst fit.
3.3.2 Fragmentation

All the memory allocation strategies suffer from external fragmentation, though first and
best fits experience the problems more so than worst fit. External fragmentation means that
the available memory is broken up into lots of little pieces, none of which is big enough to
satisfy the next memory requirement, although the sum total could.

The amount of memory lost to fragmentation may vary with algorithm, usage patterns, and
some design decisions such as which end of a hole to allocate and which end to save on the
free list.

Statistical analysis of first fit, for example, shows that for N blocks of allocated memory,
another 0.5 N will be lost to fragmentation.

Internal fragmentation also occurs, with all memory allocation strategies. This is caused by
the fact that memory is allocated in blocks of a fixed size, whereas the actual memory needed
will rarely be that exact size. For a random distribution of memory requests, on the average
1/2 block will be wasted per memory request, because on the average the last allocated block
will be only half full.
Note that the same effect happens with hard drives, and that modern hardware gives
us increasingly larger drives and memory at the expense of ever larger block sizes,
which translates to more memory lost to internal fragmentation.

Some systems use variable size blocks to minimize losses due to internal
fragmentation.

If the programs in memory are relocatable, ( using execution-time address binding ), then the
external fragmentation problem can be reduced via compaction, i.e. moving all processes
down to one end of physical memory. This only involves updating the relocation register for
each process, as all internal work is done using logical addresses.

Another solution as we will see in upcoming sections is to allow processes to use non-
contiguous blocks of physical memory, with a separate relocation register for each block.

3.4 Paging

• Paging is a memory management scheme that allows processes physical memory to be


discontinuous, and which eliminates problems with fragmentation by allocating memory
in equal sized blocks known as pages.
• Paging eliminates most of the problems of the other methods discussed previously, and is
the predominant memory management technique used today.

3.4.1 Basic Method

• The basic idea behind paging is to divide physical memory into a number of equal sized
blocks called frames, and to divide a programs logical memory space into blocks of the
same size called pages.
• Any page ( from any process ) can be placed into any available frame.
• The page table is used to look up what frame a particular page is stored in at the moment.
In the following example, for instance, page 2 of the program's logical memory is currently
stored in frame 3 of physical memory:
- Paging hardware

Paging model of logical and physical memory


• A logical address consists of two parts: A page number in which the address resides, and
an offset from the beginning of that page. ( The number of bits in the page number limits
how many pages a single process can address. The number of bits in the offset determines
the maximum size of each page, and should correspond to the system frame size. )
• The page table maps the page number to a frame number, to yield a physical address which
also has two parts: The frame number and the offset within that frame. The number of bits
in the frame number determines how many frames the system can address, and the number
of bits in the offset determines the size of each frame.
• Page numbers, frame numbers, and frame sizes are determined by the architecture, but are
typically powers of two, allowing addresses to be split at a certain number of bits. For
example, if the logical address size is 2^m and the page size is 2^n, then the high-order m-
n bits of a logical address designate the page number and the remaining n bits represent the
offset.
• Note also that the number of bits in the page number and the number of bits in the frame
number do not have to be identical. The former determines the address range of the logical
address space, and the latter relates to the physical address space.

• ( DOS used to use an addressing scheme with 16 bit frame numbers and 16-bit offsets, on
hardware that only supported 24-bit hardware addresses. The result was a resolution of
starting frame addresses finer than the size of a single frame, and multiple frame-offset
combinations that mapped to the same physical hardware address. )
• Consider the following micro example, in which a process has 16 bytes of logical memory,
mapped in 4 byte pages into 32 bytes of physical memory. ( Presumably some other
processes would be consuming the remaining 16 bytes of physical memory. )
Paging example for a 32-byte memory with 4-byte pages

• Note that paging is like having a table of relocation registers, one for each page of the
logical memory.
• There is no external fragmentation with paging. All blocks of physical memory are used,
and there are no gaps in between and no problems with finding the right sized hole for a
particular chunk of memory.
• There is, however, internal fragmentation. Memory is allocated in chunks the size of a
page, and on the average, the last page will only be half full, wasting on the average half a
page of memory per process. ( Possibly more, if processes keep their code and data in
separate pages. )
• Larger page sizes waste more memory, but are more efficient in terms of overhead. Modern
trends have been to increase page sizes, and some systems even have multiple size pages
to try and make the best of both worlds.
• Page table entries ( frame numbers ) are typically 32 bit numbers, allowing access to 2^32
physical page frames. If those frames are 4 KB in size each, that translates to 16 TB of
addressable physical memory. ( 32 + 12 = 44 bits of physical address space. )
• When a process requests memory ( e.g. when its code is loaded in from disk ), free frames
are allocated from a free-frame list, and inserted into that process's page table.
• Processes are blocked from accessing anyone else's memory because all of their memory
requests are mapped through their page table. There is no way for them to generate an
address that maps into any other process's memory space.
• The operating system must keep track of each individual process's page table, updating it
whenever the process's pages get moved in and out of memory, and applying the correct
page table when processing system calls for a particular process. This all increases the
overhead involved when swapping processes in and out of the CPU. ( The currently active
page table must be updated to reflect the process that is currently running. )

Free frames (a) before allocation and (b) after allocation

3.4.2 Shared Pages


• Paging systems can make it very easy to share blocks of memory, by simply duplicating
page numbers in multiple page frames. This may be done with either code or data.
• If code is reentrant, that means that it does not write to or change the code in any way ( it
is non self-modifying ), and it is therefore safe to re-enter it. More importantly, it means
the code can be shared by multiple processes, so long as each has their own copy of the
data and registers, including the instruction register.
• In the example given below, three different users are running the editor simultaneously,
but the code is only loaded into memory ( in the page frames ) one time.
• Some systems also implement shared memory in this fashion.

Sharing of code in a paging environment

3.4.3 Structure of the Page Table


3.4.3.1 Hierarchical Paging

• Most modern computer systems support logical address spaces of 2^32 to 2^64.
• With a 2^32 address space and 4K ( 2^12 ) page sizes, this leave 2^20 entries in the page
table. At 4 bytes per entry, this amounts to a 4 MB page table, which is too large to
reasonably keep in contiguous memory. ( And to swap in and out of memory with each
process switch. ) Note that with 4K pages, this would take 1024 pages just to hold the page
table!
• One option is to use a two-tier paging system, i.e. to page the page table.
• For example, the 20 bits described above could be broken down into two 10-bit page
numbers. The first identifies an entry in the outer page table, which identifies where in
memory to find one page of an inner page table. The second 10 bits finds a specific entry
in that inner page table, which in turn identifies a particular frame in physical memory.
( The remaining 12 bits of the 32 bit logical address are the offset within the 4K frame. )

A two-level page-table scheme


Address translation for a two-level 32-bit paging architecture

• VAX Architecture divides 32-bit addresses into 4 equal sized sections, and each page is
512 bytes, yielding an address form of:

• With a 64-bit logical address space and 4K pages, there are 52 bits worth of page numbers,
which is still too many even for two-level paging. One could increase the paging level, but
with 10-bit page tables it would take 7 levels of indirection, which would be prohibitively
slow memory access. So some other approach must be used.

64-bits Two-tiered leaves 42 bits in outer table

Going to a fourth level still leaves 32 bits in the outer table.

3.4.3.2 Hashed Page Tables

• One common data structure for accessing data that is sparsely distributed over a broad
range of possible values is with hash tables. Figure 8.16 below illustrates a hashed page
table using chain-and-bucket hashing:
Hashed page table

3.4.3.3 Inverted Page Tables

• Another approach is to use an inverted page table. Instead of a table listing all of the pages
for a particular process, an inverted page table lists all of the pages currently loaded in
memory, for all processes. ( I.e. there is one entry per frame instead of one entry per page. )
• Access to an inverted page table can be slow, as it may be necessary to search the entire
table in order to find the desired page ( or to discover that it is not there. ) Hashing the table
can help speedup the search process.
• Inverted page tables prohibit the normal method of implementing shared memory, which
is to map multiple logical pages to a common physical frame. ( Because each frame is now
mapped to one and only one process. )
Inverted page table

3.5 Segmentation

3.5.1 Basic Method

• Most users ( programmers ) do not think of their programs as existing in one continuous
linear address space.
• Rather they tend to think of their memory in multiple segments, each dedicated to a
particular use, such as code, data, the stack, the heap, etc.
• Memory segmentation supports this view by providing addresses with a segment number
( mapped to a segment base address ) and an offset from the beginning of that segment.
• For example, a C compiler might generate 5 segments for the user code, library code, global
( static ) variables, the stack, and the heap, as shown in Figure
Programmer's view of a program.

Figure - Example of segmentation


Virtual Memory
3.0 Demand Paging

• The basic idea behind demand paging is that when a process is swapped in, its
pages are not swapped in all at once. Rather they are swapped in only when
the process needs them. ( on demand. ) This is termed a lazy swapper,
although a pager is a more accurate term.

Figure 3.0- Transfer of a paged memory to contiguous disk space

3.1 Basic Concepts

• The basic idea behind paging is that when a process is swapped in, the pager
only loads into memory those pages that it expects the process to need ( right
away. )
• Pages that are not loaded into memory are marked as invalid in the page
table, using the invalid bit. ( The rest of the page table entry may either be
blank or contain information about where to find the swapped-out page on
the hard drive. )
• If the process only ever accesses pages that are loaded in memory ( memory
resident pages ), then the process runs exactly as if all the pages were loaded
in to memory.

Figure 3.1 - Page table when some pages are not in main memory.

• On the other hand, if a page is needed that was not originally loaded up, then
a page fault trap is generated, which must be handled in a series of steps:
1. The memory address requested is first checked, to make sure it was a
valid memory request.
2. If the reference was invalid, the process is terminated. Otherwise, the
page must be paged in.
3. A free frame is located, possibly from a free-frame list.
4. A disk operation is scheduled to bring in the necessary page from disk. (
This will usually block the process on an I/O wait, allowing some other
process to use the CPU in the meantime. )
5. When the I/O operation is complete, the process's page table is
updated with the new frame number, and the invalid bit is changed to
indicate that this is now a valid page reference.
6. The instruction that caused the page fault must now be restarted from
the beginning, ( as soon as this process gets another turn on the CPU. )

Figure 3.2 - Steps in handling a page fault

• In an extreme case, NO pages are swapped in for a process until they are
requested by page faults. This is known as pure demand paging.
• In theory each instruction could generate multiple page faults
• The hardware necessary to support virtual memory is the same as for paging
and swapping: A page table and secondary memory.
• A crucial part of the process is that the instruction must be restarted from
scratch once the desired page has been made available in memory. For most
simple instructions this is not a major difficulty. However there are some
architectures that allow a single instruction to modify a fairly large block of
data, ( which may span a page boundary ), and if some of the data gets
modified before the page fault occurs, this could cause problems. One
solution is to access both ends of the block before executing the instruction,
guaranteeing that the necessary pages get paged in before the instruction
begins.

3.1.1 Performance of Demand Paging

• Obviously there is some slowdown and performance hit whenever a page fault
occurs and the system has to go get it from memory, but just how big a hit is it
exactly?

• A subtlety is that swap space is faster to access than the regular file system,
because it does not have to go through the whole directory structure. For this
reason some systems will transfer an entire process from the file system to
swap space before starting up the process, so that future paging all occurs
from the ( relatively ) faster swap space.
• Some systems use demand paging directly from the file system for binary code
( which never changes and hence does not have to be stored on a page
operation ), and to reserve the swap space for data segments that must be
stored. This approach is used by both Solaris and BSD Unix.

3.2 Page Replacement

• In order to make the most use of virtual memory, we load several processes
into memory at the same time. Since we only load the pages that are actually
needed by each process at any given time, there is room to load many more
processes than if we had to load in the entire process.
• However memory is also needed for other purposes ( such as I/O buffering ),
and what happens if some process suddenly decides it needs more pages and
there aren't any free frames available? There are several possible solutions to
consider:
1. Adjust the memory used by I/O buffering, etc., to free up some frames
for user processes. The decision of how to allocate memory for I/O
versus user processes is a complex one, yielding different policies on
different systems. ( Some allocate a fixed amount for I/O, and others let
the I/O system contend for memory along with everything else. )
2. Put the process requesting more pages into a wait queue until some
free frames become available.
3. Swap some process out of memory completely, freeing up its page
frames.
4. Find some page in memory that isn't being used right now, and swap
that page only out to disk, freeing up a frame that can be allocated to
the process requesting it. This is known as page replacement, and is the
most common solution. There are many different algorithms for page
replacement, which is the subject of the remainder of this section.

Figure 3.4 - Need for page replacement.

3.2.1 Basic Page Replacement

• The previously discussed page-fault processing assumed that there would be


free frames available on the free-frame list. Now the page-fault handling must
be modified to free up a frame if necessary, as follows:
1. Find the location of the desired page on the disk, either in swap space
or in the file system.
2. Find a free frame:
a. If there is a free frame, use it.
b. If there is no free frame, use a page-replacement algorithm to
select an existing frame to be replaced, known as the victim
frame.
c. Write the victim frame to disk. Change all related page tables to
indicate that this page is no longer in memory.
3. Read in the desired page and store it in the frame. Adjust all related
page and frame tables to indicate the change.
4. Restart the process that was waiting for this page.
Figure 3.5 - Page replacement.

• Note that step 3c adds an extra disk write to the page-fault handling,
effectively doubling the time required to process a page fault. This can be
alleviated somewhat by assigning a modify bit, or dirty bit to each page,
indicating whether or not it has been changed since it was last loaded in from
disk. If the dirty bit has not been set, then the page is unchanged, and does
not need to be written out to disk. Otherwise the page write is required. It
should come as no surprise that many page replacement strategies specifically
look for pages that do not have their dirty bit set, and preferentially select
clean pages as victim pages. It should also be obvious that unmodifiable code
pages never get their dirty bits set.
• There are two major requirements to implement a successful demand paging
system. We must develop a frame-allocation algorithm and a page-
replacement algorithm. The former centers around how many frames are
allocated to each process ( and to other needs ), and the latter deals with how
to select a page for replacement when there are no free frames available.
• The overall goal in selecting and tuning these algorithms is to generate the
fewest number of overall page faults. Because disk access is so slow relative to
memory access, even slight improvements to these algorithms can yield large
improvements in overall system performance.
• Algorithms are evaluated using a given string of memory accesses known as
a reference string, which can be generated in one of ( at least ) three common
ways:
1. Randomly generated, either evenly distributed or with some
distribution curve based on observed system behavior. This is the
fastest and easiest approach, but may not reflect real performance well,
as it ignores locality of reference.
2. Specifically designed sequences. These are useful for illustrating the
properties of comparative algorithms in published papers and
textbooks, ( and also for homework and exam problems. :-) )
3. Recorded memory references from a live system. This may be the best
approach, but the amount of data collected can be enormous, on the
order of a million addresses per second. The volume of collected data
can be reduced by making two important observations:
1. Only the page number that was accessed is relevant. The offset
within that page does not affect paging operations.
2. Successive accesses within the same page can be treated as a
single page request, because all requests after the first are
guaranteed to be page hits

3.2.2 FIFO Page Replacement

• A simple and obvious page replacement strategy is FIFO, i.e. first-in-first-out.


• As new pages are brought in, they are added to the tail of a queue, and the
page at the head of the queue is the next victim. In the following example, 20
page requests result in 15 page faults:

Figure 3.7 - FIFO page-replacement algorithm.

• Although FIFO is simple and easy, it is not always optimal, or even efficient.

3.2.3 Optimal Page Replacement

• An optimal page-replacement algorithm, which is simply that which yields the


lowest of all possible page-faults.
• Such an algorithm does exist, and is called OPT or MIN. This algorithm is
simply "Replace the page that will not be used for the longest time in the
future."
• Unfortunately OPT cannot be implemented in practice, because it requires
foretelling the future, but it makes a nice benchmark for the comparison and
evaluation of real proposed new algorithms.
• In practice most page-replacement algorithms try to approximate OPT by
predicting (estimating) in one fashion or another what page will not be used
for the longest period of time. The basis of FIFO is the prediction that the page
that was brought in the longest time ago is the one that will not be needed
again for the longest future time, but as we shall see, there are many other
prediction methods, all striving to match the performance of OPT.

Figure 3.8 - Optimal page-replacement algorithm

3.2.4 LRU Page Replacement

• The prediction behind LRU, the Least Recently Used, algorithm is that the
page that has not been used in the longest time is the one that will not be
used again in the near future. ( Note the distinction between FIFO and LRU:
The former looks at the oldest load time, and the latter looks at the
oldest use time. )
• Some view LRU as analogous to OPT, except looking backwards in time instead
of forwards. ( OPT has the interesting property that for any reference string S
and its reverse R, OPT will generate the same number of page faults for S and
for R. It turns out that LRU has this same property. )
• Figure 3.9 illustrates LRU for our sample string, yielding 12 page faults, ( as
compared to 15 for FIFO and 9 for OPT. )
Figure 3.9 - LRU page-replacement algorithm.

• LRU is considered a good replacement policy, and is often used. The problem
is how exactly to implement it. There are two simple approaches commonly
used:
1. Counters. Every memory access increments a counter, and the current
value of this counter is stored in the page table entry for that page.
Then finding the LRU page involves simple searching the table for the
page with the smallest counter value. Note that overflowing of the
counter must be considered.
2. Stack. Another approach is to use a stack, and whenever a page is
accessed, pull that page from the middle of the stack and place it on the
top. The LRU page will always be at the bottom of the stack. Because
this requires removing objects from the middle of the stack, a doubly
linked list is the recommended data structure.
• Note that both implementations of LRU require hardware support, either for
incrementing the counter or for managing the stack, as these operations must
be performed for every memory access.

3.3 Thrashing

• If a process cannot maintain its minimum required number of frames, then it


must be swapped out, freeing up frames for other processes. This is an
intermediate level of CPU scheduling.
• But what about a process that can keep its minimum, but cannot keep all of
the frames that it is currently using on a regular basis? In this case it is forced
to page out pages that it will need again in the very near future, leading to
large numbers of page faults.
• A process that is spending more time paging than executing is said to
be thrashing.

3.3.1 Cause of Thrashing


• Early process scheduling schemes would control the level of
multiprogramming allowed based on CPU utilization, adding in more
processes when CPU utilization was low.
• The problem is that when memory filled up and processes started spending
lots of time waiting for their pages to page in, then CPU utilization would
lower, causing the schedule to add in even more processes and exacerbating
the problem! Eventually the system would essentially grind to a halt.
• Local page replacement policies can prevent one thrashing process from
taking pages away from other processes, but it still tends to clog up the I/O
queue, thereby slowing down any other process that needs to do even a little
bit of paging ( or any other I/O for that matter. )

Figure 3.10 - Thrashing

• To prevent thrashing we must provide processes with as many frames as they


really need "right now", but how do we know what that is?

3.3.2 Working-Set Model

• The working set model is based on the concept of locality, and defines
a working set window, of length delta. Whatever pages are included in the
most recent delta page references are said to be in the processes working set
window, and comprise its current working set, as illustrated in Figure 3.11:
Figure 3.11 - Working-set model.

• The selection of delta is critical to the success of the working set model - If it is
too small then it does not encompass all of the pages of the current locality,
and if it is too large, then it encompasses pages that are no longer being
frequently accessed.
• The total demand, D, is the sum of the sizes of the working sets for all
processes. If D exceeds the total number of available frames, then at least one
process is thrashing, because there are not enough frames available to satisfy
its minimum working set. If D is significantly less than the currently available
frames, then additional processes can be launched.
• The hard part of the working-set model is keeping track of what pages are in
the current working set, since every reference adds one to the set and
removes one older page. An approximation can be made using reference bits
and a timer that goes off after a set interval of memory references

3.3.3 Page-Fault Frequency

• A more direct approach is to recognize that what we really want to control is


the page-fault rate, and to allocate frames based on this directly measurable
value. If the page-fault rate exceeds a certain upper bound then that process
needs more frames, and if it is below a given lower bound, then it can afford
to give up some of its frames to other processes.
• a page-replacement strategy could be devised that would select victim frames
based on the process with the lowest current page-fault frequency.
Figure 3.12 - Page-fault frequency.

• Note that there is a direct relationship between the page-fault rate and the
working-set, as a process moves from one locality to another.

Deadlocks

A process in operating system uses resources in the following way.


1. Requests a resource
2. Use the resource
3. Releases the resource
A deadlock is a situation where a set of processes are blocked because each process is holding a
resource and waiting for another resource acquired by some other process.
Consider an example when two trains are coming toward each other on the same track and there
is only one track, none of the trains can move once they are in front of each other. A similar
situation occurs in operating systems when there are two or more processes that hold some
resources and wait for resources held by other(s). For example, in the below diagram, Process 1
is holding Resource 1 and waiting for resource 2 which is acquired by process 2, and process 2
is waiting for resource 1.
Fig 1 Deadlock
Examples Of Deadlock
1. The system has 2 tape drives. P1 and P2 each hold one tape drive and each needs another
one.
2. Semaphores A and B, initialized to 1, P0, and P1 are in deadlock as follows:
• P0 executes wait(A) and preempts.
• P1 executes wait(B).
• Now P0 and P1 enter in deadlock.

P0 P1

wait(A); wait(B)

wait(B); wait(A)

4.1Resources
• For the purposes of deadlock discussion, a system can be modeled as a collection of limited
resources, which can be partitioned into different categories, to be allocated to a number of
processes, each having different needs.
• Resource categories may include memory, printers, CPUs, open files, tape drives, CD-
ROMS, etc.
• By definition, all the resources within a category are equivalent, and a request of this
category can be equally satisfied by any one of the resources in that category. If this is not
the case ( i.e. if there is some difference between the resources within a category ), then
that category needs to be further divided into separate categories. For example, "printers"
may need to be separated into "laser printers" and "color inkjet printers".
• Some categories may have a single resource.
• In normal operation a process must request a resource before using it, and release it when
it is done, in the following sequence:
1. Request - If the request cannot be immediately granted, then the process must wait
until the resource(s) it needs become available. For example the system calls
open( ), malloc( ), new( ), and request( ).
2. Use - The process uses the resource, e.g. prints to the printer or reads from the file.
3. Release - The process relinquishes the resource. so that it becomes available for
other processes. For example, close( ), free( ), delete( ), and release( ).
• For all kernel-managed resources, the kernel keeps track of what resources are free and
which are allocated, to which process they are allocated, and a queue of processes waiting
for this resource to become available. Application-managed resources can be controlled
using mutexes or wait( ) and signal( ) calls, ( i.e. binary or counting semaphores. )
• A set of processes is deadlocked when every process in the set is waiting for a resource
that is currently allocated to another process in the set ( and which can only be released
when that other waiting process makes progress. )
4.2 Necessary Conditions
• There are four conditions that are necessary to achieve deadlock:
1. Mutual Exclusion - At least one resource must be held in a non-sharable
mode; If any other process requests this resource, then that process must
wait for the resource to be released.
2. Hold and Wait - A process must be simultaneously holding at least one
resource and waiting for at least one resource that is currently being held by
some other process.
3. No preemption - Once a process is holding a resource ( i.e. once its request
has been granted ), then that resource cannot be taken away from that
process until the process voluntarily releases it.
4. Circular Wait - A set of processes { P0, P1, P2, . . ., PN } must exist such
that every P[ i ] is waiting for P[ ( i + 1 ) % ( N + 1 ) ]. ( Note that this
condition implies the hold-and-wait condition, but it is easier to deal with
the conditions if the four are considered separately. )
4.3 Methods for Handling Deadlocks
• Generally speaking there are three ways of handling deadlocks:
1. Deadlock prevention or avoidance - Do not allow the system to get into a
deadlocked state.
2. Deadlock detection and recovery - Abort a process or preempt some resources when
deadlocks are detected.
3. Ignore the problem all together - If deadlocks only occur once a year or so, it may
be better to simply let them happen and reboot as necessary than to incur the
constant overhead and system performance penalties associated with deadlock
prevention or detection. This is the approach that both Windows and UNIX take.
• In order to avoid deadlocks, the system must have additional information about all
processes. In particular, the system must know what resources a process will or may request
in the future. ( Ranging from a simple worst-case maximum to a complete resource request
and release plan for each process, depending on the particular algorithm. )
• Deadlock detection is fairly straightforward, but deadlock recovery requires either aborting
processes or preempting resources, neither of which is an attractive alternative.
• If deadlocks are neither prevented nor detected, then when a deadlock occurs the system
will gradually slow down, as more and more processes become stuck waiting for resources
currently held by the deadlock and by other waiting processes. Unfortunately this
slowdown can be indistinguishable from a general system slowdown when a real-time
process has heavy computing needs.
4.4 Deadlock Prevention
• Deadlocks can be prevented by preventing at least one of the four required conditions:
4.4.1 Mutual Exclusion
• Shared resources such as read-only files do not lead to deadlocks.
• Unfortunately some resources, such as printers and tape drives, require exclusive
access by a single process.
4.4.2 Hold and Wait
• To prevent this condition processes must be prevented from holding one or more
resources while simultaneously waiting for one or more others. There are several
possibilities for this:
o Require that all processes request all resources at one time. This can be
wasteful of system resources if a process needs one resource early in its
execution and doesn't need some other resource until much later.
o Require that processes holding resources must release them before
requesting new resources, and then re-acquire the released resources along
with the new ones in a single new request. This can be a problem if a process
has partially completed an operation using a resource and then fails to get it
re-allocated after releasing it.
o Either of the methods described above can lead to starvation if a process
requires one or more popular resources.
4.4.3 No Preemption
• Preemption of process resource allocations can prevent this condition of deadlocks,
when it is possible.
o One approach is that if a process is forced to wait when requesting a new
resource, then all other resources previously held by this process are
implicitly released, ( preempted ), forcing this process to re-acquire the old
resources along with the new resources in a single request, similar to the
previous discussion.
o Another approach is that when a resource is requested and not available,
then the system looks to see what other processes currently have those
resources and are themselves blocked waiting for some other resource. If
such a process is found, then some of their resources may get preempted
and added to the list of resources for which the process is waiting.
o Either of these approaches may be applicable for resources whose states are
easily saved and restored, such as registers and memory, but are generally
not applicable to other devices such as printers and tape drives.
4.4.4 Circular Wait
• One way to avoid circular wait is to number all resources, and to require that
processes request resources only in strictly increasing ( or decreasing ) order.
• In other words, in order to request resource Rj, a process must first release all Ri
such that i >= j.
• One big challenge in this scheme is determining the relative ordering of the
different resources
4.5 Deadlock Avoidance
• The general idea behind deadlock avoidance is to prevent deadlocks from ever happening,
by preventing at least one of the aforementioned conditions.
• This requires more information about each process, AND tends to lead to low device
utilization. ( I.e. it is a conservative approach. )
• In some algorithms the scheduler only needs to know the maximum number of each
resource that a process might potentially use. In more complex algorithms the scheduler
can also take advantage of the schedule of exactly what resources may be needed in what
order.
• When a scheduler sees that starting a process or granting resource requests may lead to
future deadlocks, then that process is just not started or the request is not granted.
• A resource allocation state is defined by the number of available and allocated resources,
and the maximum requirements of all processes in the system.
4.5.1 Safe State
• A state is safe if the system can allocate all resources requested by all processes (
up to their stated maximums ) without entering a deadlock state.
• More formally, a state is safe if there exists a safe sequence of processes { P0, P1,
P2, ..., PN } such that all of the resource requests for Pi can be granted using the
resources currently allocated to Pi and all processes Pj where j < i. ( I.e. if all the
processes prior to Pi finish and free up their resources, then Pi will be able to finish
also, using the resources that they have freed up. )
• If a safe sequence does not exist, then the system is in an unsafe state,
which MAY lead to deadlock. ( All safe states are deadlock free, but not all unsafe
states lead to deadlocks. )

Figure 2 - Safe, unsafe, and deadlocked state spaces.

• For example, consider a system with 12 tape drives, allocated as follows. Is this a
safe state? What is the safe sequence?

Maximum Needs Current Allocation

P0 10 5

P1 4 2

P2 9 2

• What happens to the above table if process P2 requests and is granted one more
tape drive?
• Key to the safe state approach is that when a request is made for resources, the
request is granted only if the resulting allocation state is a safe one.
4.5.2 Resource-Allocation Graph Algorithm
• If resource categories have only single instances of their resources, then deadlock
states can be detected by cycles in the resource-allocation graphs.
• In this case, unsafe states can be recognized and avoided by augmenting the
resource-allocation graph with claim edges, noted by dashed lines, which point
from a process to a resource that it may request in the future.
• In order for this technique to work, all claim edges must be added to the graph for
any particular process before that process is allowed to request any resources.
( Alternatively, processes may only make requests for resources for which they
have already established claim edges, and claim edges cannot be added to any
process that is currently holding resources. )
• When a process makes a request, the claim edge Pi->Rj is converted to a request
edge. Similarly when a resource is released, the assignment reverts back to a claim
edge.
• This approach works by denying requests that would produce cycles in the
resource-allocation graph, taking claim edges into effect.
• Consider for example what happens when process P2 requests resource R2:

Figure 3 - Resource allocation graph for deadlock avoidance


• The resulting resource-allocation graph would have a cycle in it, and so the request
cannot be granted.
Figure 4 - An unsafe state in a resource allocation graph
4.5.3 Banker's Algorithm
• For resource categories that contain more than one instance the resource-allocation
graph method does not work, and more complex ( and less efficient ) methods must
be chosen.
• The Banker's Algorithm gets its name because it is a method that bankers could use
to assure that when they lend out resources they will still be able to satisfy all their
clients. ( A banker won't loan out a little money to start building a house unless they
are assured that they will later be able to loan out the rest of the money to finish the
house. )
• When a process starts up, it must state in advance the maximum allocation of
resources it may request, up to the amount available on the system.
• When a request is made, the scheduler determines whether granting the request
would leave the system in a safe state. If not, then the process must wait until the
request can be granted safely.
• The banker's algorithm relies on several key data structures: ( where n is the number
of processes and m is the number of resource categories. )
o Available[ m ] indicates how many resources are currently available of each
type.
o Max[ n ][ m ] indicates the maximum demand of each process of each
resource.
o Allocation[ n ][ m ] indicates the number of each resource category
allocated to each process.
o Need[ n ][ m ] indicates the remaining resources needed of each type for
each process. ( Note that Need[ i ][ j ] = Max[ i ][ j ] - Allocation[ i ][ j ] for
all i, j. )
• For simplification of discussions, we make the following notations / observations:
o One row of the Need vector, Need[ i ], can be treated as a vector
corresponding to the needs of process i, and similarly for Allocation and
Max.
o A vector X is considered to be <= a vector Y if X[ i ] <= Y[ i ] for all i.
4.5.3.1 Safety Algorithm
• In order to apply the Banker's algorithm, we first need an algorithm for
determining whether or not a particular state is safe.
• This algorithm determines if the current state of a system is safe, according
to the following steps:
1. Let Work and Finish be vectors of length m and n respectively.
▪ Work is a working copy of the available resources, which
will be modified during the analysis.
▪ Finish is a vector of booleans indicating whether a particular
process can finish. ( or has finished so far in the analysis. )
▪ Initialize Work to Available, and Finish to false for all
elements.
2. Find an i such that both (A) Finish[ i ] == false, and (B) Need[ i ] <
Work. This process has not finished, but could with the given
available working set. If no such i exists, go to step 4.
3. Set Work = Work + Allocation[ i ], and set Finish[ i ] to true. This
corresponds to process i finishing up and releasing its resources back
into the work pool. Then loop back to step 2.
4. If finish[ i ] == true for all i, then the state is a safe state, because a
safe sequence has been found.
• ( JTB's Modification:
1. In step 1. instead of making Finish an array of booleans initialized
to false, make it an array of ints initialized to 0. Also initialize an int
s = 0 as a step counter.
2. In step 2, look for Finish[ i ] == 0.
3. In step 3, set Finish[ i ] to ++s. S is counting the number of finished
processes.
4. For step 4, the test can be either Finish[ i ] > 0 for all i, or s >= n.
The benefit of this method is that if a safe state exists, then Finish[ ]
indicates one safe sequence ( of possibly many. ) )
4.5.3.2 Resource-Request Algorithm ( The Bankers Algorithm )
• Now that we have a tool for determining if a particular state is safe or not,
we are now ready to look at the Banker's algorithm itself.
• This algorithm determines if a new request is safe, and grants it only if it is
safe to do so.
• When a request is made ( that does not exceed currently available resources
), pretend it has been granted, and then see if the resulting state is a safe one.
If so, grant the request, and if not, deny the request, as follows:
1. Let Request[ n ][ m ] indicate the number of resources of each type
currently requested by processes. If Request[ i ] > Need[ i ] for any
process i, raise an error condition.
2. If Request[ i ] > Available for any process i, then that process must
wait for resources to become available. Otherwise the process can
continue to step 3.
3. Check to see if the request can be granted safely, by pretending it
has been granted and then seeing if the resulting state is safe. If so,
grant the request, and if not, then the process must wait until its
request can be granted safely.The procedure for granting a request (
or pretending to for testing purposes ) is:
▪ Available = Available - Request
▪ Allocation = Allocation + Request
▪ Need = Need - Request
4.5.3.3 An Illustrative Example
• Consider the following situation:

• And now consider what happens if process P1 requests 1 instance of A and


2 instances of C. ( Request[ 1 ] = ( 1, 0, 2 ) )
• What about requests of ( 3, 3,0 ) by P4? or ( 0, 2, 0 ) by P0? Can these be
safely granted? Why or why not?
4.6 Deadlock Detection
• If deadlocks are not avoided, then another approach is to detect when they have occurred
and recover somehow.
• In addition to the performance hit of constantly checking for deadlocks, a policy / algorithm
must be in place for recovering from deadlocks, and there is potential for lost work when
processes must be aborted or have their resources preempted.
4.6.1 Single Instance of Each Resource Type
• If each resource category has a single instance, then we can use a variation of the
resource-allocation graph known as a wait-for graph.
• A wait-for graph can be constructed from a resource-allocation graph by
eliminating the resources and collapsing the associated edges, as shown in the
figure below.
• An arc from Pi to Pj in a wait-for graph indicates that process Pi is waiting for a
resource that process Pj is currently holding.

Figure 4 - (a) Resource allocation graph. (b) Corresponding wait-for graph


• As before, cycles in the wait-for graph indicate deadlocks.
• This algorithm must maintain the wait-for graph, and periodically search it for
cycles.
4.6.2 Several Instances of a Resource Type
• The detection algorithm outlined here is essentially the same as the Banker's
algorithm, with two subtle differences:
o In step 1, the Banker's Algorithm sets Finish[ i ] to false for all i. The
algorithm presented here sets Finish[ i ] to false only if Allocation[ i ] is not
zero. If the currently allocated resources for this process are zero, the
algorithm sets Finish[ i ] to true. This is essentially assuming that IF all of
the other processes can finish, then this process can finish also.
Furthermore, this algorithm is specifically looking for which processes are
involved in a deadlock situation, and a process that does not have any
resources allocated cannot be involved in a deadlock, and so can be removed
from any further consideration.
o Steps 2 and 3 are unchanged
o In step 4, the basic Banker's Algorithm says that if Finish[ i ] == true for all
i, that there is no deadlock. This algorithm is more specific, by stating that
if Finish[ i ] == false for any process Pi, then that process is specifically
involved in the deadlock which has been detected.
• ( Note: An alternative method was presented above, in which Finish held integers
instead of booleans. This vector would be initialized to all zeros, and then filled
with increasing integers as processes are detected which can finish. If any processes
are left at zero when the algorithm completes, then there is a deadlock, and if not,
then the integers in finish describe a safe sequence. To modify this algorithm to
match this section of the text, processes with allocation = zero could be filled in
with N, N - 1, N - 2, etc. in step 1, and any processes left with Finish = 0 in step 4
are the deadlocked processes. )
• Consider, for example, the following state, and determine if it is currently
deadlocked:
• Now suppose that process P2 makes a request for an additional instance of type C,
yielding the state shown below. Is the system now deadlocked?

4.6.3 Detection-Algorithm Usage


• When should the deadlock detection be done? Frequently, or infrequently?
• The answer may depend on how frequently deadlocks are expected to occur, as well
as the possible consequences of not catching them immediately. ( If deadlocks are
not removed immediately when they occur, then more and more processes can
"back up" behind the deadlock, making the eventual task of unblocking the system
more difficult and possibly damaging to more processes. )
• There are two obvious approaches, each with trade-offs:
1. Do deadlock detection after every resource allocation which cannot be
immediately granted. This has the advantage of detecting the deadlock right
away, while the minimum number of processes are involved in the
deadlock. ( One might consider that the process whose request triggered the
deadlock condition is the "cause" of the deadlock, but realistically all of the
processes in the cycle are equally responsible for the resulting deadlock. )
The down side of this approach is the extensive overhead and performance
hit caused by checking for deadlocks so frequently.
2. Do deadlock detection only when there is some clue that a deadlock may
have occurred, such as when CPU utilization reduces to 40% or some other
magic number. The advantage is that deadlock detection is done much less
frequently, but the down side is that it becomes impossible to detect the
processes involved in the original deadlock, and so deadlock recovery can
be more complicated and damaging to more processes.
3. ( As I write this, a third alternative comes to mind: Keep a historical log of
resource allocations, since that last known time of no deadlocks. Do
deadlock checks periodically ( once an hour or when CPU usage is low?),
and then use the historical log to trace through and determine when the
deadlock occurred and what processes caused the initial deadlock.
Unfortunately I'm not certain that breaking the original deadlock would then
free up the resulting log jam. )
4.7 Recovery From Deadlock
• There are three basic approaches to recovery from deadlock:
1. Inform the system operator, and allow him/her to take manual intervention.
2. Terminate one or more processes involved in the deadlock
3. Preempt resources.
4.7.1 Process Termination
• Two basic approaches, both of which recover resources allocated to terminated
processes:
o Terminate all processes involved in the deadlock. This definitely solves the
deadlock, but at the expense of terminating more processes than would be
absolutely necessary.
o Terminate processes one by one until the deadlock is broken. This is more
conservative, but requires doing deadlock detection after each step.
• In the latter case there are many factors that can go into deciding which processes
to terminate next:
1. Process priorities.
2. How long the process has been running, and how close it is to finishing.
3. How many and what type of resources is the process holding. ( Are they
easy to preempt and restore? )
4. How many more resources does the process need to complete.
5. How many processes will need to be terminated
6. Whether the process is interactive or batch.
7. ( Whether or not the process has made non-restorable changes to any
resource. )
4.7.2 Resource Preemption
• When preempting resources to relieve deadlock, there are three important issues to
be addressed:
1. Selecting a victim - Deciding which resources to preempt from which
processes involves many of the same decision criteria outlined above.
2. Rollback - Ideally one would like to roll back a preempted process to a safe
state prior to the point at which that resource was originally allocated to the
process. Unfortunately it can be difficult or impossible to determine what
such a safe state is, and so the only safe rollback is to roll back all the way
back to the beginning. ( I.e. abort the process and make it start over. )
3. Starvation - How do you guarantee that a process won't starve because its
resources are constantly being preempted? One option would be to use a
priority system, and increase the priority of a process every time its
resources get preempted. Eventually it should get a high enough priority
that it won't get preempted any more.

5.1 File Concept

5.1.1 File Attributes

• Different OSes keep track of different file attributes, including:


o Name - Some systems give special significance to names, and particularly
extensions ( .exe, .txt, etc. ), and some do not. Some extensions may be of
significance to the OS ( .exe ), and others only to certain applications ( .jpg
)
o Identifier ( e.g. inode number )
o Type - Text, executable, other binary, etc.
o Location - on the hard drive.
o Size
o Protection
o Time & Date
o User ID

5.1.2 File Operations

• The file ADT supports many common operations:


o Creating a file
o Writing a file
o Reading a file
o Repositioning within a file
o Deleting a file
o Truncating a file.
• Most OSes require that files be opened before access and closed after all access is
complete. Normally the programmer must open and close files explicitly, but some
rare systems open the file automatically at first access. Information about currently
open files is stored in an open file table, containing for example:
o File pointer - records the current position in the file, for the next read or
write access.
o File-open count - How many times has the current file been opened
( simultaneously by different processes ) and not yet closed? When this
counter reaches zero the file can be removed from the table.
o Disk location of the file.
o Access rights
• Some systems provide support for file locking.
o A shared lock is for reading only.
o A exclusive lock is for writing as well as reading.
o An advisory lock is informational only, and not enforced. ( A "Keep Out"
sign, which may be ignored. )
o A mandatory lock is enforced. ( A truly locked door. )
o UNIX used advisory locks, and Windows uses mandatory locks.

5.1.3 File Types

• Windows ( and some other systems ) use special file extensions to indicate
the type of each file:

Figure 5.1 - Common file types.

• Macintosh stores a creator attribute for each file, according to the program that first
created it with the create( ) system call.
• UNIX stores magic numbers at the beginning of certain files. ( Experiment with the
"file" command, especially in directories such as /bin and /dev )

5.1.4 File Structure


• Some files contain an internal structure, which may or may not be known to the
OS.
• For the OS to support particular file formats increases the size and complexity of
the OS.
• UNIX treats all files as sequences of bytes, with no further consideration of the
internal structure. ( With the exception of executable binary programs, which it
must know how to load and find the first executable statement, etc. )
• Macintosh files have two forks - a resource fork, and a data fork. The resource
fork contains information relating to the UI, such as icons and button images, and
can be modified independently of the data fork, which contains the code or data as
appropriate.

5.1.5 Internal File Structure

• Disk files are accessed in units of physical blocks, typically 55 bytes or some
power-of-two multiple thereof. ( Larger physical disks use larger block sizes, to
keep the range of block numbers within the range of a 32-bit integer. )
• Internally files are organized in units of logical units, which may be as small as a
single byte, or may be a larger size corresponding to some data record or structure
size.
• The number of logical units which fit into one physical block determines
its packing, and has an impact on the amount of internal fragmentation ( wasted
space ) that occurs.
• As a general rule, half a physical block is wasted for each file, and the larger the
block sizes the more space is lost to internal fragmentation.

5.2 Access Methods

5.2.1 Sequential Access

• A sequential access file emulates magnetic tape operation, and generally supports
a few operations:
o read next - read a record and advance the tape to the next position.
o write next - write a record and advance the tape to the next position.
o rewind
o skip n records - May or may not be supported. N may be limited to positive
numbers, or may be limited to +/- 1.

Figure 5.2 - Sequential-access file.


5.2.2 Direct Access

• Jump to any record and read that record. Operations supported include:
o read n - read record number n. ( Note an argument is now required. )
o write n - write record number n. ( Note an argument is now required. )
o jump to record n - could be 0 or the end of file.
o Query current record - used to return back to this record later.
o Sequential access can be easily emulated using direct access. The inverse is
complicated and inefficient.

Figure 5.3 - Simulation of sequential access on a direct-access file.

5.2.3 Other Access Methods

• An indexed access scheme can be easily built on top of a direct access system. Very
large files may require a multi-tiered indexing scheme, i.e. indexes of indexes.

Figure 5.4 - Example of index and relative files.

5.3 Directory Structure

5.3.1 Storage Structure


• A disk can be used in its entirety for a file system.
• Alternatively a physical disk can be broken up into multiple partitions, slices, or
mini-disks, each of which becomes a virtual disk and can have its own filesystem.
( or be used for raw storage, swap space, etc. )
• Or, multiple physical disks can be combined into one volume, i.e. a larger virtual
disk, with its own filesystem spanning the physical disks.

Figure 5.5 - A typical file-system organization.

5.3.2 Directory Overview

• Directory operations to be supported include:


o Search for a file
o Create a file - add to the directory
o Delete a file - erase from the directory
o List a directory - possibly ordered in different ways.
o Rename a file - may change sorting order
o Traverse the file system.

5.3.3. Single-Level Directory

• Simple to implement, but each file must have a unique name.

Figure 5.6 - Single-level directory.


5.3.4 Two-Level Directory

• Each user gets their own directory space.


• File names only need to be unique within a given user's directory.
• A master file directory is used to keep track of each users directory, and must be
maintained when users are added to or removed from the system.
• A separate directory is generally needed for system ( executable ) files.
• Systems may or may not allow users to access other directories besides their own
o If access to other directories is allowed, then provision must be made to
specify the directory being accessed.
o If access is denied, then special consideration must be made for users to run
programs located in system directories. A search path is the list of
directories in which to search for executable programs, and can be set
uniquely for each user.

Figure 5.7 - Two-level directory structure.

5.3.5 Tree-Structured Directories

• An obvious extension to the two-tiered directory structure, and the one with which
we are all most familiar.
• Each user / process has the concept of a current directory from which all ( relative )
searches take place.
• Files may be accessed using either absolute pathnames ( relative to the root of the
tree ) or relative pathnames ( relative to the current directory. )
• Directories are stored the same as any other file in the system, except there is a bit
that identifies them as directories, and they have some special structure that the OS
understands.
• One question for consideration is whether or not to allow the removal of directories
that are not empty - Windows requires that directories be emptied first, and UNIX
provides an option for deleting entire sub-trees.
Figure 5.8 - Tree-structured directory structure.

5.3.6 Acyclic-Graph Directories

• When the same files need to be accessed in more than one place in the directory
structure ( e.g. because they are being shared by more than one user / process ), it
can be useful to provide an acyclic-graph structure. ( Note the directed arcs from
parent to child. )
• UNIX provides two types of links for implementing the acyclic-graph structure. (
See "man ln" for more details. )
o A hard link ( usually just called a link ) involves multiple directory entries
that both refer to the same file. Hard links are only valid for ordinary files
in the same filesystem.
o A symbolic link, that involves a special file, containing information about
where to find the linked file. Symbolic links may be used to link directories
and/or files in other filesystems, as well as ordinary files in the current
filesystem.
• Windows only supports symbolic links, termed shortcuts.
• Hard links require a reference count, or link count for each file, keeping track of
how many directory entries are currently referring to this file. Whenever one of the
references is removed the link count is reduced, and when it reaches zero, the disk
space can be reclaimed.
• For symbolic links there is some question as to what to do with the symbolic links
when the original file is moved or deleted:
o One option is to find all the symbolic links and adjust them also.
o Another is to leave the symbolic links dangling, and discover that they are
no longer valid the next time they are used.
o What if the original file is removed, and replaced with another file having
the same name before the symbolic link is next used?
Figure 5.9 - Acyclic-graph directory structure.

5.3.7 General Graph Directory

• If cycles are allowed in the graphs, then several problems can arise:
o Search algorithms can go into infinite loops. One solution is to not follow
links in search algorithms. ( Or not to follow symbolic links, and to only
allow symbolic links to refer to directories. )
o Sub-trees can become disconnected from the rest of the tree and still not
have their reference counts reduced to zero. Periodic garbage collection is
required to detect and resolve this problem. ( chkdsk in DOS and fsck in
UNIX search for these problems, among others, even though cycles are not
supposed to be allowed in either system. Disconnected disk blocks that are
not marked as free are added back to the file systems with made-up file
names, and can usually be safely deleted. )
Figure 5.10 - General graph directory.

5.4 File-System Implementation

5.4.1 Overview

• File systems store several important data structures on the disk:


o A boot-control block, ( per volume ) a.k.a. the boot block in UNIX or
the partition boot sector in Windows contains information about how to
boot the system off of this disk. This will generally be the first sector of the
volume if there is a bootable system loaded on that volume, or the block
will be left vacant otherwise.
o A volume control block, ( per volume ) a.k.a. the master file table in UNIX
or the superblock in Windows, which contains information such as the
partition table, number of blocks on each filesystem, and pointers to free
blocks and free FCB blocks.
o A directory structure ( per file system ), containing file names and pointers
to corresponding FCBs. UNIX uses inode numbers, and NTFS uses
a master file table.
o The File Control Block, FCB, ( per file ) containing details about
ownership, size, permissions, dates, etc. UNIX stores this information in
inodes, and NTFS in the master file table as a relational database structure.
Figure 5.11 - A typical file-control block.

• There are also several key data structures stored in memory:


o An in-memory mount table.
o An in-memory directory cache of recently accessed directory information.
o A system-wide open file table, containing a copy of the FCB for every
currently open file in the system, as well as some other related information.
o A per-process open file table, containing a pointer to the system open file
table as well as some other information. ( For example the current file
position pointer may be either here or in the system file table, depending on
the implementation and whether the file is being shared or not. )
• Figure 5.3 illustrates some of the interactions of file system components when files
are created and/or used:
o When a new file is created, a new FCB is allocated and filled out with
important information regarding the new file. The appropriate directory is
modified with the new file name and FCB information.
o When a file is accessed during a program, the open( ) system call reads in
the FCB information from disk, and stores it in the system-wide open file
table. An entry is added to the per-process open file table referencing the
system-wide table, and an index into the per-process table is returned by the
open( ) system call. UNIX refers to this index as a file descriptor, and
Windows refers to it as a file handle.
o If another process already has a file open when a new request comes in for
the same file, and it is sharable, then a counter in the system-wide table is
incremented and the per-process table is adjusted to point to the existing
entry in the system-wide table.
o When a file is closed, the per-process table entry is freed, and the counter
in the system-wide table is decremented. If that counter reaches zero, then
the system wide table is also freed. Any data currently stored in memory
cache for this file is written out to disk if necessary.
Figure 5.12 - In-memory file-system structures. (a) File open. (b) File read.

5.4.2 Partitions and Mounting

• Physical disks are commonly divided into smaller units called partitions. They can
also be combined into larger units, but that is most commonly done for RAID
installations and is left for later chapters.
• Partitions can either be used as raw devices ( with no structure imposed upon them
), or they can be formatted to hold a filesystem ( i.e. populated with FCBs and initial
directory structures as appropriate. ) Raw partitions are generally used for swap
space, and may also be used for certain programs such as databases that choose to
manage their own disk storage system. Partitions containing filesystems can
generally only be accessed using the file system structure by ordinary users, but can
often be accessed as a raw device also by root.
• The boot block is accessed as part of a raw partition, by the boot program prior to
any operating system being loaded. Modern boot programs understand multiple
OSes and filesystem formats, and can give the user a choice of which of several
available systems to boot.
• The root partition contains the OS kernel and at least the key portions of the OS
needed to complete the boot process. At boot time the root partition is mounted,
and control is transferred from the boot program to the kernel found there. ( Older
systems required that the root partition lie completely within the first 524 cylinders
of the disk, because that was as far as the boot program could reach. Once the kernel
had control, then it could access partitions beyond the 524 cylinder boundary. )
• Continuing with the boot process, additional filesystems get mounted, adding their
information into the appropriate mount table structure. As a part of the mounting
process the file systems may be checked for errors or inconsistencies, either because
they are flagged as not having been closed properly the last time they were used, or
just for general principals. Filesystems may be mounted either automatically or
manually. In UNIX a mount point is indicated by setting a flag in the in-memory
copy of the inode, so all future references to that inode get re-directed to the root
directory of the mounted filesystem.

5.4.3 Virtual File Systems

• Virtual File Systems, VFS, provide a common interface to multiple different


filesystem types. In addition, it provides for a unique identifier ( vnode ) for files
across the entire space, including across all filesystems of different types. ( UNIX
inodes are unique only across a single filesystem, and certainly do not carry across
networked file systems. )
• The VFS in Linux is based upon four key object types:
o The inode object, representing an individual file
o The file object, representing an open file.
o The superblock object, representing a filesystem.
o The dentry object, representing a directory entry.
• Linux VFS provides a set of common functionalities for each filesystem, using
function pointers accessed through a table. The same functionality is accessed
through the same table position for all filesystem types, though the actual functions
pointed to by the pointers may be filesystem-specific. See /usr/include/linux/fs.h
for full details. Common operations provided include open( ), read( ), write( ), and
mmap( ).

Figure 5.13 - Schematic view of a virtual file system.


5.5 Overview of Mass-Storage Structure

5.5.1 Magnetic Disks

• Traditional magnetic disks have the following basic structure:


o One or more platters in the form of disks covered with magnetic
media. Hard disk platters are made of rigid metal, while "floppy" disks are
made of more flexible plastic.
o Each platter has two working surfaces. Older hard disk drives would
sometimes not use the very top or bottom surface of a stack of platters, as
these surfaces were more susceptible to potential damage.
o Each working surface is divided into a number of concentric rings
called tracks. The collection of all tracks that are the same distance from
the edge of the platter, ( i.e. all tracks immediately above one another in the
following diagram ) is called a cylinder.
o Each track is further divided into sectors, traditionally containing 55 bytes
of data each, although some modern disks occasionally use larger sector
sizes. ( Sectors also include a header and a trailer, including checksum
information among other things. Larger sector sizes reduce the fraction of
the disk consumed by headers and trailers, but increase internal
fragmentation and the amount of disk that must be marked bad in the case
of errors. )
o The data on a hard drive is read by read-write heads. The standard
configuration ( shown below ) uses one head per surface, each on a
separate arm, and controlled by a common arm assembly which moves all
heads simultaneously from one cylinder to another. ( Other configurations,
including independent read-write heads, may speed up disk access, but
involve serious technical difficulties. )
o The storage capacity of a traditional disk drive is equal to the number of
heads ( i.e. the number of working surfaces ), times the number of tracks
per surface, times the number of sectors per track, times the number of bytes
per sector. A particular physical block of data is specified by providing the
head-sector-cylinder number at which it is located.
Figure 5.14 - Moving-head disk mechanism.

• In operation the disk rotates at high speed, such as 7200 rpm ( 50 revolutions per
second. ) The rate at which data can be transferred from the disk to the computer is
composed of several steps:
o The positioning time, a.k.a. the seek time or random access time is the time
required to move the heads from one cylinder to another, and for the heads
to settle down after the move. This is typically the slowest step in the
process and the predominant bottleneck to overall transfer rates.
o The rotational latency is the amount of time required for the desired sector
to rotate around and come under the read-write head.This can range
anywhere from zero to one full revolution, and on the average will equal
one-half revolution. This is another physical step and is usually the second
slowest step behind seek time. ( For a disk rotating at 7200 rpm, the average
rotational latency would be 1/2 revolution / 50 revolutions per second, or
just over 4 milliseconds, a long time by computer standards.
o The transfer rate, which is the time required to move the data electronically
from the disk to the computer. ( Some authors may also use the term transfer
rate to refer to the overall transfer rate, including seek time and rotational
latency as well as the electronic data transfer rate. )
• Disk heads "fly" over the surface on a very thin cushion of air. If they should
accidentally contact the disk, then a head crash occurs, which may or may not
permanently damage the disk or even destroy it completely. For this reason it is
normal to park the disk heads when turning a computer off, which means to move
the heads off the disk or to an area of the disk where there is no data stored.
• Floppy disks are normally removable. Hard drives can also be removable, and some
are even hot-swappable, meaning they can be removed while the computer is
running, and a new hard drive inserted in their place.
• Disk drives are connected to the computer via a cable known as the I/O Bus. Some
of the common interface formats include Enhanced Integrated Drive Electronics,
EIDE; Advanced Technology Attachment, ATA; Serial ATA, SATA, Universal
Serial Bus, USB; Fiber Channel, FC, and Small Computer Systems Interface, SCSI.
• The host controller is at the computer end of the I/O bus, and the disk controller is
built into the disk itself. The CPU issues commands to the host controller via I/O
ports. Data is transferred between the magnetic surface and onboard cache by the
disk controller, and then the data is transferred from that cache to the host controller
and the motherboard memory at electronic speeds.

5.5.2 Solid-State Disks - New

• As technologies improve and economics change, old technologies are often used in
different ways. One example of this is the increasing used of solid state disks, or
SSDs.
• SSDs use memory technology as a small fast hard disk. Specific implementations
may use either flash memory or DRAM chips protected by a battery to sustain the
information through power cycles.
• Because SSDs have no moving parts they are much faster than traditional hard
drives, and certain problems such as the scheduling of disk accesses simply do not
apply.
• However SSDs also have their weaknesses: They are more expensive than hard
drives, generally not as large, and may have shorter life spans.
• SSDs are especially useful as a high-speed cache of hard-disk information that must
be accessed quickly. One example is to store filesystem meta-data, e.g. directory
and inode information, that must be accessed quickly and often. Another variation
is a boot disk containing the OS and some application executables, but no vital user
data. SSDs are also used in laptops to make them smaller, faster, and lighter.
• Because SSDs are so much faster than traditional hard disks, the throughput of the
bus can become a limiting factor, causing some SSDs to be connected directly to
the system PCI bus for example.

5.5.3 Magnetic Tapes

• Magnetic tapes were once used for common secondary storage before the days of
hard disk drives, but today are used primarily for backups.
• Accessing a particular spot on a magnetic tape can be slow, but once reading or
writing commences, access speeds are comparable to disk drives.
• Capacities of tape drives can range from 20 to 200 GB, and compression can double
that capacity.

5.6 Disk Structure

• The traditional head-sector-cylinder, HSC numbers are mapped to linear block addresses
by numbering the first sector on the first head on the outermost track as sector 0.
Numbering proceeds with the rest of the sectors on that same track, and then the rest of the
tracks on the same cylinder before proceeding through the rest of the cylinders to the center
of the disk. In modern practice these linear block addresses are used in place of the HSC
numbers for a variety of reasons:
1. The linear length of tracks near the outer edge of the disk is much longer than for
those tracks located near the center, and therefore it is possible to squeeze many
more sectors onto outer tracks than onto inner ones.
2. All disks have some bad sectors, and therefore disks maintain a few spare sectors
that can be used in place of the bad ones. The mapping of spare sectors to bad
sectors in managed internally to the disk controller.
3. Modern hard drives can have thousands of cylinders, and hundreds of sectors per
track on their outermost tracks. These numbers exceed the range of HSC numbers
for many ( older ) operating systems, and therefore disks can be configured for any
convenient combination of HSC values that falls within the total number of sectors
physically on the drive.
• There is a limit to how closely packed individual bits can be placed on a physical media,
but that limit is growing increasingly more packed as technological advances are made.
• Modern disks pack many more sectors into outer cylinders than inner ones, using one of
two approaches:
o With Constant Linear Velocity, CLV, the density of bits is uniform from cylinder
to cylinder. Because there are more sectors in outer cylinders, the disk spins slower
when reading those cylinders, causing the rate of bits passing under the read-write
head to remain constant. This is the approach used by modern CDs and DVDs.
o With Constant Angular Velocity, CAV, the disk rotates at a constant angular speed,
with the bit density decreasing on outer cylinders. ( These disks would have a
constant number of sectors per track on all cylinders. )

5.7 Disk Attachment

Disk drives can be attached either directly to a particular host ( a local disk ) or to a network.

5.7.1 Host-Attached Storage

• Local disks are accessed through I/O Ports as described earlier.


• The most common interfaces are IDE or ATA, each of which allow up to two drives
per host controller.
• SATA is similar with simpler cabling.
• High end workstations or other systems in need of larger number of disks typically
use SCSI disks:
o The SCSI standard supports up to 16 targets on each SCSI bus, one of
which is generally the host adapter and the other 15 of which can be disk or
tape drives.
o A SCSI target is usually a single drive, but the standard also supports up to
8 units within each target. These would generally be used for accessing
individual disks within a RAID array. ( See below. )
o The SCSI standard also supports multiple host adapters in a single
computer, i.e. multiple SCSI busses.
o Modern advancements in SCSI include "fast" and "wide" versions, as well
as SCSI-2.
o SCSI cables may be either 50 or 68 conductors. SCSI devices may be
external as well as internal.
• FC is a high-speed serial architecture that can operate over optical fiber or four-
conductor copper wires, and has two variants:
o A large switched fabric having a 24-bit address space. This variant allows
for multiple devices and multiple hosts to interconnect, forming the basis
for the storage-area networks, SANs, to be discussed in a future section.
o The arbitrated loop, FC-AL, that can address up to 56 devices ( drives and
controllers. )

5.7.2 Network-Attached Storage

• Network attached storage connects storage devices to computers using a remote


procedure call, RPC, interface, typically with something like NFS filesystem
mounts. This is convenient for allowing several computers in a group common
access and naming conventions for shared storage.
• NAS can be implemented using SCSI cabling, or ISCSI uses Internet protocols and
standard network connections, allowing long-distance remote access to shared files.
• NAS allows computers to easily share data storage, but tends to be less efficient
than standard host-attached storage.

Figure 5.15 - Network-attached storage.

5.7.3 Storage-Area Network

• A Storage-Area Network, SAN, connects computers and storage devices in a


network, using storage protocols instead of network protocols.
• One advantage of this is that storage access does not tie up regular networking
bandwidth.
• SAN is very flexible and dynamic, allowing hosts and devices to attach and detach
on the fly.
• SAN is also controllable, allowing restricted access to certain hosts and devices.
Figure 5.16 - Storage-area network.

5.8 Disk Scheduling

• As mentioned earlier, disk transfer speeds are limited primarily by seek


times and rotational latency. When multiple requests are to be processed there is also some
inherent delay in waiting for other requests to be processed.
• Bandwidth is measured by the amount of data transferred divided by the total amount of
time from the first request being made to the last transfer being completed, ( for a series of
disk requests. )
• Both bandwidth and access time can be improved by processing requests in a good order.
• Disk requests include the disk address, memory address, number of sectors to transfer, and
whether the request is for reading or writing.

5.8.1 FCFS Scheduling

• First-Come First-Serve is simple and intrinsically fair, but not very efficient.
Consider in the following sequence the wild swing from cylinder 52 to 14 and then
back to 54:
Figure 5.17 - FCFS disk scheduling.

5.8.2 SSTF Scheduling

• Shortest Seek Time First scheduling is more efficient, but may lead to starvation
if a constant stream of requests arrives for the same general area of the disk.
• SSTF reduces the total head movement to 236 cylinders, down from 640 required
for the same set of requests under FCFS. Note, however that the distance could be
reduced still further to 208 by starting with 37 and then 14 first before processing
the rest of the requests.
Figure 5.18 - SSTF disk scheduling.

5.8.3 SCAN Scheduling

• The SCAN algorithm, a.k.a. the elevator algorithm moves back and forth from one
end of the disk to the other, similarly to an elevator processing requests in a tall
building.
Figure 5.19 - SCAN disk scheduling.

• Under the SCAN algorithm, If a request arrives just ahead of the moving head then
it will be processed right away, but if it arrives just after the head has passed, then
it will have to wait for the head to pass going the other way on the return trip. This
leads to a fairly wide variation in access times which can be improved upon.
• Consider, for example, when the head reaches the high end of the disk: Requests
with high cylinder numbers just missed the passing head, which means they are all
fairly recent requests, whereas requests with low numbers may have been waiting
for a much longer time. Making the return scan from high to low then ends up
accessing recent requests first and making older requests wait that much longer.

5.8.4 C-SCAN Scheduling

• The Circular-SCAN algorithm improves upon SCAN by treating all requests in a


circular queue fashion - Once the head reaches the end of the disk, it returns to the
other end without processing any requests, and then starts again from the beginning
of the disk:
Figure 5.20 - C-SCAN disk scheduling.

5.8.5 LOOK Scheduling

• LOOK scheduling improves upon SCAN by looking ahead at the queue of pending
requests, and not moving the heads any farther towards the end of the disk than is
necessary. The following diagram illustrates the circular form of LOOK:
Figure 5.21 - C-LOOK disk scheduling.

5.8.6 Selection of a Disk-Scheduling Algorithm

• With very low loads all algorithms are equal, since there will normally only be one
request to process at a time.
• For slightly larger loads, SSTF offers better performance than FCFS, but may lead
to starvation when loads become heavy enough.
• For busier systems, SCAN and LOOK algorithms eliminate starvation problems.
• The actual optimal algorithm may be something even more complex than those
discussed here, but the incremental improvements are generally not worth the
additional overhead.
• Some improvement to overall filesystem access times can be made by intelligent
placement of directory and/or inode information. If those structures are placed in
the middle of the disk instead of at the beginning of the disk, then the maximum
distance from those structures to data blocks is reduced to only one-half of the disk
size. If those structures can be further distributed and furthermore have their data
blocks stored as close as possible to the corresponding directory structures, then
that reduces still further the overall time to find the disk block numbers and then
access the corresponding data blocks.
• On modern disks the rotational latency can be almost as significant as the seek time,
however it is not within the OSes control to account for that, because modern disks
do not reveal their internal sector mapping schemes, ( particularly when bad blocks
have been remapped to spare sectors. )
o Some disk manufacturers provide for disk scheduling algorithms directly
on their disk controllers, ( which do know the actual geometry of the disk
as well as any remapping ), so that if a series of requests are sent from the
computer to the controller then those requests can be processed in an
optimal order.
o Unfortunately there are some considerations that the OS must take into
account that are beyond the abilities of the on-board disk-scheduling
algorithms, such as priorities of some requests over others, or the need to
process certain requests in a particular order. For this reason OSes may elect
to spoon-feed requests to the disk controller one at a time in certain
situations.

5.9 Disk Management

5.9.1 Disk Formatting

• Before a disk can be used, it has to be low-level formatted, which means laying
down all of the headers and trailers marking the beginning and ends of each sector.
Included in the header and trailer are the linear sector numbers, and error-
correcting codes, ECC, which allow damaged sectors to not only be detected, but
in many cases for the damaged data to be recovered ( depending on the extent of
the damage. ) Sector sizes are traditionally 55 bytes, but may be larger, particularly
in larger drives.
• ECC calculation is performed with every disk read or write, and if damage is
detected but the data is recoverable, then a soft error has occurred. Soft errors are
generally handled by the on-board disk controller, and never seen by the OS. ( See
below. )
• Once the disk is low-level formatted, the next step is to partition the drive into one
or more separate partitions. This step must be completed even if the disk is to be
used as a single large partition, so that the partition table can be written to the
beginning of the disk.
• After partitioning, then the filesystems must be logically formatted, which involves
laying down the master directory information ( FAT table or inode structure ),
initializing free lists, and creating at least the root directory of the filesystem. ( Disk
partitions which are to be used as raw devices are not logically formatted. This
saves the overhead and disk space of the filesystem structure, but requires that the
application program manage its own disk storage requirements. )

5.9.2 Boot Block

• Computer ROM contains a bootstrap program ( OS independent ) with just enough


code to find the first sector on the first hard drive on the first controller, load that
sector into memory, and transfer control over to it. ( The ROM bootstrap program
may look in floppy and/or CD drives before accessing the hard drive, and is smart
enough to recognize whether it has found valid boot code or not. )
• The first sector on the hard drive is known as the Master Boot Record, MBR, and
contains a very small amount of code in addition to the partition table. The partition
table documents how the disk is partitioned into logical disks, and indicates
specifically which partition is the active or boot partition.
• The boot program then looks to the active partition to find an operating system,
possibly loading up a slightly larger / more advanced boot program along the way.
• In a dual-boot ( or larger multi-boot ) system, the user may be given a choice of
which operating system to boot, with a default action to be taken in the event of no
response within some time frame.
• Once the kernel is found by the boot program, it is loaded into memory and then
control is transferred over to the OS. The kernel will normally continue the boot
process by initializing all important kernel data structures, launching important
system services ( e.g. network daemons, sched, init, etc. ), and finally providing
one or more login prompts. Boot options at this stage may include single-
user a.k.a. maintenance or safe modes, in which very few system services are
started - These modes are designed for system administrators to repair problems or
otherwise maintain the system.

Figure 5.21 - Booting from disk in Windows 2000.

5.9.3 Bad Blocks

• No disk can be manufactured to 50% perfection, and all physical objects wear out
over time. For these reasons all disks are shipped with a few bad blocks, and
additional blocks can be expected to go bad slowly over time. If a large number of
blocks go bad then the entire disk will need to be replaced, but a few here and there
can be handled through other means.
• In the old days, bad blocks had to be checked for manually. Formatting of the disk
or running certain disk-analysis tools would identify bad blocks, and attempt to read
the data off of them one last time through repeated tries. Then the bad blocks would
be mapped out and taken out of future service. Sometimes the data could be
recovered, and sometimes it was lost forever. ( Disk analysis tools could be either
destructive or non-destructive. )
• Modern disk controllers make much better use of the error-correcting codes, so that
bad blocks can be detected earlier and the data usually recovered. ( Recall that
blocks are tested with every write as well as with every read, so often errors can be
detected before the write operation is complete, and the data simply written to a
different sector instead. )
• Note that re-mapping of sectors from their normal linear progression can throw off
the disk scheduling optimization of the OS, especially if the replacement sector is
physically far away from the sector it is replacing. For this reason most disks
normally keep a few spare sectors on each cylinder, as well as at least one spare
cylinder. Whenever possible a bad sector will be mapped to another sector on the
same cylinder, or at least a cylinder as close as possible. Sector slipping may also
be performed, in which all sectors between the bad sector and the replacement
sector are moved down by one, so that the linear progression of sector numbers can
be maintained.
• If the data on a bad block cannot be recovered, then a hard error has occurred.,
which requires replacing the file(s) from backups, or rebuilding them from scratch.

5.10 Swap-Space Management

• Modern systems typically swap out pages as needed, rather than swapping out entire
processes. Hence the swapping system is part of the virtual memory management system.
• Managing swap space is obviously an important task for modern OSes.

5.10.1 Swap-Space Use

• The amount of swap space needed by an OS varies greatly according to how it is


used. Some systems require an amount equal to physical RAM; some want a
multiple of that; some want an amount equal to the amount by which virtual
memory exceeds physical RAM, and some systems use little or none at all!
• Some systems support multiple swap spaces on separate disks in order to speed up
the virtual memory system.

5.10.2 Swap-Space Location

Swap space can be physically located in one of two locations:

• As a large file which is part of the regular filesystem. This is easy to


implement, but inefficient. Not only must the swap space be accessed
through the directory system, the file is also subject to fragmentation issues.
Caching the block location helps in finding the physical blocks, but that is
not a complete fix.
• As a raw partition, possibly on a separate or little-used disk. This allows the
OS more control over swap space management, which is usually faster and
more efficient. Fragmentation of swap space is generally not a big issue, as
the space is re-initialized every time the system is rebooted. The downside
of keeping swap space on a raw partition is that it can only be grown by
repartitioning the hard drive.

5.10.3 Swap-Space Management: An Example


• Historically OSes swapped out entire processes as needed. Modern systems swap
out only individual pages, and only as needed. ( For example process code blocks
and other blocks that have not been changed since they were originally loaded are
normally just freed from the virtual memory system rather than copying them to
swap space, because it is faster to go find them again in the filesystem and read
them back in from there than to write them out to swap space and then read them
back. )
• In the mapping system shown below for Linux systems, a map of swap space is
kept in memory, where each entry corresponds to a 4K block in the swap space.
Zeros indicate free slots and non-zeros refer to how many processes have a mapping
to that particular block ( >1 for shared pages only. )

Figure 5.22 - The data structures for swapping on Linux systems.

5.11 RAID Structure

• The general idea behind RAID is to employ a group of hard drives together with some form
of duplication, either to increase reliability or to speed up operations, ( or sometimes both. )
• RAID originally stood for Redundant Array of Inexpensive Disks, and was designed to
use a bunch of cheap small disks in place of one or two larger more expensive ones. Today
RAID systems employ large possibly expensive disks as their components, switching the
definition to Independent disks.

5.11.1 Improvement of Reliability via Redundancy

• The more disks a system has, the greater the likelihood that one of them will go bad
at any given time. Hence increasing disks on a system actually decreases the Mean
Time To Failure, MTTF of the system.
• If, however, the same data was copied onto multiple disks, then the data would not
be lost unless both ( or all ) copies of the data were damaged simultaneously, which
is a MUCH lower probability than for a single disk going bad. More specifically,
the second disk would have to go bad before the first disk was repaired, which
brings the Mean Time To Repair into play. For example if two disks were
involved, each with a MTTF of 50,000 hours and a MTTR of 5 hours, then
the Mean Time to Data Loss would be 500 * 5^6 hours, or 57,000 years!
• This is the basic idea behind disk mirroring, in which a system contains identical
data on two or more disks.
o Note that a power failure during a write operation could cause both disks to
contain corrupt data, if both disks were writing simultaneously at the time
of the power failure. One solution is to write to the two disks in series, so
that they will not both become corrupted ( at least not in the same way ) by
a power failure. And alternate solution involves non-volatile RAM as a
write cache, which is not lost in the event of a power failure and which is
protected by error-correcting codes.

511.2 Improvement in Performance via Parallelism

• There is also a performance benefit to mirroring, particularly with respect to reads.


Since every block of data is duplicated on multiple disks, read operations can be
satisfied from any available copy, and multiple disks can be reading different data
blocks simultaneously in parallel. ( Writes could possibly be sped up as well
through careful scheduling algorithms, but it would be complicated in practice. )
• Another way of improving disk access time is with striping, which basically means
spreading data out across multiple disks that can be accessed simultaneously.
o With bit-level striping the bits of each byte are striped across multiple disks.
For example if 8 disks were involved, then each 8-bit byte would be read in
parallel by 8 heads on separate disks. A single disk read would access 8 *
55 bytes = 4K worth of data in the time normally required to read 55 bytes.
Similarly if 4 disks were involved, then two bits of each byte could be stored
on each disk, for 2K worth of disk access per read or write operation.
o Block-level striping spreads a filesystem across multiple disks on a block-
by-block basis, so if block N were located on disk 0, then block N + 1 would
be on disk 1, and so on. This is particularly useful when filesystems are
accessed in clusters of physical blocks. Other striping possibilities exist,
with block-level striping being the most common.

5.11.3 RAID Levels

• Mirroring provides reliability but is expensive; Striping improves performance, but


does not improve reliability. Accordingly there are a number of different schemes
that combine the principals of mirroring and striping in different ways, in order to
balance reliability versus performance versus cost. These are described by
different RAID levels, as follows: ( In the diagram that follows, "C" indicates a
copy, and "P" indicates parity, i.e. checksum bits. )
1. Raid Level 0 - This level includes striping only, with no mirroring.
2. Raid Level 1 - This level includes mirroring only, no striping.
3. Raid Level 2 - This level stores error-correcting codes on additional disks,
allowing for any damaged data to be reconstructed by subtraction from the
remaining undamaged data. Note that this scheme requires only three extra
disks to protect 4 disks worth of data, as opposed to full mirroring. ( The
number of disks required is a function of the error-correcting algorithms,
and the means by which the particular bad bit(s) is(are) identified. )
4. Raid Level 3 - This level is similar to level 2, except that it takes advantage
of the fact that each disk is still doing its own error-detection, so that when
an error occurs, there is no question about which disk in the array has the
bad data. As a result a single parity bit is all that is needed to recover the
lost data from an array of disks. Level 3 also includes striping, which
improves performance. The downside with the parity approach is that every
disk must take part in every disk access, and the parity bits must be
constantly calculated and checked, reducing performance. Hardware-level
parity calculations and NVRAM cache can help with both of those issues.
In practice level 3 is greatly preferred over level 2.
5. Raid Level 4 - This level is similar to level 3, employing block-level striping
instead of bit-level striping. The benefits are that multiple blocks can be
read independently, and changes to a block only require writing two blocks
( data and parity ) rather than involving all disks. Note that new disks can
be added seamlessly to the system provided they are initialized to all zeros,
as this does not affect the parity results.
6. Raid Level 5 - This level is similar to level 4, except the parity blocks are
distributed over all disks, thereby more evenly balancing the load on the
system. For any given block on the disk(s), one of the disks will hold the
parity information for that block and the other N-1 disks will hold the data.
Note that the same disk cannot hold both data and parity for the same block,
as both would be lost in the event of a disk crash.
7. Raid Level 6 - This level extends raid level 5 by storing multiple bits of
error-recovery codes, ( such as the Reed-Solomon codes ), for each bit
position of data, rather than a single parity bit. In the example shown below
2 bits of ECC are stored for every 4 bits of data, allowing data recovery in
the face of up to two simultaneous disk failures. Note that this still involves
only 50% increase in storage needs, as opposed to 50% for simple mirroring
which could only tolerate a single disk failure.
Figure 5.23 - RAID levels.

• There are also two RAID levels which combine RAID levels 0 and 1 ( striping and
mirroring ) in different combinations, designed to provide both performance and
reliability at the expense of increased cost.
o RAID level 0 + 1 disks are first striped, and then the striped disks mirrored
to another set. This level generally provides better performance than RAID
level 5.
o RAID level 1 + 0 mirrors disks in pairs, and then stripes the mirrored pairs.
The storage capacity, performance, etc. are all the same, but there is an
advantage to this approach in the event of multiple disk failures, as
illustrated below:.
▪ In diagram (a) below, the 8 disks have been divided into two sets of
four, each of which is striped, and then one stripe set is used to
mirror the other set.
▪ If a single disk fails, it wipes out the entire stripe set, but the
system can keep on functioning using the remaining set.
▪ However if a second disk from the other stripe set now fails,
then the entire system is lost, as a result of two disk failures.
▪ In diagram (b), the same 8 disks are divided into four sets of two,
each of which is mirrored, and then the file system is striped across
the four sets of mirrored disks.
▪ If a single disk fails, then that mirror set is reduced to a single
disk, but the system rolls on, and the other three mirror sets
continue mirroring.
▪ Now if a second disk fails, ( that is not the mirror of the
already failed disk ), then another one of the mirror sets is
reduced to a single disk, but the system can continue without
data loss.
▪ In fact the second arrangement could handle as many as four
simultaneously failed disks, as long as no two of them were
from the same mirror pair.

Figure 5.24 - RAID 0 + 1 and 1 + 0

5.11.4 Selecting a RAID Level

• Trade-offs in selecting the optimal RAID level for a particular application include
cost, volume of data, need for reliability, need for performance, and rebuild time,
the latter of which can affect the likelihood that a second disk will fail while the
first failed disk is being rebuilt.
• Other decisions include how many disks are involved in a RAID set and how many
disks to protect with a single parity bit. More disks in the set increases performance
but increases cost. Protecting more disks per parity bit saves cost, but increases the
likelihood that a second disk will fail before the first bad disk is repaired.

5.11.5 Extensions

• RAID concepts have been extended to tape drives ( e.g. striping tapes for faster
backups or parity checking tapes for reliability ), and for broadcasting of data.

5.11.6 Problems with RAID

• RAID protects against physical errors, but not against any number of bugs or other
errors that could write erroneous data.
• ZFS adds an extra level of protection by including data block checksums in all
inodes along with the pointers to the data blocks. If data are mirrored and one copy
has the correct checksum and the other does not, then the data with the bad
checksum will be replaced with a copy of the data with the good checksum. This
increases reliability greatly over RAID alone, at a cost of a performance hit that is
acceptable because ZFS is so fast to begin with.

Figure 5.25 - ZFS checksums all metadata and data.

• Another problem with traditional filesystems is that the sizes are fixed, and
relatively difficult to change. Where RAID sets are involved it becomes even harder
to adjust filesystem sizes, because a filesystem cannot span across multiple
filesystems.
• ZFS solves these problems by pooling RAID sets, and by dynamically allocating
space to filesystems as needed. Filesystem sizes can be limited by quotas, and space
can also be reserved to guarantee that a filesystem will be able to grow later, but
these parameters can be changed at any time by the filesystem's owner. Otherwise
filesystems grow and shrink dynamically as needed.

Figure 5.26 - (a) Traditional volumes and file systems. (b) a ZFS pool and file systems.

You might also like