Design and Implementation of The UVM Virtual Memory System in NetBSD
Design and Implementation of The UVM Virtual Memory System in NetBSD
DISSERTATION ACCEPTANCE
This student’s dissertation, entitled Design and Implementation of the UVM Virtual
Memory System has been examined by the undersigned committee of five faculty members
and has received full approval for acceptance in partial fulfillment of the requirements for
the degree Doctor of Science.
APPROVAL: Chairman
Short Title: Design and Implementation of UVM Cranor, D.Sc. 1998
WASHINGTON UNIVERSITY
SEVER INSTITUTE OF TECHNOLOGY
DEPARTMENT OF COMPUTER SCIENCE
Doctor of Science
August, 1998
Saint Louis, Missouri
WASHINGTON UNIVERSITY
SEVER INSTITUTE OF TECHNOLOGY
DEPARTMENT OF COMPUTER SCIENCE
ABSTRACT
by Charles D. Cranor
August, 1998
Saint Louis, Missouri
We introduce UVM, a new virtual memory subsystem for 4.4BSD that makes better
use of existing hardware memory management features to reduce overhead and improve
performance. Our novel approach focuses on allowing processes to pass memory to and
from other processes and the kernel, and to share memory. This approach reduces or elim-
inates the need to copy data thus reducing the time spent within the kernel and freeing
up cycles for application processing. Unlike the approaches that focus exclusively on the
networking and inter-process communications (IPC) subsystems, our approach provides a
general framework for solutions that can improve efficiency of the entire I/O subsystem.
Our primary objective in creating UVM was to produce a virtual memory system
that provides a Unix-like operating system kernel’s I/O and IPC subsystems with efficient
VM-based data movement facilities that have less overhead than a traditional data copy.
Our work seeks to:
allow a process to safely let a shared copy-on-write copy of its memory be used either
by other processes, the I/O system, or the IPC system;
allow pages of memory from the I/O system, the IPC system, or from other processes
to be inserted easily into a process’ address space; and
allow processes and the kernel to exchange large chunks of their virtual address
spaces using the VM system’s higher-level memory mapping data structures.
UVM allows processes to exchange and share memory through three innovative new
mechanisms: page loanout, page transfer, and map entry passing. We present test results
that show that our new VM-based data movement mechanisms are more efficient than data
copying.
UVM is implemented entirely within the framework of BSD and thus maintains all
the features and standard parts of the traditional Unix environment that programmers have
come to expect. The first release of UVM in NetBSD runs on several platforms including
I386-PC, DEC Alpha, Sun Sparc, Motorola m68k, and DEC VAX systems. It is already
being used on systems around the world.
to my family, especially Lorrie
Contents
Acknowledgments xv
1 Introduction 1
2 Background 7
2.1 The Role of the VM System . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 The VM System and Process Life Cycle . . . . . . . . . . . . . . . . . . . 10
2.2.1 Process Start-up . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.2 Running Processes and Page Faults . . . . . . . . . . . . . . . . . 13
2.2.3 Forking and Exiting . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.4 VM Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3 The Evolution of the VM System In BSD . . . . . . . . . . . . . . . . . . 17
2.4 Design Overview of the BSD VM System . . . . . . . . . . . . . . . . . . 19
2.4.1 Layering in the BSD VM System . . . . . . . . . . . . . . . . . . 19
2.4.2 The Machine-Dependent Layer . . . . . . . . . . . . . . . . . . . 19
2.4.3 The Machine-Independent Layer . . . . . . . . . . . . . . . . . . . 21
2.4.4 Copy-on-write and Object Chaining . . . . . . . . . . . . . . . . . 27
2.4.5 Page Fault Handling . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.4.6 Memory Sharing . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
iv
3.2.1 New Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.2.2 Improved Features . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.3 High-Level Design Overview . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.3.1 UVM’s Machine-Independent Layer . . . . . . . . . . . . . . . . . 43
3.3.2 Data Structure Locking . . . . . . . . . . . . . . . . . . . . . . . . 45
3.3.3 VM Maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.3.4 UVM Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.3.5 Anonymous Memory Structures . . . . . . . . . . . . . . . . . . . 50
3.3.6 Pager Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.3.7 Pages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.4.1 Other Virtual Memory Systems . . . . . . . . . . . . . . . . . . . 57
3.4.2 I/O and IPC Subsystems That Use VM Features . . . . . . . . . . . 69
3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
vi
6.2.11 The Make Put Cluster Operation . . . . . . . . . . . . . . . . . . . 126
6.2.12 The Share Protect Operation . . . . . . . . . . . . . . . . . . . . . 127
6.2.13 The Async I/O Done Operation . . . . . . . . . . . . . . . . . . . 127
6.2.14 The Release Page Operation . . . . . . . . . . . . . . . . . . . . . 128
6.3 BSD VM Pager vs. UVM Pager . . . . . . . . . . . . . . . . . . . . . . . 129
6.3.1 Data Structure Layout . . . . . . . . . . . . . . . . . . . . . . . . 129
6.3.2 Page Access . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
6.3.3 Other Minor Differences . . . . . . . . . . . . . . . . . . . . . . . 132
6.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
viii
9.1.1 Phase 1: Study of BSD VM . . . . . . . . . . . . . . . . . . . . . 194
9.1.2 Phase 2: UVM Under BSD VM . . . . . . . . . . . . . . . . . . . 194
9.1.3 Phase 3: Users Under UVM . . . . . . . . . . . . . . . . . . . . . 195
9.1.4 Phase 4: Kernel Under UVM . . . . . . . . . . . . . . . . . . . . . 196
9.1.5 Source Tree Management . . . . . . . . . . . . . . . . . . . . . . 196
9.2 UVMHIST: UVM Function Call History . . . . . . . . . . . . . . . . . . . 197
9.2.1 UVMHIST As an Educational Tool . . . . . . . . . . . . . . . . . 198
9.2.2 UVMHIST and Redundant VM Calls . . . . . . . . . . . . . . . . 199
9.2.3 UVMHIST and UVM Design Adjustments . . . . . . . . . . . . . 201
9.3 UVM and the Kernel Debugger . . . . . . . . . . . . . . . . . . . . . . . . 202
9.4 UVM User-Level Applications . . . . . . . . . . . . . . . . . . . . . . . . 202
9.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
ix
11 Conclusions and Future Research 232
11.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
11.2 Future Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
x
List of Tables
xi
List of Figures
8.1 The BSD VM call paths for the four mapping system calls . . . . . . . . . 167
8.2 UVM’s mapping call path . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
xiii
9.2 A screen dump of the xuvmstat program . . . . . . . . . . . . . . . . . 205
xiv
Acknowledgments
Designing, implementing, and testing a software system as large and complex as UVM was
quite a challenge. I would never have been able to complete such a task without support
from my colleagues, friends, and family. Thus, I am pleased to have the opportunity to
thank them in my dissertation.
I would first like to thank my advisor Dr. Gurudatta Parulkar for encouraging me to
attend Washington University and for providing funding throughout my years in St. Louis
(even during the “hard times”). Having an advisor who is accessible, easy to communicate
with, and supportive certainly has made my life easier.
I would also like to thank the rest of my committee for taking the time to learn
about my work and providing me with valuable feedback on how it could be improved. In
addition to providing feedback on my research, my committee has also been a source of
inspiration. Dr. George Varghese’s constant enthusiasm for research and my work helped
me keep my spirits up throughout this project. Dr. Douglas Schmidt’s boundless energy
and views on how to sell systems research helped me improve the presentation of my work
by teaching me to look at it from new points of view. Dr. Ron Cytron’s natural curiosity
and willingness to examine research outside of his primary area of interest showed me the
value of being open minded. Finally, Dr. Ronald Minnich taught me the value of patience
and perseverance when working on large software projects.
Several BSD developers helped in the latter stages of this project. I would like to
thank Matthew Green for helping to write and debug UVM’s swap management code. Matt
also coordinated the integration of UVM into the NetBSD source tree. Thanks to Chuck
Silvers for writing the aobj pager for me and helping with debugging. Finally, thanks to
Jason Thorpe for helping to port UVM to new platforms.
There are a number of people who contributed indirectly to the success of the UVM
project. I would have never gone to graduate school if it was not for my involvement
in undergraduate research at University of Delaware. I credit Professor David Farber for
allowing me to join his research group “F-Troup.” My positive experiences as a member of
F-Troup convinced me that I wanted to continue my education beyond my undergraduate
degree. I would like to thank all my friends from University of Delaware.
xv
Washington University’s Computer Science Department, Computer and Communi-
cations Research Center (CCRC), and Applied Research Lab (ARL) have been a wonderful
environment in which to work. The three key ingredients to this environment are the faculty,
staff, and students. The faculty has contributed to this environment by obtaining research
funding for graduate students, teaching interesting courses, and mentoring their students. I
would particularly like to thank Dr. Turner, Dr. Cox, Dr. Franklin, and Dr. Chamberlain for
their roles in forming and maintaining the CCRC and ARL research labs.
The CS, CCRC, and ARL office staff have helped me maintain my sense of humor
while I have made my way through the administrative red tape that comes with a university
bureaucracy. I would like to thank Jean, Sharon, Myrna, Peggy, Paula, Diana, Vykky, and
the rest of the staff for their assistance during my years at Washington University.
I would like to thank my fellow students for their friendship during my stay in
St. Louis. First, I would like to thank the members of my research group. Thanks to
Milind Buddhikot, Zubin Dittia, R. Gopalakrishnan (Gopal), and Christos Papadopoulos.
Milind kept me entertained through his tales of woe and wonder and also by leaving his
keys and wallet at random locations throughout CCRC (despite my best efforts to break
him of this habit). Zubin taught me the value of “great fun” by being willing to drop all his
work at any time on short notice so that we could go watch a movie. Gopal shared both
his wit and his technique for giving technical presentations (the “freight-train approach”).
Christos went from spending a year car-less to becoming our research group’s top auto
mechanic, and I thank him for all his service. I would also like to thank Anshul Kantawala
for coining the often-used phrase “you fool!” and Hari Adiseshu for letting me resolve
some of his technical doubts. Thanks to the rest of the group (James, Tony, Sanjay, Jim,
Vikas, Deepak, Larry, Hossam, Marcel, Dan, Jane, Woody, Sherlia, Daphne, Yuhang, and
Ed) for their support.
Thanks to Sarang Joshi and Gaurav Garg for helping me get through my first year
in St. Louis. Thanks to Greg Peterson for helping me to arrange the “phool” movie nights.
Thanks to Brad Noble for the great conversations, keeping Rob in control, and introducing
Elvis to the back hallway. Thanks to Giuseppe Bianchi (Geppo) for pulling off the best
practical jokes and not making me a target. Thanks to Rex Hill for introducing me to my
wife and for helping Zubin move the APIC a little closer to reality. Thanks to Shree for
teaching me about the “junta.” And thanks to Maurizio Casoni, Girish Chandranmenon,
Apostolos Dailianas, Andy Fingerhut, Rob Jackson, Steve von Worley, and Ellen Zegura
for making the back hall a more interesting place.
Thanks to Ken Wong for letting me help manage the CCRC systems and for being
easy and fun to work with. Thanks to John deHart and Fred Kuhns for helping with the
ARL equipment that I used in my research.
I would like to thank AT&T-Research for allowing me to use their summer stu-
dent space during the winter. In particular, I would like to thank Paul Resnick and Brian
xvi
LaMacchia for allowing me to tag along with their group when I was not in the best of
moods (having had to move to New Jersey without completing UVM). I would also like
to briefly thank the following people for their support: Kalpana Parulkar, Nikita Parulkar,
Betsy Cytron, Jessica Cytron, Melanie Cytron, Maya Gokhale, Austin Minnich, Amanda
Minnich, Luke Beegle, Chris Handley, Pam Perkins, Erik Perkins, Scott Hassan, Penny
Noble, Theo de Raadt, AGES, Joe Edwards, Dr. Hurley, Elvis Presley, H.P. Lovecraft, and
the authors of the BSD games in /usr/games.
Finally, I would like to thank my family to whom this dissertation is dedicated. My
wife Lorrie has been a constant source of support throughout the development of UVM.
Her feedback on both the written and oral presentation of UVM helped me present complex
ideas in a way that is clear, concise, and readable. I would like to thank my parents, Morris
and Connie Cranor, for raising and supporting me. I would not be where I am today without
their support and I am proud to honor them with my accomplishments. I would like to thank
Lorrie’s parents, Drs. Michael and Judy Ackerman, for accepting me into their family and
for their encouragement and understanding during my career as a doctorial student. Finally,
I would like to thank the rest of my family, including (but not limited to) my sister Ginny,
my brother-in-law Jeremy, and all eight of my grandparents for their support.
Charles D. Cranor
Washington University in Saint Louis
August 1998
xvii
1
Chapter 1
Introduction
New computer applications in areas such as multimedia, imaging, and distributed comput-
ing demand high levels of performance from the entire computer system. For example,
multimedia applications often require bandwidth on the order gigabits per second. Al-
though hardware speeds have improved, traditional operating systems have had a hard time
keeping up [32, 50]. One major problem with traditional operating systems is that they
require unnecessary data copies. Data copies are expensive for a number of reasons. First,
the bandwidth of main memory is limited. Each copy made of the data is effected by this.
Second, to copy data, the CPU must move it word-by-word from the source buffer to the
destination. While the CPU is busy moving data it is unavailable to the user application.
Finally, data copying often flushes useful information out of the cache. since the CPU ac-
cesses main memory through the cache, the process of copying data fills the cache with the
data being copied — displacing potentially useful data that was resident in the cache before
the copy.
Applications that transmit data from disk storage over a network are an example of
a class of applications that suffer from unnecessary data copying. Applications in this class
include video, file, and web servers. Figure 1.1 shows the path the data in this class of
application takes in a traditional system. First the server issues a read request. To satisfy
this request, the kernel first has the disk device store the data into a kernel buffer in main
memory. To complete the read request the kernel copies the data from the first kernel buffer
into the buffer the user specified. Note that this data copy traverses the MMU and cache.
Once the data has been read, the server application then immediately writes the data to
the network socket (without even accessing the data). This causes the kernel to copy the
data from the user buffer into another kernel buffer that is associated with the networking
system. The kernel can then program the network interface to read the data directly from
2
main
memory
kern buf1
CPU
user buf
kern buf2
MMU
disk net
cache
bus
the second kernel buffer and then transmit it. This operation requires two processor data
copies; however if the operating system used only one buffer instead of three, then no
processor data copies would be required to read and transmit the data.
A variety of efforts have sought to improve system performance by increasing the
efficiency of data flow through the operating system. Most of these efforts have been cen-
tered on adapting the I/O and networking subsystems of the operating system to reduce data
copying and protocol overhead [4, 24, 27, 53]. These approaches tend to involve superficial
modifications of the virtual memory subsystem that do not involve changing the underlying
design of the system. This is understandable because the virtual memory subsystem is typi-
cally much larger and more complex than the networking subsystem. However, substantial
gains can be made by changing the design of the virtual memory subsystem to exploit the
ability of the hardware to map a physical address to one or more virtual addresses.
This research seeks to modify a traditional Unix-like operating system’s virtual
memory subsystem to allow applications to better take advantage of the features of the
memory management hardware of the computer. Our novel approach focuses on allowing
processes to pass memory to and from other processes and the kernel, and to share mem-
ory. The memory passed or shared may be a block of pages, or it may be a chunk of virtual
memory space that may not be entirely mapped. The memory may or may not be pageable.
This approach reduces or eliminates the need to copy data thus reducing the time spent
3
within the kernel and freeing up cycles for application processing. Unlike the approaches
that focus exclusively on the networking and inter-process communications (IPC) subsys-
tems, our approach provides a general framework for solutions that can improve efficiency
of the entire I/O subsystem.
Modifying the kernel’s virtual memory system is more of a challenge than working
with other parts of the kernel (e.g. writing a device driver) for several reasons. First, the
virtual memory system’s data structures are shared between the kernel and the processes
running on the system and thus must be properly protected for concurrent access with lock-
ing. Second, the VM system must be able to handle both synchronous and asynchronous
I/O operations. Managing asynchronous operations is more difficult than synchronous op-
erations. Third, errors in the VM system can be harder to diagnose than in other kernel
subsystems because the kernel itself depends on the VM system to manage its memory
(thus if an error occurs it may not be possible to get the kernel into a debugable state) and
also because the VM system is typically in the process of handling many service requests
at once. Fourth, the BSD VM system is poorly documented when compared to other kernel
subsystems (for example, the networking stack has several books describing its operation
while the BSD VM system is only documented in a few book chapters and Mach papers).
In short, the virtual memory system is both large and complicated and has vital components
that all other parts of the kernel depend on to function properly.
We selected NetBSD [54] as the platform for this work. NetBSD is a descendant
of the well known 4.4BSD-lite operating system from University of California, Berkeley
[39]. NetBSD is a multi-platform operating system that runs on systems such as i386 PCs,
DEC Alphas, and Sun Sparc systems. It is currently being used by numerous institutions
and individuals for advanced research as well as recreational computing. NetBSD is an
ideal platform for university research because the source code is freely available on the In-
ternet and well documented in books and papers, and the TCP/IP stack is the standard BSD
TCP/IP that is used throughout the world. NetBSD is also closely related to other free BSD
operating systems such as FreeBSD [25] and OpenBSD [55], thus allowing applications to
easily migrate between these systems. When we began work on this project the NetBSD
virtual memory subsystem was largely unchanged from its roots in 4.4BSD.
The primary purpose of a virtual memory system is to manage the virtual address
space of both the kernel and user processes [18, 19]. A virtual address space is defined by
a map structure. There is a map for each process (and the kernel) on the system, consisting
of entries that describe the locations of memory objects mapped into the virtual address
space. In addition, the virtual memory system manages the mappings of individual virtual
4
memory pages to the corresponding physical page that belongs to each mapped memory
object. The virtual memory system is responsible for managing these mappings so that the
data integrity of the operating system is maintained.
In this dissertation we introduce UVM 1 , a new virtual memory system based on the
4.4BSD VM system. UVM allows processes to exchange and share memory through three
innovative new mechanisms: page loanout, page transfer, and map entry passing.
In page loanout, a process can loan its memory to other processes or to non-VM
kernel subsystems. This is achieved with a reference counter and special handling
of kernel memory mappings. Special care is taken to handle events such as write
faults and pageouts that could break the loan. Applications that transfer data over a
network can take advantage of this feature to loan data pages from the user’s address
space to the kernel’s networking subsystem. In our tests, single-page loanouts to
the networking subsystem took 26% less time than copying data. Tests involving
multi-page loanouts show that page loaning can reduce the processing time further,
for example a 256 page loanout took 78% less time than copying data.
In page transfer, a process can receive pages of memory from other processes or the
kernel. Once a page has been transfered into a process’ address space it is treated
like any other page of a process’ memory. Page transfer is achieved without the need
for complex memory object operations that occur in BSD VM. Our tests show that
single page transfers do not show an improvement over data copying, but a two-page
transfer shows a 22% reduction in processing time. The processing time decreases as
larger chunks of pages are transfered, for example a eight-page page transfer reduces
processing time by 57%.
In map entry passing, a process exports a range of its address space to other processes
or the kernel by allowing a chunk of its map entry structures to be copied. This map
entry copying can be done in such a way that the exporting process can retain, share,
or surrender control of the mappings. Map entry passing can be used to quickly
transfer memory between processes in ways that were not possible with the BSD
VM system. Single-page map entry passing tests do not show any improvement
over data copying, however two-page map entry passing shows a 30% reduction in
processing time. Processing time decreases further when larger blocks of data are
1
Following Larry Wall [70] we provide no definitive explanation for what UVM stands for. Expanding
this acronym is left as an exercise for the reader.
5
passed, for example an eight-page transfer of memory decreases processing time by
69%.
UVM also includes well-documented changes to the BSD VM system that improve
the accessibility of the code and the overall performance and stability of the operating
system:
The old anonymous memory handling system that used “shadow” and “copy object
chains” is completely replaced by a simpler two-layer system. Anonymous memory
is memory that is freed as soon as it is no longer referenced.
The new anonymous memory pageout clustering code can improve anonymous mem-
ory pageout by a factor of six.
The new pager interface gives the pager more control over its pages and eliminates
the need for awkward “fictitious” device pages.
The new memory object handling code eliminates the need for the redundant “object-
cache” and allows vnode-based objects to persist as long as their backing vnodes are
valid.
Kernel map fragmentation has been reduced by revising the handling of wired mem-
ory.
The code to handle contiguous and non-contiguous memory has been merged into a
single interface thus reducing the complexity of the source code.
Redundant virtual memory calls have been eliminated in the fork-exec-exit path, thus
reducing the processing time of a fork operation. For example, the processing time
of a fork operation on a small process has been reduced by 13% under UVM.
A new system call allows unprivileged process to get the virtual memory system
status. UVM is distributed with an X window program that uses this system call to
display virtual memory system status graphically.
The new i386 pmap eliminates “ptdi” panics and deadlock conditions that plagued
the old pmap in some cases.
UVM has been designed to completely replace the BSD VM system in NetBSD. It
was incorporated into the master NetBSD source tree in February of 1998 and has proven
6
stable in NetBSD’s Alpha, ARM32, Atari, HP300, I386, MAC68k, MVME68k, News-
Mips, PC532, PMAX, Sparc, and VAX ports. It will appear in NetBSD release 1.4.
The rest of this dissertation is organized as follows: In Chapter 2 we present back-
ground on the VM subsystem. In Chapter 3 we detail the overall UVM design and describe
related work. We describe anonymous memory handling, our new page fault handler, our
new pager interface, and the page/map loanout interface in Chapters 4, 5, 6, and 7 respec-
tively. We describe secondary design elements in Chapter 8. In Chapter 9 we introduce our
implementation methods. Finally we present our results in Chapter 10 and our conclusions
and suggestions for future work in Chapter 11. Definitions of key terms can be found in
the glossary.
7
Chapter 2
Background
This chapter describes the traditional role of the virtual memory subsystem. It explains the
history and evolution of the BSD VM and presents an overview of the current BSD VM
system as background for understanding the UVM system.
Allocation of each process’ virtual address space. This is done by keeping a list of all
allocated regions in each virtual address space and allocating free regions as needed
by the process.
Mapping physical pages into virtual address space. This is done by programming
the computer’s MMU to map physical pages into one or more virtual address spaces
with the appropriate protection.
Handling invalid memory accesses (i.e. page faults). This is done through the VM
system’s page fault handler. Page faults can happen when unmapped memory is
referenced, or when memory is referenced in a way that is inconsistent with the
current protection on the page.
Moving data between physical memory and backing store. This is done by making
requests to the I/O system to read and write data into physical pages of memory.
Managing memory shortages on the system. This is done by freeing unused and
inactive pages of physical memory (saving to backing store if necessary).
Duplicating the address space of a process during a fork. This can include changing
protection of a virtual memory region to allow for “copy-on-write.” In copy-on-write
the VM subsystem write-protects the pages so that they will be duplicated when
modified [9].
9
Operations that Operations called
make calls to VM by VM
VM SYSTEM
process management vnode system
The role of the VM system is shown in Figure 2.1. As shown on the left side of the
figure, the services of the VM system are requested in three ways. First, system calls such
as mmap and mprotect may be called by a process to change the configuration of its
virtual memory address space. Second, virtual memory process management services are
called by the kernel when a process is in the midst of a fork, exec, or exit operation.
Finally, the VM system is called when a page fault occurs to resolve the fault either by
providing memory for the process to use or by signaling an error with a “segmentation
violation.”
In the course of its operation the VM system may need to request the services of
other parts of the kernel in the form of an I/O request. The VM system makes two types of
I/O requests, as show in the right half of Figure 2.1. For devices such as frame buffers that
allow their device memory to be memory mapped (with the mmap system call), the device’s
d mmap function is called to convert an offset into the device into a physical address that
can be mapped in. For normal files the vnode system is called to perform I/O between
the VM system’s buffers and the underlying file. As shown in the figure, the underlying
file may reside on a local filesystem or it may reside on a remote computer (and is thus
10
accessed via the network file system). Note, because it is possible for these I/O systems to
make requests back into the VM system, care must be taken to avoid potential deadlock.
text: the compiled code for the init program. The text starts off with a special header
data structure1 that indicates the size of each segment of the program and the starting
address of the first machine instruction of the program.
data: the pre-initialized variable data for the program. For example, if init declared a
global variable “x” like this: “int x = 55;” then the value 55 would be stored in
the data area of the /sbin/init file.
symbols: the symbols stored in the program are used for debugging and are often stripped
out of system programs such as init. They are ignored by the kernel.
global data variables that have not been assigned a value are stored. Also, if the program
does dynamic memory allocation with malloc the bss area is often expanded to provide
memory for malloc to manage3 . This expanded area is called the “heap.” The stack area
is where the program’s stack, including local variables, is stored.
When the kernel prepares the init process for execution it must use the VM system
to establish the virtual memory mappings of the text, data, bss, and stack areas. This is part
of the “exec” process of a process’ life cycle. In an exec system call, the kernel opens the
file to be run and reads the header to determine the size of the text, data, and bss areas. It
then uses the VM system to establish virtual memory mappings for these areas, as shown in
Figure 2.2. Each of the four mappings has three main attributes as shown in Table 2.1: the
backing object, the copy-on-write status, and the protection. The backing object indicates
what object is mapped into the space. This is usually a file or “zero-fill” memory which
means that the area should be filled with zeroed memory as it is accessed. The copy-
on-write status indicates whether changes made to the memory should be reflected to the
backing object (if it is not zero-fill). If part of a file is memory mapped without copy-
on-write (“shared mappings”) then changes to that portion of memory will cause the file
itself to be changed. For zero-fill memory, the copy-on-write flag doesn’t matter. Finally,
the protection is usually one of read-only or read-write, but it can also be set to “none.”
The MMU is used to map pages in the allocated ranges of virtual address space to their
corresponding physical address.
When the kernel does an exec call on /sbin/init the following mappings are set
up:
text: The text memory area is a read-only mapping of the text part of the /sbin/init
file. The mapping is marked copy-on-write in case the program is run under a debug-
ger (in which case the debugger may change the protection of the text to read-write
3
Some newer versions of malloc allocate memory using mmap rather than expanding the bss area.
12
process
address space
max user
address
stack (zero-fill mapping)
pages of
physical memory
unallocated
mmu
/sbin/init file
bss/heap sym
data
text
0 first page
Figure 2.2: The init program’s virtual memory layout. The stack and bss/heap mappings
are zero-fill, while the text and data mappings are taken from the /sbin/init file. The
MMU maps addresses in the virtual address space to their corresponding physical address.
13
to insert and remove break-points). Under normal operation a process never writes
to the text area of memory.
data: The data memory area is a read-write copy-on-write mapping of the data part of the
/sbin/init program. This means that the initial values of the data area are loaded
from the file, but as the data is changed the changes are private to the process. Thus,
if other users run the program, they will get the correct initial values in their data
segment.
bss: The bss memory area is mapped as a zero-fill area after the data memory area.
stack: The stack memory area is mapped as a zero-fill area at the end of the user’s address
space.
Note that there is a very large gap of free virtual address space between the end of the bss
area and the beginning of the stack that can be used by the process or the operating system
to map in other files and data as needed. For programs that use shared libraries this area is
used by the dynamic linker (“ld.so”) at run time to mmap in the shared library files. In
the case of init we assume it is statically linked, and thus does not need to map in any extra
files.
1. The process accesses a memory address that is not mapped (or improperly mapped)
in to its virtual address space. The processor’s memory management until generates
a page fault trap. At this point the process is frozen unit the page fault is resolved.
2. The processor-specific part of the kernel catches the page fault and uses the MMU
to determine the virtual address that triggered the page fault. It also determines the
access type of the fault by checking the hardware to see if the process was reading
from or writing to the faulting address.
14
3. The VM system’s page fault routine is called with information on the faulting process
and the address and access type of the fault.
4. The VM system looks in the process’ mappings to see what data should be mapped
at the faulting address (for example, the fault address could be mapped to the first
page of text of the /sbin/init program).
5. If the fault routine finds that the process’ address space doesn’t allow access to the
faulting address, then it returns an error code, which under Unix-like operating sys-
tems causes the process to get a “segmentation violation” error.
6. On the other hand, if the fault routine finds that the memory access was valid, it loads
the data needed by the process into a page of physical memory and then has the MMU
map that page of memory into the process’ address space, as shown in Figure 2.2.
This is referred to as faulting in data. The fault routine then returns “success” and the
process is unfrozen and can resume running with the needed memory mapped into
its address space.
The init program starts by faulting in the first page of text from the /sbin/init
file. It then continues running, faulting in more pages of text, data, bss, and stack. As
shown in Figure 2.3, regions mapped copy-on-write require special handling by the fault
routine to ensure that the underlying executable file’s memory is not modified by a write
to copy-on-write mapping of it. Copy-on-write pages start off in the “pages un-written”
state. If a process has a read-fault on a copy-on-write page, then that page is fetched from
the backing file and mapped in read-only (thus protecting the object from being changed).
If the process has a write-fault on the page (either because it was not mapped at all, or
because it was mapped read-only due to a previous read-fault), then a copy of the backing
object’s page is made. The new copy of the page is then mapped in read-write, replacing
any previous read-only mappings of the data. At that point the page is in the “written”
state. Once the page is written, all future read and write faults cause the page to be mapped
in read-write. Such faults can be caused by the pagedaemon paging out the “written” page.
Pages in the written state stay in that state until they are freed or they are part of a copy-on-
write fork operation (described below).
map in copy
read/write
map page in
read/write
Figure 2.3: Handling copy-on-write pages when read and write faults occur
system call takes a process and creates a child process that has a copy of the parent process’
virtual address space. The VM system handles the duplication of the parent’s address space.
If the VM system actually copied every page in the parent’s address space to a new page for
the child process, a fork operation would be very expensive (because it takes a long time
to move that much data). The VM system avoids this by using the copy-on-write feature
again. For example, when the init program forks, all the pages that are copied to the child
process are put into the “pages un-written” state, as shown in Figure 2.3.
Typically, after doing a fork operation, the child process will do another exec oper-
ation on a different file (for example, /bin/sh). In preparing for the exec operation, the
kernel will first ask the VM system to unmap all active regions in the process doing the
exec. This will cause the process to return to a completely unmapped state. Then the kernel
will map the program into the address space, as previously described above.
Finally, when the process exits, the VM system removes all mappings from the
process’ address space. Handling all the cases that can occur with copy-on-write and shared
memory, and recovering from error conditions add to the complexity of the VM system.
16
2.2.4 VM Operations
Other VM operations that the process may perform while running include the following
system calls:
break: changes the size of a process’ heap area.
mprotect: changes the current protection of a mapping. Each mapped region has a
“maximum allowed protection” value controlled by the kernel that keeps a process
from gaining unauthorized access to a block of memory.
minherit: changes the inheritance of a mapped region. The inheritance attribute is used
during a fork to control the child process’ access to the parents memory. It can be
either “copy,” “share,” or “none.”
msync: flushes modified memory back to backing store. This is used when a file is
mapped directly into a process’ address space and the process wants to ensure that
the changes it has made to the file’s memory are saved to disk.
madvise: changes the current usage pattern of the memory region. The advice is a hint
from a process to the VM system as what access pattern will be used to access the
memory. Access patterns include: normal (the default), random, sequential, “will
need,” and “won’t need.”
mlock: locks data into physical memory. This ensures timely access to memory by forc-
ing the data to stay resident.
swapctl: configures the swap area. The system manager of a system uses swapctl to
add and remove swap space from the system.
In normal operation the majority of the VM system’s time is spent resolving page
faults and adding, removing, and copying mappings in process’ virtual address space. As
the system runs it is likely that more and more of the pages of physical memory will be
allocated for use by the operating system. When the number of free pages drops below
a threshold, then the VM system causes the “pagedaemon” to run. The pagedaemon is a
special system process that looks for pages of memory that are allocated but haven’t been
used for a while and adds them back to the free list. If a page contains modified data, it has
to be flushed out before it can be freed. The pagedaemon manages this flushing process.
17
2.3 The Evolution of the VM System In BSD
The VM system has changed quite a bit throughout the evolution of the BSD operating
system. This evolution is shown in Figure 2.4. Early BSD VM started with a VAX VM
system [36]. This VM system was tightly bound to the memory management features
provided by the DEC VAX processor. Thus, when porting BSD to non-VAX processors,
it was important to make the processor’s VM hardware emulate the VAX as closely as
possible. For example, SunOS 3.x which ran on Motorola processors did this. Making a
non-VAX system emulate a VAX is quite cumbersome and inefficient. In addition to being
non-portable, the BSD VAX VM did not support memory mapped files, thus complicating
the implementation of shared libraries. So, Sun threw out the early BSD VAX VM system
and designed a new one from scratch. This VM system appeared in SunOS 4 and its
descendant is currently used in Solaris [28, 46]. In the mean time a new operating system
project called Mach was started at Carnegie-Mellon University (CMU) [72, 57, 71, 64].
The virtual memory system of Mach had a clean separation between its machine-dependent
and machine-independent features. This separation is important because it allows a virtual
memory system to easily support new architectures. The Mach 2.5 VM system was ported
to BSD thus creating the BSD VM system [39, 69]. This VM system appeared in 4.3BSD
Net2. Eventually the CMU Mach project came to a close and 4.4BSD-Lite was released
in source code form. A number of operating system projects are based on the 4.4BSD-
Lite code, and thus use the BSD VM system. As time has gone on the BSD VM system
has diverged into three branches: the BSDI branch [33], the FreeBSD branch [25], and
the NetBSD/OpenBSD branch [54, 55]. The BSDI branch and the FreeBSD branch have
stuck with the basic BSD VM concepts imported from Mach, however they have worked to
improve the performance of the BSD VM in several areas. The NetBSD/OpenBSD branch
of the VM system is not as much changed from the original 4.4BSD VM code as the other
branches.
UVM’s roots lie partly in the BSD VM system from the NetBSD/OpenBSD and the
FreeBSD branches, and also with the SunOS4 VM system. UVM’s basic structure is based
on the NetBSD/OpenBSD branch of the BSD VM system. UVM’s new i386 machine-
dependent layer includes several ideas from the FreeBSD branch of BSD VM. UVM’s new
anonymous memory system is based on the anonymous memory system found in SunOS4.
UVM has recently been added to NetBSD and will eventually replace the BSD VM. There
are also plans to add UVM to OpenBSD as well.
18
Mach 2.5
SunOS4
1990
Solaris2
4.4BSD family
1995
1998 UVM
pmap protect: changes the protection of all mappings in a specified range of a pmap.
The pmap protect function is called during a copy-on-write operation to write
protect copy-on-write memory.
pmap page protect: changes the protection of all mappings of a single page in ev-
ery pmap that references it. The pmap page protect function is called before a
pageout operation to ensure that all pmap references to a page are removed.
pmap is referenced, pmap is modified: tests the referenced and modified at-
tributes for a page. This is calculated over all mappings of a page. These functions
are called by the pagedaemon when looking for pages to free.
pmap clear reference, pmap clear modify: clears the reference and modify at-
tributes on all mappings of a page. These functions are called by the pagedaemon to
help it identify pages that are no longer in demand.
21
pmap copy: copies mappings from one pmap to another. The pmap copy function is
called during a fork operation to give the child process an initial set of low-level
mappings.
Note that pmap operations such as pmap page protect may require the pmap module
to keep a list of all pmaps that reference a page.
vmspace: describes a virtual address space of a process. The vmspace structure con-
tains pointers a process’ vm map and pmap structures, and contains statistics on the
process’ memory usage.
vm map: describes the virtual address space of a process or the kernel. It contains a list of
valid mappings in the virtual address space and those mapping’s attributes.
vm object: describes a file, a zero-fill memory area, or a device that can be mapped into
a virtual address space. The vm object contains a list of vm page structures that
contain data from that object.
vm pager: describes how backing store can be accessed. Each vm object on the sys-
tem has a vm pager structure. This structure contains a pointer to a list of func-
tions used by the object to fetch and store pages between the memory pointed to by
vm page structures and backing store.
vm page: describes a page of physical memory4. When the system is booted a vm page
structure is allocated for each page of physical memory that can be used by the VM
system.
4
On a few systems, that have small hardware page sizes, such as the VAX, the VM system has a VM page
structure manage two or more hardware pages. This is all handled at the pmap layer and thus is transparent
to the machine-independent VM code.
22
process 1 (init) process 4 (sh)
vmspace
pmap
vm_map
map entry
vm_object
vm_page (in object)
vm_pager
vnode swap swap vnode
Figure 2.5: The five main machine-independent data structures: vmspace, vm map,
vm page, vm object, vm pager. The triangles represent vm objects, and the dots
within them represent vm pages. A vm object can contain any number of pages. Note
that the text and data areas of a file are different parts of a single object.
As shown in Figure 2.5, VM map structures map VM objects into an address space. VM
objects store their data in VM pages. Data in VM pages is copied to and from backing
store by VM pagers. Note that each vm map structure has an associated pmap structure to
contain the lower level mapping information for the virtual address space mapped by the
vm map. The vm map and the pmap together are referred to by a vmspace structure (not
shown).
In order to find which vm page should be mapped in at a virtual address (for ex-
ample, during a page fault), the VM system must look in the vm map for the mapping of
the virtual address. It then must check the backing vm object for the needed page. If the
page is resident (often due to having been recently accessed by some other process), then
23
vm_map
pointer to pmap pmap pmap
pointer to map entries header vm_map_entry
size size
reference count refcnt
starting virtual address min
ending virtual address max
the search is finished. However, if the page is not resident, then the VM system must issue
a request to the vm object’s vm pager to fetch the data from backing store.
We will now examine each of the machine-independent data structures in more de-
tail.
VM Maps
A VM map (vm map) structure maps memory objects into regions of a virtual address
space. Each VM map structure on the system contains a sorted doubly-linked list of “map
entry” structures. Each map entry structure contains a record of a mapping in the VM map’s
virtual address space. For example, in Figure 2.5 the map for the init process would have
four map entry structures: text, data, bss, and stack. The kernel and each process on the
system have their own VM map structures to handle the allocations of their virtual address
space.
The vm map data structure is shown in Figure 2.6. This structure contains the start-
ing and ending virtual addresses of the managed region of virtual memory, a reference
count, and the size of the region. It also contains two pointers: one to a linked list of map
entries that describe the valid areas of virtual memory in the map, and one to the machine-
dependent pmap data structure that contains the lower-level mapping information for that
map. The map entry structure is shown in Figure 2.7. This structure contains pointers to
the next and previous map entry in the map entry list. It also contains a starting and ending
virtual address, a pointer to the backing object (if any), and attributes such as protection,
and inheritance code.
A map entry usually points to the VM object it is mapping. However, in certain
cases a map entry can also point to another map. There are two types of maps that can be
pointed to by a map entry: submaps and share maps.
24
vm_map_entry
previous and next pointers prev/next vm_map_entry
starting virtual address start
end virtual address end
backing object pointer object vm_object/vm_map
is backing object a map? map
is backing object a submap? submap
protection, inheritance, etc. attributes
Submaps can only be used by the kernel. The main purpose of submaps is to break
up the the kernel virtual address space into smaller units. The lock on the main kernel
map locks the entire kernel virtual address space except for regions that are mapped
by submaps. Those regions (and their maps) are locked by the lock on the submap.
A share map allows two or more processes to share a range of virtual address space
and thus all the mappings within it. When one process changes the mappings in a
share map, all other processes accessing the share map see the change.
entries. Thus, the kernel allocates its VM map entries from a private pool of map entry
structures rather than with malloc.
VM Objects
A VM map contains a list of map entries that define allocated regions in a map’s address
space. Each map entry points to a memory object that is mapped into that region. Objects
may have more than one map entry pointing to them. In the BSD VM all memory objects
are defined by the VM object structure. The VM object structure is shown in Figure 2.8. A
VM object contains a list of all pages of memory that contain data belonging to that object
(resident pages). Pages in a VM object are identified by their offset in the object. Pages are
often added to VM objects in response to a page fault. There is also a global linked list of
all VM object allocated by the system. The VM object contains a reference counter and a
flag to indicate if data is currently being transfered between one of the VM object’s pages
and backing store (this is called “paging”). The VM object contains a shadow object and
a copy object pointer. These pointers are used for copy-on-write operations, described in
detail in Section 2.4.4. The VM object contains a pointer to a VM pager structure. The
pager is used to read and write data from backing store. Finally the VM object contains
pointers for the object cache.
The object cache is a list of VM objects that are currently not referenced (i.e. their
reference count is zero). Objects can be removed from the cache by having their reference
count increased. Objects in the object cache are said to be “persisting.” The goal of having
persisting objects is to allow memory objects that are often reused to be reclaimed rather
than reallocated and read in from backing store. For example, the program /bin/ls is
26
vm_pager
list of all pagers on system list vm_pager
external handle handle
pager type type
pager operations ops pagerops
pointer to private pager data data void
run frequently, but only for a short time. Allowing the VM object that represents this file
to persist in memory saves the VM system the trouble of having to read the /bin/ls file
from disk every time someone runs it.
VM Pagers
VM objects read and write data from backing store into a page of memory with a pager,
which is defined by a VM pager structure. Each VM object that accesses backing store has
its own VM pager data structure. There are three types of VM pagers: device, swap, and
vnode. A device pager is used for device files that allow their memory to be mmap’d (e.g.
files in /dev). A swap pager is used for anonymous memory objects that are paged out to
the swap area of the disk. A vnode pager is used for normal files that are memory mapped.
The kernel maintains a linked list of all the VM pager structures on the system.
The VM pager structure is shown in Figure 2.9. The structure contains pointers to
maintain the global linked list of pager structures. It contains a “handle” that is used as
an identification tag for the pager. It contains a type field that indicates which of the three
types of pagers the pager is. The pager also has a private data pointer that can be used by
pager-specific code to store pager-specific data. Finally, the pager structure has a pointer to
a set of pager operations. Supported operations include: allocating a new pager structure,
freeing a pager structure, reading pages in from backing store, and saving pages back to
backing store.
VM Pages
The physical memory of a system is divided into pages. The size of a page is fixed by the
memory management hardware of the computer. The VM system manages these hardware
pages with the VM page structure. On most systems there is a VM page structure for every
page of memory that is available for the VM system to use.
27
vm_page
page list pointers pageq vm_page
hash table pointers hashq vm_page
object list pointers listq vm_page
pointer to object object vm_object
offset in object offset
flags (busy, clean, etc.) flags
physical address phys_addr
The VM page structure is shown in Figure 2.10. A page structure can independently
appear on three different list of pages. These lists are:
pageq: active, inactive, and free pages. Active pages are currently in use. Inactive pages
are allocated and contain valid data but are currently not being used (they are being
saved for possible reuse). Free pages are not being used at all and contain no valid
data.
hashq: used to attach a page to a global hash table that maps a VM object and an offset to
a VM page structure. This allows a page to be quickly looked up by its object and
offset.
listq: a list of pages that belong to a VM object. This allows an object to easily access
pages that it owns.
Each page structure contains a pointer to the object that owns it, and its offset in that object.
Each page structure also contains the physical address of the page to which it refers. Finally,
there are a number of flags that are used by the VM system to store the state of the page.
Such flags include “busy,” “clean,” and “wanted.”
The VM system provides functions to allocate and free pages, to add and remove
pages from the hash table, and to zero or copy pages.
In a private copy-on-write mapping, changes made by the process with the mapping
are private to that process, but changes made to the underlying object that are not
shadowed by a page in the shadow object are seen by the mapping process. This is
the standard form of copy-on-write that most Unix-like operating systems use, and it
is the behavior specified for the mmap system call by the Unix standards [30].
Private Copy-on-write
Consider a file “test” that has just been mapped private copy-on-write. The map entry
in the VM map structure that maps test will point to the VM object that corresponds to
test, but it will have both the “copy-on-write” and “needs copy” attribute set. The copy-
on-write attribute in a map entry indicates that the mapped object is mapped copy-on-write.
The needs copy attribute indicates that a shadow object to hold changed pages will need to
be allocated. When the process firsts writes to the mapped area, a shadow object containing
the changed page will be allocated and inserted between the map entry and the underlying
test object. This is shown in Figure 2.11.
Now consider what happens if the process mapping test forks off a child process.
In that case the child will want its own copy of the copy-on-write region. In the BSD VM
this causes another layer of copying to be invoked. The original shadow object is treated
like a backing object, and thus both the parent’s and the child’s mapping enter the “needs
copy” state again, as shown in Figure 2.12(a). Now when either process attempts to write
to the memory mapped by the shadow object the VM system will catch it and insert a
new shadow object between the map entry and the original shadow object and clear “needs
29
Before write fault After write fault
needs_copy=T needs_copy=F
copy_on_write=T copy_on_write=T
changed page
./test shadow
./test
Figure 2.11: The copy-on-write mapping of a file. Note the pages in an object mapped
copy-on-write can not be changed.
copy.” This is referred to as “shadow object chaining.” After both processes perform writes,
the objects will be as shown in Figure 2.12(b).
Note that in Figure 2.12(b) the parent process has modified the center page in the
first shadow object and thus it has its own copy of it in its most recent shadow object (on
the left). The child process has not modified the center page, and thus if it was to read
that page it would read the version stored in the original shadow object (that is shared by
both the parent and child process). Now consider what would happen if the child process
was to exit. This is shown in Figure 2.13. Note that the center page appears in both
remaining shadow objects. However the parent process only needs access to the copy of
the page in the most recent shadow object. The center page in the other shadow object is
inaccessible and not needed. There is no longer a need for two shadow objects; they should
be collapsed together and the inaccessible memory freed. However, the BSD VM has no
way to realize this, and thus the inaccessible memory resources in the other shadow object
remain allocated even though they are not in use.
This is referred to as the “object collapse problem.” 4.4BSD VM did not properly
collapse all objects, and thus inaccessible memory could get stranded in chains of shadow
30
needs_copy=T needs_copy=T
shadow
./test
needs_copy=F needs_copy=F
shadow shadow
shadow
./test
needs_copy=F
shadow
inaccessible page
shadow
./test
objects. Eventually the inaccessible memory would get paged out to the swap area so the
physical page could be reused. However, the VM system would eventually run out of swap
space and the system would deadlock. Thus this problem was also referred to as the “swap
memory leak” bug because the only way to free this memory is to cause all processes
referencing the shadow object chain to exit.
The object collapse problem has been partly addressed in most versions of BSD.
However the object collapse code used to address the problem is rather complicated and has
triggered many hours of debugging. UVM handles copy-on-write in a completely different
way that avoids the object collapse problem.
Copy Copy-on-write
Consider two processes mapping the file test, as shown in Figure 2.14. Process “A” has
a shared mapping of test, and thus when process A changes its memory the changes get
reflected back to the backing file. Process “B” has a private copy-on-write mapping of
test. Note that process B has only written to the the center page of the object. If process
32
process B
map entry
needs_copy=F
process A
map entry
changed page
shadow
shared mapping
./test
A writes to the left-hand page of test before process B does, then process B will see the
changes made to that page by A up to that point. However, once the page gets added to B’s
shadow object then B will no longer see changes made by A.
Now consider a setup in which process A has a shared mapping of test and process
B has a copy copy-on-write mapping as shown in Figure 2.15. Before process A changes
the left page, it must create a copy object so that process B can access the original version
of the left page. When B changes the middle page it creates a shadow object. The shadow
object now shadows the copy object rather than test’s object. A list of objects formed by
the copy object pointer is called a “copy object chain.”
Copy objects are needed to support the non-standard copy copy-on-write semantics.
Copy objects add another layer of complexity to the VM code and make the object collapse
problem more difficult to address. Maintaining copy copy-on-write semantics is more also
expensive than private copy-on-write semantics. There are more kernel data structure to
allocate and track. There is an extra data copy of a page that the private mapping semantics
doesn’t have to do. Processes with shared mappings of files must have their mappings write
protected whenever a copy object is made so that write accesses to the object can be caught
33
process B
map entry
needs_copy=F
changed page
process A
map entry shadow
shadow pointer
shadow
pointer
copy
shared mapping
copy pointer
./test
by a page fault. Because of this, several branches of the BSD family have removed copy
objects and copy copy-on-write mappings from the kernel.
The referenced address has a valid mapping and the data is accessed from the
mapped physical page.
The referenced address has an invalid mapping and the MMU causes a page fault
to occur. The low level machine-dependent code catches the page fault and calls the
VM page fault routine. If the page fault routine cannot handle the fault then it returns
a failure code that causes the program to receive a “segmentation violation.”
34
The page fault handler is called with the VM map of the process that caused the fault,
the virtual address of the fault, and whether the faulting process was reading or writing
memory when the fault occurred. A simplified version of the fault routine’s operation is as
follows:
1. The list of entries in the VM map is searched for the map entry whose virtual address
range the faulting address falls in. If there is no such map entry, then the fault routine
returns an error code and the process gets a segmentation violation.
2. Once the proper map entry is found, then the fault routine starts at the VM object it
points to and searches for the needed page. If the page is not found in the first object,
the fault routine traverses the shadow object chain pointers until it either finds the
page or runs out of objects to try. If the fault routine runs out of objects to try it may
either cause the fault to fail or allocate a zero fill page (depending on the mapping).
The fault routine leaves a busy page in the top level object to prevent other processes
from changing its object while it is processing the fault.
3. Once the proper page is found the object’s VM pager must retrieve from backing
store if it is not resident. Then the fault routine checks to see if this is a write-fault
and if the object that owns that page has a copy object. If it does, then a copy of the
page is made and the old version of the page is placed in the object pointed to by the
copy object chain pointer.
4. If the fault routine had a VM pager do I/O to get the needed page, then the pager
unlocked all the data structures before starting the I/O. In this case the fault routine
must reverify the lookup in the VM map to ensure that the process’ mappings have
not been changed since the start of the fault operation.
5. Next the fault routine asks the pmap layer to map the page being faulted. The fault
routine must determine the appropriate protection to use depending on the copy-on-
write status and the protection of the map entry.
6. Finally, the fault routine returns a success code so that the process can resume run-
ning.
35
2.4.6 Memory Sharing
We have discussed the various way memory can be copied in the BSD VM system. Memory
sharing is much simpler than memory copying and does not require object chains. Memory
can also be shared in a number of ways in the BSD VM system including:
Memory objects such a files can be mapped into multiple virtual address spaces by
being referenced by multiple VM map entry structures. This allows those memory
objects to be shared among processes. For example, there is only one /bin/ls
memory object and its text and data pages are shared by all processes on the system.
Objects that are copied with copy-on-write use shared read-only pages to defer the
copy until the pages are actually written.
2.5 Summary
In this chapter we have reviewed the overall architecture and operation of the BSD VM
system. We introduced the major data structures and functions, and explained in detail the
copy-on-write operation. In the next chapter we will give an overview of the UVM system
and explain how it differs from the BSD VM system.
36
Chapter 3
3.1 Goals
Our primary objective in creating UVM is to produce a virtual memory system that provides
a Unix-like operating system kernel’s I/O and IPC subsystems with efficient VM-based data
movement facilities that have less overhead than a traditional data copy. While traditional
VM research often focuses on working set size and page replacement policy, our research is
focused on efficient VM-based data movement. Unlike many other VM research projects,
our work has been implemented as part of an operating system that is in wide-spread use.
Thus, we have designed our new virtual memory features so that their presence in the
kernel does not disrupt other kernel subsystems. This allows experimental changes to be
introduced into the I/O and IPC subsystem gradually, thus easing the adoption of these
features. Our work centers around five major goals:
Allow a process to safely let a shared copy-on-write copy of its memory be used
either by other processes, the I/O system, or the IPC system. The mechanism used
to do this should allow copied memory to come from a memory mapped file, anonymous
memory, or a combination of the two. It should provide copied memory either as wired
pages for the kernel’s I/O or IPC subsystems, or as pageable anonymous memory for trans-
fer to another process. It should gracefully preserve copy-on-write in the presence of page
faults, pageouts, and memory flushes. Finally, it should operate in such a way that it pro-
vides access to memory at page-level granularity without fragmenting or disrupting the VM
system’s higher-level memory mapping data structures. Section 7.1 describes how UVM
meets this goal through the page loanout mechanism.
Allow pages of memory from the I/O system, the IPC system, or from other
processes to be inserted easily into a process’ address space. Once the pages are in-
serted into the process they should become anonymous memory. Such anonymous mem-
ory should be indistinguishable from anonymous memory allocated by traditional means.
The mechanism used to do this should be able to handle pages that have been copied from
another process’ address space using the previous mechanism (page loanout). Also, if the
38
operating system is allowed to choose the virtual address where the inserted pages are
placed, then it should be able to insert them without fragmenting or disrupting the VM
system’s higher-level memory mapping data structures. Section 7.2 describes how UVM
meets this goal through the page transfer mechanism.
Allow processes and the kernel to exchange large chunks of their virtual ad-
dress spaces using the VM system’s higher-level memory mapping data structures.
Such a mechanism should be able to copy, move, or share any range of a virtual address
space. This can be a problem for some VM systems because it introduces the possibility
of allowing a copy-on-write area of memory to become shared with another process. The
per-page cost for this mechanism should be minimized. Section 7.3 describes how UVM
meets this goals through the map entry passing mechanism.
Optimize the parts of the VM system that effect the performance and complex-
ity of our new VM operations. We wish to take advantage of lessons we have learned
from observing the Mach based BSD VM system (and other VM systems); we hope to
make use of BSD VM’s positive aspects while avoiding its pitfalls. UVM addresses four
major pitfalls of the BSD VM as follows:
UVM eliminates object chaining and replaces it with a simple two-level memory
scheme, thus removing the inefficiencies and code complexity associated with main-
taining and traversing shadow object chains of arbitrary length.
UVM eliminates swap memory leaks associated with partially unmapping anony-
mous memory objects by providing efficient per-page reference counters that are
invoked when memory is partially unmapped.
UVM eliminates unnecessary map entry fragmentation associated with the wiring
and unwiring of memory.
UVM introduces a new i386 pmap module that takes advantage of all the VM hard-
ware features of the i386 (e.g. single page TLB flush, global TLB entries) and re-
duces the probability of deadlock and system crashes.
vmspace: describes a virtual address space of a process. The vmspace structure con-
tains pointers to a process’ vm map and pmap structures, and contains statistics on
the process’ memory usage.
1
UVM data structures are called “vm ” if they are either new or if they are minor modifications to BSD
VM data structures. UVM data structures are called “uvm ” if there is a corresponding “vm ” data structure
in BSD VM that is completely different from the uvm one. Once UVM no longer has to co-exist in the
source tree with BSD VM then the data structures will be renamed so that they are more uniform.
44
vm map: describes the virtual address space of a process or the kernel. It contains a list
of vm map entry structures that describe valid mappings and their attributes. The
vm map plays the same role in both BSD VM and UVM.
uvm object: forms the lower layer of UVM’s two-layer mapping scheme. A UVM
object describes a file, a zero-fill memory area, or a device that can be mapped into a
virtual address space. The uvm object contains a list of vm page structures that
contain data from that object. Unlike VM objects, UVM objects are not chained for
copy-on-write.
vm amap: forms the upper layer of UVM’s two-layer mapping scheme. A vm amap de-
scribes an area of anonymous memory. The area may have “holes” in it that allow
references to the lower underlying object layer.
vm anon: describes a single virtual page of anonymous memory. The page’s data may
reside in a vm page structure, or it may be paged out to backing store (i.e. the swap
area).
vm aref: a small data structure that points to a vm amap and an offset in it. The aref
structure is part of the map entry structure that is linked to the vm map structure.
uvm pagerops: a set of functions pointed to by a uvm object that describe how to
access backing store. In the BSD VM there is one vm pager structure per memory
object that accesses backing store, and each vm pager structure points to a set of
pager operations. In UVM the vm pager layer has been removed and uvm object
structures now point directly to a uvm pagerops structure.
vm page: describes a page of physical memory. When the system is booted a vm page
structure is allocated for each page of physical memory that can be used by the VM
system.
Figure 3.1 illustrates the general layout of the UVM data structures. The kernel
and each process on the system have a vm map structure and a machine-dependent pmap
structure that define a vmspace. The vm map structure contains a list of map entries that
define mapped areas in the virtual address space. Each map entry structure defines two
levels of memory. The map entry has a vm aref structure that points to the vm amap
structure that defines the top-level memory for that entry. The vm amap structure contains
a list of vm anon structures that define the anonymous memory in the top layer. The
45
vm anon structure identifies the location of its memory with a pointer to a vm page and
a location on the swap area where paged out data may reside. The map entry defines the
lower-level memory with a pointer to a uvm object pointer. The uvm object structure
contains a list of vm page structures that belong to it, and a pointer to a uvm pagerops
structure that describes how the object can communicate with backing store.
The vm page structures are kept in two additional sets of data structures: the page
queues and the object-offset hash table. UVM has three page queues: the active queue, the
inactive queue, and the free queue. Pages that contain valid data and are likely to be in use
by a process or the kernel are kept on the active queue. Pages that contain valid data but
are currently not being used are kept on the inactive queue. Since inactive pages contain
valid data, it is possible to “reclaim” them from the inactive queue without the need to read
them from backing store if the data is needed again. Pages on the free queue do not contain
valid data and are available for allocation. If physical memory becomes scarce, the VM
system will wake the pagedaemon and it will force active pages to become inactive and
free inactive pages.
The object-offset hash table maps a page’s uvm object pointer and its offset in
that object to a pointer to the page’s vm page structure. This hash table allows pages in a
uvm object to be looked up quickly.
UVM’s simplified machine-independent layering makes supporting new UVM fea-
tures such as page loanout and page transfer easier. For example, UVM’s two-level amap-
object mapping scheme simplifies the process of locating a page of memory over BSD
VM’s object chaining mechanism. And UVM’s anon-based anonymous memory system
allows virtual pages to be exchanged between processes without the need for object chain
manipulations.
vmspace
pmap
vm_map
map entry
vm_amap
vm_anon
vm_page (anon)
/sbin/init /bin/sh
from
uvm_object process 4
vm_page (object)
uvm_pagerops
device vnode aobj
swap space
Figure 3.1: UVM data structures at a glance. Note that there is one vm aref data structure
within each map entry structure. The page queues and object-offset hash table are not
shown.
47
processor until the lock can be acquired (thus allowing other processes to run). With a spin
lock, the process continues to attempt to acquire the lock until it gets it.
The BSD VM system is based on the Mach VM system, which supports fine-grain
locking of its data structures for multiprocessor systems. When Mach VM was ported to
BSD, the fine-grain locking support was ignored because BSD was being used on single
processor systems. Over time, as changes were made to BSD, the locking support decayed.
While UVM currently is only being used on single processor systems, BSD is in the process
of being ported to multiprocessor systems. Thus, as part of the UVM project, the fine-
grain locking code was restored for future use. However, adding amaps and removing
shadow/copy object chaining from the VM system fundamentally changes the way data
structures are used, and thus the fine-grain locking scheme for UVM had to be redesigned
to take these changes into account.
The main challenge with data structure locking is to choose a set of locks and a
locking scheme that avoid both data corruption and deadlock. The fact that the BSD VM
system’s locking is not used, incomplete, and not well documented made repairing the
damage and addressing the new data structures in UVM difficult. In addition to figuring
out what locking scheme to use, the locking semantics of all the functions of the VM system
had to be defined, and used consistently within all VM code. For example, some functions
are designed to be called with unlocked maps while other are designed to be called with
locked maps. Calling the function with the lock set the incorrect way will either lead to
kernel data structure corruption or system deadlock.
A deadlock occurs when a process is waiting to obtain a lock that it is already
holding (thus it can never acquire it), or when two or more processes are waiting in a
circular pattern for locks that the other processes are holding (e.g. A is waiting for a lock
that B is holding, and B is waiting for a lock that A is holding). It is well known that
deadlock can be avoided if all processes allocate their locks in the same order (thus no
loops can form). UVM uses this scheme to avoid deadlock. UVM also attempts to reduce
contention for data structure locks by minimizing the amount of time locks are held. Before
starting an I/O operation, UVM will drop all locks being held. Also, UVM does not hold
any locks during the normal running of a process. So, when a VM request comes into
UVM, it is safe to assume that the requesting process is not holding any locks.
The following data structures have locks in UVM. The data structures are presented
in the order in that they must be locked.
map: the lock on a map data structure prevents any process other than the lock holder
from changing the sorted list of map entries chained off the map structure. A map
48
can be read or write locked. Read locking indicates that the map is just being read
and no mappings will be changed. Write locking the map indicates that mappings
will be changed and that the version number of the map should be incremented. Map
version numbers are used to tell if a map has been changed while it was unlocked. A
write-locked map can only be accessed by the holder of the lock. A read-locked map
can be accessed by multiple readers. Attempting to lock a map that is already locked
will cause the locking process to be put to sleep until the lock is available.
amap: the amap lock prevents any process from adding or removing anon structures from
the locked amap. Note that this does not prevent the anon structures from being
added or removed from other amaps.
object: the object lock protects the list of pages associated with the object from being
changed. It also protects the “flags” field of the page data structure of all the pages
in that object from being changed.
anon: the anon lock protects the page or disk block the anon points to from being changed.
It also protects the “flags” field of the page data structure from being changed.
page queues: the page queue lock protects pages from being added or removed from the
active and inactive page queues. It also blocks the page daemon from running.
For most of the UVM system this locking order is easy to maintain. However, there
is a problem with the page daemon. The page daemon runs when there is a shortage of
physical memory. The page daemon operates by locking the page queues and traversing
them looking for pages to page out. If it finds a target page it must lock the object or anon
that the page belongs to in order to remove it from that object. This is a direct violation of
the locking order described above. However, this problem can be worked around. When
the page daemon attempts to lock the memory object owning the page it is interested in it
should only “try” to lock the object, as shown in Figure 3.2. If the object is already locked,
then the page daemon should skip the current page and move on to the next one rather than
wait for the memory object to be unlocked. Having the page daemon skip to the next page
does not violate UVM’s locking protocol and thus has no dire consequences.
3.3.3 VM Maps
In UVM, the vm map structure and its corresponding list of vm map entry structures is
handled in much the same way as in BSD VM. However, there are some differences. In
49
map
amap
object
anon
pagedaemon: "try" lock
page queues
UVM, the map entry structure has been modified for UVM’s two-layer mapping scheme.
Not only does it contain a pointer to the backing uvm object structure, but it also con-
tains a vm aref structure. The aref points to any amap associated with the mapped region.
Some of the functions that operate on maps have also been changed. For example,
UVM provides a new function called uvm map that establishes a mapping with the speci-
fied attributes. The BSD VM system does not have such a function: mappings are always
established with default attributes. Thus in the BSD VM, after establishing a mapping fur-
ther function calls are often required to change the attributes from their default values to the
desired values. On a multi-threaded system this can be a problem because the default value
for protection is read-write. Thus, when establishing a read-only mapping of a file there
is a brief window — between the time the mapping is established (with the default protec-
tion) and the time the system write-protects the mapping — where the map is unlocked and
could be accessed by another thread, thus allowing it to by-pass system security.
UVM also provides a corresponding uvm unmap call. The unmap call has been
restructured so that it holds the lock on a map for a shorter amount of time than the corre-
sponding BSD VM unmap call.
For features such as page loanout and map entry passing, new map related functions
had to be written. These functions perform a variety of tasks including:
extracting a list of map entries from a map. The extracted list can either be a copy of
the map entries in the source map or the actual map entries from the source map (in
which case the mappings are being removed from the source map).
50
replacing a map entry that is reserving a block of virtual address space with a new
set of map entries.
extracting pages for loanout. The two-level structure of the amap layer makes it easy
to do this. To extract the pages, the loan out routine only needs to look in one of two
places for each page being loaned, as compared to the BSD VM where loan out code
would have to walk the object chains for each page it wished to loan.
Map entry structures contain aref structures. Each aref can point to an amap struc-
ture. The amap structure forms the top layer of UVM’s two-layered virtual memory map-
ping scheme, and the uvm object forms the backing layer. Each amap can contain one
or more anon structures. Each anon represents a page of anonymous virtual memory. An
anon’s data can either be resident in a page of physical memory (in which case the anon
has a pointer to the corresponding vm page structure), or it can be paged out to the swap
area of the disk (in which case the anon contains the index into the swap area where the
data is located).
Copy-on-write is achieved in UVM using the amap layer. Copy-on-write data is
originally mapped in read-only from a backing uvm object. When copy-on-write data
is first written, the page fault routine allocates a new anon with a new page, copies the data
from the uvm object’s page into the new page, and then installs the anon in the amap
for that mapping. When UVM’s fault routine copies copy-on-write data from a lower-layer
uvm object into an upper-layer anon it is called a “promotion.” Once copy-on-write
data has been promoted to the amap layer, it becomes anonymous memory. Anonymous
memory also can be subject to the copy-on-write operation. An anon can be copy-on-write
copied by write-protecting its data and adding a reference to the anon. When the page fault
routine detects a write to an anon with a reference count greater than one, it will copy-on-
write the anon.
The introduction of the amap layer and the removal of object chaining had a signif-
icant effect on the handling of page fault routines, and thus the page fault routine had to be
rewritten from scratch. This is because in the BSD VM when a page fault occurs the object
chains must be traversed. But in UVM, rather than start looking directly at VM objects (or
object chains) the fault routine first looks at the amap layer to see if the needed pages are
there. If there is no page in the amap layer, then the underlying object is asked for it. If
the underlying object doesn’t have the needed data, then the fault routine fails. Thus, in
52
UVM’s fault routine, there are only two places to check for data and no shadow linked lists
have to be traversed or managed.
The amap concept was first introduced in the SunOS4 VM system [28, 46]. In
UVM, the SunOS amap concept was enhanced, and then adapted for the more Mach-like
environment provided by UVM. Amaps and anons may also be used for page loanout and
page transfer.
3.3.7 Pages
In UVM, the vm page structure has been updated for amap-based anonymous memory and
for page loanout. In the BSD VM system, an allocated page of memory can be associated
with a vm object. That object could be a normal object, or it could be a shadow or copy
object that was created as part of a copy-on-write operation. In contrast, in UVM a page
can be associated with either a uvm object or an anon structure.
For page loanout, a loan counter has been added to each page structure. A page is
loaned out if it has a non-zero loan count. One of the challenges faced when designing the
page loanout system was to determine how to handle cases where the VM system wants to
do something to a page that is currently on loan. Often this will require that the loan be
“broken.” If a process or the kernel tries to modify a loaned page, then the loan must be
broken. If memory becomes scarce and the VM system wants to pageout a loaned page,
then the loan must be broken. If the VM system tries to free or flush out a loaned page then
the loan must be broken. All these cases must be handled properly by the page loanout
code, or data corruption will result. Thus, the addition of the loan count to the vm page
structure was critical in getting loanout to work properly.
53
Table 3.1: Changes from BSD VM. The “extent of change” values are: new (new code),
replaced (totally rewritten code), revised (same basic design as BSD VM, but revised and
improved for UVM), adjusted (same basic design as BSD VM with minor cleanups), re-
moved (BSD layer removed from UVM). Each function corresponds to a UVM source-code
file.
2. research on I/O and IPC subsystems that take advantage of virtual memory services
to improve performance.
The Mach virtual memory system is used for memory management and IPC in the Mach
micro kernel developed at Carnegie Mellon University [4, 57, 64, 71, 72]. The BSD VM
system is a simplified version of the Mach virtual memory system. BSD VM and Mach VM
both use the same mapping structure to map processes address space, shadow and copy ob-
ject chains for copy-on-write memory, and a map-based machine-dependent virtual mem-
ory layer. However, Mach VM differs from BSD VM in the areas of pager support and IPC
data movement. These differences make virtual memory management under Mach much
more complex than managing memory under BSD.
Mach is a microkernel. This means that only the core aspects of the operating sys-
tem are actually part of the kernel. These services include virtual memory management,
interprocess communication, and task scheduling. All other aspects of the operating sys-
tem such as protocol processing, filesystems, pagers, and system call handling have been
pushed out into separate “server” processes. Processes communicate with the kernel and
each other by sending messages to each other using Mach message passing IPC. The goal
behind this structure is to enforce modularity on the operating system and to ease debugging
by allowing the Mach servers to be debugged without the need to reboot the microkernel.
Message passing. Because of the number of server processes needed to provide a
useful operating system the performance of Mach is greatly dependent on the performance
of its message passing mechanism. Mach message passing operates in one of two ways.
If the message is small, then the data is copied into a kernel buffer and then copied out
to the recipient. However, if the message is large then the virtual memory system is used
to transfer the message’s data. Since the data travels separately from the message this is
called an “out-of-line” message.
Out-of-line messages are transferred between address spaces using a virtual memory
data structure called a copy map. In this context a copy map can be thought of as a vm map
that is not associated with any process. To send a Mach message, the virtual address space
containing the message is copy-on-write copied into the copy map using Mach’s shadow
and copy objects. The copy map is then passed to the recipient of the message. The
recipient then copies the message to its address space. This mechanism is similar to using
the copy-on-write feature of UVM’s map entry passing mechanism.
The out-of-line message passing facility was found to be too expensive in a net-
working environment due to the high overhead of Mach virtual memory object operations
[4]. To solve this problem, the copy map was modified. Instead of just being able to trans-
fer a list of map entries, the copy object was overloaded to also be able to transfer a list of
59
vm page structures. This reduces the overhead of sending data over a network. To send
a message with this mechanism the data pages are faulted in, marked busy, and put in the
page list of the copy map. The copy map is then passed to the networking system which
transmits the pages and then clears the busy bit. The pages are marked busy while they
are in transit, so they cannot be modified until the network operation is complete. Further
optimizations to this approach allow pages to be “pinned” in memory for some cases, but
for other cases the data has to be “stolen” (i.e. copied into a freshly allocated page). A
copy map may contain only a limited number of pages. If the message contains more pages
than will fit in a copy map, then a “continuation” is used. A continuation is a call back that
allows the next set of pages to be loaded into a copy map.
External pagers. Another feature of Mach is external pagers. In BSD VM all
pagers are compiled into the kernel and are trusted. Mach allows a user process (possibly
a Mach “server”) to act as a pager for a vm object. When Mach’s pagedaemon wants to
pageout a page that is managed by an external pager it has to send that pager a message to
pageout the page. Since the pager is an untrusted user process it may decide to ignore the
pagedaemon’s pageout request, thus creating an unfair situation. To address this problem,
the copy map structure was modified to contain a pointer to a copy map copy object. When
the pagedaemon wants to page out a page of memory managed by an untrusted external
pager it allocates a new copy map and a new vm object to act as that copy map’s copy
object. The pages to be paged out are then inserted into the copy map’s copy object and
the copy map is passed to the external pager. The copy map’s copy object is managed by
the default pager. The default pager is a trusted pager that is part of the kernel and pages
anonymous memory out to the swap area of disk. So, if an external pager ignores a pageout
request from the pagedaemon its pages will get paged out by the default pager through the
copy map’s copy object. Note that this creates a problem: the pages to be paged out have
to belong to both the vm object that belongs to the external pager and to the copy map’s
copy object. Mach solves this problem by allowing pages to be “double mapped.” A double
mapped page is a page of physical memory that has two or more vm page structures that
refer to it. In our example, the main vm page will belong to the externally managed
object and the second vm page belongs to the copy map’s copy object. Note that neither
BSD VM nor UVM need or allow double mapped pages. In BSD VM there is a one-to-one
correspondence between a vm page and a page of physical memory.
In current versions of Mach a copy map can contain one of four possible items:
a list of pages
There are a number of functions that convert copy maps between these four formats (but
not all conversions are supported).
Fictitious pages. Mach also makes heavy use of fictitious pages. A fictitious page
is a vm page structure that has no physical memory allocated to it. These pages are often
used as busy-page place holders to temporarily block access to certain data structures.
Mach freely moves physical memory between normal pages and fictitious pages, so normal
pages can easily become fictitious and vice-versa.
Mach VM suffers from the same sort of problems that BSD VM does. It has a com-
plex multi-level object-chaining based copy-on-write and mapping mechanism. It does not
allow pages to be shared copy-on-write without using this mechanism. Thus, it is difficult
to have page-level granularity without the extra overhead of allocating numerous objects.
This combined with its page wiring mechanism can lead to map entry fragmentation. It
also suffers from the same partial unmap swap-memory leak problem that BSD VM does.
The FreeBSD virtual memory system is an improved version of the BSD VM system [25].
One of the main emphases of work on the FreeBSD VM system is ensuring that FreeBSD
performs well under load. Thus, FreeBSD is a popular operating system for network and
file servers.
Work on FreeBSD VM has focused on a number of areas including simplifying
data structure management, data caching, and efficient paging algorithms. FreeBSD VM’s
data structures are similar to BSD VM’s data structures, although some structures have
been eliminated. For example, FreeBSD VM no longer has share maps or copy object
chains. Since neither of these are needed to provide a Unix-like virtual memory envi-
ronment their elimination reduces the complexity of FreeBSD VM. While FreeBSD re-
tains Mach-style shadow object chaining for copy-on-write, the swap memory leaks asso-
ciated with BSD VM’s poor handling of the object collapse problem have been addressed.
FreeBSD has successfully merged the VM data cache with the Unix-style buffer cache thus
allowing VM and I/O to take better advantage of available memory. FreeBSD’s paging
61
algorithms use clustered I/O when possible and attempt to optimize read-ahead and pa-
geout operations to minimize VM I/O overhead. These paging algorithms contribute to
FreeBSD’s good performance under load.
FreeBSD’s improvements to object chaining produce similar performance improve-
ments as UVM’s elimination of object chaining, however the complexities of object chain-
ing remain in FreeBSD. Many of FreeBSD’s improvements in the areas of data caching
and paging algorithms are applicable to UVM, thus FreeBSD will be a good reference for
future work on UVM in these areas. FreeBSD does not have UVM-style features such as
page loanout and page transfer. These features could be added to FreeBSD with some dif-
ficulty, alternatively object chaining could be eliminated from FreeBSD and UVM features
added.
The SunOS4 virtual memory system is a modern VM system that was designed to replace
the 4.3BSD VM system that appeared in SunOS3 [28, 46]. This virtual memory system is
also used by Sun’s Solaris operating system.
The basic structure of the SunOS virtual memory system is similar to the structure
of BSD VM. The kernel and each process on the system have an address space that is de-
scribed by an as structure. The as structure contains a list of currently mapped regions
called segments. Each segment is described by a seg structure. The seg structure con-
tains its starting virtual address, its size, a pointer to a “segment driver” and a pointer to
the object mapped into that segment. A segment driver is a standard set of functions that
perform operations on a memory mapped object. SunOS’s as and seg structure corre-
spond to BSD VM’s vm map and vm map entry structures. The SunOS segment driver
corresponds to BSD VM pager operations. The SunOS mapped object corresponds to
BSD VM’s vm object structure. Once difference between SunOS VM and BSD VM is
that in SunOS the format of the memory object structure is private to the object’s segment
driver. Thus, the pointer to the memory object in the seg structure is a void pointer. This
forces the pointer to the segment driver function to be in the seg structure. BSD VM
stores the pointer to its pager operation structure off of its vm object structure (through
the vm pager). Both SunOS VM and BSD VM have “page” structures that are allocated
for each page of physical memory at VM startup time.
The division of labor between the pager operations and the high-level VM in SunOS
VM is different from BSD VM. In SunOS almost all high-level decisions are deligated to
the segment driver, while in BSD VM the high-level VM handles most of the work and
62
only references the pager if data transfer is needed between physical memory and backing
store. For example, in BSD VM the high-level fault routine calls the pager to fetch a page
from backing store. It then goes on to handle copy-on-write and mapping the new page into
the faulting address space. In contrast, the high-level SunOS VM fault routine determines
which segment the page fault occurred in and calls that segment driver’s fault routine to
resolve the fault. Issues such as copy-on-write and memory mapping in the faulting page
must are handled by the segment driver, not the high-level fault routine.
Both BSD VM and SunOS VM take a similar approach to dividing the machine-
dependent and machine-independent aspects of the virtual memory system. SunOS has a
machine-dependent layer called the hardware address translation (HAT) layer. The SunOS
HAT layer corresponds to the BSD VM pmap layer.
SunOS VM manages copy-on-write and anonymous memory through the use of
amaps and anons. In UVM, the SunOS amap concept was enhanced, and then adapted for
the more Mach-like environment provided by UVM. In SunOS amaps are not a general
purpose VM abstraction as they are in UVM. SunOS amaps can only be used by segment
drivers that explicitly allow amaps to be used. Once such segment driver is the vnode
segment driver. In SunOS areas of memory that are mapped shared must always remain
shared and areas of memory that are mapped copy-on-write must always remain copy-on-
write. Thus, SunOS cannot support Mach-style memory inheritance or certain types of
map entry passing. UVM has no such restriction. SunOS’s amap implementation does not
appear to support UVM-style quick traversal or deferred amap allocation during a fork.
One improvement introduced to SunOS VM in Solaris is the introduction of the
“virtual swap file system” [13]. In SunOS4 each anon on the system was statically as-
signed a pageout location in the system’s swap area. This forced the size of the swap area
to be greater than or equal to the size of main memory. The virtual swap file system allows
the backing store of an anon to be dynamically assigned. UVM has dynamically assigned
backing store for anonymous memory as well, but a filesystem abstraction is not necessary,
and thus is not used. Furthermore, UVM extends this idea to implement aggressively clus-
tered anonymous memory pageout. In this form of pageout a cluster of anonymous memory
is dynamically assigned a contiguous block of backing store so that it can be paged out in
a single I/O operation.
SunOS VM currently does not provide UVM-like features such as page loanout,
page transfer, and map entry passing. Thus, for bulk data transfer under SunOS one would
have to use traditional mechanisms such as data copying. However, SunOS’s anon-style
63
anonymous memory system and modular design would ease the implementation of UVM-
style features.
Linux is a popular free Unix-like operating system written by Linus Torvalds [66]. We
examined the virtual memory system that appears in what is currently the most recent
version of Linux (version 2.1.106).
In Linux, the kernel and each process has an address space that is described by a
mm struct structure. The mm struct structure contains a pointer to a sorted linked list
of vm area struct structures that describe mapped regions in the address space. Each
mapped area of memory has a set of flags, an offset, and a pointer to the file that is mapped
into that area. Each page of physical memory has a corresponding page structure.
VM Layering. Linux VM’s machine-dependent/machine-independent layering is
quite different from either SunOS VM or BSD VM. In BSD VM the machine-dependent
pmap module provides the machine-independent VM code with a set of functions that add,
remove, and change low-level mappings of pages. In Linux, the machine-independent
code expects the machine-dependent code to provide a set of page tables for a three-level
forward mapped MMU. The machine-independent code reads and writes entries to these
page tables using machine-dependent functions (or macros). For machines whose MMU
matches the Linux machine-independent model the machine-dependent code can arrange
for the machine-independent code to write directly into the real page tables. However,
machines whose MMU does not match the Linux three-level MMU machine-independent
model must both emulate this style MMU for the machine-independent code and internally
translate MMU requests to the native MMU format. An unfortunate result of this arrange-
ment is that hooks for machine-dependent cache and TLB flushing must appear throughout
the machine-independent code. In SunOS and BSD VM such machine-dependent details
can safely be hidden behind the HAT/pmap layer. Another unfortunate result of this struc-
ture is that all machine-independent virtual memory operations that operate on a range
of virtual addresses must be prepared to walk all three page table levels of the machine-
independent MMU model.
Linux requires that the entire contents of physical memory be contiguously mapped
into the kernel’s address space. Thus, the kernel’s virtual address space must be at least
as large as physical memory. By mapping all physical memory into the kernel’s address
space Linux can access any page of memory without having to map it in. It can also easily
translate between page structure pointers and physical addresses.
64
Copy-on-write. Linux handles copy-on-write as follows. First, each page of mem-
ory has a reference counter. Each mapping of the page counts as a reference. Also, pages
appearing in the buffer cache also have an reference. Thus any page that is mapped in
directly from a file will have a reference count of at least two. Any page whose reference
count is greater than or equal to two is considered “shared.” Copy-on-write memory is
identified by a copy-on-write flag in the vm area struct structure that maps the region.
The first read-fault that occurs on a copy-on-write area of memory will cause the backing
file’s page to be mapped in read-only. When a write-fault occurs on a copy-on-write area
of region the faulting page’s reference count is checked to see if the page is shared. If so,
then a new page is allocated (with a reference count of one), the data from the old page is
copied to the new page, and the new page is mapped in. If a process with a copy-on-write
region forks, the copy-on-write region is write protected and the reference counter for each
mapped page is incremented. This causes future write-faults to perform a copy-on-write.
Copy-on-write memory can be paged out to swap. The swap area is divided into
page-sized blocks. Each block has a reference counter that says how many page tables are
referencing it so the swap subsystem knows when it is no longer in use. To page a copy-
on-write page out to swap the page’s PTE is replaced with an invalid PTE containing the
address on swap where the data is located and the page is placed in the “swap” file object.
The page’s offset is set to the address of its location on swap. When all PTEs pointing to a
copy-on-write page are invalidated the page is removed from the swap file and object and
freed for reuse. Note that Linux attempts to place successive swap block allocations in the
same area of swap to reduce I/O overhead (but it appears to write to swap page-at-a-time).
When a process references a page who’s PTE has been modified to point to an area
of swap a page fault is generated. The page fault routine extracts the location of the page
on swap from the PTE and searches the swap file object to see if the page is still resident. If
so, the page is mapped in. If the fault is a write fault, the swap block is freed since the data
will be modified. If the page is not resident in the swap file object a new page is allocated,
added to the swap file object at the appropriate offset, and read in from swap. Then the
page can be mapped as before.
Page tables. Note that in a HAT or pmap based VM system, the information stored
in the hardware page tables can safely be thrown away because it can be easily recon-
structed based on information in the address space or map structure. Systems that use this
approach will often call on the HAT or pmap layer to free memory when a process gets
swapped out or memory becomes scarce. However, in Linux the page tables are being used
to store swap location information, and thus the information contained within them cannot
65
be freed. One way to allow such memory to be freed is to allow certain page tables to be
paged out themselves as in Windows-NT [62], however Linux currently does not appear to
support this.
Linux, like SunOS, does not support the sharing of copy-on-write memory or mem-
ory inheritance. Additionally, since Linux stores copy-on-write state in its page tables,
operations such as map entry passing could be expensive since they could require travers-
ing all the page table entries for a mapped region as well as the high-level map in the
mm struct. Linux does have some support for remapping memory within a process. The
non-standard mremap system call is used by some versions of Linux’s malloc memory
allocator to resize its heap. The mremap system call takes a range of mapped memory and
changes its size. If the new size is smaller than the old, then the extra memory is simply
unmapped. If the new size is larger than the old and there is room to grow the allocation in
place, then mremap does so. However, if there is not enough room, then mremap moves
the mapped memory to a new location that is large enough for the new size. The current
version of mremap has two limitations: it will not work across VM area boundaries and
it does not handled wired memory properly. Additionally, the mremap system call has an
interesting side effect. If the region of memory being remapped points to a file rather than
a zero-fill area memory and the size of the region is being increased then the amount of the
file mapped is increased. This may have security implications because it is giving the user
the ability to change a file’s mapping without the need for a valid open file descriptor on
that file. Since such a feature is not necessary for malloc it could easily be removed if it
is determined to be a problem.
Windows-NT is the most advanced of Windows operating system from Microsoft [62].
Although not a Unix-style operating system, NT’s virtual memory system shares many
aspects of such systems. Under NT, each process has its own private virtual address space
that is described by a list of virtual address space descriptors (VADs). Unlike BSD VM’s
map entry structures, which are organized as a sorted linked list, VADs are arranged in a
self balancing binary tree data structure. Each VAD contains a protection and pointer to the
section object that it maps. A section object is the NT-internal name for a memory mapped
file.
Memory allocation. NT allows processes to allocate and deallocate memory in
their address spaces in two phases. Virtual memory can be “reserved” by adding an entry
to the VAD list. Reserved virtual memory must remain unused until it is “committed”
66
by associating it with a mapped memory object. Processes can “decommit” and “free”
memory when it is no longer needed. This two-phase system is used by NT for stack
allocation. NT also allows processes to reserve and commit memory with a single system
call.
Page table and page management. NT’s management of pages tables is similar
to the way Linux’s VM manages them. But unlike Linux, which assumes a three level
MMU structure, NT assumes a two-level MMU structure. If the hardware does not directly
support such a page table structure, then the machine-dependent NT hardware abstraction
layer (HAL) must translate between two-level page tables and the native MMU format.
NT also maintains a page frame database (PFN database) that contains one database
entry for each page of physical memory on the system. An entry in the PFN database
contains a share (reference) count, the physical address of the page, and the rest of the
page’s current state.
Prototype PTEs. NT manages pages that may be shared through a mechanism
called “prototype PTEs.” Under NT memory can be shared either through shared memory
mappings or copy-on-write memory mappings (before the write occurs). Each memory
mapped section object has an array of prototype PTEs associated with it. These prototype
PTEs are used to determine the location of each page that the section object contains. Such
pages could be resident, zero-fill, or on disk. While the format of prototype PTEs is similar
to the format used in real PTEs, prototype PTEs are only used to keep track current location
of a page of memory and never actually appear in real page tables. For example, when a
memory mapped page is first faulted on, NT first checks the page tables of the faulting
process to determine the PTE of the page that caused the fault. Since this is the first time
the page is being faulted, the PTE contains no useful information, so NT checks the VAD
list to determine which section object is mapped into the faulting virtual address. Once
the section object is found, the prototype PTE entry for the faulting page is located. If the
prototype PTE indicates that the page is resident in physical memory, then the prototype
entry is copied into the faulting process’ page table with the protection set appropriately.
The reference counter for the page in the PFN database is also incremented. If the prototype
PTE does not point to physical memory then it will either point to a page-sized block of a
swap file or normal file, or it will indicate that the page should be zero-filled.
When a page of physical memory is removed from a process’ address space by the
NT pagedaemon the PFN database’s reference counter for that page is decremented and
the process’ PTE mapping the page is replaced with a pointer to the prototype PTE for
that page. This pointer is treated as an invalid PTE by the hardware. If the PFN database
67
reference counter reaches zero, then page of physical memory is no longer being mapped
and it can be recycled. This reference counter is also used during a copy-on-write fault to
determine if a copy needs to occur.
NT does not support the sharing of copy-on-write memory or memory inheritance.
Additionally, since NT, like Linux, stores copy-on-write state in its page tables, operations
such as map entry passing could be expensive since they may require traversing all page
table entries for a mapped region as well as the VADs. NT does have an internal-only IPC
facility called local procedure calls (LPC) that uses virtual memory features in some cases
to transfer data. For LPC messages of less than 256 bytes data copying is used. For LPC
messages larger than that, a shared section object is allocated and used to pass data. For
very large data that will not fit in a shared section, NT’s LPC mechanism allows the server
to directly read or write data from the clients address space.
Sprite was an experimental operating system that was developed at University of California
at Berkeley in the late 1980s [51]. Sprite’s original virtual memory system provided the
basic functionality of the 4.2BSD virtual memory system [48]. Thus, it was lacking sup-
port for copy-on-write, mmap, and shared libraries. This allowed page table management
to be simplified by associating page tables with files rather than processes. These page
tables could be shared among all processes. Sprite did support a primitive form of memory
inheritance. When a process forked, it had the option of sharing its data and heap segments
with its child process. Sprite was later modified to support copy-on-write [49] for data and
stack pages. This was done through a segment list structure. Each copy-on-write copy of a
segment adds a segment to the segment list. Only one of the segments, the master segment,
on the segment list can have a valid mapping of the page at any one time. The PTEs in
the rest of the segments contain pointers to the master segment (the hardware treats these
pointers as invalid entries). If a copy-on-write fault occurs on the master segment, then a
copy is made and another segment is made master. If a copy-on-write fault occurs on a
non-master segment, then the data is just copied to a private page.
Chorus is a microkernel based operating system with an object-oriented virtual
memory interface developed in the early 1990s at Chorus Systems and INRIA, France
[1, 2, 17]. The interface for the virtual memory system is called the “generic memory man-
agement interface” or GMI. The GMI defines a set of virtual memory operations that the
Chorus microkernel supports. Each process under Chorus has virtual memory “context”
68
that consists of a set of mapped “regions.” This corresponds to Mach’s maps and map en-
tries. Each region points to a local “cache” that maps a “segment.” Data in a segment is
accessed through its “segment manager.” A cache corresponds to a Mach memory object,
a segment corresponds to backing store, and a segment manager corresponds to a Mach
pager. The operations that the GMI defines that can be performed on these objects in-
clude adding and removing a mapped region from a context, splitting a mapped region into
smaller parts, changing the protection of a mapped region, and locking and unlocking a
mapped region in memory. The “paged virtual memory manager” (PVM) is a portable im-
plementation of the GMI interface for paged memory architectures. PVM includes both the
hardware-independent and hardware-dependent parts of the virtual memory system. PVM
supports copy-on-write through the use of “history objects.” History objects are similar to
Mach’s shadow object chains. They form a binary tree that must be searched to resolve a
copy-on-write page fault. Chorus supports a per-page form of copy-on-write through the
use of “copy-on-write page stubs” which are similar to fictitious pages in Mach. These
stubs are kept on a linked list that is rooted off the real page’s descriptor.
Spring is an experimental object-oriented microkernel operating system developed
at Sun [35, 31, 44]. Each process under Spring has an address space called a “domain.”
Each domain contains a sorted linked list of regions that are mapped into it. Each region
points to the memory objects that it maps. Mapped memory object regions can be copy-on-
write copied into other address spaces. A cache object is a container for physical memory.
A pager object is used to transfer data between physical memory and backing store. Spring,
like Mach, supports external pagers. Spring also provide coherent file mappings across
a network. Spring supports copy-on-write through a copy-on-write map structure. This
structure contains pointers to regions of the source and destination cache, and a bitmap
indicating which pages have be copied from the source cache. The caches that the copy-
on-write map structures point to can chain to arbitrary levels, much like Mach shadow
object. Spring has a bulk data movement facility for IPC that moves data either by moving
it or by copy-on-write copying it (using the copy-on-write map mechanism). If the data is
being moved rather than copied and the data is resident in memory, then Spring will steal
the pages from the source object and place them in the destination. This is important for
external filesystem pagers since filesystems often have to move lots of data. Spring reuses
the SunOS machine-dependent HAT layer to avoid duplicate work.
69
3.4.2 I/O and IPC Subsystems That Use VM Features
In this section we examine research on I/O and IPC subsystems that can take advantage
of services offered by virtual memory systems to reduce data movement overheads. While
this research shares some of the same goals as UVM, our approach is different. In UVM
we are interested in creating a virtual memory system whose internal structure provides
efficient and flexible support for data movement. In I/O and IPC research, the focus is on
using features that the virtual memory is assumed to already provide. Thus, I/O and IPC
research addresses some problems that UVM does not address such as buffering schemes
and API design.
Recent research by Brustoloni and Steenkiste analyzed the effects of data passing seman-
tics on kernel I/O and IPC [10, 11]. Since the traditional Unix API uses “copy” semantics
to move data this research focuses on providing optimizations that reduce data transfer
overhead while maintaining the illusion of the copy semantic. These optimizations were
implemented under Genie, a prototype I/O system that runs under NetBSD 1.1. Since Ge-
nie was intended as an experimental testbed to compare various IPC schemes it contains
features that may not be suitable for production use. Several techniques that preserve the
copy semantic are presented including temporary copy-on-write (TCOW), and input align-
ment. We briefly describe each of these below.
Genie’s TCOW mechanism is similar to UVM’s page loanout mechanism. TCOW
allows a process to loan its pages out to the Genie I/O system for processing without fear of
interference from the pagedaemon or page faults from other processes. TCOW differs from
page loanout in three ways. First, TCOW only allows pages to be loaned out to wired kernel
pages, while UVM’s page loanout allows both object and anonymous pages to be loaned
out to pageable anonymous memory. Second, TCOW was implemented on top of the old
BSD VM object chaining model, while page loanout is implemented on top of UVM’s new
anonymous memory layer. Third, in UVM we have demonstrated how page loanout can
easily be integrated with BSD’s mbuf-based networking subsystem while TCOW was only
demonstrated with the Genie I/O framework, completely bypassing the traditional BSD
IPC system.
Input alignment is a technique that can be used to preserve an API with copy se-
mantics on data input while using page remapping to actually move the data. In order to do
input alignment one of two conditions must hold. Either the input request must be issued
70
before the data arrives so Genie can analyze the alignment requirements of the input buffer
and configure its buffers appropriately, or the client must query Genie about the alignment
of buffers that are currently waiting to be transfered so that it can configure its buffers to
match. When remapping pages that are not completely filled with data, Genie uses “reverse
copyout” to fill the remaining areas with the appropriate data from the receiving process’
address space. Genie also has several techniques for avoiding data fragmentation when the
input data contains packet headers from network interfaces.
The fast buffers (fbufs) kernel subsystem is an operating system facility for IPC buffer man-
agement developed at University of Arizona [23, 24, 65]. Fbufs provide fast data transfer
across protection domain boundaries. The fbuf system is based on the assumption that IPC
buffers are immutable. An immutable buffer is one that is not modified after it has been
initialized. An example of an immutable buffer is a buffer that has been queued for trans-
mission on a network interface. Once the data has been loaded into the buffer it will not be
changed until the buffer is freed and reused.
Fbufs were implemented as an add-on to the Mach microkernel. The kernel and
each process on the system share a fixed sized area of virtual memory set aside as the “fbuf
region.” All fbuf data pages must be mapped into this shared region. Although the fbuf
region of virtual memory is shared among all processes the individual fbufs mapped into
the region are protected so that only authorized processes have access to the fbufs they are
using. The fbuf region can be thought of as a large vm object that is completely mapped
into all address spaces. The fbuf code acts as a pager for this object. This structure allows
the fbuf system to operate under Mach without the need to make changes to the core of the
Mach virtual memory system.
To send a message using an fbuf, a process first allocates an fbuf using the fbuf
system’s API. The fbuf allocator will allocate an unused chunk of virtual memory from the
fbuf region and mark it for use by the process requesting the allocation. At this point the
fbuf system will allow the process read and write access to the allocated part of the fbuf
region. The application can then place its data in the fbuf and request that the data be sent.
Since fbufs are immutable, the application should not attempt to access the fbuf after it is
sent. The fbuf system gives the recipient of the fbuf read access to it so that it may process
the data. Data stored in the fbuf object is pageable unless it has been wired for I/O. If a
process attempts to read or write an area of the fbuf region that it is not authorized to access
then it receives a memory access violation.
71
The fbuf facility has two useful optimizations: fbuf caching and volatile fbufs. In
fbuf caching, the sending process specifies the path the fbuf will take through the system
at fbuf allocation time. This allows the fbuf system to reuse the same fbuf for multiple
IPC operations. Once an fbuf has been sent, it is returned to the originating process for
reuse. This saves the fbuf system from having to allocate and zero-fill new fbufs for each
transaction. Volatile fbufs are fbufs that are not write protected in the sending process. This
saves memory management operations at the cost allowing a sending process to write an
immutable buffer after it has been sent.
While noteworthy, the fbuf facility has several limitations. First, the only way to get
data from the filesystem into an fbuf is to copy it. Second, if an application has data that it
wants to send using an fbuf but that data is not in the fbuf region, then it must first be copied
there. Third, it is not possible to use copy-on-write with an fbuf. It should be possible to
combine UVM features such as page loanout and page transfer with the fbuf concept to lift
these limitations.
Container Shipping
The Container Shipping I/O system allows a process to transfer data without having direct
access to it [53]. Developed at University of California (San Diego), container shipping
is based on the container-shipping mechanism used by the cargo-transportation industry.
In container-shipping goods are placed on protective pallots which are then loaded into
containers. Because the containers are uniform in size they can easily be moved, stacked,
and transported.
The container shipping I/O system operates in a similar way. Using a new API, a
process can allocate a container. It then can “fill” the container with pointers to data buffers.
Once a container has been filled with the necessary data a process can “ship” it to another
process or the kernel. Rather than receiving the data directly, the receiving party receives
a handle to the container. The receiver then has four options. First, it could “unload” the
container, thus causing the data in the container to be mapped into the receiver’s address
space. Second, it could “free” the container (and the data within). Third, it could “ship” the
container off to another process. Fourth, it could “empty” the container, thus freeing data
from it. There is also an operation that allows a process to query the status of a container.
By using these operations processes can pass data between themselves without mapping
it into their address space. A container shipping I/O system could easily take advantage
of UVM services such as page loanout and page transfer when loading and unloading
containers.
72
The Peer-to-Peer I/O System
Another I/O system developed at University of California San Diego is the Peer-to-Peer
I/O system. The Peer-to-Peer I/O system was designed to efficiently support I/O endpoints
without direct user process intervention [26, 27]. This is achieved through the use of the
splice mechanism. The splice mechanism allows a data source to directly transfer data to a
data sink without going through a process’ address space.
To use a splice a process must first open the data source for reading and the data
sink for writing. The endpoints can either be files or devices. The process then calls the
splice system call with the source and sink file descriptors as arguments. This causes
the kernel to create a splice that associates the source file descriptor with the sink file
descriptor. The splice system call returns a reference to the splice to the calling process
in the form of a a new third file descriptor. Once the splice is established, the process can
transfer a chunk of data from the source to sink by writing a control message to the splice’s
file descriptor. The control message contains the size of the chunk. Such data is transfered
through kernel buffers rather than through the process’ address space. The process can also
delegate control of data transfer to the operating system. In that case, the operating system
will continue to transfer chunks of data until the end-of-file for the data source is reached.
The process can query the status of a splice by reading a status message from the splice file
descriptor. The splice is terminated simply by closing the splice file descriptor.
The splice mechanism also allows for some data processing to occur within the
kernel. This is done by allowing software data processing modules to be put between the
data source and sink. Such modules can either be compiled into the kernel or dynamically
loaded (if the process has suitable privileges). This mechanism is similar to the STREAMS
mechanism that is part of System V Unix [56, 58, 68].
Rice University’s I/O Lite unified buffering and caching system unifies the buffering and
caching of all I/O subsystems in the kernel [52]. I/O Lite extends Fbuf’s concept of im-
mutable buffers to the entire operating system. These immutable buffers are read-only
mapped into the address space of processes that are accessing them. Processes access the
immutable buffers through lists of pointers to the immutable buffers called mutable buffer
aggregates. A new API is used to pass a mutable list of buffer aggregates between end-
points. When a process wants to modify data stored in an immutable buffer it has three
options. First, if the entire buffer is being changed it can be completely replaced with
73
a new immutable buffer. Second, if only part of an immutable buffer is being changed
then a new buffer aggregate can be created that directs references to the changed area to a
new immutable buffer. Finally, for applications that require write access to the immutable
buffer I/O Lite has a special “mmap” mode that provides shared access to the (no longer)
immutable buffer.
University of Maryland’s Roadrunner I/O system also unifies buffering and caching
of all I/O systems in the kernel [41, 42]. All I/O systems in Roadrunner use a “common
buffer” to pass data. Roadrunner supports a more generalized version of the splice called a
“stream” that uses a kernel thread to provide autonomous I/O between any two endpoints.
Roadrunner’s caching layer is global. Unlike Unix, Roadrunner uses the same layer and
interfaces for all devices.
Washington University’s UCM I/O system is an I/O system that was designed to
unify the buffering of all I/O subsystems of the kernel [15, 16]. UCM I/O includes a new
API that supports both network and file I/O in a general way. It was designed to use page
remapping to move data between users and the kernel. The design also includes a splice
like facility for moving multimedia data and a mechanism to combine separate system calls
into a single system call.
Other Work
Research in the area of remote procedure call has used virtual memory services to help
move data. The DEC Firefly RPC system used a permanently mapped unprotected area
of memory shared across all processes to pass RPC data [61] and provide low latency.
The LRPC and URPC systems copy data into memory statically shared between the client
and the server [5, 6]. The DASH IPC mechanism used a reserved area of virtual space in
processes to exchange IPC data through page remapping [67]. The Peregrine IPC system
used copy-on-write for output, copy for input, and page remapping to move data between
IPC endpoints [73]. Finally, the Mach IPC system uses copy maps, as described previously.
Some research has produced special purpose I/O systems to work around a specific
problem. For example the MMBUF system is a system that avoids data copying between
the filesystem and networking layer by using a special buffer called an MMBUF that acts
as both a network mbuf or a filesystem buffer [12]. This system is used to transmit video
data from a disk file over an ATM network using real-time upcalls to maintain quality of
service [29]. This system could be adapted to use a mechanism like UVM’s page loanout
to move data without copying it (or mapping it) through a user address space.
74
New designs in host-network interfaces also look to take advantage of services that a
VM system such as UVM can offer. For example, the APIC ATM host-network adaptor can
separate protocol headers from data and store inbound data into pages that can be remapped
into a user’s address space using a mechanism such as UVM’s page transfer [20, 21, 22].
Researchers are often reluctant to change traditional APIs for fear of breaking user
programs. However, new applications are often taking advantage of communications mid-
dleware such as ACE wrappers [60]. ACE wrappers abstract around the differences be-
tween operating systems to provide applications with a uniform interface to IPC services.
Such techniques could be applied in the I/O and IPC area to hide differences between tradi-
tional APIs and newer experimental ones. Applications that use such wrappers would not
have to be recompiled to take advantage of new APIs. Instead they could dynamically link
with a new middleware layer that provides the appropriate abstractions and understands the
new API.
Recent operating systems research has focused on extensible operating systems.
For example, the SPIN kernel allows programs written in a type-safe language to be down-
loaded into the kernel and executed [7, 8]. The Exokernel attempts to provide processes
with as much access to the hardware of the system as safely possible so that alternate oper-
ating system abstractions can easily be evaluated without the Exokernel getting in the way
[34]. The Scout operating system is intended for small information appliances [45, 47].
Scout allows you to build optimized “paths” through the operating system for specific ap-
plications. Such work on extensible systems can provide us with insight into servers that a
VM system should provide, and in some cases allow applications more direct access to the
services that a VM system provides the kernel.
Research using the L3 and L4 microkernels has been used to determine the maximal
achievable level of IPC performance the hardware will allow [37, 38]. This was done by
hand coding the IPC mechanism in assembly language, ensuring that code and data was
positioned in memory in such a way that they all could be co-resident in the cache, and by
minimizing the number of TLB flushes used for an IPC operation. The result of this exer-
cise shows that software adds a substantial amount to IPC overhead and provides a “best
case” target for software developers to shoot for when developing new IPC mechanisms.
3.5 Summary
In this chapter we have presented an overview of UVM. The overview consisted of the
goals of the UVM project, a description of new and improved features that UVM provides,
75
and a description of UVM’s main data structures and functions. Table 3.1 and Table 3.2
summarize the changes introduced to BSD by UVM and list the design issues we faced
during the implementation of UVM. We also discussed related research in the area of vir-
tual memory systems, I/O, and IPC. In the remaining chapters of this dissertation we will
examine the design and implementation of UVM in greater detail.
76
Chapter 4
This chapter describes the data structures, functions, and operation of UVM’s anonymous
memory system. Anonymous memory is memory that is freed as soon as it is no longer
referenced. This memory is referred to as anonymous because it is not associated with a
file and thus does not have a file name. Anonymous memory is paged out to the “swap”
area when memory is scarce.
for shared regions of memory that can be accessed with the System V shared memory
system calls.
In the BSD VM system all anonymous memory is managed through objects whose pager
is the “swap” pager. Such objects are allocated either for zero-fill memory, or to act as
shadow or copy objects for a region of copy-on-write memory.
UVM uses a two-level amap-object memory mapping scheme, as previously de-
scribed in Chapter 3. In UVM’s two-level scheme, anonymous memory can be found
at either level. At the lower backing object level, UVM supports anonymous memory
77
uvm object structures (“aobj” objects). These objects use the anonymous memory ob-
ject pager (“aobj” pager) to access backing store (i.e. the system “swap” area). UVM
aobj-objects are used to support the System V shared memory interface and to provide
pageable kernel memory.
UVM’s upper level of memory consists entirely of anonymous memory. Memory at
this level is referenced through data structures called anonymous memory maps (“amaps”).
Thus, the upper layer can be referred to as the “amap layer.” The amap layer sits on top
of the backing layer. When resolving a page fault, UVM must first check to see if the data
being faulted is in the amap layer or not. If it is not, then the backing object is checked.
Note that unlike the BSD VM, there is only one backing object — there are no chains of
objects to traverse. The UVM amap layer serves the same function as a shadow object
chain in BSD VM; the amap layer is used for zero-fill mappings and for copy-on-write.
In a typical kernel running UVM, most anonymous memory resides in the amap
layer rather than in “aobj” objects. The amap layer can also be used for page loanout and
transfer (see Chapter 7). Thus amap-based anonymous memory plays a central role in
UVM. The remainder of this chapter will focus on issues relating to amaps. More informa-
tion on aobj-based anonymous memory can be found in Section 8.3 on the aobj pager.
When a map entry is split, one of the entries can keep the old slot offset, but one of them
will need a new slot offset that has been adjusted by the size of the other entry, as shown in
Figure 4.2.
The format of an anon structure is shown in Figure 4.3. An anon contains a lock, a
reference counter indicating how many amaps have references to the anon, a pointer, and
an address in the swap area (the swap slot number). The anon’s pointer is a C union that
can be used in one of two ways. Unused anon structures use their pointer to form the free
list. In-use anon structures use their pointer to point to the vm page structure associated
with the anon’s data. The swap slot number is non-zero only if the anon’s data has been
paged out. In that case, the location of the data on backing store is stored in the swap slot
number. Note that amaps track anons rather than pages because they need to keep track of
the extra information stored in the anon structure (which is not stored in the page structure).
Anons that have been paged out may not even have a page of physical memory associated
with them. When UVM needs to access data contained in an anon, it first looks for a page
structure. If there is none, then it allocates one and arranges for the data to be brought in
from backing store.
79
vm_anon
spin lock lock
reference count ref
page or free list pointer (union) nxt/page vm_anon/vm_page
swap slot number swslot
Function Description
amap alloc allocate a new amap of the specified size — the new
amap does not contain any anons
amap free free an amap that is no longer in use
amap ref add a reference to an amap
amap unref drop a reference from an amap
amap splitref split a single reference into two separate references
amap extend increase the size of an amap
amap add add an anon to an amap
amap unadd remove an anon from an amap
amap lookup look in a slot in an amap for an anon
amap lookups lookup a range of slots
amap copy make a copy-on-write copy of an amap
amap cow now resolve all copy-on-write faults in an amap now (used
for wired memory)
amap share protect protect pages in an amap that resides in a share map
UVM defines a set of functions that make up the operations that can be performed
on an amap. The amap API is shown in Table 4.1.
anon a b c
4 1 7 slot #
amap
anon pointers
anon b a c
implementation that is conservative on its RAM usage, where as machines that have lots of
RAM may wish to devote part of it to speeding up amap operations.
There are many ways that an amap can be implemented. For example, one pos-
sible implementation of an amap is an array of pointers to anon structures, as shown in
Figure 4.4. The pointer array is indexed by the slot number. This implementation makes
amap lookup operations fast, however operations that require traversing all active slots in
the amap would have to examine every slot in the array to determine if it was in use or not.
This is how amaps are implemented in SunOS4 VM.
Another possible amap implementation is shown in Figure 4.5. In this implemen-
tation, a dynamically sized array contains a list of anons and their slot number. If the list
array fills, a new larger one can be allocated to replace it. This implementation can save
physical memory, especially when an amap is mapping a sparsely populated area of vir-
tual memory. This implementation slows down amap lookups (because the list has to be
searched for the correct anon for each lookup), however it handles a traversal of the entire
amap efficiently.
UVM includes a novel amap implementation that combines the ideas behind these
two simple implementations.
81
vm_amap
spin lock l
reference count ref
"shared" flag flags
number of slots allocated maxslot
number of active slots nslot
number of slots in use nused
contiguous array of active slots slots int
back pointer to am_slots bckptr int
array of anon pointers anon vm_anon
per-page reference count ppref int
slot #
0 1 2 3 4 5 6 7 8
am_anon
a b c anons
am_bckptr 1 0 2
am_slots 4 1 7
The am anon array consists of a set of pointers to an anon structure. This array is
indexed by slot number. The am anon array allows quick amap lookup operations.
The am slots array contains a contiguous list of the slots in the am anon array
that are currently in use. This list is not in any special order. The am slots array
can be used to quickly traverse all anons that are currently pointed to by the amap.
The am bckptr array is used to keep the am anon array and am slots array
in sync. The am bckptr array maps an active slot number to the index in the
am slots array in which its contiguous entry lies. For example, in Figure 4.7
amap anon[7] points to anon “c” and its entry in the contiguous am slots array
is at am slots[2].
Figure 4.9: Per-page reference counters before and after an unmapping. Before the unmap,
the amap’s overall reference count is one, and after the unmap its overall reference count is
two (one for each aref).
contiguous chunk, and then have another array to indicate the length of the chunk. An
example of this encoding is shown in Figure 4.10.
Adding a second array to keep track of the length of a mapping would allow us
to change the reference count on a block of pages quickly. However, it also doubles the
memory needed by the per-page reference count system. UVM takes advantage of the large
number of “don’t care” values in the array to compact the reference and length arrays into
one single array. This is done by dividing the blocks of references into two groups: those
of one page in length and those of more than one page in length. Single page blocks can be
distinguished from multi-page blocks by the sign bit on their reference count. If the value
is positive, then it is a single page block. If the value is negative, then there is a multi-page
block, and the length of the block is the following value. There is one more problem: how
to handle blocks whose reference count is zero. Since zero is neither negative or positive
it is impossible to tell the length of the block. This issue is addressed by offsetting all
reference counts by one. Thus, the reference count and length of a block can be found with
the code shown in Figure 4.11. An example of this encoding is shown in Figure 4.12.
An attractive feature of this original scheme is that its processing cost starts off
small and only rises as the amap is broken into smaller chunks. Processes that do not do
any partial unmaps will not have any per-page reference count arrays active. Processes that
only do a few partial unmaps will have only a few chunks in their per-page reference count
array.
85
per-page refs
per-page refs before after
before after (ref/len) (ref/len)
1 1 1/10 1/1
1 0 x/x 0/8
1 0 x/x x/x
1 0 x/x x/x
1 0 x/x x/x
1 0 x/x x/x
1 0 x/x x/x
1 0 x/x x/x
1 0 x/x x/x
1 1 x/x 1/1
Figure 4.10: Per-page reference counters with length. The symbol “x” means that this value
does not matter.
if (am_ppref[x] > 0) {
reference_count = am_ppref[x] - 1;
length = 1;
} else {
reference_count = -am_ppref[x] - 1;
length = am_ppref[x + 1];
}
Figure 4.11: Code to find the per-page reference count and length
store for an aobj-based anonymous page of memory is done through the uvm object’s
pager operations pointer. This will not work for an anon because it is not a uvm object
and thus does not have a pager operations pointer. Since each anon is backed by the swap
area, there is no need for such a pointer. However, there still needs to be a way to map a
paged-out anon to its data on backing store. In the SunOS4 VM, this is done by statically
assigning a page-sized block of swap to each anon. There are several problems with this
scheme. First, it forces the system’s swap size to be greater than or equal to the the size
of physical memory so that each anon allocated at boot-time has a location to which it
can be swapped out (this problem was later addressed in Solaris [13]). Second, the static
allocation of anon to swap area makes it hard to cluster anonymous memory pageouts and
thus anonymous memory pageout takes longer.
Rather than statically assigning a swap block to each anon, in UVM the assignment
of a swap block to an anon is done by the pagedaemon at pageout time. The swapping layer
of the VM system presents swap to the rest of the kernel as a big “file.” This file, known as
the “drum,” can be accessed by the root user via the device file /dev/drum. Internal to
the swap layer of the VM, the drum might be interleaved across several different devices
or network files. To most of UVM the drum is simply a large array of page-sized chunks
of disk. Each page-sized chunk of swap is assigned a “slot” number. When an anon has
backing store assigned to it, it saves the slot number in its an swslot field. Swap slot
number zero is reserved: it means that a backing store has not been assigned to an anon.
87
The data in an anon can be in one of three possible states:
resident with no backing store slot assigned: In this case the anon’s an page pointer
points to the page in which the anon’s data resides. The anon’s an swslot field is
zero indicating that no swap slot has yet been assigned to this anon. All anons start
in this state.
non-resident: An anon can be in this state if a swap slot was allocated for it, its page was
paged out to its area of the drum, and then its page was freed. In this state the page
pointer will be null and the swap slot number will be non-zero. An anon can enter
this state from either of the other two states.
resident with backing store slot assigned: An anon enters this state if it was previously
non-resident and its data was paged in from backing store. In this case it has both
a page pointer and a swap slot assigned. As long as the data in the resident page
remains unmodified the data on backing store is valid.
Note that in UVM it is not possible to have an anon that has neither a page nor a swap slot
assigned to it.
4.6.1 Copy-on-write
As previously described in Chapter 3, a process’ virtual address space is described by a map
structure. Each mapped region in the address space is described by map entry structures
that are kept in a sorted linked list off the map. Each map entry maps two levels: a top-
level anonymous amap layer, and a bottom-level backing object layer. Either or both of
these layers can be null. For copy-on-write mappings, all copy-on-write anonymous data
is stored in the amap layer. Each map entry describes the attributes of the mapping it maps.
With respect to copy-on-write, there are two important boolean attributes in the map entry:
copy-on-write: If copy-on-write is true, then it indicates that the mapped region is a copy-
on-write mapping and that all changes made to data in the mapping should be done
in anonymous memory. If copy-on-write is false, then that indicates that the mapping
is a shared mapping and that all changes should be made directly to the object that is
mapped.
needs-copy: If needs-copy is true, then it indicates that this region of memory needs its
own private amap, but it has not been allocated yet. The region either has no amap
at all, or it has a reference to an amap that it needs to copy on the first write to the
89
region. The needs-copy flag allows UVM to defer the creation of an amap until the
first memory write occurs. The needs-copy flag is used for copy-on-write operations.
2. A new page is allocated and the data is copied from the old anon’s page to the new
anon’s page
3. The old anon’s reference counter is dropped by one and the new anon is installed in
the current amap.
4. The copy-on-write operation is complete and the write can proceed as normal.
Thus, data in an anon can only be changed if its reference count is one. Note that the pro-
cedure described above is incomplete because it does not take into account the possibility
that the system is out of memory. If the system is out of virtual memory, then the allocation
of the new anon will fail. This can happen if all of the system’s swap space is allocated.
90
To recover from this, the process can either wait in hopes that some other process will free
some memory, or it can be killed. Currently in UVM the process is killed, however, further
work to find a better solution might be useful. On the other hand, if the system is out of
physical memory then the allocation of the new page will fail. In this case, the process
must drop all its locks and wake up the pagedaemon. The pagedaemon will (hopefully)
free some pages by doing some pageouts and then wake the process back up.
This is the basics of how copy-on-write works in UVM. Note that amaps combined
with the anon’s reference counter perform the same role in copy-on-write that object chain-
ing does in BSD VM. However, UVM’s amap-based scheme is simpler than BSD VM’s
object chaining scheme because it requires only a two-level lookup rather than a traver-
sal of an object chain of arbitrary length, and it does not require complex object collapse
schemes or leak swap memory. In the next section the details of copy-on-write with respect
to forking are described.
4.6.2 Forking
When a process forks a child process it must create a new address space for the child
process based on the parent process’ address space. In traditional Unix, regions mapped
copy-on-write are copy-on-write copied to the child process while regions mapped shared
are shared with the child process. In the Mach operating system the forking operation is
more complex because the parent process has more control over the access given to the
child process. Both the BSD VM and UVM include this feature from Mach. Each map
entry in a map has an inheritance value. The inheritance value is one of the following2:
none: An inheritance code of “none” indicates that the child process should not get access
to the area of memory mapped by the map entry. The fork routine processes this by
skipping over this map entry and moving on to the next one in the list.
share: An inheritance code of “share” indicates that the parent and child processes should
share the area of memory mapped by the map entry. This means that both the ref-
erences to any amap and backing object mapped should be shared between the two
processes. Note that the copy-on-write flag of a mapping in the child process is al-
ways the same as in the parent (the child can’t suddenly get direct access to a backing
memory object).
2
There is also a “donate copy” inheritance value that does not appear in Mach but is defined (but not
implemented) in BSD VM.
91
copy: An inheritance code of “copy” indicates that the child process should get a copy-
on-write copy of the parent’s map entry. First, in order to achieve proper copy-on-
write semantics, all resident pages in the parent’s mapping are write protected. This
ensures that a page fault will occur the next time the parent attempts to write to one
of the pages. As part of the fault recovery procedure, the page fault handler will
do the copy-on-write on the faulting page. Second, the child gets a reference to the
parent’s backing object, if any. Third, the child adds a reference to the parent’s amap
(if any) and sets the copy-on-write flag. Fourth, the needs-copy flag is set in both the
parent and child. In some cases (described later) the needs-copy flag must be cleared
immediately.
The default inheritance value for a region mapped copy-on-write is copy, while the default
inheritance value for a region mapped shared is share. Thus, the default behavior emulates
traditional Unix. However, it is possible for a process to change the inheritance attribute of
a region of memory by using the minherit system call.
Note that in share inheritance the amap structure can be shared between mappings.
When this happens a special “shared” flag is set in the amap structure to indicate that
the amap is being shared. When the shared flag is set, the amap module is careful to
propagate changes in the amap to all processes that are sharing the mapping. For example,
when an anon is removed from an amap any resident page associated with that anon must
be unmapped from the process using the amap. If the amap is known not to be shared,
then simply removing the mapping for that page in the current process’ pmap is sufficient.
However, if the amap is being shared (and thus the shared flag is set) then not only must
the mapping be removed from the current process’ pmap, but it must also be removed
from any other process that is sharing the mapping. Removing a mapping from the current
process’ pmap can be accomplished by using the pmap remove function, while removing
all mappings of a page can be accomplished with the pmap page protect function.
In UVM, the function that copies the address space of a process during a fork opera-
tion is uvmspace fork. The C pseudo-code for this function is shown in in Figure 4.13.
Note that each map entry in the parent’s address space is copied to the child process based
on its inheritance value.
It is often easier to understand the relationship between amaps and forking, faulting,
and inheritance changes by looking at a memory diagram that shows the state of the data
structure through these critical events in a process’ lifetime. Table 4.2 defines the symbols
used in a memory diagram. Figure 4.14 shows the arrangement of the data structure in a
memory diagram. Note that in a map entry the upper left corner contains the mapping type
92
uvmspace_fork() {
switch(entry->inherit) {
case VM_INHERIT_NONE:
/* do nothing */
break;
case VM_INHERIT_SHARE:
/* add shared mapping to child */
break;
case VM_INHERIT_COPY:
/* add copy-on-write mapping to child */
break;
default: panic("uvmspace_fork");
}
}
return(new vmspace);
}
Symbol
Meaning
an anon structure with reference count
copy
process forks
process changes an inheritance value
an amap structure with a reference count
needs-copy flag is true
process triggers a read-fault
share
!" “shared” amap flag is true
process triggers a write-fault
# “don’t care”
93
process id
mapping type 1 NC
map entry C
S
inheritance
SHARE
amap M2
anon A1
1 1 2
x x x
S S S
F1
SHARE
M1 M2
A1 A1
and the lower right corner contains the inheritance value. Figure 4.15 shows what happens
to a map entry during a fork when its inheritance is set to share. Note that the amap’s share
flag becomes true after the fork and the amap’s reference count goes from one to two while
the anon (that is shared) remains unchanged.
Figure 4.16 shows what happens to a map entry during a fork when its inheritance
is set to copy. Note that in this case the amap gets an additional reference, but it is not a
shared one. Both process one and process two have their needs-copy flag set. Also, note
that the mappings in process one have been write protected to ensure that the copy-on-write
operation will happen. The needs-copy flag will remain set on the map entries until the first
write fault either in process one or process two. If process one had a write fault on anon “A”
then the result would be as shown in Figure 4.17. When that write fault occurs, the needs-
copy flag will first need to be cleared with amap copy before the fault can be processed.
This causes process one to get a new amap and both anons “A” and “B” get a reference
94
1 1 NC 2 NC
C C C
C C C
F1
M1 M2
A1 B1 A1 B1
1 2 NC
C C
C C
WF1
M1 M1
C1 A1
B2
Figure 4.17: Clearing needs-copy after a copy-inheritance fork. Needs-copy was cleared
because process one took a write fault on anon “A” causing a new anon (“C”) to be allo-
cated.
count of two. Once the new amap has been created, then a new anon (“C”) for process one
can be allocated and inserted into it. After the copy-on-write fault anon “B” is still shared
between the two processes. Note that process two still has its needs-copy flag set: this is
harmless. If process two has a write fault on the mapped area, amap copy will be called
on the mapping to clear needs-copy. The amap copy function will notice that the amap
pointed to by process two has a reference count of one. This indicates that process two is
the only process referencing the amap, and thus the amap does not need to be copied. In
this case amap copy can clear the needs-copy flag without actually copying the amap.
95
1 2 NC 3 NC
x C C
x S S
F
2
SHARE
M3
A
1
Figure 4.18: An erroneous fork. Process two is the parent process that has needs-copy set
and shared inheritance. Process three is the child process. Process one is also using the
amap.
SHARE
M M1
2
A2
Figure 4.19: Erroneous fork after process three write-faults. Processes two and three should
be sharing the same amap, but they are not.
1 2 3
x C C
x S S
F2
SHARE
M1 M2
A2
in Figure 4.19. Given the data structures involved there is no way for process three to know
that it needs to update process two’s amap pointer to point at the new amap it just created.
UVM avoids this problem by checking the needs-copy flag on the parent process
during a shared-inheritance fork. If the flag is set, then UVM goes on and calls amap copy
immediately so that the amap to be shared is created before a problem like the one shown
in Figure 4.19 can occur. The correct handling of the erroneous fork situation is shown in
Figure 4.20.
SHARE
M
2
A
1
that one of the processes mapping the map has its inheritance value set to copy. This
configuration is shown in Figure 4.21. Note that processes one and two should be sharing
the amap.
Now suppose process two forks. If the scheme shown in Figure 4.16 were followed
the result would be as shown in Figure 4.22. That configuration is wrong. A write fault by
any of the three processes will cause problems:
If process one writes to a page in the amap process two will see the change because
it is sharing the same amap with process one. However, process three will also see
the change and it should not.
If process two writes to a page in the amap the needs-copy flag on its mapping would
cause a new amap to be allocated to store the changed page. This is wrong because
process one and two are supposed to be sharing the same amap.
If process three writes to the mapping then it will get a new amap as expected, and
the copy-on-write will proceed normally. However process two will still have its
needs-copy flag set after process three’s write is finished, and that will lead to the
problem described above.
The correct way to handle a copy-inherit fork when the amap is being shared is to
use amap copy during the fork to immediately clear needs-copy in the child process. The
correct result of the fork is shown in Figure 4.23. The share flag has been added to the
amap structure in UVM to detect this situation.
98
1 2 NC 3 NC
x x C
x C C
F
2
SHARE
M3
A1
1 2 3
x x C
C C C
F2
SHARE
M M1
2
A2
The data in the amap layer immediately becomes copy-on-write in both the parent
and child process.
The data in the object layer remains shared in the parent process, but becomes copy-
on-write in the child process.
4.8 Summary
In this chapter we have reviewed the function of anonymous memory in UVM. We in-
troduced the amap interface and described UVM’s reference amap implementation. We
explained how UVM handles partial unmaps of regions of anonymous memory efficiently.
We then explained the role of amaps in copy-on-write and how Mach-style memory inher-
itance interacts with the amap system.
100
Chapter 5
This chapter describes the operation of uvm fault, the UVM page fault handler. Page
faults occur when a process or the kernel attempts to read or write virtual memory that is not
mapped to allow such an access. This chapter includes an overview of UVM’s page fault
handling, a discussion of error recovery in the fault handler, and a comparison between the
BSD VM’s fault routine and UVM’s fault routine. Note that the effect of page loanout on
the fault routine is described in Chapter 7.
routine operates by first looking up the faulting address in the faulting process’ VM map.
Once the appropriate map entry is found the fault routine must check for the faulting page
in each of UVM’s two-levels of mapping. The fault routine first checks the amap level to
see if the faulting page is part of anonymous memory. If not, then the fault routine requests
the faulting page from the the uvm object associated with the faulting map entry. In
either case, if a copy-on-write fault is detected a new anon is allocated and inserted into
the amap at the appropriate position. Finally, the fault routine establishes a mapping in
the faulting process’ pmap and returns successfully. High-level pseudo code for the fault
routine is shown in Figure 5.1. The remainder of Section 5.1 describes the step-by-step
operation of UVM’s page fault routine.
faulting virtual address: The machine-dependent code obtains from the MMU the ad-
dress that was accessed in virtual memory to trigger the page fault.
faulting vm map: Based on the faulting virtual address, hardware information, and the
pointer to the currently running process the machine-dependent code must determine
if the fault occurred in memory managed by the current process’ vm map or by the
kernel’s vm map. The faulting map is passed to the fault routine.
102
fault type: Based on information from the hardware, the machine-dependent code must
determine the type of page fault that occurred. The fault type is “invalid” if the
virtual address being accessed was not mapped. The fault type is “protect” if the
virtual address being access was mapped, but the access violated the protections on
the mapping (e.g. writing to a read-only mapping). Finally, there is an internally
used fault type, “wire,” that is used when wiring pages.
access type: Based on information from the hardware, the machine-dependent code must
determine if the faulting process was reading or writing memory at the time of the
fault. This is the fault access type.
The fault routine will use these four pieces of information to process the fault. Once the
fault routine has resolved the fault it returns a status value to the machine-dependent code
that called it. The fault routine will either return “success” indicating that the fault was
resolved and that the faulting process can resume running, or it will return an error code
that indicates that faulting process should be sent an error signal. If there is an error and
the faulting process is the kernel the system will usually crash (“panic”).
Figure 5.2: The faultinfo data structure. The first three fields are filled out at the start of
the fault routine. The remaining fields are filled up by the uvmfault lookup function
when it does the map lookup. All virtual addresses are truncated to a page boundary.
the fault routine has to do a lengthy I/O operation to get the needed data, then it unlocks
the faulting map so that other processes or threads may access the map while the I/O is
in progress. Once the I/O operation is complete, the fault routine must relock the map to
check that its mapping is still valid.
If a map lookup results in a map entry that points to a share map or submap, then
the lookup must be repeated in that map. Since map entries in share maps and submaps
are not allowed to point to share maps or submaps themselves, only two levels of maps can
be encountered: the faulting map (called the “parent map”) and the share or submap. Note
that share maps and submaps have their own locks that must be held during lookups.
While the fault routine must handle map locking, share maps, and submaps properly,
the details can be isolated in a data structure and a few helper functions so that the bulk
of the fault routine code doesn’t have to worry about how many maps it is dealing with.
The data structure that contains information on the current mapping of the faulting address
is called the uvm faultinfo data structure, shown in Figure 5.2. The faultinfo data
structure is often called “ufi” for short.
The first thing the fault routine does is to store the faulting map, address, and size in
its faultinfo data structure. For page faults, the fault size is always the page size (the fault
size field is for use by other functions such as page loanout).
The fault routine then calls the uvmfault lookup function to look up the map
entry that maps the faulting address. The fault lookup function takes a faultinfo structure
104
with its first three fields filled in as input. It then locks the faulting map, and does a lookup
on the faulting address. If the faulting address is not mapped, the lookup routine returns
an error code. If the faulting address is in a region mapped by a share or submap, then
the lookup routine makes the original map the “parent” map, locks the new map, and then
repeats the lookup in the new map. If the lookup is successful, then the fault lookup func-
tion will fill in the remaining seven fields of the faultinfo structure. For faulting addresses
that map to a share map or submap, the parent map field will point to the original map,
and the map field will point to the share or submap. If there is no share or submap, then
parent map will be null and map will point to the original map. Thus, the lowest-level
map always ends up in the map field of the faultinfo structure. This map is sometimes
referred to as the “current” map. The entry field is always a map entry in this map. If
successful, the lookup function returns with the maps locked.
Note that the faultinfo structure contains version numbers for both the parent map
and the current map. If the fault routine unlocks a map for I/O, these version numbers can
be used to detect if the map was changed while unlocked. If the map was not changed, then
the fault routine knows that the lookup information cached in its faultinfo structure is still
valid, and thus a fresh lookup is not needed. However, if the version numbers are changed,
then the fault routine must repeat the lookup in case the mappings have changed while the
map was unlocked.
The UVM fault routine has four helper functions relating to map lookups. They are
summarized in Table 5.1.
Function Description
uvmfault lookup Looks up a virtual address in a map, handling
share and submaps. The virtual address and
map are passed to this function in a faultinfo
structure. If successful, the lookup function
returns with the maps locked.
uvmfault unlockmaps Unlocks the map pointed to by a faultinfo
structure.
uvmfault relock Attempts to relock the maps pointed to by
a faultinfo structure without doing a new
lookup. This will succeed only if the maps’
version numbers have not changed since they
were locked.
uvmfault unlockall Unlocks an anon, amap, uvm object, and
a faultinfo structure.
This is done by calling the uvmfault amapcopy function. This function drops
the read-lock on the maps, obtains write-locks on the maps, and clears needs-copy
by calling amap copy. Then the fault routine restarts the fault processing.
$ Once needs-copy has been cleared, the fault routine checks the current map entry to
ensure that it has something mapped in at one of the two-levels. If the map entry
has neither an amap nor a uvm object attached to it, then there are no memory
objects associated with it. In this case the fault routine unlocks the maps and returns
an invalid address error.
The first thing that the fault routine must do in an anon fault is ensure that the data in
the faulting anon is resident in an available page of physical memory. This is done with
the “anonget” function (called uvmfault anonget). The anonget function is called by
the fault routine with faultinfo, amap, and anon as arguments. These structures must be
locked before the function is called. The anonget function must be prepared to handle the
following three cases:
$ non-resident: The data has been paged out to the swap area.
$ unavailable resident page: The data is resident, but the page it resides in is currently
busy.
$ available resident page: The data is resident and its page is available for use.
while (1) {
if (page resident) {
if (page is available)
return(success); /* done! */
mark page wanted;
unlock data structures;
sleep;
} else {
allocate new page;
unlock data structures and read page from swap;
}
relock data structures;
} /* end of while loop */
Once the anon’s data is resident, the fault routine divides anon faults into two sub-cases, as
shown in Figure 5.4.
In case 1A, the page in the anon is mapped directly into the faulting process’ address
space. Case 1A is triggered by either a read-fault on an anon’s page, or by a write-fault on
an anon whose reference count is equal to one. Case 1B is triggered by a write-fault on
an anon whose reference count is greater than one. Case 1B is required for proper copy-
on-write operation. An anon’s reference count can only be greater than one if two or more
case 1a case 1b
copy
anon-layer
Figure 5.4: Anon fault sub-cases. The difference between case 1a and case 1b fault is that
in case 1b the data must go through the copy-on-write process in order to resolve the fault.
109
processes are sharing its page as part of a copy-on-write operation. Thus, when writing to
an anon in this state, a copy of the data must be made first to preserve the copy-on-write
semantics. Likewise, for case 1A, when faulting on anons whose reference count is greater
than one, the faulting page must be mapped in read-only to prevent it from being changed
while the reference count is greater than one.
To handle case 1B, the fault routine must allocate a new anon containing a new page.
The data from the old anon must be copied into the new anon, and then the new anon must
be installed in place of the old one in the amap. The new anon will have a reference count
of one, and the old anon will lose one reference. The fault routine must be prepared to
handle the sort of out-of-memory errors described in Section 5.2 while handling case 1B.
Once the copy-on-write process of case 1B has been addressed, the new page can be entered
into the faulting process’ pmap, thus establishing a low-level mapping for the page being
faulted. This is done with a call to the pmap enter function. Once the low-level mapping
has been established the fault routine can return successfully. The faulting process will
resume execution with the memory reference it made to the page that was just faulted on.
Earlier in the fault process, the fault routine did a locked get to find any resident pages in
the object that were close to the faulting page. If the faulting page was resident and not
busy, then the locked get returned the faulting page as well. The locked get sets the “busy”
flag on the pages it returns. The fault routine does not have to worry about getting a page
that is busy by another process because the locked get will not return such a page.
If the page is not resident, then the fault routine must issue an unlocked get request
to the pager get function. To do this, the fault routine unlocks all its data structures (with
the exception of the object — the pager get function will unlock that before starting I/O)
and calls the pager get function. After starting I/O, the process will sleep in the pager get
110
case 2a case 2b
anon-layer
new anon
promote copy
object-layer
Figure 5.5: Object fault sub-cases. In case 2a the faulting process has direct access to
the object’s memory, while in case 2b the faulting process has copy-on-write access to the
object.
function until the data is available. When the data is available, the pager get function will
return with the faulting page resident and busy. The fault routine must check for errors and
relock the data structures.
Once the object’s data is resident, the fault routine divides object faults into two sub-cases,
as shown in Figure 5.5.
In case 2A the object’s page is mapped directly into the faulting process’ address
space. Case 2A is triggered by either a read-fault on an object’s page, or by a write-fault
on an object that is mapped shared (i.e. not copy-on-write). If a read-fault occurs on a
copy-on-write region, then the page will be mapped in read-only to ensure that the faulting
process cannot attempt to modify it without generating a page fault. Note that if the object
is a memory mapped file, then changes made to its pages will be reflected back to the file
and seen by other processes.
Case 2B is triggered by a write-fault on an object that is mapped copy-on-write,
or by a fault on a copy-on-write region that has no object associated with it. The latter
case happens with regions of virtual memory that are mapped “zero-fill.” When data is
copied from the lower-level object layer to the upper-level anon layer the data is said to be
“promoted.” To promote data, the fault routine must allocate a new anon with a new page
111
and copy the data from the old page to the new page. The fault routine must be prepared to
handle the sort of out of memory errors described in Section 5.2 while handling case 2B.
Once the object page is resident and case 2B is taken care of, then the new page can be
entered into the the faulting process’ pmap with pmap enter. After mapping the page,
the fault routine must clear the faulting page’s busy flag, and check the faulting page’s
wanted flag. If the page is wanted, then the fault routine wakes up any process that may
be sleeping waiting for it. Once this is done, the fault routine can return successfully. The
faulting process will resume execution with the memory reference it made to the page that
was just faulted on.
5.1.8 Summary
In summary, a page fault happens as follows: First the faultinfo structure is filled in for
the lookup operation. After the lookup operation is performed the protection is checked
and needs-copy is cleared if necessary. Next, available resident neighboring pages are then
mapped in. Finally, the fault routine resolves the fault by determining which case the fault
is, getting the needed data, and mapping it in. Pseudo code showing the overall structure
of UVM’s fault routine is shown in Figure 5.6.
address error code to the caller. This error most often occurs when buggy programs attempt
to reference data through an invalid pointer.
Because of UVM’s two-level mapping scheme, there is no need for UVM’s fault routine to
worry about races down an object chain or objects being collapsed out from under it. This
simplifies UVM’s fault handling.
117
5.3.3 Mapping Neighboring Resident Pages
As explained earlier, the UVM fault routine maps in neighboring resident pages while
processing a fault. This usually reduces the total number of page faults processes have to
take. In contrast, the BSD VM fault routine ignores neighboring pages. If they are needed
they will be faulted in later. Thus, UVM reduces process’ fault overhead. For example, the
number of pages faults for the command “ls /” was reduced from 59 to 33 by UVM (see
Section 10.5.3).
It should be noted that UVM’s two-level mapping scheme makes mapping in neigh-
boring pages relatively easy. The neighboring pages are either in the amap or the object
layer. In the BSD VM system to search for neighboring pages, the fault routine would have
the additional overhead of walking the object chains to find neighboring pages.
5.4 Summary
In this chapter we have reviewed the function of the UVM fault routine. We explained the
steps the fault routine goes through to resolve a fault. The possible errors that the fault
routine may encounter and how it recovers from the were described. Finally, some of the
differences between the UVM fault routine and the BSD VM fault routine were presented.
118
Chapter 6
Pager Interface
This chapter describes UVM’s pager interface. The pager is responsible for performing
operations on uvm object structures. Such operations include changing the reference
count of a uvm object and transferring data between backing store and physical mem-
ory. This chapter contains a description of the role of the pager, explanations of the pager
functions, and a comparison of UVM’s pager interface with the BSD VM pager interface.
aobj: The aobj pager is used by uvm object structures that contain anonymous memory.
device: The device pager is used by uvm object structures that contain device memory.
vnode: The vnode pager is used by uvm object structures that contain normal file mem-
ory.
UVM performs all pager-related operations on a uvm object through its pager functions.
This allows UVM to treat uvm object structures that represent different types of memory
alike.
Note that there are two types of memory objects in UVM: uvm object struc-
tures and anon structures. Pagers only perform operations on uvm object memory ob-
jects. For the remainder of this chapter the term “object” will be used as shorthand for
“uvm object structure.”
119
Function Description
pgo init init a pager’s private data structures – called when the system
is booted
attach gain a reference to kernel data structure that can be memory
mapped by an object
pgo reference add a reference to an object
pgo detach drop a reference to an object
pgo get get pages from backing store
pgo asyncget start an asynchronous I/O operation to get pages from back-
ing store
pgo fault object fault handler function
pgo put write pages to backing store
pgo flush flush an object’s pages from memory
pgo cluster determine how many pages can be get or put at the same time
pgo mk pcluster make a cluster for a pageout operation
pgo shareprot change protection of an object that is mapped by a share map
pgo aiodone handle the completion of an asynchronous I/O operation
pgo releasepg free a released page
Table 6.1: The Pager Operations. Note that the attach function is not part of the
uvm pagerops structure.
uao create: the aobj pager attach function. The uao create functions allocates a
new aobj-object of the specified size and returns a pointer to its uvm object struc-
ture. This type of object is used to map System V shared memory.
udv attach: the device pager attach function. This function takes a dev t device iden-
tifier and returns a pointer to its uvm object. If the device does not currently have
a uvm object structure allocated for it then the attach function will allocate one
for the device.
uvn attach: the vnode pager attach function. This function takes a pointer to a vn-
ode structure and returns a pointer to the vnode’s uvm object structure. If the
uvm object structure is not being used, then uvn attach will initialize it. The
uvn attach function also takes an access level as an argument. The access level is
used to determine if the vnode is being mapped writable or not.
All of the attach functions return NULL if the attach operation fails.
uobj: A pointer to the locked object whose data is being requested. This object may be
unlocked for I/O only in an unlocked get operation.
offset: The location in the object of the first page being requested.
pps: An array of pointers to vm page structures. There is one pointer per page in the
requested range. The calling function initializes the pointers to null if the get routine
is to fetch the page. Any other value (including the “don’t care” value) will cause the
get function to skip over that page.
npagesp: A pointer to an integer that contains the number of page pointers in the “pps”
array. For locked get operations, the number of pages found is returned in this integer.
access type: Indicates how the caller wants to access the page. The access type will either
be read or write.
122
advice: A hint to the pager as to the calling process’ access pattern on the object’s memory
(for example, normal, random, or sequential). This value usually comes from the map
entry data structure’s advice field.
flags: Flags for the get operation. The get function has two flags: “locked” and “allpages.”
The get function returns an integer code value that indicates the results of the get
operation. Possible return values are:
VM PAGER BAD: The requested pages are not part of the object.
VM PAGER ERROR: An I/O error occurred while fetching the data. This is usually the
result of a hardware problem.
VM PAGER UNLOCK: The requested pages cannot be retrieved without unlocking the ob-
ject for I/O. This value can only be returned by a locked get operation.
The get operation sets the busy flag on any page it returns on behalf of the caller. When
the caller is finished with the pages it must clear the busy flag. Note that there is also a
VM PAGER PEND return value that is used for asynchronous I/O. Since the get operation
is always synchronous it cannot return this value.
The get operation can be called by the fault routine or the loanout routine (described
in Chapter 7). Once the calling routine looks up the faulting or loaning address in the map,
it then determines the range of interest. The range includes the page currently being faulted
or loaned and any neighboring pages of interest. The calling routine then checks the amap
layer for the page. If the page is not found in the amap layer and there is an object in
the object layer, then the calling routine performs a locked get operation using the object’s
get routine. The calling routine passes an array of page pointers into the get routine with
the faulting or loaning page as the center page. The pointers for neighboring pages that
are already in the amap layer are set to “don’t care.” The rest of the pointers are set to
null. If the locked get finds the center page resident and available it returns VM PAGER OK,
otherwise it returns VM PAGER UNLOCK. If the pager returns “ok” then the calling routine
has all the pages it needs. It need only clear the busy flag on the pages it gets from the get
operation before returning. Note that a get operation can still return neighboring resident
pages even if it returns VM PAGER UNLOCK.
123
If the locked get returns VM PAGER UNLOCK, then the calling routine will eventu-
ally have to unlock its data structures and perform an unlocked get. Note that an unlocked
get operation requires a locked object. The get routine will unlock the object before start-
ing I/O. When I/O is complete, the get routine will return with the object unlocked. The
calling routine must then relock all its data structures and reverify the map lookup before
continuing to process.
ufi: A uvm faultinfo data structure that contains the current state of the fault.
vaddr: The virtual address of the first page in the requested region in the current vm map’s
address space.
pps: An array of vm page pointers. The fault routine ignores any page with a non-null
pointer.
The fault operation has the same set of return values as the get operation. In order for
the fault operation to successfully resolve a fault, it must call pmap enter to establish a
mapping for the page being faulted.
uobj: The locked object to which the pages belong. The put function will unlock the object
before starting I/O and return with the object unlocked.
pps: An array of pointers to the pages being flushed. The pages must have been marked
busy by the caller.
flags: The flags. Currently there is only one flag: PGO SYNCIO. If this flag is set then
synchronous I/O rather than asynchronous I/O is used.
The put operation returns either one of the return values used by the get operation or —
when an asynchronous I/O operation has been successfully started — VM PAGER PEND.
In cases when the put operation returns VM PAGER PEND the pager will unbusy the pages
and set their clean flag rather than the calling function.
If the put routine returns anything other than VM PAGER PEND then the calling
function must clear the pages’ busy flags. If the put was successful it must also set the
pages’ clean flags to indicate that the page and backing store are now in sync.
The flush operation is used by the msync system call to write memory to backing store
and invalidate pages. The vnode pager uses its flush operation to clean out all pages in a
vnode’s uvm object before freeing it.
The flush routine takes the following arguments:
uobj: The locked object to which the pages belong. The flush function returns with the
object locked. The calling function must not be holding the page queue lock since
the flush function may have to lock it. The flush function will unlock the object if it
has to perform any I/O.
end: The ending offset in the object. Pages from the ending offset onward are not flushed.
The flush operation returns “TRUE” unless there was an I/O error.
The flush operation’s flags control how the object’s pages are flushed. The flags are:
PGO CLEANIT: Write dirty pages to backing store. If this flag is set, then the flush func-
tion will unlock the object while performing I/O. If this flag is not set, then the flush
function will never unlock the object.
PGO SYNCIO: Causes all page cleaning I/O operations to be synchronous. The flush rou-
tine will not return until I/O is complete. This flag has no effect unless the “cleanit”
flag is specified.
PGO DOACTCLUST: Causes the flush function to consider pages that are currently
mapped for clustering when writing to backing store. Normally when writing a page
to backing store the flush function looks for neighboring dirty inactive pages to write
along with the page being flushed. Specifying this flag causes the flush function to
also consider active (mapped) pages for clustering.
126
PGO DEACTIVATE: Causes an object’s pages to be marked inactive. This makes them
more likely to be recycled by the pagedaemon if memory becomes scarce.
PGO FREE: Causes an object’s pages to be freed. Note that PGO DEACTIVATE and
PGO FREE cannot both be specified at the same time. If both PGO CLEANIT and
PGO FREE are specified, the pages are cleaned before they are freed.
PGO ALLPAGES: Causes the flush operation to ignore the “start” and “end” arguments
and perform the specified flushing operation on all pages in the object.
vm_object uvm_object
uvm_vnode
vnode
vm_pager
pagerops
pagerops vn_pager
vnode
pgo putpages operation. The extra overhead of a function call is not needed. In UVM
all pager operations are routed directly to the pager rather than through a small function
like vm pager put pages.
vm_pager_t pager;
vm_page_t *mlist;
int npages;
boolean_t sync;
{
if (pager == NULL)
panic("vm_pager_put_pages: null pager");
return ((*pager->pg_ops->pgo_putpages)
(pager, mlist, npages, sync));
}
Another difference between BSD VM and UVM is how the pager get operation gets
pages. In BSD VM, the process fetching the data from backing store must allocate a free
page, add it to the object, and then pass it into the pager get routine. In UVM, the process
fetching the data does not allocate anything. If the pager needs a free page it allocates it
itself. This allows the pager to have full control over when pages get added to the object.
A related problem with the BSD VM pager get operation is that it is required to
return a vm page structure. This causes problems for pagers that manage device memory
because device memory does not have any vm page structures associated with it. Consider
what happens when a process generates a page fault on a device’s page under BSD VM.
The fault routine will allocate a free page and add it to the device’s memory object. It will
then call the pager get operation to fill the page with “data” (actually the device’s memory).
To service this request, the device pager must use the kernel memory allocator to allocate a
fictitious vm page structure that is associated with the device’s memory. The device pager
get operation must then remove the normal page of memory passed into it from the object,
free the page, and then install the fictitious page in the object and return. This will allow
the fault routine to resolve the fault. The BSD VM system must be careful that it does
not allow a fictitious page to be added to the page queues or free page list as the fictitious
page’s memory is part of device memory.
In UVM fictitious pages are not necessary. The device pager uses the pager fault
operation to map device memory into a faulting process’ address space without the need
for fictitious pages.
132
6.3.3 Other Minor Differences
There are two noteworthy minor differences between pagers in UVM and BSD VM:
$ In BSD VM, the pager get operation can return VM PAGER FAIL if the offset of the
requested page is valid but the page does not reside in the object. This is used when
walking an object chain. If a pager returns VM PAGER FAIL then the fault routine
knows to skip to the next object. Since UVM does not have any object chains, the
VM PAGER FAIL error code is not used.
$ UVM provides a generic structure (uvm aiodesc) for maintaining the state of
asynchronous I/O operations. This structure is allocated by a pager when the asyn-
chronous I/O operation is initiated. When the device has finished all parts of an asyn-
chronous I/O operation, the uvm aiodesc structure for that operation is placed on a
globally linked list of completed asynchronous I/O operations called
uvm.aio done. At this time the pagedaemon is woken up. As part of the state in-
formation for the asynchronous I/O operation, the uvm aiodesc structure contains
a pointer to an “asynchronous I/O done” pager operation. The pagedaemon will tra-
verse the completed asynchronous I/O operations on the uvm.aio done list calling
each operation’s asynchronous I/O done function. This function frees all resources
being used for that asynchronous I/O operation, including the uvm aiodesc struc-
ture.
In contrast, BSD VM provides very little generic structure for asynchronous I/O
operations. In BSD VM, the pager put operation is overloaded to be the asynchronous
I/O done operation as well. If the pager put operation is called with a non-null pager,
then the request is treated like a normal put operation. However, if the pager put
operation is called with its pager argument set to null, then that is a signal to the
pager to look for any completed asynchronous I/O operations to clean up.
when the BSD VM pagedaemon first being running it calls each pager’s put func-
tion with pager set to null — regardless of whether there are any asynchronous I/O
operations waiting to be finished. This is inefficient because there often are no I/O
operations waiting to be finished. BSD VM’s overloading of the pager put function
is confusing because it is not well documented and the completion of asynchronous
I/O operations has little to do with starting a pageout. BSD VM pagers that do not
support asynchronous I/O ignore put requests with null pagers.
133
1
It should be noted that the BSD I/O subsystem currently requires that all pages that
are in I/O be mapped in the kernel’s virtual address space. For I/O operations that directly
access memory (i.e. use DMA) this mapping is unnecessary because the kernel mappings
are never used. In the future the I/O subsystem should be modified to take a list of pages
rather than a kernel virtual address. That way only drivers that need to have their pages
mapped in the kernel’s address space will bother to do so and the extra overhead can be
avoided for drivers that do not need kernel mappings.
6.4 Summary
In this chapter the design of UVM’s pager interface was presented. The pager provides a
set of functions that present a uniform interface to uvm object memory objects. Func-
tions in the pager interface include functions that adjust an object’s reference count, and
functions that transfer data between backing store and physical memory. The differences
between the BSD VM pager interface and UVM’s pager interface were also discussed. In
BSD VM, each vm object data structure is individually allocated and has its own pri-
vate vm pager data structure. In UVM, the uvm object structure is designed to be
embedded within a mappable kernel data structure such as a vnode. Rather than having a
uvm pager structure, the uvm object contains a pointer to a shared set of pager oper-
ations.
1
The operating systems that use UVM use the BSD I/O subsystem.
134
Chapter 7
This chapter describes the design and function of UVM’s page loan out, page transfer, and
map entry passing interfaces. With page loanout, a process can safely “loan” read-only
copies of its memory pages out to the kernel or other processes. Page transfer allows a
process to receive pages from the kernel or other processes. Map entry passing allows
processes to copy or move chunks of virtual address space between themselves. All three
interfaces allow a process to move data using the virtual memory system and thus avoid
costly data copies. These interfaces provide the basic building blocks on which advanced
I/O and IPC subsystems can be built. In this chapter we explain the workings of UVM’s
page loanout, page transfer, and map entry passing. In Chapter 10 we will present perfor-
mance data for our implementation of these functions.
Moving data with the virtual memory system is more efficient than copying data.
To transfer a page of data from one page to another through copying, the kernel must
first ensure that both the source and destination pages are properly mapped in the kernel
and then copy, word by word, each piece of data in the page. To transfer a page of data
using the virtual memory system the kernel need only ensure that the page is read-only and
establish a second mapping of the page at the destination virtual address.1
However, moving data with the virtual memory system is more complicated than
copying data. First, the kernel must use copy-on-write to prevent data corruption. Second,
for pages that are going to be used by the kernel for I/O, the kernel must ensure that the data
remains wired and is not paged out or flushed. Third, virtual memory based data moving
1
On an i386 system there are 1024 words in a page, thus copying a page requires a loop of 1024 memory
loads and 1024 memory stores. Transferring a page of data using the virtual memory system requires two
memory loads and stores, one load-store pair to write protect the source virtual address and one pair to
establish a mapping at the destination address. There is also some overhead associated with looking up the
page and adjusting some counters.
135
only works for large chunks of data (close to page-sized). Copying small chunks of data
is more efficient than using the virtual memory system to move the data. Thus, a hybrid
approach is optimal.
Page loanout is a feature of UVM that addresses these issues by allowing a process
to safely loan out its pages to another process or the kernel. This allows the process to
send data directly from its memory to a device or another process without having to copy
the data to an intermediate kernel buffer. This reduces the overhead of I/O and allows the
kernel to spend more time doing useful work and less time copying data.
At an abstract level, loaning a page of memory out is not overly complex. To loan
out a page, UVM must make the page read-only and, increment the page’s loan counter.
136
virtual address space
page loanout to
obj page
wired kernel pages
anon page
obj page
page loanout to pageable
anon page
anonymous memory (anons)
Then the page can safely be mapped (read-only). The complexity of page loanout arises
when handling the special cases where some other process needs to make use of a page
currently loaned out. In that case the loan may have to be broken.
object to kernel: A page from a uvm object loaned out to the kernel.
anon to anon: A page from an anon loaned out as an anon. Since the data is already in an
anon, no special action other than incrementing the anon’s reference count is required
to create a loan of this type.
137
The procedures for performing each of these types of loans will be described in Sec-
tion 7.1.5. Note that it is possible for a page from a uvm object to be loaned to an
anon and then later loaned from that anon to the kernel.
All types of loaned-out pages are read-only. As long as the page is loaned-out its
data cannot be modified. In order to modify data in a loaned out page, the kernel must
terminate the loan. This is called “breaking” the loan. Breaking a loan involves allocating
a new page off the free list, copying the data from the loaned page to the new page, and
then using the new page.
There are several events that can cause UVM to break a loan. These events include:
$ The owner of a loaned page wants to free the page.
$ A write fault on a loaned page.
$ The pagedaemon wants to pageout a page currently loaned out.
$ An object needs to flush a loaned page.
UVM had to be modified to handle loaned pages in each of these cases. These modifications
are described later in this chapter.
Memory pages loaned to the kernel have three important properties. First, pages
loaned to the kernel must be resident. If the data is on backing store then it is fetched
before the loanout proceeds. Second, each page loaned to the kernel is “wired” to prevent
the pagedaemon from attempting to pageout a page loaned to the kernel while the kernel is
still using it. This is important because if the pagedaemon was allowed to pageout the page
then the next time the kernel accessed the loaned page an unresolvable page fault would be
generated (and the system would crash). Third, pages loaned to the kernel are entered into
the kernel’s pmap with a special function that prevents page-based pmap operations from
affecting the kernel mapping. This allows UVM to perform normal VM operations on the
loaned-out page without having to worry about disrupting the kernel’s loaned mapping and
causing an unresolvable page fault.
Memory pages loaned from an object to an anon have two important properties.
First, while most pages are referenced by either a uvm object or an anon, these pages
are referenced by both types of memory objects. However, it should be noted that the pages
are considered owned by their uvm object and not the anon that they are loaned to. This
means that in order to lock the fields of a page loaned from a uvm object to an anon, the
uvm object must be locked. Second, a uvm object’s page can only be loaned to one
138
anon at a time. Future loans to an anon need only gain a reference to the anon the page is
already loaned to (rather than allocating a new anon to loan to).
When an object that owns a loaned page decides to free it, the page cannot be placed
on the free page list because it is still being used by the process that the page is loaned to.
Rather than free the page, the page becomes ownerless. If an ownerless page was originally
loaned from a uvm object to an anon, then the anon is free to take over the ownership of
the page the next time it holds the appropriate lock. When the loan count of an ownerless
page is decremented to zero, then the page can be placed on the free list.
To loan a page from a uvm object to the kernel the following steps are taken. First the
object is locked. Second, the page is fetched from the object using the object’s pager get
function. If the page is not resident, then the object is unlocked and an I/O is initiated. Once
the page is resident the object is relocked. Third, the page queues are locked. Since both the
object and the page queues are locked this allows the pages loan counter to be incremented.
If the loan counter was zero, then the page must be globally write-protected to trigger a
copy-on-write operation in the event that some process attempts to modify the page while
the loan is established. Fourth, the page is wired (i.e. removed from the pagedaemon’s
queues). Finally the page queues and object can be unlocked. The pseudo code for this
process is:
lock(object);
page = get_page(object,offset); /* make resident */
lock(page queues);
if (page->loan_count == 0)
write_protect(page);
page->loan_count++;
wire(page);
141
unlock(object, page queues);
return(page);
To loan a page from an anon to a wired kernel page the following steps are taken. First,
the anon must be locked. Second, the uvmfault anonget function must be called to
make the anon’s page resident. Note that uvmfault anonget checks for anon that have
pages on loan from uvm objects. In that case the anonget function will ensure that both
the anon and the uvm object are locked. Third, the page queues are locked and the page
is write protected (if needed), the loan count is incremented, and the page is wired. Finally,
the anon, uvm object (if there is one), and the page queues are unlocked and the page is
returned. The pseudo code for this process is:
lock(anon);
uvmfault_anonget(anon); /* makes resident */
/* if anon’s page has an object, it is locked */
lock(page queues);
if (anon->page->loan_count == 0)
write_protect(anon->page);
anon->page->loan_count++;
wire(anon->page);
if (anon->page->object)
unlock(anon->page->object);
unlock(anon, page queues);
return(anon->page);
To loan a page from an object to an anon the following steps are taken. First the object is
locked and the page is looked up using the pager get function. If the page is not resident,
then it is fetched. Second, UVM checks to see if the page has already been loaned to an
anon. If so, it locks the anon, increments the anon’s reference count, unlocks the anon and
object, and returns a pointer to the anon. Third, if the page has not been loaned to an anon,
then a new anon is allocated. The page queues are then locked, the page is write protected
if necessary, the loan count is incremented, and a bi-direction link is established between
142
the new anon and the page. Finally, the page queues and object are unlocked and the new
anon is returned. The pseudo code for this process is:
lock(object);
page = get_page(object,offset); /* make resident */
if (page->anon) { /* already loaned to an anon? */
lock(anon);
anon->reference_count++1;
unlock(anon,object);
return(page);
}
allocate new anon;
point new anon at page and vice-versa;
lock(page queues);
if (page->loan_count == 0)
write_protect(page);
page->loan_count++;
unlock(page queues,object);
return(new anon);
Loaning a page from an anon to an anon is a simple matter since the page is already in an
anon. All that needs to be done is to lock the anon, increment the anon’s reference count,
write protect the page (if necessary), and then unlock and return the anon.
Once a page has been transfered into a process’ address space it is treated like any other
page of anonymous memory. Page transfer is shown in Figure 7.2.
There are two types of page transfer: kernel page transfer and anonymous page
transfer. These two types of page transfer are described in the following sections.
“share.” When that process forks a child process it will share the memory with the child
process.
UVM provides a new way for processes to exchange data: map entry passing. For
large data sizes, this is more efficient than pipes because data is moved without being
copied. Map entry passing is also more flexible than traditional shared memory. Map entry
passing does not have System V shared memory’s requirement that a process pre-allocate a
shared memory segment. It also does not require the services of the filesystem layer, unlike
mmap. Additionally, map entry passing allows a range of virtual addresses to be shared.
This range can include multiple memory mappings and unmapped areas of virtual memory.
Map entry passing is shown in Figure 7.3.
as shown in Figure 7.4. The base and len fields identify the block of virtual address space
that is exported. Note that more than one object can be mapped in this space and it also can
have unmapped gaps in it. In fact, it can even be completely unmapped. The export type can
describes how the block of memory is exported. There are currently four possible values
for this field (described in the next paragraph). The export flags along with the process id,
user id, group id, and mode are used to control access to the exported region of memory. A
process has the option of exporting memory to a specific process or establishing a file-like
protection code. The export flags control which access control mechanism is used.
The four export type values are as follows:
share: causes the importing process to share the memory with the exporting process.
copy: causes the importing process to get a copy-on-write copy of the exported virtual
memory.
donate: causes the memory to be removed from the exporting process’ virtual address
space and placed in the importing process’ space. After the import the virtual space
in the exporting process will be unmapped.
donate with zero: the same as “donate,” except rather than leaving the exported range of
virtual space unmapped after the import it is reset to be zero-fill memory.
On a successful export of memory, a “tag” structure is created for the memory. This
structure is called mexpimp tag and it contains a process ID and timestamp. To import
virtual memory this tag must be passed in to the import system call. The tag is used to look
up the exported region of memory and import it.
152
UVM provides two system calls for the export and import operations. Their signa-
tures are:
Both system calls return zero on success and %& if there is an error. For the export sys-
tem call, the exporting process fills out a mexpimp info structure and if successful the
kernel will fill out the tag structure with the tag of the exported region of memory. For the
import system call, the importing process fills out the tag structure and calls import. If the
mexpimp info pointer is non-null, then the info used to export that region of memory
will be returned in the structure pointed to by this pointer.
remove: remove the mapping from the source map after it has been transfered.
zero: valid only if “remove” is set, this causes the removed area to become zero-fill after
the extract operation.
contig: abort if there are unmapped gaps in the region. The contig flag is a “best effort”
flag. It is possible for the extract to succeed even if there are gaps in the map in some
cases.
qref: “quick” references. The qref flag is used for brief map extractions such as the ones
generated by ptrace. With ptrace the following pattern is used: extract a single
page from a process’ map, read or write a word in that page, unmap the extracted
region. Under normal VM operation this sort of operation would lead to a lot of map
entry fragmentation because of the clipping used to extract the paged-sized chunks.
153
Quick references take advantage of the fact that the extracted mapping is going to
be short lived to relax the management of references to uvm object structures and
amaps and avoid map entry fragmentation.
fixprot: causes the protection of the extracted area to be set to the maximum protection.
This is necessary for the ptrace call because it needs to write break points to a
process’ text area. The text area is normally mapped copy-on-write with is protection
set to read-only and its maximum protection set to read-write. In order for ptrace
to be able to write to this memory, the protection must be raised from its current
read-only value to its maximum value. Note that this does not effect the source map,
only the destination map.
The actual uvm map extract function is a bit more complex than this due to the need
to handle quick references and data structure locking properly.
Chapter 8
This chapter describes a number of the secondary design elements of UVM. These elements
are small parts of the UVM design that differ from BSD VM and contribute to the overall
function of UVM. Topics covered in this chapter include amap chunking, clustered anony-
mous pageout, the design of UVM’s three pagers, memory mapping issues, page flags, VM
startup, and pmap issues.
– If there is already a pager associated with the vnode, then the allocate function
lookups up the vm object associated with that pager, gains a reference to it,
and returns a pointer to the pager.
– If there is no pager associated with the vnode, then the allocate function mal-
locs a new vm pager, vn pager, and vm object structure for the vnode
162
and ties them all together. The allocate function then gains a reference to the
backing vnode using the VREF macro and returns a pointer to the freshly allo-
cated pager.
$ The pager is then used by vm mmap to lookup the vnode’s vm object (again).
It should be noted that in BSD VM the vnode’s vm object maintains an active reference
to the vnode. The reference is gained with VREF when the vm object is allocated and it
is dropped with vrele when the object is freed. The vm object has its own reference
count that is used to count how many VM data structures are using it.
The BSD VM system also has a data structure called the “object cache.” The pur-
pose of the object cache is to allow vm object structures that are not currently being used
to remain active for a time after their last reference is dropped. Objects in the object cache
are said to be “persisting.” This is useful for vm object structures that map files that are
repeatedly used for short periods of time. For example, it is useful for the /bin/ls object
to remain active even when the program is not being run because it is used frequently. Each
object on the system has a flag that says whether it is allowed to persist or not. When a
vm object structure’s reference count drops to zero, if its persist flag is true it is added
to the object cache, otherwise it is freed.
The number of objects allowed in BSD VM’s object cache is statically limited to
100 objects (vm cache max). If this limit is reached, an older object from the cache may
be freed to make room for a new one. If a process memory maps an object that is currently
in the cache then it is removed from the cache when the vm object’s reference count is
incremented to one.
The main problem with the object cache is that persisting unreferenced vnode ob-
jects that are in the object cache are still holding a reference to their backing vnode. BSD’s
vnode layer has a similar caching mechanism to the object cache, and the fact that vnodes
2
Describing the details of this code is beyond the scope of this section — see vm mmap.c for details.
163
persisting in the VM system have an active reference prevents the vnodes from being re-
cycled off of the vnode cache. The net result is that in BSD VM there are two layers of
persistence caching code and the VM object cache interferes with the proper operation of
the vnode cache. Thus, useful persisting virtual memory information may be thrown away
prematurely, and the selection of a vnode to recycle may be less than optimal because vn-
odes with virtual memory data in the object cache cannot be recycled even though they are
no longer being used.
– If the uvm object is not currently in use, then the attach function will get the
size of the vnode, initialize the uvm object structure, gain a reference to the
vnode with VREF, and then return a pointer to the uvm object structure.
– If the uvm object is already active then the attach function will check the
object’s reference count. If it is zero then the object is currently persisting. In
that case the reference count is incremented to one and the attach function gains
a reference to the backing vnode with VREF. If the object is not persisting, then
the only action required is to increment the reference count. A pointer to the
uvm object structure is returned.
$ The uvm map function is called to map that object in with the appropriate attributes.
164
When the final reference to a vnode uvm object is dropped the reference to the
underlying vnode is dropped (with vrele). This causes the vnode to enter the vnode
cache. If the vnode’s flag is set to allow the uvm object to persist, then the persisting
flag is set and processing is complete. If the object is not allowed to persist, then all of its
pages are freed and it is marked as being invalid.
In order for UVM’s scheme to work, UVM must be notified when a vnode is be-
ing removed from the vnode cache for recycling. The vnode pager provides a function
uvm vnp terminate for this. When the vnode layer of the operating system wants to
recycle an unreferenced vnode off the vnode cache it calls uvm vnp terminate on the
vnode. This function checks to see if the vnode’s uvm object is persisting. If so, it re-
leases all the uvm object’s resources and marks it invalid. The uvm vnp terminate
function is called from the vclean function in vfs subr.c.
The second argument of uvn attach is the protection access level that the caller desires.
For example, if a file is to be memory mapped read-only or copy-on-write, then the access
protection would be “read” to indicate that the attacher will not modify the object’s pages.
On the other hand, if the attacher is going to modify the object, then it specifies an access
protection of “write.” The vnode pager uses this information to maintain a list of vnode
objects that are currently writable. Then when the vnode sync helper function is called the
vnode pager only considers vnode objects on the writable list. Since most vnode objects
are mapped read-only, this reduces the overhead of the sync operation because rather than
traversing every page of every active vnode object to see if it needs to be cleaned only the
writable vnodes need to be checked.
3
This function is necessary in both BSD VM and UVM because the VM cache and buffer cache have not
yet been merged in NetBSD.
166
Table 8.1: The eight BSD VM memory mapping functions. The vm map find,
vm allocate, vm allocate with pager, and vm mmap functions map memory at
either a specified virtual address or they use vm map findspace to find an available
virtual address for the mapping.
Function Usage
vm map lookup entry looks for the map entry that maps the specified ad-
dress; if the address is not found, then return the map
entry that maps the area preceding the specified ad-
dress
vm map entry link links a new map entry into a map
vm map findspace finds unallocated space in a map
vm map insert inserts a mapping of an object in a map at the speci-
fied address
vm map find inserts a mapping of an object in a map
vm allocate allocates zero fill in a map
vm allocate with pager maps a pager’s object into a map; the new mapping
is shared (rather than copy-on-write)
vm mmap maps a file or device into a map; the vm mmap func-
tion handles the special object chain adjustments re-
quired to map an object in copy-on-write
exec: maps a newly executed program’s text, data, and bss areas into the process’ address
space.
obreak: grows a process’ heap by mapping zero-fill memory at the end of the heap.
shmat: maps a System V shared memory segment into a process’ address space.
The BSD VM system provides eight functions that are used to establish mappings
for these four system calls. These eight functions are shown in Table 8.1. The call path for
these functions is complex, as shown in Figure 8.1.
On the other hand, UVM provides five functions for establishing memory mappings.
These functions are shown in Table 8.2. Unlike BSD VM, UVM has one main mapping
167
exec mmap obreak shmat
vm_mmap
vm_allocate_with_pager vm_allocate
vm_map_find
vm_map_insert vm_map_findspace
vm_map_entry_link vm_map_lookup_entry
Figure 8.1: The BSD VM call paths for the four mapping system calls
function, uvm map. All mapping operations use this function to establish their mappings.
This makes UVM’s code cleaner and easier to understand than the BSD VM code. UVM’s
mapping call path is shown in Figure 8.2.
The prototype for the uvm map function is:
A call to uvm map maps size bytes of object uobj into map map. The first page of
the mapping maps to offset uoffset in the object being mapped. The startp argument
contains a hint for the address where the mapping should be established. Once the mapping
is established, the actual address used for the mapping is returned in startp. The flags,
described below, are used to control the attributes of the mapping. The uvm map function
168
Function Usage
uvm map lookup entry looks for the map entry that maps the speci-
fied address; if the address is not found, re-
turns the map entry that maps the area pre-
ceding the specified address
uvm map entry link links a new map entry into a map
uvm map findspace finds space in a map and inserts a mapping of
the specified object
uvm map establishes a mapping
uvm mmap attaches to a file’s or device’s uvm object
and maps it in
returns KERN SUCCESS if the mapping was successfully established, otherwise it returns
an error code.
The uvm map function’s flags are a bitwise encoding of the desired attributes of
the mapping. This encoding is usually created with a call to the UVM MAPFLAG macro.
This macro takes the following arguments:
protection: the initial protection of the mapping. This protection must not exceed the
maximum protection.
uvm_mmap
uvm_map
uvm_map_findspace
uvm_map_entry_link uvm_map_lookup_entry
trylock: try and lock the map. If the map is already locked, return an error code
rather than sleep waiting for the map to be unlocked.
Note that under UVM the uvm map function allows all the mapping attributes to be
specified at map time. This cannot be done under BSD VM. BSD VM’s mapping functions
always establish a mapping with a default protection and inheritance. For example, the
default protection of a mapping under BSD VM is read-write. Thus, if the kernel wants to
establish a read-only mapping of a file using the BSD VM system it must first use the map-
ping functions to establish a read-write mapping. It then must use the vm map protect
function to write-protect the mapping. This is both wasteful and dangerous. It is waste-
ful because the mapping function locks the map, establishes the mapping with the default
protection, and then unlocks the map. Then the vm map protect function must relock
the map, repeat the map lookup, change the protection, and then unlock the map. Thus,
under BSD VM the map must be locked and unlocked twice, and there is an extra lookup
and protection change. This is also dangerous because when establishing a memory map-
ping there is a brief window of time between establishing the mapping and write protecting
the mapping where the map is unlocked and the object is mapped read-write. On a mul-
tithreaded system this could allow a malicious process to race the kernel and successfully
170
modify a read-only file before the kernel has a chance to write-protect the mapping. UVM’s
uvm map function avoids both these problems by allowing the attributes of the mapping to
be specified at the time the mapping is created.
If a caller to uvm map does not wish to map an object in at the object layer of the
mapping, it can call uvm map with uobj set to null. Thus, a zero-fill mapping can be
created by setting uobj to null and setting the “copyonw” flag.
Note that both BSD VM and UVM support the BSD “pmap prefer” interface for
systems with virtually addressed caches (VACs) or MMUs that do not allow mappings
in certain virtual address ranges (e.g. Sun sparc systems). This interface is used by the
mapping functions to push a mapping address forward so that it does not create VAC cache
aliases or place a mapping at an invalid address.
uvm km alloc: The alloc function allocates wired memory in the kernel’s virtual ad-
dress space. The allocated memory is uninitialized. If there is not enough physical
memory available to satisfy the allocation, then the allocating process will wake the
pagedaemon and sleep until memory is available. If there is not enough virtual space
in the map this function will fail. BSD VM does not have a function that fills this
role (but see uvm km zalloc below).
uvm km zalloc: This function operates in the same way as uvm km alloc except the
allocated memory is zeroed by the memory allocator. This function is known as
kmem alloc in BSD VM.
uvm km valloc: The valloc function allocates zero-fill virtual memory in the kernel
address space. No physical pages are allocated for the memory. As the memory is
accessed the fault routine will map in zeroed pages to resolve the faults. If there is
172
no free virtual space in the map this function will fail. This function is known as
kmem alloc pageable in BSD VM.
uvm km valloc wait: This function operates in the same way as uvm km valloc
except that if the map is out of virtual space then the caller will sleep until space is
available. This function is known as kmem alloc wait in BSD VM.
uvm km kmemalloc: This function is the low-level memory allocator for the kernel mal-
loc. It allocates wired memory in the specified map. If there is not enough physical
memory available to satisfy the allocation, then the allocating process will wake the
pagedaemon and sleep until memory is available unless a “cannot wait” flag is set. If
there is not enough virtual space for the allocation then this function will fail. This
function is known as kmem malloc in BSD VM.
uvm km free: This function is a simple front end for the standard unmap function. It is
known as kmem free in BSD VM.
uvm km free wakeup: This function is the same as uvm km free except that after
freeing the memory it wakes up any processes that are waiting for space in the map.
It is known as kmem free wakeup in BSD VM.
private kernel memory: Private kernel memory is always wired and only appears within
the kernel map. BSD VM stores the wired state of wired kernel memory in both
the kernel map and the vm page structures that are mapped by the kernel. In UVM
the wired state is only stored in the vm page structures, reducing kernel map entry
fragmentation.
pageable kernel memory: Pageable kernel memory is used for processes’ user struc-
ture. BSD VM stores the wired state of a processes user structure in both a kernel
map entry and in the process’ proc structures flag. UVM takes advantage of this.
Since the wired state is stored in the proc structure, there is no need to store a redun-
dant copy of the state information in the kernel map. In UVM, instead of fragment-
ing the kernel map the map’s wire counters are left alone. When a user structure
is wired or unwired both the flag in the proc structure and the wire counters in the
vm page structures that map the user structure are adjusted appropriately. This
also reduces kernel map entry fragmentation.
179
sysctl and physio: In both of these functions, memory is wired, an operation is per-
formed, memory is unwired, and the function returns. BSD VM stores the wired state
of processes performing these functions in both the process’ map, and on the process’
kernel stack. In UVM the wired state of memory is only stored on the process’ kernel
stack. This reduces user map entry fragmentation.
It should be noted that in UVM, unlike in BSD VM, the vslock and vsunlock
interface used by sysctl and physio wire memory using the vm page structure’s wire
counter and do not change map entries.
In UVM, the only time a map entry’s wire counter is used is for the mlock and
munlock system calls. All other places that use wired memory store the wired state
information elsewhere to avoid map entry fragmentation. As a result maps under UVM
have fewer entries than under BSD VM.
PG BUSY: The data in the page is currently in the process of being transfered between
memory and backing store. All processes that wish to access the page must wait
until the page becomes “unbusy” before accessing the page (see PG WANTED).
PG CLEAN: The page has not been modified since it was loaded from backing store. Pages
that are not clean are said to be “dirty.”
PG CLEANCHK: The clean check flag is used as a hint by the clustering code to indicate
if a mapped page has been checked to see if it is clean or not. If the hint is wrong it
may prevent clustering but it will not cause any problems. The clean check flag is an
optimization borrowed by UVM from FreeBSD.
PG FAKE: The fake flag indicates that a page has been allocated to an object but has not
yet been filled with valid data. UVM currently does not use this flag for anything
other than to assist with debugging.
PG RELEASED: The released flag indicates that a page that is currently busy should be
freed when it becomes unbusy. It is the responsibility of the process that set the busy
flag to check the released flag when unbusying the page.
PG TABLED: The tabled flag indicates that the page is currently part of the object/offset
hash table. When a page that is tabled is freed, it is removed from the hash table
before being added to the free list.
PG WANTED: The wanted flag is set on a busy page to indicate that a process is waiting
for the page to become unbusy so that it can access it. When the process that set the
busy flag on the page clears it, it must check the wanted flag. If the wanted flag is
set, then it must issue a wakeup call to wakeup the process waiting for the page.
In addition to these flags there are five additional page flags that appear in BSD VM that
should no longer be used: PG FICTITIOUS, PG FAULTING, PG DIRTY, PG FILLED,
and PG LAUNDRY. The elimination of the PG FICTITIOUS flag for device pages UVM
was described earlier in Section 8.4. The PG FAULTING flag is used by BSD VM for
debugging the swap pager and is not applicable to UVM. BSD VM uses the dirty and filled
182
flags for debugging by the I/O system. This usage should be discontinued because the I/O
system does not obtain the proper locks before accessing the flags and thus could corrupt
the flags if fine-grain locking is in use. The laundry flag is used by BSD VM to indicate a
dirty inactive page4 . This flag is not needed by UVM.
Note that functions that set and clear the busy bit of a page need to check the wanted
and released bits only if the object that owns the page was unlocked while the page was
busy. If the object was not unlocked while the page was busy, then no other process could
step in and set the wanted or released bits because they are protected by the object’s lock.
The valid pqflags are:
PQ FREE: The page is on the free page list. If PQ FREE is set, then all other page queue
flags must be clear.
PQ SWAPBACKED: The page is backed by the swap area, and thus must be either
PQ ANON or PQ AOBJ (and in fact, this flag is defined as the logical-or of those
two flags).
Contiguous Memory
If the MACHINE NONCONTIG symbol is not defined for the C pre-processor then the old
contiguous physical memory interface is selected. To use this interface, a process must set
up four global variables before calling vm mem init. The variables are avail start,
avail end, virtual avail, and virtual end. These variables define the start and
end of available physical memory, and the start and end of available kernel virtual memory
(respectively).
Non-contiguous Memory
If the MACHINE NONCONTIG symbol is defined in the C pre-processor then the non-
contiguous physical memory interface is selected. The non-contig interface is defined by
several functions:
pmap virtual space: returns values that indicate the range of kernel virtual space
that is available for use by the VM system.
pmap free pages: returns the number of pages that are currently free.
pmap next page: returns the physical address of the next free page of available mem-
ory.
pmap page index: given a physical address, return the index of the page in the array of
vm page structures.
184
pmap steal memory: allocates memory. This allocator can be used only before the
VM system is started.
random: The array of physical memory segments is stored in random order and searched
linearly for the requested page.
bsearch: The array of physical memory segments is sorted by physical address and a bi-
nary search is used to find the correct entry.
185
bigfirst: The array of physical memory segments is sorted by the size of memory segment
and searched linearly. This is useful for systems like the i386 that have one small
chunk of physical memory and one large chunk.
MNN encapsulates these options into a single function with the following prototype:
This function returns the index of the entry in the array of physical memory segments for
the specified physical address (or it returns %& if the physical address is not valid). The
“offset” is the offset in the segment for the given physical address.
When a system using MNN is booting, before calling the VM startup function, it
must first load a description of physical memory into the array of physical memory seg-
ments. This is done with the uvm page physload function. This function takes the
starting and ending physical addresses of a segment of physical memory and inserts the
segment into the array, maintaining the requested sorting order. This single function re-
places the two older interfaces supported by BSD VM.
pmap_page_protect(page->phys_addr, VM_PROT_READ);
In order to process this, the pmap module must look up the physical address in the table
of managed pages to determine what vm page the physical page belongs to so the list of
mappings for that page can be traversed. If the page is not a managed page, then there is
no such list and no action is performed.
The current procedure is wasteful because the function calling the pmap page pro-
tection function knows the vm page structure for the page, but does not pass this informa-
tion to the pmap module. This forces the pmap module to do a page lookup to determine
the identity of the physical address’ page. In the new pmap interface, these functions take
a pointer to a vm page structure rather than a physical address so that this extra lookup
operation can be avoided.
8.15 Summary
In this chapter a number of secondary design elements of UVM were presented. These
elements include:
$ the chunking of amap allocation to reduce the amount of kernel memory allocated
for a sparsely populated amap.
$ the aggressive pageout clustering of anonymous memory that allows UVM to quickly
recover from a memory shortage.
$ the design of UVM’s three pagers: the aobj pager, the device pager, and the vnode
pager.
$ the uvm map memory mapping function that provides a single uniform interface to
memory mapping under UVM.
$ the unmapping function that separates unmapping memory and dropping references
to memory objects.
$ the management of kernel memory maps and objects that reduces the number of VM
calls needed to establish kernel memory mappings.
$ changes in the management of wired memory that prevents unnecessary map entry
fragmentation and thus reduces map lookup time.
$ the redesign of the page flags to accommodate fine grain locking and to eliminate
unneeded flags.
$ MACHINE NEW NONCONTIG, the new unified scheme used to manage a computer’s
physical memory.
$ the new pmap interface that supports features incorporated from FreeBSD and also
kernel page loanout.
192
$ the new i386 pmap module that manages pmaps more efficiently and does not block
machine-independent code.
While the design elements are secondary, they cumulate to produce a beneficial effect on
UVM’s design and performance, and thus play an important role within UVM.
193
Chapter 9
Implementation Methods
This chapter describes the methods used to implement and debug UVM. First, the strategy
behind the UVM implementation is presented. Then the tools used to assist with debugging
are presented. These tools include the UVM history mechanism and functions that can be
called from the kernel debugger. Finally, two user-level applications that display the status
of UVM are presented.
invocation number
$ Four integers that contain the data for the printf format.
Clearly both unmap operations are redundant. To fix this problem, the exec code
was modified to call uvm map directly, since the special handling of uvm mmap is
not needed in UVM.
$ UVMHIST traces also detected a redundant call to uvm map protect in the exec
code path. The function vmcmd map zero establishes zero-fill mappings. It is used
to memory map the bss and stack segments of a process. As previously described,
in BSD VM all mappings are established with the default protection (read-write).
Early versions of UVM’s memory mapping function also always used the default
protection. It turn out that in the BSD code the vmcmd map zero function always
used uvm map protect to set the protection after the mapping was established —
even if the desired protection was the default protection. This results in the following
trace during an exec:
1. zero-map the bss segment (the protection is set to the default, read-write).
2. set the protection of the bss segment to read-write [redundant]
3. zero-map the reserved stack area
201
4. set the protection of the reserved stack area to “none”
5. zero-map the active stack area
6. set the protection of the active stack area to read-write [redundant]
The two redundant uvm map protect calls set the protection to the value it is
already set to. This trace was one of the motivating factors for allowing protection
to be specified in the flags of the uvm map function. Thus, in UVM not only are the
redundant uvm map protect calls eliminated, but the call to set the protection of
the reserved stack area to “none” is also eliminated since that protection can now be
set while memory mapping the area.
$ When a process exits, it must unmap its memory to free it. The UVMHIST traces
show that the BSD kernel was unmapping an exiting process’ address space in three
places on i386 systems: once in the main exit function, once in the machine-depen-
dent cpu exit function, and once in the uvmspace free function. Based on
this information the unmap in cpu exit was removed and the unmap operation in
uvmspace free was modified to only occur if the map being freed still has entries.
Command Usage
ps Lists all active processes on the system. If the “a”
option is specified (“ps/a”), then the kernel address
of each processes’ vm map structure is printed.
show map Shows the contents of the specified vm map struc-
ture. If the “f” option is specified, then the list of
map entries attached to the map is also printed. For
user processes, use the “ps/a” command to get map
addresses. For the kernel map, use the “show map
*kernel map” command.
show object Shows the contents of the specified uvm object
structure. If the “f” option is specified, then a list
of all pages in the object is also printed.
show page Shows the contents of the specified vm page struc-
ture. If the “f” option is specified, the extra sanity
checks on the page are performed.
call uvm dump Shows the current values of the counters used by
UVM and the kernel addresses of kernel objects.
is no longer the case. UVM supports a sysctl system call that any process can use to get
UVM status information without the need for special privileges. This gives normal users
the flexibility to write their own status gathering programs without the need to bother their
system administrator.
The first UVM status program is called uvmexp. It is a simple text-based pro-
gram that gets the status of UVM, prints it, and exits. This is much like the “call
uvm dump” command in DDB. The second UVM status program is an X11-based pro-
gram called xuvmstat. This program creates a window and graphically displays various
system counts (updating the display once a second). A screen dump of the xuvmstat
program is shown in Figure 9.2.
9.5 Summary
In this chapter we have reviewed the methods used to implement and debug UVM. UVM
was implemented in four phases. First the design of the BSD VM was studied. Second,
UVM’s core memory mapping and page fault functions were implemented and tested under
204
a system running BSD VM. Third, virtual memory management of all user-processes was
switched from BSD VM to UVM. Finally, the kernel’s memory management was switched
over to UVM. Throughout the implementation process, changes made to the BSD source
tree were merged into the UVM kernel source tree using CVS. This eased of integration of
UVM into the main source tree. Additionally, UVM was designed to co-exist in the source
tree with BSD VM. This allows for a smooth transition for BSD users from BSD VM to
UVM.
The UVMHIST tracing facility, UVM DDB commands, and UVM status programs
were also presented in this chapter. UVMHIST allows kernel developers to produce dy-
namic code traces through the VM system. This helps developers better understand the
operation of UVM and also was used to detect redundant calls in the VM system. The
UVM DDB commands allow a user to easily examine the state of UVM on a running sys-
tem. The UVM status programs are built on top of a non-privileged system call that allows
users to query the status of UVM.
205
Chapter 10
Results
In previous chapters we described the design and implementation of new VM features first
introduced to BSD by UVM. The design of features such as page loanout, page transfer,
and map entry passing provide the I/O and IPC subsystems with multiple data movement
options. But such gains in flexibility are useful only if they are more efficient than tradi-
tional data copying. In this chapter we describe five sets of tests designed to measure the
performance of UVM’s new features. The results of these measurements show that page
loanout, page transfer, and map entry passing can significantly improve I/O and IPC perfor-
mance over traditional systems. We also measure the effects of UVM’s secondary design
elements. Our results show that due to our secondary design improvements UVM outper-
forms BSD VM in several critical areas, even when our new data movement mechanisms
are not used.
All our tests were performed on 200 MHz Pentium Pro processors running NetBSD
1.3A with either UVM or BSD VM. Our Pentium Pros have a 16K level-one cache, a 256K
level-two cache, and 32MB of physical memory. The main memory bandwidth of our
Pentium Pros is 225 MB/sec for reads, 82 MB/sec for writes, and 50 MB/sec for copies, as
reported by the lmbench memory benchmarks1 [40].
socket
sosend
layer
protocol
tcp_usrreq udp_usrreq
layer
Figure 10.1: Socket write code path. The code in the box has been duplicated and
modified to use page loanout for our measurements. The modified code is accessed by an
alternate write system call.
file descriptors: sockets and vnodes. The write file operation for sockets is the soo write
function. The write operation for vnodes is vn write.
The soo write function is a one line function that translates the write call into a
sosend call. The sosend function does the lion’s share of the work of the socket layer
including handling all the numerous options supported by the socket layer. If the sosend
function determines that it ok to send data on the socket it will allocate an mbuf chain for
the data, copy the data from the user buffer pointed to by the uio structure into the chain,
and then pass the newly created mbuf chain to the protocol layer’s user request function for
transmission4. Once the socket layer has passed the data to the protocol layer, its role in
the transmit process is complete. The code path for a write is shown in Figure 10.1.
To enable the socket layer to use page loanout rather than data copying for data
transmission a second write system call was added to the kernel. Having two write sys-
tem calls in the kernel allows us to easily switch between data copying and page loanout
when taking measurements. The new write system call calls a modified version of
soo write when a write to a socket is detected. The alternate soo write function
calls xsosend. The xsosend function is a modified version of sosend that looks for
large user buffers.
When xsosend detects a large user buffer, it uses UVM’s page loanout to loan the
user’s pages out to kernel pages. It then maps those pages into the kernel address space,
4
This is the PRU SEND request.
210
and allocates an external mbuf for each one. Each external mbuf points to a special free
function that will drop the loan once the pages have been transmitted.
3. A buffer is allocated.
4. The entire buffer is written to the null socket in fixed sized chunks (the “write size”)
a specified number of times.
The test program takes as parameters the size of the buffer, the number of times the buffer
is transmitted, and the amount of data to pass to the write system call (the write size).
The test program was run using a two megabyte buffer that was transmitted 1024
times. Thus, two gigabyte of data was transmitted for each run of the test program. Each
run of the program was timed so that the bandwidth of the null socket could be determined.
The write size was varied from 1K to 2048K. The results of the test using copying and page
loanout are shown in Figure 10.2.
For write sizes that are less than the hardware page size (4K), copying the data
produces a higher bandwidth than using page loanout. This is due to the page loanout
mechanism being used to loan out pages that are only partially full of valid data. For
example, when the write size is 1K for each page loaned out there is 3K of data in the
page that is not being used. Once the write size reaches the page size, page loanout’s
bandwidth overtakes data copying. As the write size is increased to allow multiple pages to
be transmitted in a single write call, the data copying bandwidth increases until the write
211
800
*copy
600 loan
bandwidth (MB/sec)
400
200
4K
0
'
1 (
10 100 ( (
1000 (
10000
)
write size (kbytes)
Figure 10.2: Comparison of copy and loan procedures for writing to a null socket
size reaches 32K. The data copy bandwidth never exceeds 172 MB/sec. Meanwhile, the
loanout bandwidth rises sharply as the write size is increased. When the write size is 32K
the loanout bandwidth is 560 MB/sec. The loanout bandwidth levels off at 750 MB/sec
when the write size is 512K. Clearly, page loanout improved the I/O performance of the
test program by reducing data copy overhead.
Note that the high bandwidth achieved by page loanout is through virtual memory
based operations rather than through data copying. The results of the lmbench memory
bandwidth benchmarks show that such high bandwidths cannot be achieved if each byte of
the data is read or copied. Not all applications require the data to be accessed. For example,
the multimedia video server described in [12] transmits data read from disk without reading
or touching it and thus could benefit from page loanout.
socket
soreceive
layer
protocol
tcp_usrreq
layer
Figure 10.3: Socket read code path. The code in the box has been duplicated and modified
to use page transfer for our measurements. The modified code is accessed by an alternate
read system call.
1. Unused portions of the cluster mbuf’s data area are zeroed to ensure data security.
3. The new page is mapped into the cluster mbuf’s data area and the old page is attached
to the new anon.
4. The new anon is returned (it is ready to be transfered into the receiving processes’
address space).
Once the data pages have been removed from the cluster mbufs and placed into anons,
the anons can be transferred into the user’s address space using the normal page transfer
functions.
.copy
bandwidth (MB/sec)
600
/transfer
400
200
4K
0
+
1 ,
10 100 , ,
1000 ,
10000
-
read size (kbytes)
Figure 10.4: Comparison of copy and transfer procedures for reading from a null socket
2. A buffer is allocated if data copying is being used. This buffer will be used with the
read system call (its size is called the “read size”). If page transfer is being used,
then the operating system will provide the buffer.
3. The data is read using null socket reads in fixed sized chunks.
4. If page transfer is used, the transfered chunk of memory is freed using the anflush
system call after the read operation.
The test program takes as input parameters the amount of data to read, and the number of
bytes the program should read in one system call (the read size).
The test program was used to transfer 1GB of data first using data copying and then
using page transfer. The read sized was varied from 4K (one page) to 2048K, and the
bandwidth was measured. The results of the test are shown in Figure 10.4.
When copying data, smaller reads produce a greater bandwidth. As the read size
increases, the null socket read bandwidth decreases from 315 MB/sec to 50 MB/sec. This
decrease is due to data caching on the Pentium Pro. Memory accesses to the cache are
216
significantly faster than memory accesses to main memory. If an application’s buffer en-
tirely fits within a cache then memory references to it will be faster than the bandwidth of
main memory. For example, when the test program uses 4K buffers both the source and
destination buffers of the data copy fit within the level-one cache producing a bandwidth
of 315 MB/sec. When the size of the test program’s buffers are increased to 8K they no
longer fit within the level-one cache, but they do fit within the level-two cache. This results
in a bandwidth drop to 225 MB/sec. The bandwidth remains nearly constant for read sizes
from 8K to 64K. However, between 128K and 256K the bandwidth drops to 50 MB/sec
as the size of the data buffers becomes larger than the level-two cache. Finally, the band-
width eventually levels out at 50 MB/sec (the same value we got from lmbench’s memory
copy benchmark). The 4K read size gets the full benefit of the high bandwidth of the Pen-
tium Pro’s level-one cache since it was the only process running on the system when the
measurements were taken (so its data was always fresh in the cache).
On the other hand, the cache does not play as great a role in page transfer. The
bandwidth starts around 190 MB/sec for page sized transfers, and then increases up to
730 MB/sec for a 1MB page transfer. The final dip in the page transfer curve is due to
the cache. As the transfer size gets larger, the page transfer code touches more kernel data
structures, and at some point this causes touched data structures to exceed the size of the
level one cache.
Note that the high bandwidth achieved by page transfer is through virtual mem-
ory based operations rather than through data copying. Such high bandwidths cannot be
achieved if each byte of the data is read or copied. Not all applications require the data to
be accessed. For example, the multimedia video recorder used in our video server project
receives data from an ATM network and immediately writes it to disk without reading or
touching the data [12].
1. The parameters are checked to see if they are favorable to page transfer. If they are
not, then soo read2 uses the traditional soo read function to move the data.
2. If there is no data to receive, then the soo read2 function causes the calling process
to sleep until data arrives (unless the socket is in non-blocking mode, in which case
an error is returned).
3. Once soo read2 has data, the mbuf chain the data resides in is checked to see if
its data can be transferred using page transfer. In order to do this, the mbuf chain
must consist of either cluster mbufs or of external mbufs whose data areas are pages
loaned out from some other process. If the mbuf chain is not a candidate for page
transfer, then soo read2 uses a traditional data copy to deliver the data.
218
4. For mbuf chains that can be used with page transfer, the soo read2 function calls
the so transfer function to transfer the data to the user process. It then returns
the address to which the pages were transfered to the xread function.
The so transfer function handles the page transfer of an mbuf chain. It operates
as follows:
1. Space in the receiving process’ page transfer area is reserved for the data.
2. The mbuf chain is then traversed. Each mbuf’s data area is loaded into an anon. For
cluster mbufs, so mcl steal is used (see Section 10.2.1). For external mbufs, the
so mextloan reloan function is used (see below).
3. Finally, the anons are inserted into the receiving process’ address space.
The reloan function is used for external mbufs whose data area is on loan from
a process’ address space due to page loanout caused by the xwrite system call. This
function looks up the page associated with the mbuf. If the page is attached to an anon,
then the anon’s reference count is incremented. Otherwise, a new anon is allocated and the
page is attached to it. The reloan function returns the anon.
These changes allow a process to receive data from a pipe using page transfer.
3. The child process writes a specified amount of data into the pipe, using page loanout
if requested. Once the data is written the child process closes the pipe and exits.
4. The parent process reads data out of the pipe, using page transfer if requested. After
a read operation, the data is released using the anflush system call. Once all data
is read, the parent process exits.
The test program takes as parameters the amount of data to transfer over the pipe and the
read and write sizes.
219
500
400
bandwidth (MB/sec)
.copy
300
/transfer
0loan
both
200
100
0
+
1 ,
10 ,
100 ,
1000 ,
10000
transfer size (kbytes)
Figure 10.5: Comparison of copy, transfer, loan, and both procedures for moving data over
a pipe
We used the test program to produce four sets of data. For each set of data we
transfered 1GB of data varying the read and write sizes from 4K to 2048K. The four data
sets are:
copy: Data was copied on both the sending and receiving ends of the pipe.
transfer: Data was copied on the sending side and moved with page transfer on the re-
ceiving side of the pipe.
loan: Data was loaned out on the sending side and copied on the receiving side of the pipe.
both: Data was loaned out on the sending side and moved with page transfer on the re-
ceiving side of the pipe.
4. For the test program that uses copying to move the data, the parent process initializes
the buffer (if data “touching” is requested), and writes the buffer into the socket. The
child process then reads the data from the socket. If data touching is requested, then
the child process modifies the data. Then the child process writes the data back into
its end of the socket. The parent process reads the buffer back. This entire process is
repeated for the specified number of times.
5. For the test program that uses map entry passing to move the data, the procedure is
similar to the data copying procedure except that the export and import system calls
are used. The parent process initializes the data (if touching), exports the buffer, and
writes the export tag to the socket. The child process reads the export tag from the
socket and uses that to import the buffer from the parent process (thus removing it
from the parent’s address space). The child process then modifies the data buffer
(if touching), exports the buffer, and writes the export tag to the socket. The parent
reads the tag from the socket and imports the buffer back from the child. This entire
process is repeated for the specified number of times.
Both test programs take as parameters the size of the buffer to transfer and the number of
times the buffer should be transfered.
Each test program was run using transfer sizes ranging from 4K to 4096K. The
number of times the data was transferred from parent to child process and back was adjusted
so that the test program ran for at least thirty seconds. Each run of the program was timed
so that the bandwidth could be determined.
As shown in Figure 10.6, the bandwidth curves for the test program that uses data
copying via the socket pair start off near 35 MB/sec for a one page transfer size. The
bandwidth decreases as the transfer size increases due to caching effects. The bandwidth
of the data-copying tests level off at 12 MB/sec. On the other hand, the bandwidth curve for
the map entry passing test program peaks around 96 MB/sec for a transfer size of 64K and
then caching effects on kernel VM data structures cause it to fall and level off at 34 MB/sec.
222
100
.copy
m.e.p.
80
bandwidth (MB/sec)
60
40
20
0
+
1 ,
10 ,
100 ,
1000 ,
10000
transfer size (kbytes)
Figure 10.6: Comparison of copy and map entry passing (m.e.p.) transferring memory
between two processes
If the application does not touch the data after it is transfered then the bandwidth
for data copying levels off at 18 MB/sec rather than 12 MB/sec. On the other hand, if the
map entry passing test program does not touch the data, then the bandwidth rises rapidly,
reaching 18500 MB/sec for a transfer size of 4096K. This high bandwidth occurs because
the data being transfered is accessed neither by the application program nor the kernel, so
the data is never loaded into the processor from main memory.
10.4.2 Data Pipeline With Map Entry Passing and Page Loanout
Another test program we wrote combines the use of map entry passing with page loanout.
The test program consists of three processes. The first process allocates and initializes a
data buffer of a specified sized and then sends it to the second process. The second process
modifies the data and then sends it to the third process. The third process receives the
data and writes it to a null protocol socket. Once the data has been written to the null
protocol socket, then the first process can generate another buffers worth of data. Such
223
50
.copy
1pipeline
40
bandwidth (MB/sec)
30
20
10
0
+
1 ,
10 ,
100 ,
1000 ,
10000
transfer size (kbytes)
Figure 10.7: Comparison of copy and map entry passing/page loanout pipeline
3. The test program then repeatedly uses fork or vfork to create a child process. The
child process immediately exits.
The test program takes as parameters the size of anonymous memory to allocate, whether
to use fork or vfork, and whether or not to touch the anonymous memory after each
fork operation.
The test program was run under both BSD VM and UVM for memory allocations
of 0MB to 16MB. We had the test program perform 10,000 forks for each allocation size.
225
10000
BSD+fork+touch
UVM+fork+touch
8000 BSD+fork
UVM+fork
BSD+vfork
fork+exit time (usec)
UVM+vfork
6000
4000
2000
0
,0 25 10 , 2
15 ,
20
3
allocation (MB)
Figure 10.8: Process creation and deletion overhead under BSD VM and UVM
For the fork operation, we ran the test program with and without touching the allocated
memory area. The results of this test are shown in Figure 10.8.
For vfork operations, the time it took for a process to fork and exit remained
constant regardless of the memory allocation for both BSD VM (218 microseconds) and
UVM (210 microseconds). This was expected because with vfork the child process never
gets an address space of its own. Note that a normal fork operation takes at least five times
as much time to complete as a vfork operation.
For fork operations where the allocated data is not touched for each fork opera-
tion there is a major difference between BSD VM and UVM. For UVM, the fork and exit
time remained constant at 1035 microseconds. On the other hand, the time for BSD VM
grows with the size of the memory allocation. UVM’s behavior is easy to explain. At the
time of the first fork operation, the map entry in the parent’s map that maps the allocated
area of memory is write protected and marked “needs-copy.” Since the parent process never
writes to the allocated area memory between fork operations, the “needs-copy” flag re-
mains set. This indicates that the allocated area of memory is already set for copy-on-write.
As explained in Section 4.6 and Section 4.7, the UVM fork code takes advantage of this
226
to avoid repeating the copy-on-write setup operation, thus bypassing the overhead associ-
ated with the memory allocation. On the other hand, the BSD VM code appears to perform
some shadow-object chain related operations regardless of the needs-copy state of its map
entry, thus contributing to the overhead of the fork operation.
If the allocated data is touched between fork operations, then needs-copy gets
cleared and both virtual memory systems are forced to do all copy-on-write related opera-
tions on the allocated memory each time through the loop. In this case the time for a fork
and exit for both BSD VM and UVM increases with the size of the memory allocation.
For an allocation of zero, the times for BSD VM and UVM are 1177 microseconds and
1025 microseconds, respectively. For an allocation of 16MB, the loop times for BSD VM
and UVM rise to 8652 microseconds and 3764 microseconds, respectively. BSD VM’s
fork processing time increases more rapidly than UVM’s due to the expense of maintaining
object chains.
BSD VM
UVM
60
time (seconds)
6 40
20
0 40 4
10 20
4 4
30
4
40
4
50
5
memory allocation (MB)
clustering its performance would improve, provided that the pagedaemon found contiguous
dirty pages to build clusters around.
Table 10.1: Page fault counts. The cc command was run on a “hello world” program.
When measuring the number of map entries under BSD VM, we did not count the map en-
tries allocated for the “buffer map” (see Section 8.10.1). The results of these measurements
are shown in Table 10.2.
The number of statically allocated map entries under UVM remains nearly constant
(seventeen or eighteen) as the number of processes on the system increases. On the other
hand, the number of statically allocated map entries under BSD VM grows with the number
of processes, reaching ninety-four entries when there are thirty-seven processes active on
229
Table 10.2: Comparison of the number of allocated map entries (the BSD VM numbers do
not include 859 dynamically allocated map entries used for the buffer map)
the system. This is due to the way BSD VM wires each process’ “user” structure into
kernel memory.
The total number of allocated map entries rises with the number of active processes
for both BSD VM and UVM, with BSD VM’s map entry allocation rising more rapidly
than UVM’s. For example, for thirty-seven processes BSD VM requires 785 map entries,
while UVM requires only 487.
Chapter 11
In this dissertation we have presented the design and implementation of the UVM virtual
memory system. UVM was designed to meet the needs of new computer applications that
require data to be moved in bulk through the kernel’s I/O system. Our novel approach fo-
cuses on allowing processes to pass memory to and from other processes and the kernel,
and to share memory. The memory passed or shared may be a block of pages, or it may be
a chunk of virtual memory space that may not be entirely mapped. This approach reduces
or eliminates the need to copy data thus reducing the time spent within the kernel and free-
ing up cycles for application processing. Unlike the approaches that focus exclusively on
the networking subsystem, our approach provides a general solution that can improve effi-
ciency of the entire I/O subsystem. UVM also improves the kernel’s overall performance
in other key areas. In this chapter we describe the contributions of the UVM project and
directions for future research.
11.1 Contributions
UVM provides the BSD kernel with three new mechanisms that allow processes to ex-
change and share paged-sized chunks of memory without data copies. These three new
mechanisms are page loanout, page transfer, and map entry passing. The design of these
mechanisms was based on several key principles that we believe should be applicable to
other operating systems:
7 First, although traditional virtual memory systems operate at a mapping-level gran-
ularity, we believe that page granular operations are also important and should be
provided. This gives kernel subsystems such as the I/O and IPC system the ability to
233
add and extract pages from a process’ address space without disrupting or fragment-
ing the high-level address space mappings.
7 Second, a virtual memory system should be designed to avoid complex memory
lookup mechanisms. Simple memory lookup mechanisms are more efficient and ease
implementation of VM-based data movement mechanisms. For example, UVM’s
two-level memory mapping scheme improves performance over BSD object chains
and simplifies traditional functions such as the fault routine. It also eased the imple-
mentation of UVM’s page granulary data movement mechanisms.
7 Third, a virtual memory system should be designed in such a way that they allow
their pages to be shared easily between processes, the virtual memory layer, and
kernel subsystems. This allows data to be moved throughout the operating system
without it having to be copied. While, BSD VM assumes that it always has complete
control over virtual memory pages, UVM does not.
While it would be foolish to use UVM’s data movement mechanisms on small data
chunks, our measurements clearly show that for larger chunks of data they provide a sig-
nificant reduction in kernel overhead as compared to data copying. For example:
7 UVM’s page loanout mechanism provides processes with the ability to avoid costly
data copies by loaning a copy-on-write copy of their memory out to the kernel or
other processes. Our measurements show that loaning a single 4K page’s worth of
data out to the kernel rather than copying it can increase bandwidth by 35%. Larger
sized chunks of data provide an even greater performance improvement. For exam-
ple, loaning out a 2MB buffer rather than copying it increases the bandwidth by a
factor of 4.5.
7 UVM’s page transfer mechanism allows a process to receive anonymous pages of
memory from other processes or the kernel without having to copy the data. Once a
page has been received with page transfer it becomes a normal page of anonymous
memory and requires no further special treatment from UVM. Our measurements
show that page transfer pays when transferring two or more pages of data. A two-
page page transfer increases the bandwidth over copying by 28%. Increasing the
page transfer size to 1MB (256 pages) yields a bandwidth that is 14 times greater
than the data copying bandwidth.
7 In addition to being used separately, we have shown that page transfer and page
loanout can be used together to improve the bandwidth delivered by a pipe by a
234
factor of 1.7 for page-sized chunks of data, and by a factor of 16 for 1MB-sized
chunks of data.
7 UVM’s map entry passing mechanism allows processes to easily copy, share, and
exchange chunks of virtual memory from their address space. Measurements show
that map entry passing between two processes is more efficient than data copying for
two or more pages. For a two-page exchange, map entry passing provides a 44%
improvement over data copying. For a 512K (128 page) data exchange, map entry
passing outperforms data copying by factor of 2.8.
7 We have also shown that map entry passing can be combined with page loanout to
improve the performance of a data pipeline application. Our measurements show
that the pipeline using map entry passing and page loanout performs better than data
copying for data chunks of two or more pages. While map entry passing and page
loanout yield a modest 2% improvement for data chunks of two pages, they improve
performance by a factor of 2.4 for sixteen-page data chunks.
Our measurements also clearly show the effects of the kernel copying large buffers of data
multiple times on the cache. These data copies effectively flush all the data out of a system’s
cache, thus causing it to have to be reloaded. UVM’s new mechanisms eliminate this cache-
flushing effect.
In addition to providing new features, UVM also improves the BSD kernel’s perfor-
mance in traditional areas. These areas include the following:
7 Forking time has been reduced under UVM. The vfork system call time has been
reduced by 4%. The fork system call time has been reduced by 13% for small
processes. As the process size increases, UVM’s improvement increases as well. For
example, a process with a 16MB region forks 56% faster under UVM than under
BSD VM.
7 Pageout time has been reduced under UVM. For example, on our 32MB system the
time to satisfy a memory allocation of 50MB has been reduced by 83%.
7 The number of calls to the pagefault routine has been reduced under UVM. For ex-
ample, when compiling a “hello world” program the page fault routine is called 46%
less often under UVM.
7 Map entry fragmentation has been greatly reduced under UVM. Static kernel map en-
tries are no longer allocated each time a process is created under UVM. And UVM’s
235
improved handling of wired memory reduces the number of map entries required for
traditional processes. For example, a system freshly booted multiuser under UVM
uses 40% fewer map entries than it would under BSD VM.
UVM also includes numerous design improvements over BSD VM. These improvements
include: the removal of object chaining and fictitious pages, the merging of the vnode and
VM object cache, the removal of deadlock conditions in the pmap, the elimination of the
troublesome “ptdi” panic condition, a unified mechanism to register the configuration of
physical memory with the VM system, and new unprivileged system calls that allow users
to obtain VM status without the need for “root” access.
In addition, UVM’s well documented implementation methods show how a project
of this size can be accomplished by a single person.
UVM is now part of the standard NetBSD distribution and is scheduled to replace
the BSD VM system for NetBSD release 1.4. UVM already runs on almost all of NetBSD’s
platforms and is expected to run on every NetBSD platform soon. A port to OpenBSD is
also expected.
The UVM system is well organized and documented, thus making it well suited
for additional enhancements. Given the work we’ve already accomplished with UVM,
taking on these tasks for future research should be an interesting, but not insurmountable
challenge. Additional efforts in this area will allow applications to take full advantage of
the benefits of UVM.
237
Appendix A
Glossary
address space: a numeric range that identifies a portion of physical or virtual memory.
amap: a data structure also known as an anonymous map. An amap structure contains
pointers to a set of anons that are mapped together in virtual memory.
anon: a data structure that describes a single page of anonymous memory. Information
contained in an anon includes a reference counter and the current location of the data
(i.e. in RAM or in backing store).
aref: a data structure that is part of a map entry structure and contains a reference and an
offset to an amap structure.
BSD VM: the VM system imported into 4.4BSD from the Mach operating system.
backing store: a non-volatile area of memory, typically on disk, where data is stored when
it is not resident. Examples of backing store include plain files and the system’s swap
area.
bss area: the area of a process’ virtual address space where un-initialized data is placed.
The VM system fills this area with zeroed pages as it is referenced.
238
busy page: a page that should not be accessed because it is being read or written.
chunking, amap: the process of breaking up the allocation of large amaps into smaller
allocations.
clustered I/O: I/O in which operations on smaller contiguous regions are merged into a
single large I/O operation.
copy-on-write: a way of using the VM system to minimize the amount of data that must
be copied. The actual copy of each page of data is deferred until it is first written in
hopes that it will not be written at all. There are two forms of copy-on-write: private
copy-on-write and copy copy-on-write.
copy object: in the BSD VM, an anonymous memory object that stores the original ver-
sion of a changed page in a copied object. Copy objects are necessary to support
the non-standard copy copy-on-write semantics. A list of objects connected by their
copy object pointers is called a copy object chain.
data area: the area of a process’ virtual address space where initialized data is placed.
The data area comes from the file containing the program being run and is mapped
copy-on-write.
dirty page: a page whose data has been modified. Dirty pages must be cleaned before
they can be paged out.
fictitious page: a vm page structure used by the BSD VM system to refer to device mem-
ory.
kernel: the program that controls a computer’s hardware and software resources. The
kernel has several subsystems including networking, I/O, process scheduling, and
virtual memory. In a Unix-like operating system the kernel is stored in a file such as
/vmunix, or /netbsd. The kernel is part of an operating system.
memory management unit (MMU): the part of a computer’s hardware that translates vir-
tual addresses to physical addresses.
memory mapped object: a memory object that has pages which are currently mapped
into a user’s address space. Examples of memory mapped objects include memory
mapped files and memory mapped devices.
memory object: any kernel data structure that represents data that can be mapped into a
virtual address space.
object collapse problem: a problem with the BSD VM system where memory remains
allocated even though it can no longer be accessed. This eventually leads to the
kernel running out of swap space.
operating system: the kernel and associated system programs that come with it (e.g. sys-
tem utilities, windowing system, etc.).
page fault: a hardware exception triggered by the MMU when virtual memory is accessed
in a way that is inconsistent with how it is mapped. Page faults may be triggered,
for example, by accessing a virtual address that is not currently mapped to physical
memory, or by writing to a virtual address that is mapped read-only. There are several
kinds of page faults. A read fault occurs when an unmapped area of memory is
read. A write fault occurs when an unmapped or write-protected area of memory is
written.
released page: a page that needs to be freed but cannot be freed because it is busy. Such a
page is marked released and will be freed when it is no longer busy.
share map: a map which defines a range of shared virtual address space. Mapping changes
in a share map are seen by all processes sharing the map.
shared mapping: a mapping in which changes made to memory are reflected back to the
backing object. Thus these changes are shared by any process mapping the object.
shadow object: in the BSD VM, an object that stores the changed version of an object
mapped copy-on-write. A list of objects connected by their shadow object pointers
is called a shadow object chain.
stack area: the area of a process’ virtual address space where the program execution stack
is placed. This area is filled with zeroed pages by the VM system as it is referenced.
submap: a map used by the kernel to break up its address space into smaller chunks.
swap space: the area of backing store where anonymous memory is paged out to when
physical memory is scarce.
swap memory leak: in the BSD VM, when anonymous vm object structures contain
memory that is not accessible by any process, those pages eventually get paged out
to swap space. Swap space eventually gets filled, and the operating system can dead-
lock.
text area: the area of a process’ virtual address space where the executable instructions
of a program are placed. The text area comes from the file containing the program
being run and is mapped read-only.
virtual address space: an address space in which virtual memory resides. In a Unix-like
operating system each process on the system has its own private virtual address space.
241
virtual memory: memory that must be mapped to a page of physical memory by the VM
system before it can be referenced.
vm object: a data structure that describes a file, device, or zero-fill area which can be
mapped into virtual address space.
wanted page: a page that is needed but currently busy. A process can mark a busy page
wanted and then wait for it to become un-busy.
wired page: a page of virtual memory that is always resident in physical memory.
zero-filled memory: memory that is allocated and filled with zeros only when it is refer-
enced.
242
References
[3] E. Anderson. Container Shipping: A Uniform Interface for Fast, Efficient, High-
Bandwidth I/O. PhD thesis, University of California, San Diego, 1995.
[4] J. Barrera. A fast Mach network IPC implementation. In Proceedings of the USENIX
Mach Symposium, pages 1–11, November 1991.
[10] J. Brustoloni. Effects of Data Passing Semantics and Operating System Structure on
Network I/O Performance. PhD thesis, Carnegie Mellon University, September 1997.
[12] M. Buddhikot, X. Chen, D. Wu, and G. Parulkar. Enhancements to 4.4 BSD UNIX
for networked multimedia in project MARS. In Proceedings of IEEE Multimedia
Systems’98, June 1998.
[13] H. Chartock and P. Snyder. Virtual swap space in SunOS. In Proceedings of the
Autumn 1991 European UNIX Users Group Conference, September 1991.
[14] H. Chu. Zero-copy TCP in Solaris. In Proceedings of the USENIX Conference, pages
253–264. USENIX, 1996.
[15] C. Cranor and G. Parulkar. Universal continuous media I/O: Design and implementa-
tion. Technical Report WUCS-94-34, Washington University, 1994.
[16] C. Cranor and G. Parulkar. Design of universal continuous media I/O. In Proceedings
of the 5th International Workshop on Network and Operating Systems Support for
Digital Audio and Video, pages 83–86, 1995.
[17] R. Dean and F. Armand. Data movement in kernelized systems. Technical Report
CS-TR-92-51, Chorus Systèmes, 1992.
[19] P. Denning. Virtual memory. ACM Computing Surveys, 28(1), March 1996.
[20] Z. Dittia, J. Cox, and G. Parulkar. Design of the APIC: A high performance ATM
host-network interface chip. In The Proceedings of IEEE INFOCOM 1995, pages
179–187, 1995.
244
[21] Z. Dittia, J. Cox, and G. Parulkar. Design and implementation of a versatile multime-
dia network interface and I/O chip. In Proceedings of the 6th International Workshop
on Network and Operating Systems Support for Digital Audio and Video, April 1996.
[22] Z. Dittia, G. Parulkar, and J. Cox. The APIC approach to high performance network
interface design: Protected DMA and other techniques. In The Proceedings of IEEE
INFOCOM 1997, 1997.
[23] P. Druschel. Operating System Support for High-Speed Networking. PhD thesis,
University of Arizona, August 1994.
[26] K. Fall. A Peer-to-Peer I/O System in Support of I/O Intensive Workloads. PhD thesis,
University of California, San Diego, 1994.
[27] K. Fall and J. Pasquale. Exploiting in-kernel data paths to improve I/O throughput
and CPU availability. In Proceedings of USENIX Winter Conference, pages 327–333.
USENIX, January 1993.
[29] R. Gopalakrishnan and G. Parulkar. Bringing real-time scheduling theory and practice
closer for multimedia computing. In Proceedings ACM SIGMETRICS, May 1996.
[30] The Open Group. The Single UNIX Specification, Version 2. The Open Group, 1998.
See http://www.UNIX-systems.org/online.html.
[31] G. Hamilton and P. Kougiouris. The Spring nucleus: A microkernel for objects. In
Proceedings of USENIX Summer Conference, pages 147–160. USENIX, June 1993.
[35] Y. Khalidi and M. Nelson. The Spring virtual memory system. Technical Report
SMLI TR-93-09, Sun Microsystems Laboratories, Inc., February 1993.
[36] S. Leffler, M. McKusick, M. Karels, and J. Quarterman. The Design and Implemen-
tation of the 4.3BSD Operating System. Addison-Wesley, 1989.
[37] J. Liedtke. Improving IPC by kernel design. In Proceedings of the Fourteenth ACM
Symposium on Operating Systems Principles, pages 175–188, 1993.
[39] M. McKusick, K. Bostic, M. Karels, and J. Quarterman. The Design and Implemen-
tation of the 4.4BSD Operating System. Addison Wesley, 1996.
[40] L. McVoy and C. Staelin. lmbench: Portable tools for performance analysis. In
Proceedings of USENIX Conference, pages 279–294. USENIX, 1996.
[41] F. Miller. Input/Output System Design For Streaming. PhD thesis, University of
Maryland, 1998.
[42] F. Miller and S. Tripathi. An integrated input/output system for kernel data streaming.
Multimedia Computing and Networking, 1998.
[43] R. Minnich. Mether-NFS: A modified NFS which supports virtual shared memory.
In Proceedings of USENIX Experiences with Distributed and Multiprocessor Systems
(SEDMS IV), September 1993.
[46] J. Moran. SunOS virtual memory implementation. In Proceedings of the Spring 1988
European UNIX Users Group Conference, April 1988.
[47] D. Mosberger and L. Peterson. Making paths explicit in the Scout operating system.
In OSDI 1996, pages 153–168, 1996.
[48] M. Nelson. Virtual memory for the Sprite operating system. Technical Report
UCB/CSD 86/301, Computer Science Division, EECS Department, University of
California, Berkeley, June 1986.
[50] J. Ousterhout. Why aren’t operating systems getting faster as fast as hardware? In
Proceedings of USENIX Summer Conference, pages 247–256. USENIX, 1990.
[51] J. Ousterhout, A. Cherenson, F. Douglis, M. Nelson, and B. Welch. The Sprite net-
work operating system. Computer, 21(2):23–36, February 1988.
[52] V. Pai, P. Druschel, and W. Zwaenepoel. IO-Lite: A unified I/O buffering and caching
system. Technical Report TR97-294, Department of Computer Science, Rice Univer-
sity, 1997.
[53] J. Pasquale, E. Anderson, and P. Muller. Container Shipping: Operating system sup-
port for I/O intensive applications. Computer, 27(3), March 1994.
[58] D. Ritchie. A stream input-output system. AT&T Bell Laboratories Technical Journal,
63(8):1897–1910, October 1984.
[61] M. Schroeder and M. Burrows. The performance of Firefly RPC. ACM Transactions
on Computer Systems, 8(1):1–17, February 1990.
[62] D. Solomon. Inside Windows NT. Microsoft Press, 2nd edition, 1998. Based on the
first edition by Helen Custer.
[63] R. Stallman and Cygnus Support. Debugging with GDB. The Free Software Founda-
tion, 1994.
[65] M. Thadani and Y. Khalidi. An efficient zero-copy I/O framework for Unix. Technical
Report SMLI TR-95-39, Sun Microsystems Laboratories, May 1995.
[66] L. Torvalds et al. The Linux operating system. See http://www.linux.org for
more information.
[67] S. Tzou and D. Anderson. The performance of message-passing using restricted vir-
tual memory remapping. Software - Practice and Experience, 21(3):251–267, March
1991.
[68] USL. Unix System V Release 4.2 STREAMS Modules and Drivers. Prentice-Hall,
1992.
[73] W. Zwaenepoel and D. Johnson. The Peregrine high-performance RPC system. Soft-
ware - Practice and Experience, 23(2):201–221, February 1993.
249
Vita
Charles D. Cranor
Date of Birth June 14, 1967
August 1998