Virtualization Chapter Readding
Virtualization Chapter Readding
• In the Normal World, in kernel mode are (1) the HAL and its extensions and
(2) the kernel and its executive, which load drivers and DLL dependencies.
In user mode are a collection of system processes, the Win32 environment
subsystem, and various services.
• In the Secure World, if VSM is enabled, are a secure kernel and executive
(within which a secure micro-HAL is embedded). A collection of isolated
(discussed later) run in secure user mode.
• Finally, the bottommost layer in Secure World runs in a special processor
mode (called, for example, VMX Root Mode on Intel processors), which
contains the Hyper-V hypervisor component, which uses hardware virtu-
alization to construct the Normal-to-Secure-World boundary. (The user-to-
kernel boundary is provided by the CPU natively.)
21.3.4 Kernel
The kernel layer of Windows has the following main responsibilities: thread
scheduling and context switching, low-level processor synchronization, inter-
rupt and exception handling, and switching between user mode and kernel
mode through the system-call interface. Additionally, the kernel layer imple-
ments the initial code that takes over from the boot loader, formalizing the tran-
sition into the Windows operating system. It also implements the initial code
that safely crashes the kernel in case of an unexpected exception, assertion, or
other inconsistency. The kernel is mostly implemented in the C language, using
assembly language only when absolutely necessary to interface with the lowest
level of the hardware architecture and when direct register access is needed.
The dispatcher provides the foundation for the executive and the subsystems.
Most of the dispatcher is never paged out of memory, and its execution is
never preempted. Its main responsibilities are thread scheduling and context
switching, implementation of synchronization primitives, timer management,
software interrupts (asynchronous and deferred procedure calls), interproces-
sor interrupts (IPIs) and exception dispatching. It also manages hardware and
software interrupt prioritization under the system of
( ).
Like many other modern operating systems, Windows uses threads as the key
schedulable unit of executable code, with processes serving as containers of
threads. Therefore, each process must have at least one thread, and each thread
has its own scheduling state, including actual priority, processor af nity, and
CPU usage information.
There are eight possible thread states: initializing, ready, deferred-
ready, standby, running, waiting, transition, and terminated. ready
indicates that the thread is waiting to execute, while deferred-ready indicates
that the thread has been selected to run on a speci c processor but has not yet
been scheduled. A thread is running when it is executing on a processor core. It
runs until it is preempted by a higher-priority thread, until it terminates, until
its allotted execution time (quantum) ends, or until it waits on a dispatcher
object, such as an event signaling I/O completion. If a thread is preempting
another thread on a different processor, it is placed in the standby state on
that processor, which means it is the next thread to run.
Preemption is instantaneous —the current thread does not get a chance to
nish its quantum. Therefore, the processor sends a software interrupt —in
this case, a ( )—to signal to the other processor
that a thread is in the standby state and should be immediately picked up for
execution. Interestingly, a thread in the standby state can itself be preempted
if yet another processor nds an even higher-priority thread to run in this
processor. At that point, the new higher-priority thread will go to standby,
and the previous thread will go to the ready state. A thread is in the waiting
state when it is waiting for a dispatcher object to be signaled. A thread is in
the transition state while it waits for resources necessary for execution; for
example, it may be waiting for its kernel stack to be paged in from secondary
storage. A thread enters the terminated state when it nishes execution, and a
thread begins in the initializing state as it is being created, before becoming
ready for the rst time.
The dispatcher uses a 32-level priority scheme to determine the order of
thread execution. Priorities are divided into two classes: variable class and
static class. The variable class contains threads having priorities from 1 to 15,
and the static class contains threads with priorities ranging from 16 to 31. The
dispatcher uses a linked list for each scheduling priority; this set of lists is called
the . The database uses a bitmap to indicate the presence
of at least one entry in the list associated with the priority of the bit’s position.
Therefore, instead of having to traverse the set of lists from highest to lowest
until it nds a thread that is ready to run, the dispatcher can simply nd the
list associated with the highest bit set.
Prior to Windows Server 2003, the dispatcher database was global,
resulting in heavy contention on large CPU systems. In Windows Server 2003
and later versions, the global database was broken apart into per-processor
databases, with per-processor locks. With this new model, a thread will only
be in the database of its . It is thus guaranteed to have a
processor af nity that includes the processor on whose database it is located.
The dispatcher can now simply pick the rst thread in the list associated with
the highest bit set and does not have to acquire a global lock. Dispatching
is therefore a constant-time operation, parallelizable across all CPUs on the
machine.
On a single-processor system, if no ready thread is found, the dispatcher
executes a special thread called the idle thread, whose role is to begin the
transition to one of the CPU’s initial sleep states. Priority class 0 is reserved for
the idle thread. On a multiprocessor system, before executing the idle thread,
the dispatcher looks at the dispatcher databases of other nearby processors,
taking caching topologies and NUMA node distances into consideration. This
operation requires acquiring the locks of other processor cores in order to safely
inspect their lists. If no thread can be stolen from a nearby core, the dispatcher
looks at the next nearest core, and so on. If no threads can be stolen at all, then
the processor executes the idle thread. Therefore, in a multiprocessor system,
each CPU will have its own idle thread.
Putting each thread on only the dispatcher database of its ideal processor
causes a locality problem. Imagine a CPU executing a thread at priority 2 in a
CPU-bound way, while another CPU is executing a thread at priority 18, also
CPU-bound. Then, a thread at priority 17 becomes ready. If the ideal processor
of this thread is the rst CPU, the thread preempts the current running thread.
But if the ideal processor is the latter CPU, it goes into the ready queue instead,
waiting for its turn to run (which won’t happen until the priority 17 thread
gives up the CPU by terminating or entering a wait state).
Windows 7 introduced a load-balancer algorithm to address this situation,
but it was a heavy-handed and disruptive approach to the locality issue. Win-
dows 8 and later versions solved the problem in a more nuanced way. Instead
of a global database as in Windows XP and earlier versions, or a per-processor
database as in Windows Server 2003 and later versions, the newer Windows
versions combine these approaches to form a among a
group of some, but not all, processors. The number of CPUs that form one
shared group depends on the topology of the system, as well as on whether
it is a server or client system. The number is chosen to keep contention low
on very large processor systems, while avoiding locality (and thus latency and
contention) issues on smaller client systems. Additionally, processor af nities
are still respected, so that a processor in a given group is guaranteed that all
threads in the shared ready queue are appropriate—it never needs to “skip”
over a thread, keeping the algorithm constant time.
Windows has a timer expire every 15 milliseconds to create a clock “tick” to
examine system states, update the time, and do other housekeeping. That tick
is received by the thread on every non-idle core. The interrupt handler (being
run by the thread, now in KT mode) determines if the thread’s quantum has
expired. When a thread’s time quantum runs out, the clock interrupt queues
a quantum-end DPC to the processor. Queuing the DPC results in a software
interrupt when the processor returns to normal interrupt priority. The software
interrupt causes the thread to run dispatcher code in KT mode to reschedule the
processor to execute the next ready thread at the preempted thread’s priority
level in a round-robin fashion. If no other thread at this level is ready, a lower-
priority ready thread is not chosen, because a higher-priority ready thread
already exists—the one that exhausted its quantum in the rst place. In this
situation, the quantum is simply restored to its default value, and the same
thread executes once again. Therefore, Windows always executes the highest-
priority ready thread.
When a variable-priority thread is awakened from a wait operation, the
dispatcher may boost its priority. The amount of the boost depends on the type
of wait associated with the thread. If the wait was due to I/O, then the boost
depends on the device for which the thread was waiting. For example, a thread
waiting for sound I/O would get a large priority increase, whereas a thread
waiting for a disk operation would get a moderate one. This strategy enables
I/O-bound threads to keep the I/O devices busy while permitting compute-
bound threads to use spare CPU cycles in the background.
Another type of boost is applied to threads waiting on mutex, semaphore,
or event synchronization objects. This boost is usually a hard-coded value
of one priority level, although kernel drivers have the option of making a
different change. (For example, the kernel-mode GUI code applies a boost of
two priority levels to all GUI threads waking up to process window messages.)
This strategy is used to reduce the latency between when a lock or other
noti cation mechanism is signaled and when the next waiter in line executes
in response to the state change.
In addition, the thread associated with the user’s active GUI window
receives a priority boost of two whenever it wakes up for any reason, on top
of any other existing boost, to enhance its response time. This strategy, called
the foreground priority separation boost, tends to give good response times to
interactive threads.
Finally, Windows Server 2003 added a lock-handoff boost for certain classes
of locks, such as critical sections. This boost is similar to the mutex, semaphore,
and event boost, except that it tracks ownership. Instead of boosting the waking
thread by a hard-coded value of one priority level, it boosts to one priority
level above that of the current owner (the one releasing the lock). This helps in
situations where, for example, a thread at priority 12 is releasing a mutex, but
the waiting thread is at priority 8. If the waiting thread receives a boost only to
9, it will not be able to preempt the releasing thread. But if it receives a boost
to 13, it can preempt and instantly acquire the critical section.
Because threads may run with boosted priorities when they wake up from
waits, the priority of a thread is lowered at the end of every quantum as long
as the thread is above its base (initial) priority. This is done according to the
following rule: For I/O threads and threads boosted due to waking up because
of an event, mutex, or semaphore, one priority level is lost at quantum end.
For threads boosted due to the lock-handoff boost or the foreground priority
separation boost, the entire value of the boost is lost. Threads that have received
boosts of both types will obey both of these rules (losing one level of the
rst boost, as well as the entirety of the second boost). Lowering the thread’s
priority makes sure that the boost is applied only for latency reduction and for
keeping I/O devices busy, not to give undue execution preference to compute-
bound threads.
Scheduling occurs when a thread enters the ready or waiting state, when a
thread terminates, or when an application changes a thread’s processor af nity.
As we have seen throughout the text, a thread could become ready at any
time. If a higher-priority thread becomes ready while a lower-priority thread is
running, the lower-priority thread is preempted immediately. This preemption
gives the higher-priority thread instant access to the CPU, without waiting on
the lower-priority thread’s quantum to complete.
It is the lower-priority thread itself, performing some event that caused it
to operate in the dispatcher, that wakes up the waiting thread and immedi-
ately context-switches to it while placing itself back in the ready state. This
model essentially distributes the scheduling logic throughout dozens of Win-
dows kernel functions and makes each currently running thread behave as
the scheduling entity. In contrast, other operating systems rely on an external
“scheduler thread” triggered periodically based on a timer. The advantage of
the Windows approach is latency reduction, with the cost of added overhead
inside every I/O and other state-changing operation, which causes the current
thread to perform scheduler work.
Windows is not a hard-real-time operating system, however, because it
does not guarantee that any thread, even the highest-priority one, will start to
execute within a particular time limit or have a guaranteed period of execution.
Threads are blocked inde nitely while DPCs and
( ) are running (as further discussed below), and they can be preempted at
any time by a higher-priority thread or be forced to round-robin with another
thread of equal priority at quantum end.
Traditionally, the Windows scheduler uses sampling to measure CPU uti-
lization by threads. The system timer res periodically, and the timer inter-
rupt handler takes note of what thread is currently scheduled and whether it
is executing in user or kernel mode when the interrupt occurred. This sam-
pling technique originally came about because either the CPU did not have a
high-resolution clock or the clock was too expensive or unreliable to access
frequently. Although ef cient, sampling is inaccurate and leads to anomalies
such as charging the entire duration of the clock (15 milliseconds) to the cur-
rently running thread (or DPC or ISR). Therefore, the system ends up completely
ignoring some number of milliseconds—say, 14.999—that could have been
spent idle, running other threads, running other DPCs and ISRs, or a combi-
nation of all of these operations. Additionally, because quantum is measured
based on clock ticks, this causes the premature round-robin selection of a new
thread, even though the current thread may have run for only a fraction of the
quantum.
Starting with Windows Vista, execution time is also tracked using the
hardware ( ) included in all processors since the Pen-
tium Pro. Using the TSC results in more accurate accounting of CPU usage (for
applications that use it —note that Task Manager does not) and also causes
the scheduler not to switch out threads before they have run for a full quan-
tum. Additionally, Windows 7 and later versions track, and charge, the TSC
to ISRsand DPCs, resulting in more accurate “Interrupt Time” measurements
as well (again, for tools that use this new measurement). Because all possible
execution time is now accounted for, it is possible to add it to idle time (which
is also tracked using the TSC) and accurately compute the exact number of
CPU cycles out of all possible CPU cycles in a given period (due to the fact
that modern processors have dynamically shifting frequencies), resulting in
cycle-accurate CPU usage measurements. Tools such as Microsoft’s SysInternals
Process Explorer use this mechanism in their user interface.
All of the dispatcher objects can be accessed from user mode via an open
operation that returns a handle. The user-mode code waits on handles to
synchronize with other threads as well as with the operating system (see
Section 21.7.1).
Both hardware and software interrupts are prioritized and are serviced in
priority order. There are 16 interrupt request levels (IRQLs) on all Windows ISAs
except the legacy IA-32, which uses 32. The lowest level, IRQL 0, is called the
PASSIVE LEVEL and is the default level at which all threads execute, whether in
kernel or user mode. The next levels are the software interrupt levels for APCs
and DPCs. Levels 3 to 10 are used to represent hardware interrupts based on
selections made by the PnP manager with the help of the HAL and the PCI/ACPI
bus drivers. Finally, the uppermost levels are reserved for the clock interrupt
(used for quantum management) and IPI delivery. The last level, HIGH LEVEL,
blocks all maskable interrupts and is typically used when crashing the system
in a controlled manner.
The Windows IRQLs are de ned in Figure 21.2.
The kernel dispatcher also provides trap handling for exceptions and
interrupts generated by hardware or software. Windows de nes several
architecture-independent exceptions, including:
The trap handlers deal with the hardware-level exceptions (called ) and
call the elaborate exception-handling code performed by the kernel’s exception
dispatcher. The creates an exception record containing
the reason for the exception and nds an exception handler to deal with it.
When an exception occurs in kernel mode, the exception dispatcher simply
calls a routine to locate the exception handler. If no handler is found, a fatal
system error occurs and the user is left with the infamous “blue screen of death”
that signi es system failure. In Windows 10, this is now a friendlier “sad face
of sorrow” with a QR code, but the blue color remains.
Exception handling is more complex for user-mode processes, because the
Windows error reporting (WER) service sets up an ALPC error port for every
process, on top of the Win32 environment subsystem, which sets up an ALPC
exception port for every process it creates. (For details on ports, see Section
21.3.5.4.) Furthermore, if a process is being debugged, it gets a debugger port.
If a debugger port is registered, the exception handler sends the exception to
the port. If the debugger port is not found or does not handle that exception, the
dispatcher attempts to nd an appropriate exception handler. If none exists, it
contacts the default unhandled exception handler, which will notify WER of the
process crash so that a crash dump can be generated and sent to Microsoft. If
there is a handler, but it refuses to handle the exception, the debugger is called
again to catch the error for debugging. If no debugger is running, a message is
sent to the process’s exception port to give the environment subsystem a chance
to react to the exception. Finally, a message is sent to WER through the error
port, in the case where the unhandled exception handler may not have had a
chance to do so, and then the kernel simply terminates the process containing
the thread that caused the exception.
WER will typically send the information back to Microsoft for further anal-
ysis, unless the user has opted out or is using a local error-reporting server. In
some cases, Microsoft’s automated analysis may be able to recognize the error
immediately and suggest a x or workaround.
The interrupt dispatcher in the kernel handles interrupts by calling either
an interrupt service routine (ISR) supplied by a device driver or a kernel trap-
handler routine. The interrupt is represented by an that con-
tains all the information needed to handle the interrupt. Using an interrupt
object makes it easy to associate interrupt-service routines with an interrupt
without having to access the interrupt hardware directly.
Different processor architectures have different types and numbers of inter-
rupts. For portability, the interrupt dispatcher maps the hardware interrupts
into a standard set.
The kernel uses an to bind each interrupt level
to a service routine. In a multiprocessor computer, Windows keeps a separate
interrupt-dispatch table (IDT) for each processor core, and each processor’s
IRQL can be set independently to mask out interrupts. All interrupts that occur
at a level equal to or less than the IRQL of a processor are blocked until the
IRQL is lowered by a kernel-level thread or by an ISR returning from interrupt
processing. Windows takes advantage of this property and uses software inter-
rupts to deliver APCs and DPCs, to perform system functions such as synchro-
nizing threads with I/O completion, to start thread execution, and to handle
timers.
21.3.5 Executive
The Windows executive provides a set of services that all environment sub-
systems use. To give you a good basic overview, we discuss the following
services here: object manager, virtual memory manager, process manager,
advanced local procedure call facility, I/O manager, cache manager, security
reference monitor, plug-and-play and power managers, registry, and startup.
Note, though, that the Windows executive includes more than two dozen ser-
vices in total.
The executive is organized according to object-oriented design principles.
An in Windows is a system-de ned data type that has a set of
attributes (data values) and a set of methods (for example, functions or opera-
tions) that help de ne its behavior. An is an instance of an object type.
The executive performs its job by using a set of objects whose attributes store
the data and whose methods perform the activities.
The executive component that manages the virtual address space, physical
memory allocation, and paging is the ( ). The design of
the MM assumes that the underlying hardware supports virtual-to-physical
mapping, a paging mechanism, and transparent cache coherence on multipro-
cessor systems, as well as allowing multiple page-table entries to map to the
same physical page frame. The MM in Windows uses a page-based manage-
ment scheme based on the page sizes supported by hardware (4 KB, 2 MB, and
1 GB). Pages of data allocated to a process that are not in physical memory
are either stored in the on secondary storage or mapped directly
to a regular le on a local or remote le system. A page can also be marked
zero- ll-on-demand, which initializes the page with zeros before it is mapped,
thus erasing the previous contents.
On 32-bit processors such as IA-32 and ARM, each process has a 4-GB virtual
address space. By default, the upper 2 GB are mostly identical for all processes
and are used by Windows in kernel mode to access the operating-system code
and data structures. For 64-bit architectures such as the AMD64 architecture,
Windows provides a 256-TB per-process virtual address space, divided into two
128-TB regions for user mode and kernel mode. (These restrictions are based on
hardware limitations that will soon be lifted. Intel has announced that its future
processors will support up to 128 PB of virtual address space, out of the 16 EB
theoretically available.)
The availability of the kernel’s code in each process’s address space is
important, and commonly found in many other operating systems as well.
Generally, virtual memory is used to map the kernel code into the address
space of each process. Then, when say a system call is executed or an interrupt
is received, the context switch to allow the current core to run that code is
lighter-weight than it would otherwise be without this mapping. Speci cially,
no memory-management registers need to be saved and restored, and the cache
does not get invalidated. The net result is much faster movement between user
and kernel code, compared to older architectures that keep kernel memory
separate and not available within the process address space.
The Windows MM uses a two-step process to allocate virtual memory. The
rst step reserves one or more pages of virtual addresses in the process’s virtual
address space. The second step commits the allocation by assigning virtual
memory space (physical memory or space in the paging les). Windows limits
the amount of virtual memory space a process consumes by enforcing a quota
on committed memory. A process de-commits memory that it is no longer using
to free up virtual memory space for use by other processes. The APIs used
to reserve virtual addresses and commit virtual memory take a handle on a
process object as a parameter. This allows one process to control the virtual
memory of another.
Windows implements shared memory by de ning a . After
getting a handle to a section object, a process maps the memory of the section to
a range of addresses, called a . A process can establish a view of the entire
section or only the portion it needs. Windows allows sections to be mapped
not just into the current process but into any process for which the caller has a
handle.
Sections can be used in many ways. A section can be backed by secondary
storage either in the system-paging le or in a regular le (a memory-mapped
le). A section can be based, meaning that it appears at the same virtual address
for all processes attempting to access it. Sections can also represent physical
memory, allowing a 32-bit process to access more physical memory than can
t in its virtual address space. Finally, the memory protection of pages in the
section can be set to read only, read– write, read–write –execute, execute only,
no access, or copy-on-write.
Let’s look more closely at the last two of these protection settings:
page table page page table page table page page table
entry 0 table 0 entry 511 entry 0 table 511 entry 511
4 KB 4 KB 4 KB 4 KB
page page page page
only 8 MB. Additionally, the MM allocates pages of PDEs and PTEs as needed and
moves page-table pages to secondary storage when not in use, so that the actual
physical memory overhead of the paging structures for each process is usually
approximately 2 KB. The page-table pages are faulted back into memory when
referenced.
We next consider how virtual addresses are translated into physical
addresses on IA-32-compatible processors. A 2-bit value can represent the
values 0, 1, 2, 3. A 9-bit value can represent values from 0 to 511; a 12-bit value,
values from 0 to 4,095. Thus, a 12-bit value can select any byte within a 4-KB
page of memory. A 9-bit value can represent any of the 512 PDEs or PTEs in a
page directory or PTE-table page. As shown in Figure 21.4, translating a virtual
address pointer to a byte address in physical memory involves breaking the
32-bit pointer into four values, starting from the most signi cant bits:
• Two bits are used to index into the four PDEs at the top level of the page
table. The selected PDE will contain the physical page number for each of
the four page-directory pages that map 1 GB of the address space.
• Nine bits are used to select another PDE, this time from a second-level page
directory. This PDE will contain the physical page numbers of up to 512
PTE-table pages.
• Nine bits are used to select one of 512 PTEs from the selected PTE-table
page. The selected PTE will contain the physical page number for the byte
we are accessing.
• Twelve bits are used as the byte offset into the page. The physical address
of the byte we are accessing is constructed by appending the lowest 12 bits
of the virtual address to the end of the physical page number we found in
the selected PTE.
Note that the number of bits in a physical address may be different from
the number of bits in a virtual address. For example, when PAE is enabled
(the only mode supported by Windows 8 and later versions), the IA-32 MMU
is extended to the larger 64-bit PTE size, while the hardware supports 36-bit
physical addresses, granting access to up to 64 GB of RAM, even though a
single process can only map an address space up to 4 GB in size. Today, on
the AMD64 architecture, server versions of Windows support very, very large
physical addresses—more than we can possibly use or even buy (24 TB as of
the latest release). (Of course, at one time time 4 GB seemed optimistically large
for physical memory.)
To improve performance, the MM maps the page-directory and PTE-table
pages into the same contiguous region of virtual addresses in every process.
This self-map allows the MM to use the same pointer to access the current PDE
or PTE corresponding to a particular virtual address no matter what process is
running. The self-map for the IA-32 takes a contiguous 8-MB region of kernel
virtual address space; the AMD64 self-map occupies 512 GB. Although the self-
map occupies signi cant address space, it does not require any additional
virtual memory pages. It also allows the page table’s pages to be automatically
paged in and out of physical memory.
In the creation of a self-map, one of the PDEs in the top-level page directory
refers to the page-directory page itself, forming a “loop” in the page-table
translations. The virtual pages are accessed if the loop is not taken, the PTE-table
pages are accessed if the loop is taken once, the lowest-level page-directory
pages are accessed if the loop is taken twice, and so forth.
The additional levels of page directories used for 64-bit virtual memory are
translated in the same way except that the virtual address pointer is broken up
into even more values. For the AMD64, Windows uses four full levels, each of
which maps 512 pages, or 9 + 9 + 9 + 9 + 12 = 48 bits of virtual address.
To avoid the overhead of translating every virtual address by looking up
the PDE and PTE, processors use ( ) hardware,
which contains an associative memory cache for mapping virtual pages to PTEs.
The TLB is part of the ( ) within each processor.
The MMU needs to “walk” (navigate the data structures of) the page table in
memory only when a needed translation is missing from the TLB.
The PDEs and PTEs contain more than just physical page numbers. They
also have bits reserved for operating-system use and bits that control how the
hardware uses memory, such as whether hardware caching should be used for
each page. In addition, the entries specify what kinds of access are allowed for
both user and kernel modes.
A PDE can also be marked to say that it should function as a PTE rather than
a PDE. On a IA-32, the rst 11 bits of the virtual address pointer select a PDE in
the rst two levels of translation. If the selected PDE is marked to act as a PTE,
then the remaining 21 bits of the pointer are used as the offset of the byte. This
results in a 2-MB size for the page. Mixing and matching 4-KB and 2-MB page
sizes within the page table is easy for the operating system and can signi cantly
improve the performance of some programs. The improvement results from
reducing how often the MMU needs to reload entries in the TLB, since one PDE
mapping 2 MB replaces 512 PTEs, each mapping 4 KB. Newer AMD64 hardware
even supports 1-GB pages, which operate in a similar fashion.
Managing physical memory so that 2-MB pages are available when needed
is dif cult, as they may continually be broken up into 4-KB pages, causing
external fragmentation of memory. Also, the large pages can result in very
signi cant internal fragmentation. Because of these problems, it is typically
only Windows itself, along with large server applications, that use large pages
to improve the performance of the TLB. They are better suited to do so because
operating-system and server applications start running when the system boots,
before memory has become fragmented.
Windows manages physical memory by associating each physical page
with one of seven states: free, zeroed, modi ed, standby, bad, transition, or
valid.
• A free page is an available page that has stale or uninitialized content.
• A zeroed page is a free page that has been zeroed out and is ready for
immediate use to satisfy zero-on-demand faults.
• A modi ed page has been written by a process and must be sent to sec-
ondary storage before it is usable by another process.
• A standby page is a copy of information already stored on secondary
storage. Standby pages may be pages that were not modi ed, modi ed
pages that have already been written to secondary storage, or pages that
were prefetched because they were expected to be used soon.
• A bad page is unusable because a hardware error has been detected.
• A transition page is on its way from secondary storage to a page frame
allocated in physical memory.
• A valid page either is part of the working set of one or more processes and
is contained within these processes’ page tables, or is being used by the
system directly (such as to store the nonpaged pool).
While valid pages are contained in processes’ page tables, pages in other
states are kept in separate lists according to state type. Additionally, to improve
performance and protect against aggressive recycling of the standby pages,
Windows Vista and later versions implement eight prioritized standby lists.
The lists are constructed by linking the corresponding entries in the
( ) database, which includes an entry for each physical memory
page. The PFN entries also include information such as reference counts, locks,
and NUMA information. Note that the PFN database represents pages of phys-
ical memory, whereas the PTEs represent pages of virtual memory.
When the valid bit in a PTE is zero, hardware ignores all the other bits,
and the MM can de ne them for its own use. Invalid pages can have a number
of states represented by bits in the PTE. Page- le pages that have never been
faulted in are marked zero-on-demand. Pages mapped through section objects
encode a pointer to the appropriate section object. PTEs for pages that have
been written to the page le contain enough information to locate the page on
secondary storage, and so forth. The structure of the page- le PTE is shown in
Figure 21.5. The T, P, and V bits are all zero for this type of PTE. The PTE includes
5 bits for page protection, 32 bits for page- le offset, and 4 bits to select the
paging le. There are also 20 bits reserved for additional bookkeeping.
Windows uses a per-working-set, least recently used (LRU) replacement
policy to take pages from processes as appropriate. When a process is started, it
is assigned a default minimum working-set size, at which point the MM starts to
track the age of the pages in each working set. The working set of each process
is allowed to grow until the amount of remaining physical memory starts to
run low. Eventually, when the available memory runs critically low, the MM
trims the working set to remove older pages.
The age of a page depends not on how long it has been in memory but on
when it was last referenced. The MM makes this determination by periodically
passing through the working set of each process and incrementing the age for
pages that have not been marked in the PTE as referenced since the last pass.
When it becomes necessary to trim the working sets, the MM uses heuristics to
63 32
page- le o set
31 0
page
T P prot V
le
decide how much to trim from each process and then removes the oldest pages
rst.
A process can have its working set trimmed even when plenty of memory is
available, if it was given a hard limit on how much physical memory it could
use. In Windows 7 and later versions, the MM also trims processes that are
growing rapidly, even if memory is plentiful. This policy change signi cantly
improved the responsiveness of the system for other processes.
Windows tracks working sets not only for user-mode processes but also
for various kernel-mode regions, which include the le cache and the pageable
kernel heap. Pageable kernel and driver code and data have their own working
sets, as does each TS session. The distinct working sets allow the MM to use
different policies to trim the different categories of kernel memory.
The MM does not fault in only the page immediately needed. Research
shows that the memory referencing of a thread tends to have a prop-
erty. That is, when a page is used, it is likely that adjacent pages will be
referenced in the near future. (Think of iterating over an array or fetching
sequential instructions that form the executable code for a thread.) Because of
locality, when the MM faults in a page, it also faults in a few adjacent pages.
This prefetching tends to reduce the total number of page faults and allows
reads to be clustered to improve I/O performance.
In addition to managing committed memory, the MM manages each pro-
cess’s reserved memory, or virtual address space. Each process has an asso-
ciated tree that describes the ranges of virtual addresses in use and what the
uses are. This allows the MM to fault in page-table pages as needed. If the PTE
for a faulting address is uninitialized, the MM searches for the address in the
process’s tree of ( ) and uses this information
to ll in the PTE and retrieve the page. In some cases, a PTE-table page may not
exist; such a page must be transparently allocated and initialized by the MM. In
other cases, the page may be shared as part of a section object, and the VAD will
contain a pointer to that section object. The section object contains information
on how to nd the shared virtual page so that the PTE can be initialized to point
to it directly.
Starting with Vista, the Windows MM includes a component called Super-
Fetch. This component combines a user-mode service with specialized kernel-
mode code, including a le-system lter, to monitor all paging operations on
the system. Each second, the service queries a trace of all such operations and
uses a variety of agents to monitor application launches, fast user switches,
standby/sleep/hibernate operations, and more as a means of understanding
the system’s usage patterns. With this information, it builds a statistical model,
using Markov chains, of which applications the user is likely to launch when, in
combination with what other applications, and what portions of these appli-
cations will be used. For example, SuperFetch can train itself to understand
that the user launches Microsoft Outlook in the mornings mostly to read e-
mail but composes e-mails later, after lunch. It can also understand that once
Outlook is in the background, Visual Studio is likely to be launched next, and
that the text editor is going to be in high demand, with the compiler demanded
a little less frequently, the linker even less frequently, and the documentation
code hardly ever. With this data, SuperFetch will prepopulate the standby list,
making low-priority I/O reads from secondary storage at idle times to load
what it thinks the user is likely to do next (or another user, if it knows a fast
user switch is likely). Additionally, by using the eight prioritized standby lists
that Windows offers, each such prefetched paged can be cached at a level that
matches the statistical likelihood that it will be needed. Thus, unlikely-to-be-
demanded pages can cheaply and quickly be evicted by an unexpected need
for physical memory, while likely-to-be-demanded-soon pages can be kept in
place for longer. Indeed, SuperFetch may even force the system to trim working
sets of other processes before touching such cached pages.
SuperFetch’s monitoring does create considerable system overhead. On
mechanical (rotational) drives, which have seek times in the milliseconds, this
cost is balanced by the bene t of avoiding latencies and multisecond delays in
application launch times. On server systems, however, such monitoring is not
bene cial, given the random multiuser workloads and the fact that throughput
is more important than latency. Further, the combined latency improvements
and bandwidth on systems with fast, ef cient nonvolatile memory, such as
SSDs, make the monitoring less bene cial for those systems as well. In such
situations, SuperFetch disables itself, freeing up a few spare CPU cycles.
Windows 10 brings another large improvement to the MM by introducing
a component called the compression store manager. This component creates
a compressed store of pages in the working set of the
, which is a type of system process. When shareable pages go on the
standby list and available memory is low (or certain other internal algorithm
decisions are made), pages on the list will be compressed instead of evicted.
This can also happen to modi ed pages targeted for eviction to secondary
storage —both by reducing memory pressure, perhaps avoiding the write in
the rst place, and by causing the written pages to be compressed, thus con-
suming less page le space and taking less I/O to page out. On today’s fast
multiprocessor systems, often with built-in hardware compression algorithms,
the small CPU penalty is highly preferable to the potential secondary storage
I/O cost.
The Windows process manager provides services for creating, deleting, inter-
rogating, and managing processes, threads, and jobs. It has no knowledge
about parent –child relationships or process hierarchies, although it can group
processes in jobs, and the latter can have hierarchies that must then be main-
tained. The process manager is also not involved in the scheduling of threads,
other than setting the priorities and af nities of the threads in their owner
processes. Additionally, through jobs, the process manager can effect various
changes in scheduling attributes (such as throttling ratios and quantum val-
ues) on threads. Thread scheduling proper, however, takes place in the kernel
dispatcher.
Each process contains one or more threads. Processes themselves can be
collected into larger units called . The original use of job objects
was to place limits on CPU usage, working-set size, and processor af nities
that control multiple processes at once. Job objects were thus used to man-
age large data-center machines. In Windows XP and later versions, job objects
were extended to provide security-related features, and a number of third-
party applications such as Google Chrome began using jobs for this purpose. In
Windows 8, a massive architectural change allowed jobs to in uence schedul-
ing through generic CPU throttling as well as per-user-session-aware fairness
throttling/balancing. In Windows 10, throttling support was extended to sec-
ondary storage I/O and network I/O as well. Additionally, Windows 8 allowed
job objects to nest, creating hierarchies of limits, ratios, and quotas that the
system must accurately compute. Additional security and power management
features were given to job objects as well.
As a result, all Windows Store applications and all UWP application
processes run in jobs. The DAM, introduced earlier, implements Connected
Standby support using jobs. Finally, Windows 10’s support for Docker
Containers, a key part of its cloud offerings, uses job objects, which it calls
. Thus, jobs have gone from being an esoteric data-center resource
management feature to a core mechanism of the process manager for multiple
features.
Due to Windows’s layered architecture and the presence of environment
subsystems, process creation is quite complex. An example of process creation
in the Win32 environment under Windows 10 is as follows. Note that the
launching of UWP “Modern” Windows Store applications (which are called
, or “AppX”) is signi cantly more complex and involves
factors outside the scope of this discussion.
The Windows APIs for manipulating virtual memory and threads and for
duplicating handles take a process handle, so their subsystem and other ser-
vices, when noti ed of process creation, can perform operations on behalf of
the new process without having to execute directly in the new process’s con-
text. Windows also supports a UNIX fork() style of process creation. A number
of features—including , which is used by the Windows error
reporting (WER) infrastructure during process crashes, as well as the Windows
subsystem for Linux’s implementation of the Linux fork() API —depend on
this capability.
The debugger support in the process manager includes the APIs to suspend
and resume threads and to create threads that begin in suspended mode.
There are also process-manager APIs that get and set a thread’s register context
and access another process’s virtual memory. Threads can be created in the
current process; they can also be injected into another process. The debugger
makes use of thread injection to execute code within a process being debugged.
Unfortunately, the ability to allocate, manipulate, and inject both memory and
threads across processes is often misused by malicious programs.
While running in the executive, a thread can temporarily attach to a dif-
ferent process. is used by kernel worker threads that need to
execute in the context of the process originating a work request. For example,
the MM might use thread attach when it needs access to a process’s working set
or page tables, and the I/O manager might use it in updating the status variable
in a process for asynchronous I/O operations.
Like many other modern operating systems, Windows uses a client –server
model throughout, primarily as a layering mechanism, which allows putting
common functionality into a “service” (the equivalent of a daemon in UNIX
terms), as well as splitting out content-parsing code (such as a PDF reader or
Web browser) from system-action-capable code (such as the Web browser’s
capability to save a le on secondary storage or the PDF reader’s ability to
print out a document). For example, on a recent Windows 10 operating sys-
tem, opening the New York Times website with the Microsoft Edge browser
will likely result in 12 to 16 different processes in a complex organization of
“broker,” “renderer/parser,” “JITTer,” services, and clients.
The most basic such “server” on a Windows computer is the Win32 envi-
ronment subsystem, which is the server that implements the operating-system
personality of the Win32 API inherited from the Windows 95/98 days. Many
other services, such as user authentication, network facilities, printer spooling,
Web services, network le systems, and plug-and-play, are also implemented
using this model. To reduce the memory footprint, multiple services are often
collected into a few processes running the svchost.exe program. Each service
is loaded as a dynamic-link library (DLL), which implements the service by rely-
ing on user-mode thread-pool facilities to share threads and wait for messages
(see Section 21.3.5.3). Unfortunately, this pooling originally resulted in poor
user experience in troubleshooting and debugging runaway CPU usage and
memory leaks, and it weakened the overall security of each service. Therefore,
in recent versions of Windows 10, if the system has over 2 GB of RAM, each DLL
service runs in its own individual svchost.exe process.
In Windows, the recommended paradigm for implementing client–server
computing is to use RPCs to communicate requests, because of their inher-
ent security, serialization services, and extensibility features. The Win32 API
supports the Microsoft standard of the DCE-RPC protocol, called MS-RPC, as
described in Section 21.6.2.7.
RPC uses multiple transports (for example, named pipes and TCP/IP) that
can be used to implement RPCs between systems. When an RPC occurs only
between a client and a server on the local system, ALPC can be used as the
transport. Furthermore, because RPC is heavyweight and has multiple system-
level dependencies (including the WINXXIII environment subsystem itself),
many native Windows services, as well as the kernel, directly use ALPC, which
is not available (nor suitable) for third-party programmers.
ALPC is a message-passing mechanism similar to UNIX domain sockets
and Mach IPC. The server process publishes a globally visible connection-port
object. When a client wants services from the server, it opens a handle to the
server’s connection-port object and sends a connection request to the port. If
the server accepts the connection, then ALPC creates a pair of communication-
port objects, providing the client’s connect API with its handle to the pair, and
then providing the server’s accept API with the other handle to the pair.
At this point, messages can be sent across communication ports as either
datagrams, which behave like UDP and require no reply, or requests, which
must receive a reply. The client and server can then use either synchronous
messaging, in which one side is always blocking (waiting for a request or
expecting a reply), or asynchronous messaging, in which the thread-pool
mechanism can be used to perform work whenever a request or reply is
received, without the need for a thread to block for a message. For servers
located in kernel mode, communication ports also support a callback mech-
anism, which allows an immediate switch to the kernel side (KT) of the user-
mode thread (UT), immediately executing the server’s handler routine.
When an ALPC message is sent, one of two message-passing techniques can
be chosen.
In many operating systems, caching is done by the block device system, usually
at the physical/block level. Instead, Windows provides a centralized caching
facility that operates at the logical/virtual le level. The works
closely with the MM to provide cache services for all components under the
control of the I/O manager. This means that the cache can operate on anything
from remote les on a network share to logical les on a custom le system. The
size of the cache changes dynamically according to how much free memory
is available in the system; it can grow as large as 2 TB on a 64-bit system.
The cache manager maintains a private working set rather than sharing the
system process’s working set, which allows trimming to page out cached les
more effectively. To build the cache, the cache manager memory-maps les into
kernel memory and then uses special interfaces to the MM to fault pages into
or trim them from this private working set, which lets it take advantage of
additional caching facilities provided by the memory manager.
The cache is divided into blocks of 256 KB. Each cache block can hold a
view (that is, a memory-mapped region) of a le. Each cache block is described
by a ( ) that stores the virtual address and
le offset for the view, as well as the number of processes using the view.
The VACBs reside in arrays maintained by the cache manager, and there are
arrays for critical as well as low-priority cached data to improve performance
in situations of memory pressure.
When the I/O manager receives a le’s user-level read request, the I/O
manager sends an IRP to the I/O stack for the volume on which the le resides.
For les that are marked as cacheable, the le system calls the cache manager
to look up the requested data in its cached le views. The cache manager
calculates which entry of that le’s VACB index array corresponds to the byte
offset of the request. The entry either points to the view in the cache or is
invalid. If it is invalid, the cache manager allocates a cache block (and the
corresponding entry in the VACB array) and maps the view into the cache block.
The cache manager then attempts to copy data from the mapped le to the
caller’s buffer. If the copy succeeds, the operation is completed.
If the copy fails, it does so because of a page fault, which causes the MM
to send a noncached read request to the I/O manager. The I/O manager sends
another request down the driver stack, this time requesting a paging operation,
which bypasses the cache manager and reads the data from the le directly into
the page allocated for the cache manager. Upon completion, the VACB is set to
point at the page. The data, now in the cache, are copied to the caller’s buffer,
and the original I/O request is completed. Figure 21.6 shows an overview of
these operations.
When possible, for synchronous operations on cached les, I/O is handled
by the . This mechanism parallels the normal IRP-based I/O
process
cached I/O
cache manager file system
page fault
VM manager disk driver