0% found this document useful (0 votes)
34 views

Virtualization Chapter Readding

Uploaded by

m.zeeshanpc1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views

Virtualization Chapter Readding

Uploaded by

m.zeeshanpc1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

directly through Microsoft, avoiding the usage of installation DVDs or having

the user scour the manufacturer’s website.


Beyond peripherals, Windows Server also supports dynamic hot-add and
hot-replace of CPUs and RAM, as well as dynamic hot-remove of RAM. These
features allow the components to be added, replaced, or removed without
system interruption. While of limited use in physical servers, this technology
is key to dynamic scalability in cloud computing, especially in Infrastructure-
as-a-Service (IaaS) and cloud computing environments. In these scenarios,
a physical machine can be con gured to support a limited number of its
processors based on a service fee, which can then be dynamically upgraded,
without requiring a reboot, through a compatible hypervisor such as Hyper-V
and a simple slider in the owner’s user interface.

21.3 System Components


The architecture of Windows is a layered system of modules operating at
speci c privilege levels, as shown earlier in Figure 21.1. By default, these
privilege levels are rst implemented by the processor (providing a “vertical”
privilege isolation between user mode and kernel mode). Windows 10 can also
use its Hyper-V hypervisor to provide an orthogonal (logically independent)
security model through ( ). When users enable this
feature, the system operates in a Virtual Secure Mode (VSM). In this mode,
the layered privileged system now has two implementations, one called the
, or VTL 0, and one called the , or VTL 1. Within
each of these worlds, we nd a user mode and a kernel mode.
Let’s look at this structure in somewhat more detail.

• In the Normal World, in kernel mode are (1) the HAL and its extensions and
(2) the kernel and its executive, which load drivers and DLL dependencies.
In user mode are a collection of system processes, the Win32 environment
subsystem, and various services.
• In the Secure World, if VSM is enabled, are a secure kernel and executive
(within which a secure micro-HAL is embedded). A collection of isolated
(discussed later) run in secure user mode.
• Finally, the bottommost layer in Secure World runs in a special processor
mode (called, for example, VMX Root Mode on Intel processors), which
contains the Hyper-V hypervisor component, which uses hardware virtu-
alization to construct the Normal-to-Secure-World boundary. (The user-to-
kernel boundary is provided by the CPU natively.)

One of the chief advantages of this type of architecture is that interactions


between modules, and between privilege levels, are kept simple, and that iso-
lation needs and security needs are not necessarily con ated through privilege.
For example, a secure, protected component that stores passwords can itself be
unprivileged. In the past, operating-system designers chose to meet isolation
needs by making the secure component highly privileged, but this results in a
net loss for the security of the system when this component is compromised.
The remainder of this section describes these layers and subsystems.
21.3.1 Hyper-V Hypervisor
The hypervisor is the rst component initialized on a system with VSM enabled,
which happens as soon as the user enables the Hyper-V component. It is
used both to provide hardware virtualization features for running separate
virtual machines and to provide the VTL boundary and related access to the
hardware’s Second Level Address Translation (SLAT) functionality (discussed
shortly). The hypervisor uses a CPU-speci c virtualization extension, such as
AMD’s Paci ca (SVMX) or Intel’s Vanderpool (VT-x), to intercept any interrupt,
exception, memory access, instruction, port, or register access that it chooses
and deny, modify, or redirect the effect, source, or destination of the operation.
It also provides a interface, which enables it to communicate with
the kernel in VTL 0, the secure kernel in VTL 1, and all other running virtual
machine kernels and secure kernels.

21.3.2 Secure Kernel


The secure kernel acts as the kernel-mode environment of isolated (VTL 1) user-
mode Trustlet applications (applications that implement parts of the Windows
security model). It provides the same system-call interface that the kernel does,
so that all interrupts, exceptions, and attempts to enter kernel mode from a
VTL 1 Trustlet result in entering the secure kernel instead. However, the secure
kernel is not involved in context switching, thread scheduling, memory man-
agement, interprocess-communication, or any of the other standard kernel
tasks. Additionally, no kernel-mode drivers are present in VTL 1. In an attempt
to reduce the attack surface of the Secure World, these complex implementa-
tions remain the responsibility of Normal World components. Thus, the secure
kernel acts as a type of “proxy kernel” that hands off the management of its
resources, paging, scheduling, and more, to the regular kernel services in VTL
0. This does make the Secure World vulnerable to denial-of-service attacks, but
that is a reasonable tradeoff of the security design, which values data privacy
and integrity over service guarantees.
In addition to forwarding system calls, the secure kernel’s other responsi-
bility is providing access to the hardware secrets, the trusted platform module
(TPM), and code integrity policies that were captured at boot. With this infor-
mation, Trustlets can encrypt and decrypt data with keys that the Normal
World cannot obtain and can sign and attest (co-sign by Microsoft) reports
with integrity tokens that cannot be faked or replicated outside of the Secure
World. Using a CPU feature called Second Level Address Translation (SLAT),
the secure kernel also provides the ability to allocate virtual memory in such a
way that the physical pages backing it cannot be seen at all from the Normal
World. Windows 10 uses these capabilities to provide additional protection of
enterprise credentials through a feature called Credential Guard.
Furthermore, when Device Guard (mentioned earlier) is activated, it takes
advantage of VTL 1 capabilities by moving all digital signature checking into
the secure kernel. This means that even if attacked through a software vulner-
ability, the normal kernel cannot be forced to load unsigned drivers, as the VTL
1 boundary would have to be breached for that to occur. On a Device Guard–
protected system, for a kernel-mode page in VTL 0 to be authorized for execu-
tion, the kernel must rst ask permission from the secure kernel, and only the
secure kernel can grant this page executable access. More secure deployments
(such as in embedded or high-risk systems) can require this level of signature
validation for user-mode pages as well.
Additionally, work is being done to allow special classes of hardware
devices, such as USB webcams and smartcard readers, to be directly managed
by user-mode drivers running in VTL 1 (using the UMDF framework described
later), allowing biometric data to be securely captured in VTL 1 without any
component in the Normal World being able to intercept it. Currently, the only
Trustlets allowed are those that provide the Microsoft-signed implementation
of Credential Guard and virtual-TPM support. Newer versions of Windows
10 will also support , which will allow validly signed (but not
necessarily Microsoft-signed) third-party code wishing to perform its own
cryptographic calculations to do so. Software enclaves will allow regular VTL
0 applications to “call into” an enclave, which will run executable code on top
of input data and return presumably encrypted output data.
For more information on the secure kernel, see https://blogs.technet.micro
soft.com/ash/2016/03/02/windows-10-device-guard-and-credential-guard-d
emystified/.

21.3.3 Hardware-Abstraction Layer


The HAL is the layer of software that hides hardware chipset differences
from upper levels of the operating system. The HAL exports a virtual hard-
ware interface that is used by the kernel dispatcher, the executive, and the
device drivers. Only a single version of each device driver is required for
each CPU architecture, no matter what support chips might be present. Device
drivers map devices and access them directly, but the chipset-speci c details
of mapping memory, con guring I/O buses, setting up DMA, and coping with
motherboard-speci c facilities are all provided by the HAL interfaces.

21.3.4 Kernel
The kernel layer of Windows has the following main responsibilities: thread
scheduling and context switching, low-level processor synchronization, inter-
rupt and exception handling, and switching between user mode and kernel
mode through the system-call interface. Additionally, the kernel layer imple-
ments the initial code that takes over from the boot loader, formalizing the tran-
sition into the Windows operating system. It also implements the initial code
that safely crashes the kernel in case of an unexpected exception, assertion, or
other inconsistency. The kernel is mostly implemented in the C language, using
assembly language only when absolutely necessary to interface with the lowest
level of the hardware architecture and when direct register access is needed.

The dispatcher provides the foundation for the executive and the subsystems.
Most of the dispatcher is never paged out of memory, and its execution is
never preempted. Its main responsibilities are thread scheduling and context
switching, implementation of synchronization primitives, timer management,
software interrupts (asynchronous and deferred procedure calls), interproces-
sor interrupts (IPIs) and exception dispatching. It also manages hardware and
software interrupt prioritization under the system of
( ).

What the programmer thinks of as a thread in traditional Windows is actually


a thread with two modes of execution: a ( ) and a
( ). The thread has two stacks, one for UT execution and the
other for KT. A UT requests a system service by executing an instruction that
causes a trap to kernel mode. The kernel layer runs a trap handler that switches
UT stack to its KT sister and changes CPU mode to kernel. When thread in KT
mode has completed its kernel execution and is ready to switch back to the
corresponding UT, the kernel layer is called to make the switch to the UT, which
continues its execution in user mode. The KT switch also happens when an
interrupt occurs.
Windows 7 modi es the behavior of the kernel layer to support user-
mode scheduling of the UTs. User-mode schedulers in Windows 7 support
cooperative scheduling. A UT can explicitly yield to another UT by calling
the user-mode scheduler; it is not necessary to enter the kernel. User-mode
scheduling is explained in more detail in Section 21.7.3.7.
In Windows, the dispatcher is not a separate thread running in the kernel.
Rather, the dispatcher code is executed by the KT component of a UT thread. A
thread goes into kernel mode in the same circumstances that, in other operating
systems, cause a kernel thread to be called. These same circumstances will
cause the KT to run through the dispatcher code after its other operations,
determining which thread to run next on the current core.

Like many other modern operating systems, Windows uses threads as the key
schedulable unit of executable code, with processes serving as containers of
threads. Therefore, each process must have at least one thread, and each thread
has its own scheduling state, including actual priority, processor af nity, and
CPU usage information.
There are eight possible thread states: initializing, ready, deferred-
ready, standby, running, waiting, transition, and terminated. ready
indicates that the thread is waiting to execute, while deferred-ready indicates
that the thread has been selected to run on a speci c processor but has not yet
been scheduled. A thread is running when it is executing on a processor core. It
runs until it is preempted by a higher-priority thread, until it terminates, until
its allotted execution time (quantum) ends, or until it waits on a dispatcher
object, such as an event signaling I/O completion. If a thread is preempting
another thread on a different processor, it is placed in the standby state on
that processor, which means it is the next thread to run.
Preemption is instantaneous —the current thread does not get a chance to
nish its quantum. Therefore, the processor sends a software interrupt —in
this case, a ( )—to signal to the other processor
that a thread is in the standby state and should be immediately picked up for
execution. Interestingly, a thread in the standby state can itself be preempted
if yet another processor nds an even higher-priority thread to run in this
processor. At that point, the new higher-priority thread will go to standby,
and the previous thread will go to the ready state. A thread is in the waiting
state when it is waiting for a dispatcher object to be signaled. A thread is in
the transition state while it waits for resources necessary for execution; for
example, it may be waiting for its kernel stack to be paged in from secondary
storage. A thread enters the terminated state when it nishes execution, and a
thread begins in the initializing state as it is being created, before becoming
ready for the rst time.
The dispatcher uses a 32-level priority scheme to determine the order of
thread execution. Priorities are divided into two classes: variable class and
static class. The variable class contains threads having priorities from 1 to 15,
and the static class contains threads with priorities ranging from 16 to 31. The
dispatcher uses a linked list for each scheduling priority; this set of lists is called
the . The database uses a bitmap to indicate the presence
of at least one entry in the list associated with the priority of the bit’s position.
Therefore, instead of having to traverse the set of lists from highest to lowest
until it nds a thread that is ready to run, the dispatcher can simply nd the
list associated with the highest bit set.
Prior to Windows Server 2003, the dispatcher database was global,
resulting in heavy contention on large CPU systems. In Windows Server 2003
and later versions, the global database was broken apart into per-processor
databases, with per-processor locks. With this new model, a thread will only
be in the database of its . It is thus guaranteed to have a
processor af nity that includes the processor on whose database it is located.
The dispatcher can now simply pick the rst thread in the list associated with
the highest bit set and does not have to acquire a global lock. Dispatching
is therefore a constant-time operation, parallelizable across all CPUs on the
machine.
On a single-processor system, if no ready thread is found, the dispatcher
executes a special thread called the idle thread, whose role is to begin the
transition to one of the CPU’s initial sleep states. Priority class 0 is reserved for
the idle thread. On a multiprocessor system, before executing the idle thread,
the dispatcher looks at the dispatcher databases of other nearby processors,
taking caching topologies and NUMA node distances into consideration. This
operation requires acquiring the locks of other processor cores in order to safely
inspect their lists. If no thread can be stolen from a nearby core, the dispatcher
looks at the next nearest core, and so on. If no threads can be stolen at all, then
the processor executes the idle thread. Therefore, in a multiprocessor system,
each CPU will have its own idle thread.
Putting each thread on only the dispatcher database of its ideal processor
causes a locality problem. Imagine a CPU executing a thread at priority 2 in a
CPU-bound way, while another CPU is executing a thread at priority 18, also
CPU-bound. Then, a thread at priority 17 becomes ready. If the ideal processor
of this thread is the rst CPU, the thread preempts the current running thread.
But if the ideal processor is the latter CPU, it goes into the ready queue instead,
waiting for its turn to run (which won’t happen until the priority 17 thread
gives up the CPU by terminating or entering a wait state).
Windows 7 introduced a load-balancer algorithm to address this situation,
but it was a heavy-handed and disruptive approach to the locality issue. Win-
dows 8 and later versions solved the problem in a more nuanced way. Instead
of a global database as in Windows XP and earlier versions, or a per-processor
database as in Windows Server 2003 and later versions, the newer Windows
versions combine these approaches to form a among a
group of some, but not all, processors. The number of CPUs that form one
shared group depends on the topology of the system, as well as on whether
it is a server or client system. The number is chosen to keep contention low
on very large processor systems, while avoiding locality (and thus latency and
contention) issues on smaller client systems. Additionally, processor af nities
are still respected, so that a processor in a given group is guaranteed that all
threads in the shared ready queue are appropriate—it never needs to “skip”
over a thread, keeping the algorithm constant time.
Windows has a timer expire every 15 milliseconds to create a clock “tick” to
examine system states, update the time, and do other housekeeping. That tick
is received by the thread on every non-idle core. The interrupt handler (being
run by the thread, now in KT mode) determines if the thread’s quantum has
expired. When a thread’s time quantum runs out, the clock interrupt queues
a quantum-end DPC to the processor. Queuing the DPC results in a software
interrupt when the processor returns to normal interrupt priority. The software
interrupt causes the thread to run dispatcher code in KT mode to reschedule the
processor to execute the next ready thread at the preempted thread’s priority
level in a round-robin fashion. If no other thread at this level is ready, a lower-
priority ready thread is not chosen, because a higher-priority ready thread
already exists—the one that exhausted its quantum in the rst place. In this
situation, the quantum is simply restored to its default value, and the same
thread executes once again. Therefore, Windows always executes the highest-
priority ready thread.
When a variable-priority thread is awakened from a wait operation, the
dispatcher may boost its priority. The amount of the boost depends on the type
of wait associated with the thread. If the wait was due to I/O, then the boost
depends on the device for which the thread was waiting. For example, a thread
waiting for sound I/O would get a large priority increase, whereas a thread
waiting for a disk operation would get a moderate one. This strategy enables
I/O-bound threads to keep the I/O devices busy while permitting compute-
bound threads to use spare CPU cycles in the background.
Another type of boost is applied to threads waiting on mutex, semaphore,
or event synchronization objects. This boost is usually a hard-coded value
of one priority level, although kernel drivers have the option of making a
different change. (For example, the kernel-mode GUI code applies a boost of
two priority levels to all GUI threads waking up to process window messages.)
This strategy is used to reduce the latency between when a lock or other
noti cation mechanism is signaled and when the next waiter in line executes
in response to the state change.
In addition, the thread associated with the user’s active GUI window
receives a priority boost of two whenever it wakes up for any reason, on top
of any other existing boost, to enhance its response time. This strategy, called
the foreground priority separation boost, tends to give good response times to
interactive threads.
Finally, Windows Server 2003 added a lock-handoff boost for certain classes
of locks, such as critical sections. This boost is similar to the mutex, semaphore,
and event boost, except that it tracks ownership. Instead of boosting the waking
thread by a hard-coded value of one priority level, it boosts to one priority
level above that of the current owner (the one releasing the lock). This helps in
situations where, for example, a thread at priority 12 is releasing a mutex, but
the waiting thread is at priority 8. If the waiting thread receives a boost only to
9, it will not be able to preempt the releasing thread. But if it receives a boost
to 13, it can preempt and instantly acquire the critical section.
Because threads may run with boosted priorities when they wake up from
waits, the priority of a thread is lowered at the end of every quantum as long
as the thread is above its base (initial) priority. This is done according to the
following rule: For I/O threads and threads boosted due to waking up because
of an event, mutex, or semaphore, one priority level is lost at quantum end.
For threads boosted due to the lock-handoff boost or the foreground priority
separation boost, the entire value of the boost is lost. Threads that have received
boosts of both types will obey both of these rules (losing one level of the
rst boost, as well as the entirety of the second boost). Lowering the thread’s
priority makes sure that the boost is applied only for latency reduction and for
keeping I/O devices busy, not to give undue execution preference to compute-
bound threads.

Scheduling occurs when a thread enters the ready or waiting state, when a
thread terminates, or when an application changes a thread’s processor af nity.
As we have seen throughout the text, a thread could become ready at any
time. If a higher-priority thread becomes ready while a lower-priority thread is
running, the lower-priority thread is preempted immediately. This preemption
gives the higher-priority thread instant access to the CPU, without waiting on
the lower-priority thread’s quantum to complete.
It is the lower-priority thread itself, performing some event that caused it
to operate in the dispatcher, that wakes up the waiting thread and immedi-
ately context-switches to it while placing itself back in the ready state. This
model essentially distributes the scheduling logic throughout dozens of Win-
dows kernel functions and makes each currently running thread behave as
the scheduling entity. In contrast, other operating systems rely on an external
“scheduler thread” triggered periodically based on a timer. The advantage of
the Windows approach is latency reduction, with the cost of added overhead
inside every I/O and other state-changing operation, which causes the current
thread to perform scheduler work.
Windows is not a hard-real-time operating system, however, because it
does not guarantee that any thread, even the highest-priority one, will start to
execute within a particular time limit or have a guaranteed period of execution.
Threads are blocked inde nitely while DPCs and
( ) are running (as further discussed below), and they can be preempted at
any time by a higher-priority thread or be forced to round-robin with another
thread of equal priority at quantum end.
Traditionally, the Windows scheduler uses sampling to measure CPU uti-
lization by threads. The system timer res periodically, and the timer inter-
rupt handler takes note of what thread is currently scheduled and whether it
is executing in user or kernel mode when the interrupt occurred. This sam-
pling technique originally came about because either the CPU did not have a
high-resolution clock or the clock was too expensive or unreliable to access
frequently. Although ef cient, sampling is inaccurate and leads to anomalies
such as charging the entire duration of the clock (15 milliseconds) to the cur-
rently running thread (or DPC or ISR). Therefore, the system ends up completely
ignoring some number of milliseconds—say, 14.999—that could have been
spent idle, running other threads, running other DPCs and ISRs, or a combi-
nation of all of these operations. Additionally, because quantum is measured
based on clock ticks, this causes the premature round-robin selection of a new
thread, even though the current thread may have run for only a fraction of the
quantum.
Starting with Windows Vista, execution time is also tracked using the
hardware ( ) included in all processors since the Pen-
tium Pro. Using the TSC results in more accurate accounting of CPU usage (for
applications that use it —note that Task Manager does not) and also causes
the scheduler not to switch out threads before they have run for a full quan-
tum. Additionally, Windows 7 and later versions track, and charge, the TSC
to ISRsand DPCs, resulting in more accurate “Interrupt Time” measurements
as well (again, for tools that use this new measurement). Because all possible
execution time is now accounted for, it is possible to add it to idle time (which
is also tracked using the TSC) and accurately compute the exact number of
CPU cycles out of all possible CPU cycles in a given period (due to the fact
that modern processors have dynamically shifting frequencies), resulting in
cycle-accurate CPU usage measurements. Tools such as Microsoft’s SysInternals
Process Explorer use this mechanism in their user interface.

Windows uses a number of to control dispatching and


synchronization in the system. Examples of these objects include the following:

• The is used to record an event occurrence and to synchronize this


occurrence with some action. Noti cation events signal all waiting threads,
and synchronization events signal a single waiting thread.
• The provides kernel-mode or user-mode mutual exclusion associ-
ated with the notion of ownership.
• The acts as a counter or gate to control the number of threads
that access a resource.
• The is the entity that is scheduled by the kernel dispatcher. It is
associated with a process, which encapsulates a virtual address space, list
of open resources, and more. The thread is signaled when the thread exits,
and the process, when the process exits (that is, when all of its threads have
exited).
• The is used to keep track of time and to signal timeouts when
operations take too long and need to be interrupted or when a periodic
activity needs to be scheduled. Just like events, timers can operate in
noti cation mode (signal all) or synchronization mode (signal one).

All of the dispatcher objects can be accessed from user mode via an open
operation that returns a handle. The user-mode code waits on handles to
synchronize with other threads as well as with the operating system (see
Section 21.7.1).

Both hardware and software interrupts are prioritized and are serviced in
priority order. There are 16 interrupt request levels (IRQLs) on all Windows ISAs
except the legacy IA-32, which uses 32. The lowest level, IRQL 0, is called the
PASSIVE LEVEL and is the default level at which all threads execute, whether in
kernel or user mode. The next levels are the software interrupt levels for APCs
and DPCs. Levels 3 to 10 are used to represent hardware interrupts based on
selections made by the PnP manager with the help of the HAL and the PCI/ACPI
bus drivers. Finally, the uppermost levels are reserved for the clock interrupt
(used for quantum management) and IPI delivery. The last level, HIGH LEVEL,
blocks all maskable interrupts and is typically used when crashing the system
in a controlled manner.
The Windows IRQLs are de ned in Figure 21.2.

The dispatcher implements two types of software interrupts:


( ) and deferred procedure calls (DPCs, mentioned earlier).
APCs are used to suspend or resume existing threads, terminate threads, deliver
noti cations that an asynchronous I/O has completed, and extract or modify
the contents of the CPU registers (the context) from a running thread. APCs are
queued to speci c threads and allow the system to execute both system and
user code within a process’s context. User-mode execution of an APC cannot
occur at arbitrary times, but only when the thread is waiting and is marked
alertable. Kernel-mode execution of an APC, in contrast, instantaneously exe-
cutes in the context of a running thread because it is delivered as a software
interrupt running at IRQL 1 (APC LEVEL), which is higher than the default IRQL
0 (PASSIVE LEVEL). Additionally, even if a thread is waiting in kernel mode, the
wait can be broken by the APC and resumed once the APC completes execution.

interrupt levels types of interrupts

31 machine check or bus error


30 power fail
29 interprocessor notification (request another processor
to act; e.g., dispatch a process or update the TLB)
28 clock (used to keep track of time)
27 profile
3–26 traditional PC IRQ hardware interrupts
2 dispatch and deferred procedure call (DPC) (kernel)
1 asynchronous procedure call (APC)
0 passive

Figure 21.2 Windows x86 interrupt-request levels (IRLQs).


DPCs are used to postpone interrupt processing. After handling all urgent
device-interrupt processing, the ISR schedules the remaining processing by
queuing a DPC. The associated software interrupt runs at IRQL 2 (DPC LEVEL),
which is lower than all other hardware/I/O interrupt levels. Thus, DPCs do not
block other device ISRs. In addition to deferring device-interrupt processing,
the dispatcher uses DPCs to process timer expirations and to interrupt current
thread execution at the end of the scheduling quantum.
Because IRQL 2 is higher than 0 (PASSIVE) and 1 (APC), execution of DPCs
prevents standard threads from running on the current processor and also
keeps APCs from signaling the completion of I/O. Therefore, it is important
for DPC routines not to take an extended amount of time. As an alternative, the
executive maintains a pool of worker threads. DPCs can queue work items to
the worker threads, where they will be executed using normal thread schedul-
ing at IRQL 0. Because the dispatcher itself runs at IRQL 2, and because paging
operations require waiting on I/O (and that involves the dispatcher), DPC rou-
tines are restricted in that they cannot take page faults, call pageable system
services, or take any other action that might result in an attempt to wait for a
dispatcher object to be signaled. Unlike APCs, which are targeted to a thread,
DPC routines make no assumptions about what process context the processor
is executing, since they execute in the same context as the currently executing
thread, which was interrupted.

The kernel dispatcher also provides trap handling for exceptions and
interrupts generated by hardware or software. Windows de nes several
architecture-independent exceptions, including:

• Integer or oating-point over ow


• Integer or oating-point divide by zero
• Illegal instruction
• Data misalignment
• Privileged instruction
• Access violation
• Paging le quota exceeded
• Debugger breakpoint

The trap handlers deal with the hardware-level exceptions (called ) and
call the elaborate exception-handling code performed by the kernel’s exception
dispatcher. The creates an exception record containing
the reason for the exception and nds an exception handler to deal with it.
When an exception occurs in kernel mode, the exception dispatcher simply
calls a routine to locate the exception handler. If no handler is found, a fatal
system error occurs and the user is left with the infamous “blue screen of death”
that signi es system failure. In Windows 10, this is now a friendlier “sad face
of sorrow” with a QR code, but the blue color remains.
Exception handling is more complex for user-mode processes, because the
Windows error reporting (WER) service sets up an ALPC error port for every
process, on top of the Win32 environment subsystem, which sets up an ALPC
exception port for every process it creates. (For details on ports, see Section
21.3.5.4.) Furthermore, if a process is being debugged, it gets a debugger port.
If a debugger port is registered, the exception handler sends the exception to
the port. If the debugger port is not found or does not handle that exception, the
dispatcher attempts to nd an appropriate exception handler. If none exists, it
contacts the default unhandled exception handler, which will notify WER of the
process crash so that a crash dump can be generated and sent to Microsoft. If
there is a handler, but it refuses to handle the exception, the debugger is called
again to catch the error for debugging. If no debugger is running, a message is
sent to the process’s exception port to give the environment subsystem a chance
to react to the exception. Finally, a message is sent to WER through the error
port, in the case where the unhandled exception handler may not have had a
chance to do so, and then the kernel simply terminates the process containing
the thread that caused the exception.
WER will typically send the information back to Microsoft for further anal-
ysis, unless the user has opted out or is using a local error-reporting server. In
some cases, Microsoft’s automated analysis may be able to recognize the error
immediately and suggest a x or workaround.
The interrupt dispatcher in the kernel handles interrupts by calling either
an interrupt service routine (ISR) supplied by a device driver or a kernel trap-
handler routine. The interrupt is represented by an that con-
tains all the information needed to handle the interrupt. Using an interrupt
object makes it easy to associate interrupt-service routines with an interrupt
without having to access the interrupt hardware directly.
Different processor architectures have different types and numbers of inter-
rupts. For portability, the interrupt dispatcher maps the hardware interrupts
into a standard set.
The kernel uses an to bind each interrupt level
to a service routine. In a multiprocessor computer, Windows keeps a separate
interrupt-dispatch table (IDT) for each processor core, and each processor’s
IRQL can be set independently to mask out interrupts. All interrupts that occur
at a level equal to or less than the IRQL of a processor are blocked until the
IRQL is lowered by a kernel-level thread or by an ISR returning from interrupt
processing. Windows takes advantage of this property and uses software inter-
rupts to deliver APCs and DPCs, to perform system functions such as synchro-
nizing threads with I/O completion, to start thread execution, and to handle
timers.

21.3.5 Executive
The Windows executive provides a set of services that all environment sub-
systems use. To give you a good basic overview, we discuss the following
services here: object manager, virtual memory manager, process manager,
advanced local procedure call facility, I/O manager, cache manager, security
reference monitor, plug-and-play and power managers, registry, and startup.
Note, though, that the Windows executive includes more than two dozen ser-
vices in total.
The executive is organized according to object-oriented design principles.
An in Windows is a system-de ned data type that has a set of
attributes (data values) and a set of methods (for example, functions or opera-
tions) that help de ne its behavior. An is an instance of an object type.
The executive performs its job by using a set of objects whose attributes store
the data and whose methods perform the activities.

For managing kernel-mode entities, Windows uses a generic set of interfaces


that are manipulated by user-mode programs. Windows calls these entities
objects, and the executive component that manipulates them is the
. Examples of objects are les, registry keys, devices, ALPC ports,
drivers, mutexes, events, processes, and threads. As we saw earlier, some of
these, such as mutexes and processes, are dispatcher objects, which means
that threads can block in the dispatcher waiting for any of these objects to be
signaled. Additionally, most of the non-dispatcher objects include an internal
dispatcher object, which is signaled by the executive service controlling it. For
example, le objects have an event object embedded, which is signaled when
a le is modi ed.
User-mode and kernel-mode code can access these objects using an opaque
value called a , which is returned by many APIs. Each process has a
containing entries that track the objects used by the process.
There is a “system process” (see Section 21.3.5.11) that has its own handle
table, which is protected from user code and is used when kernel-mode code
is manipulating handles. The handle tables in Windows are represented by a
tree structure, which can expand from holding 1,024 handles to holding over
16 million. In addition to using handles, kernel-mode code can also access an
object by using , which it must obtain by calling a special
API. When handles are used, they must eventually be closed, to avoid keeping
an active reference on the object. Similarly, when kernel code uses a referenced
pointer, it must use a special API to drop the reference.
A handle can be obtained by creating an object, by opening an existing
object, by receiving a duplicated handle, or by inheriting a handle from a parent
process. To work around the issue that developers may forget to close their
handles, all of the open handles of a process are implicitly closed when it exits
or is terminated. However, since kernel handles belong to the system-wide
handle table, when a driver unloads, its handles are not automatically closed,
and this can lead to resource leaks on the system.
Since the object manager is the only entity that generates object handles,
it is the natural place to centralize calling the security reference monitor (SRM)
(see Section 21.3.5.7) to check security. When an attempt is made to open an
object, the object manager calls the SRM to check whether a process or thread
has the right to access the object. If the access check is successful, the resulting
rights (encoded as an ) are cached in the handle table. Therefore,
the opaque handle both represents the object in the kernel and identi es the
access that was granted to the object. This important optimization means that
whenever a le is written to (which could happen hundreds of times a second),
security checks are completely skipped, since the handle is already encoded as
a “write” handle. Conversely, if a handle is a “read” handle, attempts to write
to the le would instantly fail, without requiring a security check.
The object manager also enforces quotas, such as the maximum amount of
memory a process may use, by charging a process for the memory occupied
by all its referenced objects and refusing to allocate more memory when the
accumulated charges exceed the process’s quota.
Because objects can be referenced through handles from user and kernel
mode, and referenced through pointers from kernel mode, the object manager
has to keep track of two counts for each object: the number of handles for the
object and the number of references. The handle count is the number of handles
that refer to the object in all of the handle tables (including the system handle
table). The reference count is the sum of all handles (which count as references)
plus all pointer references done by kernel-mode components. The count is
incremented whenever a new pointer is needed by the kernel or a driver and
decremented when the component is done with the pointer. The purpose of
these reference counts is to ensure that an object is not freed while it still has a
reference, but can still release some of its data (such as the name and security
descriptor) when all handles are closed (since kernel-mode components don’t
need this information).
The object manager maintains the Windows internal name space. In con-
trast to UNIX, which roots the system name space in the le system, Windows
uses an abstract object manager name space that is only visible in memory or
through specialized tools such as the debugger. Instead of le-system directo-
ries, the hierarchy is maintained by a special kind of object called a
that contains a hash bucket of other objects (including other directory
objects). Note that some objects don’t have names (such as threads), and even
for other objects, whether an object has a name is up to its creator. For example,
a process would only name a mutex if it wanted other processes to nd, acquire,
or inquire about the state of the mutex.
Because processes and threads are created without names, they are
referenced through a separate numerical identi er, such as a process ID
(PID) or thread (TID). The object manager also supports symbolic links
in the name space. As an example, DOS drive letters are implemented
using symbolic links; ∖Global??∖C: is a symbolic link to the device object
∖Device∖HarddiskVolumeN, representing a mounted le-system volume in the
∖Device directory.
Each object, as mentioned earlier, is an instance of an object type. The
object type speci es how instances are to be allocated, how data elds are to
be de ned, and how the standard set of virtual functions used for all objects
are to be implemented. The standard functions implement operations such as
mapping names to objects, closing and deleting, and applying security checks.
Functions that are speci c to a particular type of object are implemented by
system services designed to operate on that particular object type, not by the
methods speci ed in the object type.
The parse() function is the most interesting of the standard object func-
tions. It allows the implementation of an object to override the default naming
behavior of the object manager (which is to use the virtual object directo-
ries). This ability is useful for objects that have their own internal namespace,
especially when the namespace might need to be retained between boots. The
I/O manager (for le objects) and the con guration manager (for registry key
objects) are the most notable users of parse functions.
Returning to our Windows naming example, device objects used to rep-
resent le-system volumes provide a parse function. This allows a name like
∖Global??∖C:∖foo∖bar.doc to be interpreted as the le ∖foo∖bar.doc on the
volume represented by the device object HarddiskVolume2. We can illustrate
how naming, parse functions, objects, and handles work together by looking
at the steps to open the le in Windows:

An application requests that a le named C:∖foo∖bar.doc be opened.


The object manager nds the device object HarddiskVolume2, looks up
the parse procedure (for example, IopParseDevice) from the object’s
type, and invokes it with the le’s name relative to the root of the le
system.
IopParseDevice() looks up the le system that owns the volume Hard-
DiskVolume2 and then calls into the le system, which looks up how to
access ∖foo∖bar.doc on the volume, performing its own internal parsing
of the foo directory to nd the bar.doc le. The le system then allocates
a le object and returns it to the I/O manager’s parse routine.
When the le system returns, the object manager allocates an entry for
the le object in the handle table for the current process and returns the
handle to the application.

If the le cannot successfully be opened, IopParseDevice returns an error


indication to the application.

The executive component that manages the virtual address space, physical
memory allocation, and paging is the ( ). The design of
the MM assumes that the underlying hardware supports virtual-to-physical
mapping, a paging mechanism, and transparent cache coherence on multipro-
cessor systems, as well as allowing multiple page-table entries to map to the
same physical page frame. The MM in Windows uses a page-based manage-
ment scheme based on the page sizes supported by hardware (4 KB, 2 MB, and
1 GB). Pages of data allocated to a process that are not in physical memory
are either stored in the on secondary storage or mapped directly
to a regular le on a local or remote le system. A page can also be marked
zero- ll-on-demand, which initializes the page with zeros before it is mapped,
thus erasing the previous contents.
On 32-bit processors such as IA-32 and ARM, each process has a 4-GB virtual
address space. By default, the upper 2 GB are mostly identical for all processes
and are used by Windows in kernel mode to access the operating-system code
and data structures. For 64-bit architectures such as the AMD64 architecture,
Windows provides a 256-TB per-process virtual address space, divided into two
128-TB regions for user mode and kernel mode. (These restrictions are based on
hardware limitations that will soon be lifted. Intel has announced that its future
processors will support up to 128 PB of virtual address space, out of the 16 EB
theoretically available.)
The availability of the kernel’s code in each process’s address space is
important, and commonly found in many other operating systems as well.
Generally, virtual memory is used to map the kernel code into the address
space of each process. Then, when say a system call is executed or an interrupt
is received, the context switch to allow the current core to run that code is
lighter-weight than it would otherwise be without this mapping. Speci cially,
no memory-management registers need to be saved and restored, and the cache
does not get invalidated. The net result is much faster movement between user
and kernel code, compared to older architectures that keep kernel memory
separate and not available within the process address space.
The Windows MM uses a two-step process to allocate virtual memory. The
rst step reserves one or more pages of virtual addresses in the process’s virtual
address space. The second step commits the allocation by assigning virtual
memory space (physical memory or space in the paging les). Windows limits
the amount of virtual memory space a process consumes by enforcing a quota
on committed memory. A process de-commits memory that it is no longer using
to free up virtual memory space for use by other processes. The APIs used
to reserve virtual addresses and commit virtual memory take a handle on a
process object as a parameter. This allows one process to control the virtual
memory of another.
Windows implements shared memory by de ning a . After
getting a handle to a section object, a process maps the memory of the section to
a range of addresses, called a . A process can establish a view of the entire
section or only the portion it needs. Windows allows sections to be mapped
not just into the current process but into any process for which the caller has a
handle.
Sections can be used in many ways. A section can be backed by secondary
storage either in the system-paging le or in a regular le (a memory-mapped
le). A section can be based, meaning that it appears at the same virtual address
for all processes attempting to access it. Sections can also represent physical
memory, allowing a 32-bit process to access more physical memory than can
t in its virtual address space. Finally, the memory protection of pages in the
section can be set to read only, read– write, read–write –execute, execute only,
no access, or copy-on-write.
Let’s look more closely at the last two of these protection settings:

• A no-access page raises an exception if accessed. The exception can be


used, for example, to check whether a faulty program iterates beyond
the end of an array or simply to detect that the program attempted to
access virtual addresses that are not committed to memory. User- and
kernel-mode stacks use no-access pages as to detect stack
over ows. Another use is to look for heap buffer overruns. Both the user-
mode memory allocator and the special kernel allocator used by the device
veri er can be con gured to map each allocation onto the end of a page,
followed by a no-access page to detect programming errors that access
beyond the end of an allocation.
• The copy-on-write mechanism enables the MM to use physical memory
more ef ciently. When two processes want independent copies of data
from the same section object, the MM places a single shared copy into
virtual memory and activates the copy-on-write property for that region
of memory. If one of the processes tries to modify data in a copy-on-write
page, the MM makes a private copy of the page for the process.
The virtual address translation on most modern processors uses a multi-
level page table. For IA-32 (operating in Physical Address Extension, or PAE,
mode) and AMD64 processors, each process has a that contains
512 ( ), each 8 bytes in size. Each PDE points to a
that contains 512 ( ), each 8 bytes in size. Each
PTE points to a 4-KB in physical memory. For a variety of reasons,
the hardware requires that the page directories or PTE tables at each level of a
multilevel page table occupy a single page. Thus, the number of PDEs or PTEs
that t in a page determines how many virtual addresses are translated by that
page. See Figure 21.3 for a diagram of this structure.
The structure described so far can be used to represent only 1 GB of virtual
address translation. For IA-32, a second page-directory level is needed, con-
taining only four entries, as shown in the diagram. On 64-bit processors, more
entries are needed. For AMD64, the processor can ll all the remaining entries
in the second page-directory level and thus obtain 512 GB of virtual address
space. Therefore, to support the 256 TB that are required, the processor needs
a third page-directory level (called the PML4), which also has 512 entries, each
pointing to the lower-level directory. As mentioned earlier, future processors
announced by Intel will support 128 PB, requiring a fourth page-directory level
(PML5). Thanks to this hierarchical mechanism, the total size of all page-table
pages needed to fully represent a 32-bit virtual address space for a process is

page directory pointer table


pointer 0 pointer 1 pointer 2 pointer 3

page page page page page page


directory directory directory directory directory directory
entry 0 0 entry 511 entry 0 3 entry 511

page table page page table page table page page table
entry 0 table 0 entry 511 entry 0 table 511 entry 511

4 KB 4 KB 4 KB 4 KB
page page page page

Figure 21.3 Page-table layout.


31 0

PTR PDE index PTE index page offset

Figure 21.4 Virtual-to-physical address translation on IA-32.

only 8 MB. Additionally, the MM allocates pages of PDEs and PTEs as needed and
moves page-table pages to secondary storage when not in use, so that the actual
physical memory overhead of the paging structures for each process is usually
approximately 2 KB. The page-table pages are faulted back into memory when
referenced.
We next consider how virtual addresses are translated into physical
addresses on IA-32-compatible processors. A 2-bit value can represent the
values 0, 1, 2, 3. A 9-bit value can represent values from 0 to 511; a 12-bit value,
values from 0 to 4,095. Thus, a 12-bit value can select any byte within a 4-KB
page of memory. A 9-bit value can represent any of the 512 PDEs or PTEs in a
page directory or PTE-table page. As shown in Figure 21.4, translating a virtual
address pointer to a byte address in physical memory involves breaking the
32-bit pointer into four values, starting from the most signi cant bits:

• Two bits are used to index into the four PDEs at the top level of the page
table. The selected PDE will contain the physical page number for each of
the four page-directory pages that map 1 GB of the address space.
• Nine bits are used to select another PDE, this time from a second-level page
directory. This PDE will contain the physical page numbers of up to 512
PTE-table pages.

• Nine bits are used to select one of 512 PTEs from the selected PTE-table
page. The selected PTE will contain the physical page number for the byte
we are accessing.
• Twelve bits are used as the byte offset into the page. The physical address
of the byte we are accessing is constructed by appending the lowest 12 bits
of the virtual address to the end of the physical page number we found in
the selected PTE.

Note that the number of bits in a physical address may be different from
the number of bits in a virtual address. For example, when PAE is enabled
(the only mode supported by Windows 8 and later versions), the IA-32 MMU
is extended to the larger 64-bit PTE size, while the hardware supports 36-bit
physical addresses, granting access to up to 64 GB of RAM, even though a
single process can only map an address space up to 4 GB in size. Today, on
the AMD64 architecture, server versions of Windows support very, very large
physical addresses—more than we can possibly use or even buy (24 TB as of
the latest release). (Of course, at one time time 4 GB seemed optimistically large
for physical memory.)
To improve performance, the MM maps the page-directory and PTE-table
pages into the same contiguous region of virtual addresses in every process.
This self-map allows the MM to use the same pointer to access the current PDE
or PTE corresponding to a particular virtual address no matter what process is
running. The self-map for the IA-32 takes a contiguous 8-MB region of kernel
virtual address space; the AMD64 self-map occupies 512 GB. Although the self-
map occupies signi cant address space, it does not require any additional
virtual memory pages. It also allows the page table’s pages to be automatically
paged in and out of physical memory.
In the creation of a self-map, one of the PDEs in the top-level page directory
refers to the page-directory page itself, forming a “loop” in the page-table
translations. The virtual pages are accessed if the loop is not taken, the PTE-table
pages are accessed if the loop is taken once, the lowest-level page-directory
pages are accessed if the loop is taken twice, and so forth.
The additional levels of page directories used for 64-bit virtual memory are
translated in the same way except that the virtual address pointer is broken up
into even more values. For the AMD64, Windows uses four full levels, each of
which maps 512 pages, or 9 + 9 + 9 + 9 + 12 = 48 bits of virtual address.
To avoid the overhead of translating every virtual address by looking up
the PDE and PTE, processors use ( ) hardware,
which contains an associative memory cache for mapping virtual pages to PTEs.
The TLB is part of the ( ) within each processor.
The MMU needs to “walk” (navigate the data structures of) the page table in
memory only when a needed translation is missing from the TLB.
The PDEs and PTEs contain more than just physical page numbers. They
also have bits reserved for operating-system use and bits that control how the
hardware uses memory, such as whether hardware caching should be used for
each page. In addition, the entries specify what kinds of access are allowed for
both user and kernel modes.
A PDE can also be marked to say that it should function as a PTE rather than
a PDE. On a IA-32, the rst 11 bits of the virtual address pointer select a PDE in
the rst two levels of translation. If the selected PDE is marked to act as a PTE,
then the remaining 21 bits of the pointer are used as the offset of the byte. This
results in a 2-MB size for the page. Mixing and matching 4-KB and 2-MB page
sizes within the page table is easy for the operating system and can signi cantly
improve the performance of some programs. The improvement results from
reducing how often the MMU needs to reload entries in the TLB, since one PDE
mapping 2 MB replaces 512 PTEs, each mapping 4 KB. Newer AMD64 hardware
even supports 1-GB pages, which operate in a similar fashion.
Managing physical memory so that 2-MB pages are available when needed
is dif cult, as they may continually be broken up into 4-KB pages, causing
external fragmentation of memory. Also, the large pages can result in very
signi cant internal fragmentation. Because of these problems, it is typically
only Windows itself, along with large server applications, that use large pages
to improve the performance of the TLB. They are better suited to do so because
operating-system and server applications start running when the system boots,
before memory has become fragmented.
Windows manages physical memory by associating each physical page
with one of seven states: free, zeroed, modi ed, standby, bad, transition, or
valid.
• A free page is an available page that has stale or uninitialized content.
• A zeroed page is a free page that has been zeroed out and is ready for
immediate use to satisfy zero-on-demand faults.
• A modi ed page has been written by a process and must be sent to sec-
ondary storage before it is usable by another process.
• A standby page is a copy of information already stored on secondary
storage. Standby pages may be pages that were not modi ed, modi ed
pages that have already been written to secondary storage, or pages that
were prefetched because they were expected to be used soon.
• A bad page is unusable because a hardware error has been detected.
• A transition page is on its way from secondary storage to a page frame
allocated in physical memory.
• A valid page either is part of the working set of one or more processes and
is contained within these processes’ page tables, or is being used by the
system directly (such as to store the nonpaged pool).

While valid pages are contained in processes’ page tables, pages in other
states are kept in separate lists according to state type. Additionally, to improve
performance and protect against aggressive recycling of the standby pages,
Windows Vista and later versions implement eight prioritized standby lists.
The lists are constructed by linking the corresponding entries in the
( ) database, which includes an entry for each physical memory
page. The PFN entries also include information such as reference counts, locks,
and NUMA information. Note that the PFN database represents pages of phys-
ical memory, whereas the PTEs represent pages of virtual memory.
When the valid bit in a PTE is zero, hardware ignores all the other bits,
and the MM can de ne them for its own use. Invalid pages can have a number
of states represented by bits in the PTE. Page- le pages that have never been
faulted in are marked zero-on-demand. Pages mapped through section objects
encode a pointer to the appropriate section object. PTEs for pages that have
been written to the page le contain enough information to locate the page on
secondary storage, and so forth. The structure of the page- le PTE is shown in
Figure 21.5. The T, P, and V bits are all zero for this type of PTE. The PTE includes
5 bits for page protection, 32 bits for page- le offset, and 4 bits to select the
paging le. There are also 20 bits reserved for additional bookkeeping.
Windows uses a per-working-set, least recently used (LRU) replacement
policy to take pages from processes as appropriate. When a process is started, it
is assigned a default minimum working-set size, at which point the MM starts to
track the age of the pages in each working set. The working set of each process
is allowed to grow until the amount of remaining physical memory starts to
run low. Eventually, when the available memory runs critically low, the MM
trims the working set to remove older pages.
The age of a page depends not on how long it has been in memory but on
when it was last referenced. The MM makes this determination by periodically
passing through the working set of each process and incrementing the age for
pages that have not been marked in the PTE as referenced since the last pass.
When it becomes necessary to trim the working sets, the MM uses heuristics to
63 32

page- le o set

31 0

page
T P prot V
le

Figure 21.5 Page- le page-table entry. The valid bit is zero.

decide how much to trim from each process and then removes the oldest pages
rst.
A process can have its working set trimmed even when plenty of memory is
available, if it was given a hard limit on how much physical memory it could
use. In Windows 7 and later versions, the MM also trims processes that are
growing rapidly, even if memory is plentiful. This policy change signi cantly
improved the responsiveness of the system for other processes.
Windows tracks working sets not only for user-mode processes but also
for various kernel-mode regions, which include the le cache and the pageable
kernel heap. Pageable kernel and driver code and data have their own working
sets, as does each TS session. The distinct working sets allow the MM to use
different policies to trim the different categories of kernel memory.
The MM does not fault in only the page immediately needed. Research
shows that the memory referencing of a thread tends to have a prop-
erty. That is, when a page is used, it is likely that adjacent pages will be
referenced in the near future. (Think of iterating over an array or fetching
sequential instructions that form the executable code for a thread.) Because of
locality, when the MM faults in a page, it also faults in a few adjacent pages.
This prefetching tends to reduce the total number of page faults and allows
reads to be clustered to improve I/O performance.
In addition to managing committed memory, the MM manages each pro-
cess’s reserved memory, or virtual address space. Each process has an asso-
ciated tree that describes the ranges of virtual addresses in use and what the
uses are. This allows the MM to fault in page-table pages as needed. If the PTE
for a faulting address is uninitialized, the MM searches for the address in the
process’s tree of ( ) and uses this information
to ll in the PTE and retrieve the page. In some cases, a PTE-table page may not
exist; such a page must be transparently allocated and initialized by the MM. In
other cases, the page may be shared as part of a section object, and the VAD will
contain a pointer to that section object. The section object contains information
on how to nd the shared virtual page so that the PTE can be initialized to point
to it directly.
Starting with Vista, the Windows MM includes a component called Super-
Fetch. This component combines a user-mode service with specialized kernel-
mode code, including a le-system lter, to monitor all paging operations on
the system. Each second, the service queries a trace of all such operations and
uses a variety of agents to monitor application launches, fast user switches,
standby/sleep/hibernate operations, and more as a means of understanding
the system’s usage patterns. With this information, it builds a statistical model,
using Markov chains, of which applications the user is likely to launch when, in
combination with what other applications, and what portions of these appli-
cations will be used. For example, SuperFetch can train itself to understand
that the user launches Microsoft Outlook in the mornings mostly to read e-
mail but composes e-mails later, after lunch. It can also understand that once
Outlook is in the background, Visual Studio is likely to be launched next, and
that the text editor is going to be in high demand, with the compiler demanded
a little less frequently, the linker even less frequently, and the documentation
code hardly ever. With this data, SuperFetch will prepopulate the standby list,
making low-priority I/O reads from secondary storage at idle times to load
what it thinks the user is likely to do next (or another user, if it knows a fast
user switch is likely). Additionally, by using the eight prioritized standby lists
that Windows offers, each such prefetched paged can be cached at a level that
matches the statistical likelihood that it will be needed. Thus, unlikely-to-be-
demanded pages can cheaply and quickly be evicted by an unexpected need
for physical memory, while likely-to-be-demanded-soon pages can be kept in
place for longer. Indeed, SuperFetch may even force the system to trim working
sets of other processes before touching such cached pages.
SuperFetch’s monitoring does create considerable system overhead. On
mechanical (rotational) drives, which have seek times in the milliseconds, this
cost is balanced by the bene t of avoiding latencies and multisecond delays in
application launch times. On server systems, however, such monitoring is not
bene cial, given the random multiuser workloads and the fact that throughput
is more important than latency. Further, the combined latency improvements
and bandwidth on systems with fast, ef cient nonvolatile memory, such as
SSDs, make the monitoring less bene cial for those systems as well. In such
situations, SuperFetch disables itself, freeing up a few spare CPU cycles.
Windows 10 brings another large improvement to the MM by introducing
a component called the compression store manager. This component creates
a compressed store of pages in the working set of the
, which is a type of system process. When shareable pages go on the
standby list and available memory is low (or certain other internal algorithm
decisions are made), pages on the list will be compressed instead of evicted.
This can also happen to modi ed pages targeted for eviction to secondary
storage —both by reducing memory pressure, perhaps avoiding the write in
the rst place, and by causing the written pages to be compressed, thus con-
suming less page le space and taking less I/O to page out. On today’s fast
multiprocessor systems, often with built-in hardware compression algorithms,
the small CPU penalty is highly preferable to the potential secondary storage
I/O cost.

The Windows process manager provides services for creating, deleting, inter-
rogating, and managing processes, threads, and jobs. It has no knowledge
about parent –child relationships or process hierarchies, although it can group
processes in jobs, and the latter can have hierarchies that must then be main-
tained. The process manager is also not involved in the scheduling of threads,
other than setting the priorities and af nities of the threads in their owner
processes. Additionally, through jobs, the process manager can effect various
changes in scheduling attributes (such as throttling ratios and quantum val-
ues) on threads. Thread scheduling proper, however, takes place in the kernel
dispatcher.
Each process contains one or more threads. Processes themselves can be
collected into larger units called . The original use of job objects
was to place limits on CPU usage, working-set size, and processor af nities
that control multiple processes at once. Job objects were thus used to man-
age large data-center machines. In Windows XP and later versions, job objects
were extended to provide security-related features, and a number of third-
party applications such as Google Chrome began using jobs for this purpose. In
Windows 8, a massive architectural change allowed jobs to in uence schedul-
ing through generic CPU throttling as well as per-user-session-aware fairness
throttling/balancing. In Windows 10, throttling support was extended to sec-
ondary storage I/O and network I/O as well. Additionally, Windows 8 allowed
job objects to nest, creating hierarchies of limits, ratios, and quotas that the
system must accurately compute. Additional security and power management
features were given to job objects as well.
As a result, all Windows Store applications and all UWP application
processes run in jobs. The DAM, introduced earlier, implements Connected
Standby support using jobs. Finally, Windows 10’s support for Docker
Containers, a key part of its cloud offerings, uses job objects, which it calls
. Thus, jobs have gone from being an esoteric data-center resource
management feature to a core mechanism of the process manager for multiple
features.
Due to Windows’s layered architecture and the presence of environment
subsystems, process creation is quite complex. An example of process creation
in the Win32 environment under Windows 10 is as follows. Note that the
launching of UWP “Modern” Windows Store applications (which are called
, or “AppX”) is signi cantly more complex and involves
factors outside the scope of this discussion.

A Win32 application calls CreateProcess().


A number of parameter conversions and behavioral conversions are done
from the Win32 world to the NT world.
CreateProcess() then calls the NtCreateUserProcess() API in the
process manager of the NT executive to actually create the process and
its initial thread.
The process manager calls the object manager to create a process object
and returns the object handle to Win32. It then calls the memory manager
to initialize the address space of the new process, its handle table, and
other key data structures, such as the process environment block (PEBL)
(which contains internal process management data).
The process manager calls the object manager again to create a thread
object and returns the handle to Win32. It then calls the memory man-
ager to create the thread environment block (TEB) and the dispatcher
to initialize the scheduling attributes of the thread, setting its state to
initializing.
The process manager creates the initial thread startup context (which
will eventually point to the main() routine of the application), asks the
scheduler to mark the thread as ready, and then immediately suspends
it, putting it into a waiting state.
A message is sent to the Win32 subsystem to notify it that the process is
being created. The subsystem performs additional Win32-speci c work to
initialize the process, such as computing its shutdown level and drawing
the animated hourglass or “donut” mouse cursor.
Back in CreateProcess(), inside the parent process, the
ResumeThread() API is called to wake up the process’s initial thread.
Control returns to the parent.
Now, inside the initial thread of the new process, the user-mode link
loader takes control (inside ntdll.dll, which is automatically mapped
into all processes). It loads all the library dependencies (DLLs) of the appli-
cation, creates its initial heap, sets up exception handling and application
compatibility options, and eventually calls the main() function of the
application.

The Windows APIs for manipulating virtual memory and threads and for
duplicating handles take a process handle, so their subsystem and other ser-
vices, when noti ed of process creation, can perform operations on behalf of
the new process without having to execute directly in the new process’s con-
text. Windows also supports a UNIX fork() style of process creation. A number
of features—including , which is used by the Windows error
reporting (WER) infrastructure during process crashes, as well as the Windows
subsystem for Linux’s implementation of the Linux fork() API —depend on
this capability.
The debugger support in the process manager includes the APIs to suspend
and resume threads and to create threads that begin in suspended mode.
There are also process-manager APIs that get and set a thread’s register context
and access another process’s virtual memory. Threads can be created in the
current process; they can also be injected into another process. The debugger
makes use of thread injection to execute code within a process being debugged.
Unfortunately, the ability to allocate, manipulate, and inject both memory and
threads across processes is often misused by malicious programs.
While running in the executive, a thread can temporarily attach to a dif-
ferent process. is used by kernel worker threads that need to
execute in the context of the process originating a work request. For example,
the MM might use thread attach when it needs access to a process’s working set
or page tables, and the I/O manager might use it in updating the status variable
in a process for asynchronous I/O operations.
Like many other modern operating systems, Windows uses a client –server
model throughout, primarily as a layering mechanism, which allows putting
common functionality into a “service” (the equivalent of a daemon in UNIX
terms), as well as splitting out content-parsing code (such as a PDF reader or
Web browser) from system-action-capable code (such as the Web browser’s
capability to save a le on secondary storage or the PDF reader’s ability to
print out a document). For example, on a recent Windows 10 operating sys-
tem, opening the New York Times website with the Microsoft Edge browser
will likely result in 12 to 16 different processes in a complex organization of
“broker,” “renderer/parser,” “JITTer,” services, and clients.
The most basic such “server” on a Windows computer is the Win32 envi-
ronment subsystem, which is the server that implements the operating-system
personality of the Win32 API inherited from the Windows 95/98 days. Many
other services, such as user authentication, network facilities, printer spooling,
Web services, network le systems, and plug-and-play, are also implemented
using this model. To reduce the memory footprint, multiple services are often
collected into a few processes running the svchost.exe program. Each service
is loaded as a dynamic-link library (DLL), which implements the service by rely-
ing on user-mode thread-pool facilities to share threads and wait for messages
(see Section 21.3.5.3). Unfortunately, this pooling originally resulted in poor
user experience in troubleshooting and debugging runaway CPU usage and
memory leaks, and it weakened the overall security of each service. Therefore,
in recent versions of Windows 10, if the system has over 2 GB of RAM, each DLL
service runs in its own individual svchost.exe process.
In Windows, the recommended paradigm for implementing client–server
computing is to use RPCs to communicate requests, because of their inher-
ent security, serialization services, and extensibility features. The Win32 API
supports the Microsoft standard of the DCE-RPC protocol, called MS-RPC, as
described in Section 21.6.2.7.
RPC uses multiple transports (for example, named pipes and TCP/IP) that
can be used to implement RPCs between systems. When an RPC occurs only
between a client and a server on the local system, ALPC can be used as the
transport. Furthermore, because RPC is heavyweight and has multiple system-
level dependencies (including the WINXXIII environment subsystem itself),
many native Windows services, as well as the kernel, directly use ALPC, which
is not available (nor suitable) for third-party programmers.
ALPC is a message-passing mechanism similar to UNIX domain sockets
and Mach IPC. The server process publishes a globally visible connection-port
object. When a client wants services from the server, it opens a handle to the
server’s connection-port object and sends a connection request to the port. If
the server accepts the connection, then ALPC creates a pair of communication-
port objects, providing the client’s connect API with its handle to the pair, and
then providing the server’s accept API with the other handle to the pair.
At this point, messages can be sent across communication ports as either
datagrams, which behave like UDP and require no reply, or requests, which
must receive a reply. The client and server can then use either synchronous
messaging, in which one side is always blocking (waiting for a request or
expecting a reply), or asynchronous messaging, in which the thread-pool
mechanism can be used to perform work whenever a request or reply is
received, without the need for a thread to block for a message. For servers
located in kernel mode, communication ports also support a callback mech-
anism, which allows an immediate switch to the kernel side (KT) of the user-
mode thread (UT), immediately executing the server’s handler routine.
When an ALPC message is sent, one of two message-passing techniques can
be chosen.

The rst technique is suitable for small to medium-sized messages (below


64 KB). In this case, the port’s kernel message queue is used as intermedi-
ate storage, and the messages are copied from one process, to the kernel,
to the other process. The disadvantage of this technique is the double
buffering, as well as the fact that messages remain in kernel memory until
the intended receiver consumes them. If the receiver is highly contended
or currently unavailable, this may result in megabytes of kernel-mode
memory being locked up.
The second technique is for larger messages. In this case, a shared-
memory section object is created for the port. Messages sent through the
port’s message queue contain a “message attribute,” called a
, that refers to the section object. The receiving side “exposes”
this attribute, resulting in a virtual address mapping of the section object
and a sharing of physical memory. This avoids the need to copy large
messages or to buffer them in kernel-mode memory. The sender places
data into the shared section, and the receiver sees them directly, as soon
as it consumes a message.

Many other possible ways of implementing client–server communication


exist, such as by using mailslots, pipes, sockets, section objects paired with
events, window messages, and more. Each one has its uses, bene ts, and
disadvantages. RPC and ALPC remain the most fully featured, safe, secure, and
feature-rich mechanisms for such communication, however, and they are the
mechanisms used by the vast majority of Windows processes and services.

The is responsible for all device drivers on the system, as well as


for implementing and de ning the communication model that allows drivers
to communicate with each other, with the kernel, and with user-mode clients
and consumers. Additionally, as in UNIX-based operating systems, I/O is
always targeted to a , even if the device is not a le system. The I/O
manager in Windows allows device drivers to be “ ltered” by other drivers,
creating a through which I/O ows and which can be used to
modify, extend, or enhance the original request. Therefore, the I/O manager
always keeps track of which device drivers and lter drivers are loaded.
Due to the importance of le-system drivers, the I/O manager has special
support for them and implements interfaces for loading and managing le sys-
tems. It works with the MM to provide memory-mapped le I/O and controls
the Windows cache manager, which handles caching for the entire I/O sys-
tem. The I/O manager is fundamentally asynchronous, providing synchronous
I/O by explicitly waiting for an I/O operation to complete. The I/O manager
provides several models of asynchronous I/O completion, including setting of
events, updating of a status variable in the calling process, delivery of APCs to
initiating threads, and use of I/O completion ports, which allow a single thread
to process I/O completions from many other threads. It also manages buffers
for I/O requests.
Device drivers are arranged in a list for each device (called a driver or
I/O stack). A driver is represented in the system as a . Because
a single driver can operate on multiple devices, the drivers are represented in
the I/O stack by a , which contains a link to the driver object.
Additionally, nonhardware drivers can use device objects as a way to expose
different interfaces. As an example, there are TCP6, UDP6, UDP, TCP, RawIp, and
RawIp6 device objects owned by the TCP/IP driver object, even though these
don’t represent physical devices. Similarly, each volume on secondary storage
is its own device object, owned by the volume manager driver object.
Once a handle is opened to a device object, the I/O manager always creates
a le object and returns a le handle instead of a device handle. It then converts
the requests it receives (such as create, read, and write) into a standard form
called an ( ). It forwards the IRP to the rst driver in the
targeted I/O stack for processing. After a driver processes the IRP, it calls the
I/O manager either to forward the IRP to the next driver in the stack or, if all
processing is nished, to complete the operation represented by the IRP.
The I/O request may be completed in a context different from the one in
which it was made. For example, if a driver is performing its part of an I/O
operation and is forced to block for an extended time, it may queue the IRP to
a worker thread to continue processing in the system context. In the original
thread, the driver returns a status indicating that the I/O request is pending
so that the thread can continue executing in parallel with the I/O operation.
An IRP may also be processed in interrupt-service routines and completed in
an arbitrary process context. Because some nal processing may need to take
place in the context that initiated the I/O, the I/O manager uses an APC to do
nal I/O-completion processing in the process context of the originating thread.
The I/O stack model is very exible. As a driver stack is built, vari-
ous drivers have the opportunity to insert themselves into the stack as
. Filter drivers can examine and potentially modify each I/O opera-
tion. Volume snapshotting ( ) and disk encryption ( )
are two built-in examples of functionality implemented using lter drivers
that execute above the volume manager driver in the stack. File-system lter
drivers execute above the le system and have been used to implement func-
tionalities such as hierarchical storage management, single instancing of les
for remote boot, and dynamic format conversion. Third parties also use le-
system lter drivers to implement anti-malware tools. Due to the large number
of le-system lters, Windows Server 2003 and later versions now include a
component, which acts as the sole le-system lter and which
loads ordered by speci c (relative priorities). This model
allows lters to transparently cache data and repeated queries without having
to know about each other’s requests. It also provides stricter load ordering.
Device drivers for Windows are written to the Windows Driver Model
(WDM) speci cation. This model lays out all the requirements for device
drivers, including how to layer lter drivers, share common code for han-
dling power and plug-and-play requests, build correct cancellation logic, and
so forth.
Because of the richness of the WDM, writing a full WDM device driver for
each new hardware device can involve a great deal of work. In some cases,
the port/miniport model makes it unnecessary to do this for certain hardware
devices. Within a range of devices that require similar processing, such as audio
drivers, storage controllers, or Ethernet controllers, each instance of a device
shares a common driver for that class, called a . The port driver
implements the standard operations for the class and then calls device-speci c
routines in the device’s to implement device-speci c function-
ality. The physical-link layer of the network stack is implemented in this way,
with the ndis.sys port driver implementing much of the generic network
processing functionality and calling out to the network miniport drivers for
speci c hardware commands related to sending and receiving network frames
(such as Ethernet).
Similarly, the WDM includes a class/miniclass model. Here, a certain class
of devices can be implemented in a generic way by a single class driver, with
callouts to a miniclass for speci c hardware functionality. For example, the
Windows disk driver is a class driver, as are drivers for CD/DVDs and tape
drives. The keyboard and mouse driver are class drivers as well. These types
of devices don’t need a miniclass, but the battery class driver, for example,
does require a miniclass for each of the various external uninterruptible power
supplies (UPSs) sold by vendors.
Even with the port/miniport and class/miniclass model, signi cant
kernel-facing code must be written. And this model is not useful for custom
hardware or for logical (nonhardware) drivers. Starting with Windows 2000
Service Pack 4, kernel-mode drivers can be written using the
( ), which provides a simpli ed programming
model for drivers on top of WDM. Another option is the
( ), which allows drivers to be written in user mode through
a driver in the kernel that forwards the requests through the kernel’s
I/O stack. These two frameworks make up the
model, which has reached Version 2.1 in Windows 10 and contains a fully
compatible API between KMDF and UMDF. It has been fully open-sourced on
GitHub.
Because many drivers do not need to operate in kernel mode, and it is easier
to develop and deploy drivers in user mode, UMDF is strongly recommended
for new drivers. It also makes the system more reliable, because a failure in a
user-mode driver does not cause a kernel (system) crash.

In many operating systems, caching is done by the block device system, usually
at the physical/block level. Instead, Windows provides a centralized caching
facility that operates at the logical/virtual le level. The works
closely with the MM to provide cache services for all components under the
control of the I/O manager. This means that the cache can operate on anything
from remote les on a network share to logical les on a custom le system. The
size of the cache changes dynamically according to how much free memory
is available in the system; it can grow as large as 2 TB on a 64-bit system.
The cache manager maintains a private working set rather than sharing the
system process’s working set, which allows trimming to page out cached les
more effectively. To build the cache, the cache manager memory-maps les into
kernel memory and then uses special interfaces to the MM to fault pages into
or trim them from this private working set, which lets it take advantage of
additional caching facilities provided by the memory manager.
The cache is divided into blocks of 256 KB. Each cache block can hold a
view (that is, a memory-mapped region) of a le. Each cache block is described
by a ( ) that stores the virtual address and
le offset for the view, as well as the number of processes using the view.
The VACBs reside in arrays maintained by the cache manager, and there are
arrays for critical as well as low-priority cached data to improve performance
in situations of memory pressure.
When the I/O manager receives a le’s user-level read request, the I/O
manager sends an IRP to the I/O stack for the volume on which the le resides.
For les that are marked as cacheable, the le system calls the cache manager
to look up the requested data in its cached le views. The cache manager
calculates which entry of that le’s VACB index array corresponds to the byte
offset of the request. The entry either points to the view in the cache or is
invalid. If it is invalid, the cache manager allocates a cache block (and the
corresponding entry in the VACB array) and maps the view into the cache block.
The cache manager then attempts to copy data from the mapped le to the
caller’s buffer. If the copy succeeds, the operation is completed.
If the copy fails, it does so because of a page fault, which causes the MM
to send a noncached read request to the I/O manager. The I/O manager sends
another request down the driver stack, this time requesting a paging operation,
which bypasses the cache manager and reads the data from the le directly into
the page allocated for the cache manager. Upon completion, the VACB is set to
point at the page. The data, now in the cache, are copied to the caller’s buffer,
and the original I/O request is completed. Figure 21.6 shows an overview of
these operations.
When possible, for synchronous operations on cached les, I/O is handled
by the . This mechanism parallels the normal IRP-based I/O

process

I/O I/O manager

cached I/O
cache manager file system

data copy noncached I/O

page fault
VM manager disk driver

Figure 21.6 File I/O.


but calls into the driver stack directly rather than passing down an IRP, which
saves memory and time. Because no IRP is involved, the operation should
not block for an extended period of time and cannot be queued to a worker
thread. Therefore, when the operation reaches the le system and calls the
cache manager, the operation fails if the information is not already in the cache.
The I/O manager then attempts the operation using the normal IRP path.
A kernel-level read operation is similar, except that the data can be accessed
directly from the cache rather than being copied to a buffer in user space.
To use le-system metadata (data structures that describe the le system),
the kernel uses the cache manager’s mapping interface to read the metadata.
To modify the metadata, the le system uses the cache manager’s pinning
interface. a page locks the page into a physical-memory page frame so
that the MM manager cannot move the page or page it out. After updating the
metadata, the le system asks the cache manager to unpin the page. A modi ed
page is marked dirty, and so the MM ushes the page to secondary storage.
To improve performance, the cache manager keeps a small history of read
requests and from this history attempts to predict future requests. If the cache
manager nds a pattern in the previous three requests, such as sequential
access forward or backward, it prefetches data into the cache before the next
request is submitted by the application. In this way, the application may nd
its data already cached and not need to wait for secondary storage I/O.
The cache manager is also responsible for telling the MM to ush the
contents of the cache. The cache manager’s default behavior is write-back
caching: it accumulates writes for 4 to 5 seconds and then wakes up the cache-
writer thread. When write-through caching is needed, a process can set a ag
when opening the le, or can call an explicit cache- ush function.
A fast-writing process could potentially ll all the free cache pages before
the cache-writer thread had a chance to wake up and ush the pages to sec-
ondary storage. The cache writer prevents a process from ooding the system
in the following way. When the amount of free cache memory becomes low,
the cache manager temporarily blocks processes attempting to write data and
wakes the cache-writer thread to ush pages to secondary storage. If the fast-
writing process is actually a network redirector for a network le system,
blocking it for too long could cause network transfers to time out and be
retransmitted. This retransmission would waste network bandwidth. To pre-
vent such waste, network redirectors can instruct the cache manager to limit
the backlog of writes in the cache.
Because a network le system needs to move data between secondary
storage and the network interface, the cache manager also provides a DMA
interface to move the data directly. Moving data directly avoids the need to
copy data through an intermediate buffer.

Centralizing management of system entities in the object manager enables


Windows to use a uniform mechanism to perform run-time access validation
and audit checks for every user-accessible entity in the system. Additionally,
even entities not managed by the object manager may have access to the API
routines for performing security checks. Whenever a thread opens a handle to
a protected data structure (such as an object), the

You might also like