0% found this document useful (0 votes)

106 views100 pages

Programming The Microsoft Windows Driver Model 2nd Edition (101-200)

Uploaded by

gaugen

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

106 views100 pages

Programming The Microsoft Windows Driver Model 2nd Edition (101-200)

Uploaded by

gaugen

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 100

- 83 -

Chapter 4
4 Synchronization

Microsoft Windows XP is a multitasking operating system that can run in a symmetric multiprocessor environment. It’s not my
purpose here to provide a rigorous description of the multitasking capabilities of Microsoft Windows XP; one good place to get
more information is David Solomon and Mark Russinovich’s Inside Windows 2000, Third Edition (Microsoft Press, 2000). All
we need to understand as driver writers is that our code executes in the context of one thread or another (and the thread context
can change from one invocation of our code to another) and that the exigencies of multitasking can yank control away from us
at practically any moment. Furthermore, true simultaneous execution of multiple threads is possible on a multiprocessor
machine. In general, we need to assume two worst-case scenarios:
1. The operating system can preempt any subroutine at any moment for an arbitrarily long period of time, so we cannot be
sure of completing critical tasks without interference or delay.
2. Even if we take steps to prevent preemption, code executing simultaneously on another CPU in the same computer can
interfere with our code—it’s even possible that the same set of instructions belonging to one of our programs could be
executing in parallel in the context of two different threads.
Windows XP allows you to solve these general synchronization problems by using a variety of synchronization primitives. The
system prioritizes the handling of hardware and software interrupts with the interrupt request level (IRQL). The system offers a
variety of synchronization primitives. Some of these primitives are appropriate at times when you can safely block and
unblock threads. One primitive, the spin lock, allows you to synchronize access to shared resources even at times when thread
blocking wouldn’t be allowed because of the priority level at which a program runs.

4.1 An Archetypal Synchronization Problem

A hackneyed example will motivate this discussion. Suppose your driver has a static integer variable that you use for some
purpose, say, to count the number of I/O requests that are currently outstanding:

static LONG lActiveRequests;

Suppose further that you increment this variable when you receive a request and decrement it when you later complete the
request:

NTSTATUS DispatchPnp(PDEVICE_OBJECT fdo, PIRP Irp)

{
++lActiveRequests;
... // process PNP request
--lActiveRequests;
}

I’m sure you recognize already that a counter such as this one ought not to be a static variable: it should be a member of your
device extension so that each device object has its own unique counter. Bear with me, and pretend that your driver always
manages only a single device. To make the example more meaningful, suppose finally that a function in your driver will be
called when it’s time to delete your device object. You might want to defer the operation until no more requests are outstanding,
so you might insert a test of the counter:

NTSTATUS HandleRemoveDevice(PDEVICE_OBJECT fdo, PIRP Irp)

{
if (lActiveRequests)
<wait for all requests to complete>
IoDeleteDevice(fdo);
}

This example describes a real problem, by the way, which we’ll tackle in Chapter 6 in our discussion of Plug and Play (PnP)
requests. The I/O Manager can try to remove one of our devices at a time when requests are active, and we need to guard
against that by keeping some sort of counter. I’ll show you in Chapter 6 how to use IoAcquireRemoveLock and some related
functions to solve the problem.
A horrible synchronization problem lurks in the code fragments I just showed you, but it becomes apparent only if you look
behind the increment and decrement operations inside DispatchPnp. On an x86 processor, the compiler might implement them
- 84 - Synchronization | Chapter 4

using these instructions:

; ++lActiveRequests;
mov eax, lActiveRequests
add eax, 1
mov lActiveRequests, eax

; --lActiveRequests;
mov eax, lActiveRequests
sub eax, 1
mov lActiveRequests, eax

To expose the synchronization problem, let’s consider first what might go wrong on a single CPU. Imagine two threads that are
both trying to advance through DispatchPnp at roughly the same time. We know they’re not both executing truly
simultaneously because we have only a single CPU for them to share. But imagine that one of the threads is executing near the
end of the function and manages to load the current contents of lActiveRequests into the EAX register just before the other
thread preempts it. Suppose lActiveRequests equals 2 at that instant. As part of the thread switch, the operating system saves
the EAX register (containing the value 2) as part of the outgoing thread’s context image somewhere in main memory.

NOTE
The point being made in the text isn’t limited to thread preemption that occurs as a result of a time slice
expiring. Threads can also involuntarily lose control because of page faults, changes in CPU affinity, or priority
changes instigated by outside agents. Think, therefore, of preemption as being an all-encompassing term that
includes all means of giving control of a CPU to another thread without explicit permission from the currently
running thread.

Now imagine that the other thread manages to get past the incrementing code at the beginning of DispatchPnp. It will
increment lActiveRequests from 2 to 3 (because the first thread never got to update the variable). If the first thread preempts
this other thread, the operating system will restore the first thread’s context, which includes the value 2 in the EAX register.
The first thread now proceeds to subtract 1 from EAX and store the result back in lActiveRequests. At this point,
lActiveRequests contains the value 1, which is incorrect. Somewhere down the road, we might prematurely delete our device
object because we’ve effectively lost track of one I/O request.
Solving this particular problem is easy on an x86 computer—we just replace the load/add/store and load/subtract/store
instruction sequences with atomic instructions:

; ++lActiveRequests;
inc lActiveRequests

; --lActiveRequests;
dec lActiveRequests

On an Intel x86, the INC and DEC instructions cannot be interrupted, so there will never be a case in which a thread can be
preempted in the middle of updating the counter. As it stands, though, this code still isn’t safe in a multiprocessor environment
because INC and DEC are implemented in several microcode steps. It’s possible for two different CPUs to be executing their
microcode just slightly out of step such that one of them ends up updating a stale value. The multi-CPU problem can also be
avoided in the x86 architecture by using a LOCK prefix:

; ++lActiveRequests;
lock inc lActiveRequests

; --lActiveRequests;
lock dec lActiveRequests

The LOCK instruction prefix locks out all other CPUs while the microcode for the current instruction executes, thereby
guaranteeing data integrity.
Not all synchronization problems have such an easy solution, unfortunately. The point of this example isn’t to demonstrate how
to solve one simple problem on one of the platforms where Windows XP runs but rather to illustrate the two sources of
difficulty: preemption of one thread by another in the middle of a state change and simultaneous execution of conflicting
state-change operations. We can avoid difficulty by judiciously using synchronization primitives, such as mutual exclusion
objects, to block other threads while our thread accesses shared data. At times when thread blocking is impermissible, we can
avoid preemption by using the IRQL priority scheme, and we can pre vent simultaneous execution by judiciously using spin
locks.

Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
4.2 Interrupt Request Level - 85 -

4.2 Interrupt Request Level

Windows XP assigns an interrupt request level to each hardware interrupt and to a select few software events. Each CPU has
its own IRQL. We label the different IRQL levels with names such as PASSIVE_LEVEL, APC_LEVEL, and so on. Figure 4-1
illustrates the range of IRQL values for the x86 platform. (In general, the numeric values of IRQL depend on which platform
you’re talking about.) Most of the time, the computer executes in user mode at PASSIVE_LEVEL. All of your knowledge about
how multitasking operating systems work applies at PASSIVE_LEVEL. That is, the scheduler may preempt a thread at the end
of a time slice or because a higher-priority thread has become eligible to run. Threads can also voluntarily block while they
wait for events to occur.

Figure 4-1. Interrupt request levels.

When an interrupt occurs, the kernel raises the IRQL on the interrupting CPU to the level associated with that interrupt. The
activity of processing an interrupt can be—uh, interrupted—to process an interrupt at a higher IRQL but never to process an
interrupt at the same or a lower IRQL. I’m sorry to use the word interrupt in two slightly different ways here. I struggled to
find a word to describe the temporary suspension of an activity that wouldn’t cause confusion with thread preemption, and that
was the best choice.
What I just said is sufficiently important to be enshrined as a rule:
An activity on a given CPU can be interrupted only by an activity that executes at a higher IRQL.
You have to read this rule the way the computer does. Expiration of a time slice eventually invokes the thread scheduler at
DISPATCH_LEVEL. The scheduler can then make a different thread current. When the IRQL returns to PASSIVE_LEVEL, a
different thread is running. But it’s still true that the first PASSIVE_LEVEL activity wasn’t interrupted by the second
PASSIVE_LEVEL activity. I thought this interpretation was incredible hair-splitting until it was pointed out to me that this
arrangement allows a thread running at APC_LEVEL to be preempted by a different thread running at PASSIVE_LEVEL.
Perhaps a more useful statement of the rule is this one:
An activity on a given CPU can be interrupted only by an activity that executes at a higher IRQL. An activity at or above
DISPATCH_LEVEL cannot be suspended to perform another activity at or below the then-current IRQL.
Since each CPU has its own IRQL, it’s possible for any CPU in a multiprocessor computer to run at an IRQL that’s less than or
equal to the IRQL of any other CPU. In the next major section, I’ll tell you about spin locks, which combine the within-a-CPU
synchronizing behavior of an IRQL with a multiprocessor lockout mechanism. For the time being, though, I’m talking just
about what happens on a single CPU.
To repeat something I just said, user-mode programs execute at PASSIVE_LEVEL. When a user-mode program calls a function
in the native API, the CPU switches to kernel mode but continues to run at PASSIVE_LEVEL in the same thread context. Many
times, the native API function calls an entry point in a driver without raising the IRQL. Driver dispatch routines for most types
of I/O request packet (IRP) execute at PASSIVE_LEVEL. In addition, certain driver subroutines, such as DriverEntry and
AddDevice, execute at PASSIVE_LEVEL in the context of a system thread. In all of these cases, the driver code can be
preempted just as a user-mode application can be.
Certain common driver routines execute at DISPATCH_LEVEL, which is higher than PASSIVE_LEVEL. These include the
StartIo routine, deferred procedure call (DPC) routines, and many others. What they have in common is a need to access fields
Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
- 86 - Synchronization | Chapter 4

in the device object and the device extension without interference from driver dispatch routines and one another. When one of
these routines is running, the rule stated earlier guarantees that no thread can preempt it on the same CPU to execute a driver
dispatch routine because the dispatch routine runs at a lower IRQL. Furthermore, no thread can preempt it to run another of
these special routines because that other routine will run at the same IRQL.

NOTE
Dispatch routine and DISPATCH_LEVEL unfortunately have similar names. Dispatch routines are so called
because the I/O Manager dispatches I/O requests to them. DISPATCH_LEVEL is so called because it’s the IRQL
at which the kernel’s thread dispatcher originally ran when deciding which thread to run next. (The thread
dispatcher runs at SYNCH_LEVEL, if you care. This is the same as DISPATCH_LEVEL on a uniprocessor machine,
if you really care.)

Between DISPATCH_LEVEL and PROFILE_LEVEL is room for various hardware interrupt levels. In general, each device that
generates interrupts has an IRQL that defines its interrupt priority vis-à-vis other devices. A WDM driver discovers the IRQL
for its interrupt when it receives an IRP_MJ_PNP request with the minor function code IRP_MN_START_DEVICE. The
device’s interrupt level is one of the many items of configuration information passed as a parameter to this request. We often
refer to this level as the device IRQL, or DIRQL for short. DIRQL isn’t a single request level. Rather, it’s the IRQL for the
interrupt associated with whichever device is under discussion at the time.
The other IRQL levels have meanings that sometimes depend on the particular CPU architecture. Since those levels are used
internally by the kernel, their meanings aren’t especially germane to the job of writing a device driver. The purpose of
APC_LEVEL, for example, is to allow the system to schedule an asynchronous procedure call (APC), which I’ll describe in
detail later in this chapter. Operations that occur at HIGH_LEVEL include taking a memory snapshot just prior to hibernating
the computer, processing a bug check, handling a totally spurious interrupt, and others. I’m not going to attempt to provide an
exhaustive list here because, as I said, you and I don’t really need to know all the details.
To summarize, drivers are normally concerned with three interrupt request levels:
PASSIVE_LEVEL, at which many dispatch routines and a few special routines execute
DISPATCH_LEVEL, at which StartIo and DPC routines execute
DIRQL, at which an interrupt service routine executes

4.2.1 IRQL in Operation

To illustrate the importance of IRQL, refer to Figure 4-2, which illustrates a possible time sequence of events on a single CPU.
At the beginning of the sequence, the CPU is executing at PASSIVE_LEVEL. At time t1, an interrupt arrives whose service
routine executes at IRQL-1, one of the levels between DISPATCH_LEVEL and PROFILE_LEVEL. Then, at time t2, another
interrupt arrives whose service routine executes at IRQL-2, which is less than IRQL-1. Because of the rule already discussed,
the CPU continues servicing the first interrupt. When the first interrupt service routine completes at time t3, it might request a
DPC. DPC routines execute at DISPATCH_LEVEL. Consequently, the highest priority pending activity is the service routine
for the second interrupt, which therefore executes next. When it finishes at t4, assuming nothing else has occurred in the
meantime, the DPC will run at DISPATCH_LEVEL. When the DPC routine finishes at t5, IRQL can drop back to
PASSIVE_LEVEL.

Figure 4-2. Interrupt priority in action.

4.2.2 IRQL Compared with Thread Priorities

Thread priority is a very different concept from IRQL. Thread priority controls the actions of the scheduler in deciding when to
preempt running threads and what thread to start running next. The only “priority” that means anything at IRQLs above
APC_LEVEL is IRQL itself, and it controls which programs can execute rather than the thread context within which they
execute.

Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
4.2 Interrupt Request Level - 87 -

4.2.3 IRQL and Paging

One consequence of running at elevated IRQL is that the system becomes incapable of servicing page faults. The rule this fact
implies is simply stated:
Code executing at or above DISPATCH_LEVEL must not cause page faults.
One implication of this rule is that any of the subroutines in your driver that execute at or above DISPATCH_LEVEL must be
in nonpaged memory. Furthermore, all the data you access in such a subroutine must also be in nonpaged memory. Finally, as
IRQL rises, fewer and fewer kernel-mode support routines are available for your use.
The DDK documentation explicitly states the IRQL restrictions on support routines. For example, the entry for
KeWaitForSingleObject indicates two restrictions:
The caller must be running at or below DISPATCH_LEVEL.
If a nonzero timeout period is specified in the call, the caller must be running strictly below DISPATCH_LEVEL.
Reading between the lines, what is being said here is this: if the call to KeWaitForSingleObject might conceivably block for
any period of time (that is, you’ve specified a nonzero timeout), you must be below DISPATCH_LEVEL, where thread
blocking is permitted. If all you want to do is check to see whether an event has been signaled, however, you can be at
DISPATCH_LEVEL. You can’t call this routine at all from an interrupt service routine or other routine running above
DISPATCH_LEVEL.
For the sake of completeness, it’s well to point out that the rule against page faults is really a rule prohibiting any sort of
hardware exception, including page faults, divide checks, bounds exceptions, and so on. Software exceptions, like quota
violations and probe failures on nonpaged memory, are permissible. Thus, it’s acceptable to call ExAllocatePoolWithQuota to
allocate nonpaged memory at DISPATCH_LEVEL.

4.2.4 Implicitly Controlling IRQL

Most of the time, the system calls the routines in your driver at the correct IRQL for the activities you’re supposed to carry out.
Although I haven’t discussed many of these routines in detail, I want to give you an example of what I mean. Your first
encounter with a new I/O request occurs when the I/O Manager calls one of your dispatch routines to process an IRP. The call
usually occurs at PASSIVE_LEVEL because you might need to block the calling thread and you might need to call any support
routine at all. You can’t block a thread at a higher IRQL, of course, and PASSIVE_LEVEL is the level at which there are the
fewest restrictions on the support routines you can call.

NOTE
Driver dispatch routines usually execute at PASSIVE_LEVEL but not always. You can designate that you want to
receive IRP_MJ_POWER requests at DISPATCH_LEVEL by setting the DO_POWER_INRUSH flag, or by clearing
the DO_POWER_PAGABLE flag, in a device object. Sometimes a driver architecture requires that other drivers
be able to send certain IRPs at DISPATCH_LEVEL. The USB bus driver, for example, accepts data transfer
requests at DISPATCH_LEVEL or below. A standard serial-port driver accepts any read, write, or control
operation at or below DISPATCH_LEVEL.

If your dispatch routine queues the IRP by calling IoStartPacket, your next encounter with the request will be when the I/O
Manager calls your StartIo routine. This call occurs at DISPATCH_LEVEL because the system needs to access the queue of I/O
requests without interference from the other routines that are inserting and removing IRPs from the queue. As I’ll discuss later
in this chapter, queue access occurs under protection of a spin lock, and that carries with it execution at DISPATCH_LEVEL.
Later on, your device might generate an interrupt, whereupon your interrupt service routine will be called at DIRQL. It’s likely
that some registers in your device can’t safely be shared. If you access those registers only at DIRQL, you can be sure that no
one can interfere with your interrupt service routine (ISR) on a single-CPU computer. If other parts of your driver need to
access these crucial hardware registers, you would guarantee that those other parts execute only at DIRQL. The
KeSynchronizeExecution service function helps you enforce that rule, and I’ll discuss it in Chapter 7 in connection with
interrupt handling.
Still later, you might arrange to have a DPC routine called. DPC routines execute at DISPATCH_LEVEL because, among other
things, they need to access your IRP queue to remove the next request from a queue and pass it to your StartIo routine. You call
the IoStartNextPacket service routine to extract the next request from the queue, and it must be called at DISPATCH_LEVEL. It
might call your StartIo routine before returning. Notice how neatly the IRQL requirements dovetail here: queue access, the call
to IoStartNextPacket, and the possible call to StartIo are all required to occur at DISPATCH_LEVEL, and that’s the level at
which the system calls the DPC routine.
Although it’s possible for you to explicitly control IRQL (and I’ll explain how in the next section), there’s seldom any reason
to do so because of the correspondence between your needs and the level at which the system calls you. Consequently, you
don’t need to get hung up on which IRQL you’re executing at from moment to moment: it’s almost surely the correct level for
the work you’re supposed to do right then.

Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
- 88 - Synchronization | Chapter 4

4.2.5 Explicitly Controlling IRQL

When necessary, you can raise and subsequently lower the IRQL on the current processor by calling KeRaiseIrql and
KeLowerIrql. For example, from within a routine running at PASSIVE_LEVEL:

KIRQL oldirql;

ASSERT(KeGetCurrentIrql() <= DISPATCH_LEVEL);

KeRaiseIrql(DISPATCH_LEVEL, &oldirql);

KeLowerIrql(oldirql);

1. KIRQL is the typedef name for an integer that holds an IRQL value. We’ll need a variable to hold the current IRQL, so
we declare it this way.
2. This ASSERT expresses a necessary condition for calling KeRaiseIrql: the new IRQL must be greater than or equal to the
current level. If this relation isn’t true, KeRaiseIrql will bugcheck (that is, report a fatal error via a blue screen of death).
3. KeRaiseIrql raises the current IRQL to the level specified by the first argument. It also saves the current IRQL at the
location pointed to by the second argument. In this example, we’re raising IRQL to DISPATCH_LEVEL and saving the
current level in oldirql.
4. After executing whatever code we desired to execute at elevated IRQL, we lower the request level back to its previous
value by calling KeLowerIrql and specifying the oldirql value previously returned by KeRaiseIrql.
After raising the IRQL, you should eventually restore it to the original value. Otherwise, various assumptions made by code
you call later or by the code that called you can later turn out to be incorrect. The DDK documentation says that you must
always call KeLowerIrql with the same value as that returned by the immediately preceding call to KeRaiseIrql, but this
information isn’t exactly right. The only rule that KeLowerIrql actually applies is that the new IRQL must be less than or equal
to the current one. You can lower the IRQL in steps if you want to.
It’s a mistake (and a big one!) to lower IRQL below whatever it was when a system routine called your driver, even if you raise
it back before returning. Such a break in synchronization might allow some activity to preempt you and interfere with a data
object that your caller assumed would remain inviolate.

4.3 Spin Locks

To help you synchronize access to shared data in the symmetric multiprocessing world of Windows XP, the kernel lets you
define any number of spin lock objects. To acquire a spin lock, code on one CPU executes an atomic operation that tests and
then sets a memory variable in such a way that no other CPU can access the variable until the operation completes. If the test
indicates that the lock was previously free, the program continues. If the test indicates that the lock was previously held, the
program repeats the test-and-set in a tight loop: it “spins.” Eventually the owner releases the lock by resetting the variable,
whereupon one of the waiting CPUs’ test-and-set operations will report the lock as free.

Figure 4-3. Using a spin lock to guard a shared resource.

Figure 4-3 illustrates the concept of using a spin lock. Suppose we have some “resource” that might be used simultaneously on
two different CPUs. To make the example concrete, imagine that the resource is the LIST_ENTRY cell that anchors a linked list
of IRPs. The list might be accessed by one or more dispatch routines, a cancel routine, a DPC routine, and perhaps others as
well. Any number of these routines might be executing simultaneously on different CPUs and trying to modify the list anchor.

Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
4.3 Spin Locks - 89 -

To prevent chaos, we associate a spin lock with this “resource.”

Suppose now that code executing on CPU A wants to access the shared resource at time t1. It acquires the spin lock and begins
its access. Shortly afterward, at time t2, code executing on CPU B also wants to access the same resource. The CPU-B program
tries to acquire the spin lock. Since CPU A currently owns the spin lock, CPU B spins in a tight loop, continually checking and
rechecking the spin lock to see whether it has become free. When CPU A releases the lock at time t3, CPU B finds the lock free
and claims it. Then CPU B has unfettered access to the resource. Finally, at time t4, CPU B finishes its access and releases the
lock.
I want to be very clear about how a spin lock and a shared resource come to be associated. We make the association when we
design the driver. We decide that we will access the resource only while owning the spin lock. The operating system isn’t
aware of our decision. Furthermore, we can define as many spin locks as we want, to guard as many shared resources as we
want.

4.3.1 Some Facts About Spin Locks

You need to know several important facts about spin locks. First of all, if a CPU already owns a spin lock and tries to obtain it
a second time, the CPU will deadlock. No usage counter or owner identifier is associated with a spin lock; somebody either
owns the lock or not. If you try to acquire the lock when it’s owned, you’ll wait until the owner releases it. If your CPU
happens to already be the owner, the code that would release the lock can never execute because you’re spinning in a tight loop
testing and setting the lock variable.

CAUTION
You can certainly avoid the deadlock that occurs when a CPU tries to acquire a spin lock it already owns by
following this rule: make sure that the subroutine that claims the lock releases it and never tries to claim it
twice, and then don’t call any other subroutine while you own the lock. There’s no policeman in the operating
system to ensure you don’t call other subroutines—it’s just an engineering rule of thumb that will help you avoid
an inadvertent mistake. The danger you’re guarding against is that you (or some maintenance programmer
who follows in your footsteps) might forget that you’ve already claimed a certain spin lock. I’ll tell you about an
ugly exception to this salutary rule in Chapter 5, when I discuss IRP cancel routines.

In addition, acquiring a spin lock raises the IRQL to DISPATCH_LEVEL automatically. Consequently, code that acquires a lock
must be in nonpaged memory and must not block the thread in which it runs. (There is an exception in Windows XP and later
systems. KeAcquireInterruptSpinLock raises the IRQL to the DIRQL for an interrupt and claims the spin lock associated with
the interrupt.)
As an obvious corollary of the previous fact, you can request a spin lock only when you’re running at or below
DISPATCH_LEVEL. Internally, the kernel is able to acquire spin locks at an IRQL higher than DISPATCH_LEVEL, but you
and I are unable to accomplish that feat.
Another fact about spin locks is that very little useful work occurs on a CPU that’s waiting for a spin lock. The spinning
happens at DISPATCH_LEVEL with interrupts enabled, so a CPU that’s waiting for a spin lock can service hardware interrupts.
But to avoid harming performance, you need to minimize the amount of work you do while holding a spin lock that some other
CPU is likely to want.
Two CPUs can simultaneously hold two different spin locks, by the way. This arrangement makes sense: you associate a spin
lock with a certain shared resource, or some collection of shared resources. There’s no reason to hold up processing related to
different resources protected by different spin locks.
As it happens, there are separate uniprocessor and multiprocessor kernels. The Windows XP setup program decides which
kernel to install after inspecting the computer. The multiprocessor kernel implements spin locks as I’ve just described. The
uniprocessor kernel realizes, however, that another CPU can’t be in the picture, so it implements spin locks a bit more simply.
On a uniprocessor system, acquiring a spin lock raises the IRQL to DISPATCH_LEVEL and does nothing else. Do you see how
you still get the synchronization benefit from claiming the so-called lock in this case? For some piece of code to attempt to
claim the same spin lock (or any other spin lock, actually, but that’s not the point here), it would have to be running at or below
DISPATCH_LEVEL—you can request a lock starting at or below DISPATCH_LEVEL only. But we already know that’s
impossible because, once you’re above PASSIVE_LEVEL, you can’t be interrupted by any other activity that would run at the
same or a lower IRQL. Q., as we used to say in my high school geometry class, E.D.

4.3.2 Working with Spin Locks

To use a spin lock explicitly, allocate storage for a KSPIN_LOCK object in nonpaged memory. Then call KeInitializeSpinLock
to initialize the object. Later, while running at or below DISPATCH_LEVEL, acquire the lock, perform the work that needs to
be protected from interference, and then release the lock. For example, suppose your device extension contains a spin lock
named QLock that you use for guarding access to a special IRP queue you’ve set up. You’ll initialize this lock in your
AddDevice function:

typedef struct _DEVICE_EXTENSION {

Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
- 90 - Synchronization | Chapter 4

KSPIN_LOCK QLock;
} DEVICE_EXTENSION, *PDEVICE_EXTENSION;

NTSTATUS AddDevice(...)
{

PDEVICE_EXTENSION pdx = ...;

KeInitializeSpinLock(&pdx->QLock);

Elsewhere in your driver, say in the dispatch function for some type of IRP, you can claim (and quickly release) the lock
around some queue manipulation that you need to perform. Note that this function must be in nonpaged memory because it
executes for a period of time at an elevated IRQL.

NTSTATUS DispatchSomething(...)
{
KIRQL oldirql;
PDEVICE_EXTENSION pdx = ...;

KeAcquireSpinLock(&pdx->QLock, &oldirql);

KeReleaseSpinLock(&pdx->QLock, oldirql);
}

1. When KeAcquireSpinLock acquires the spin lock, it also raises IRQL to DISPATCH_LEVEL and returns the current (that
is, preacquisition) level in the variable to which the second argument points.
2. When KeReleaseSpinLock releases the spin lock, it also lowers IRQL back to the value specified in the second argument.
If you know you’re already executing at DISPATCH_LEVEL, you can save a little time by calling two special routines. This
technique is appropriate, for example, in DPC, StartIo, and other driver routines that execute at DISPATCH_LEVEL:

KeAcquireSpinLockAtDpcLevel(&pdx->QLock);

KeReleaseSpinLockFromDpcLevel(&pdx->QLock);

4.3.3 Queued Spin Locks

Windows XP introduces a new type of spin lock, called an in-stack queued spin lock, that has a more efficient implementation
than a regular spin lock. The mechanics of using this new kind of lock are a bit different from what I just described. You still
allocate a KSPIN_LOCK object in nonpaged memory to which all relevant parts of your driver have access, and you still
initialize it by calling KeAcquireSpinLock. To acquire and release the lock, however, you use code like the following:

KLOCK_QUEUE_HANDLE qh;

KeAcquireInStackQueuedSpinLock(&pdx->QLock, &qh);

KeReleaseInStackQueuedSpinLock(&qh);

1. The KLOCK_QUEUE_HANDLE structure is opaque—you’re not supposed to know what it contains, but you do have to
reserve storage for it. The best way to do that is to define an automatic variable (hence the in-stack part of the name).
2. Call KeAcquireInStackQueuedSpinLock instead of KeAcquireSpinLock to acquire the lock, and supply the address of the
KLOCK_QUEUE_HANDLE object as the second argument.
3. Call KeReleaseInStackQueuedSpinLock instead of KeReleaseSpinLock to release the lock.
The reason an in-stack queued spin lock is more efficient relates to the performance impact of a standard spin lock. With a
standard spin lock, each CPU that is contending for ownership constantly modifies the same memory location. Each
modification requires every contending CPU to reload the same dirty cache line. A queued spin lock, introduced for internal
use in Windows 2000, avoids this adverse effect by cleverly using interlocked exchange and compare-exchange operations to
track users and waiters for a lock. A waiting CPU continually reads (but does not write) a unique memory location. A CPU that
releases a lock alters the memory variable on which the next waiter is spinning.

Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
4.4 Kernel Dispatcher Objects - 91 -

Internal queued spin locks can’t be directly used by driver code because they rely on a fixed-size table of lock pointers to
which drivers don’t have access. Windows XP added the in-stack queued spin lock, which relies on an automatic variable
instead of the fixed-size table.
In addition to the two routines I showed you for acquiring and releasing this new kind of spin lock, you can also use two other
routines if you know you’re already executing at DISPATCH_LEVEL: KeAcquireInStackQueuedSpinLockAtDpcLevel and
KeReleaseInStackQueuedSpinLockFromDpcLevel. (Try spelling those names three times fast!)

NOTE
Because Windows versions earlier than XP don’t support the in-stack queued spin lock or interrupt spin lock
routines, you can’t directly call them in a driver intended to be binary portable between versions. The SPINLOCK
sample driver shows how to make a run-time decision to use the newer spin locks under XP and the old spin
locks otherwise.

4.4 Kernel Dispatcher Objects

The kernel provides five types of synchronization objects that you can use to control the flow of nonarbitrary threads. See
Table 4-1 for a summary of these kernel dispatcher object types and their uses. At any moment, one of these objects is in one
of two states: signaled or not-signaled. At times when it’s permissible for you to block a thread in whose context you’re
running, you can wait for one or more objects to reach the signaled state by calling KeWaitForSingleObject or
KeWaitForMultipleObjects. The kernel also provides routines for initializing and controlling the state of each of these objects.

Object Data Type Description

Event KEVENT Blocks a thread until some other thread detects that an event has occurred
Semaphore KSEMAPHORE Used instead of an event when an arbitrary number of wait calls can be satisfied
Mutex KMUTEX Excludes other threads from executing a particular section of code
Timer KTIMER Delays execution of a thread for a given period of time
Thread KTHREAD Blocks one thread until another thread terminates

Table 4-1. Kernel Dispatcher Objects

In the next few sections, I’ll describe how to use the kernel dispatcher objects. I’ll start by explaining when you can block a
thread by calling one of the wait primitives, and then I’ll discuss the support routines that you use with each of the object types.
I’ll finish this section by discussing the related concepts of thread alerts and asynchronous procedure call delivery.

4.4.1 How and When You Can Block

To understand when and how it’s permissible for a WDM driver to block a thread on a kernel dispatcher object, you have to
recall some of the basic facts about threads from Chapter 2. In general, whatever thread was executing at the time of a software
or hardware interrupt continues to be the current thread while the kernel processes the interrupt. We speak of executing
kernel-mode code in the context of this current thread. In response to interrupts of various kinds, the scheduler might decide to
switch threads, of course, in which case a new thread becomes “current.”
We use the terms arbitrary thread context and nonarbitrary thread context to describe the precision with which we can know
the thread in whose context we’re currently operating in a driver subroutine. If we know that we’re in the context of the thread
that initiated an I/O request, the context is not arbitrary. Much of the time, however, a WDM driver can’t know this fact
because chance usually controls which thread is active when the interrupt occurs that results in the driver being called. When
applications issue I/O requests, they cause a transition from user mode to kernel mode. The I/O Manager routines that create an
IRP and send it to a driver dispatch routine continue to operate in this nonarbitrary thread context, as does the first dispatch
routine to see the IRP. We use the term highest-level driver to describe the driver whose dispatch routine first receives the IRP.
As a general rule, only a highest-level driver can know for sure that it’s operating in a nonarbitrary thread context. Let’s
suppose you are a dispatch routine in a lower-level driver, and you’re wondering whether you’re getting called in an arbitrary
thread. If the highest-level driver just sent you an IRP directly from its dispatch routine, you’d be in the original, nonarbitrary,
thread. But suppose that driver had put an IRP on a queue and then returned to the application. That driver would have
removed the IRP from the queue in an arbitrary thread and then sent it or another IRP to you. Unless you know that didn’t
happen, you should assume you’re in an arbitrary thread if you’re not the highest-level driver.
Notwithstanding what I just said, in many situations you can be sure of the thread context. Your DriverEntry and AddDevice
routines are called in a system thread that you can block if you need to. You won’t often need to explicitly block inside these
routines, but you could if you wanted to. You receive IRP_MJ_PNP requests in a system thread too. In many cases, you must
block that thread to correctly process the request. Finally, you’ll sometimes receive I/O requests directly from an application, in
which case you’ll know you’re in a thread belonging to the application.

Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
- 92 - Synchronization | Chapter 4

NOTE
Microsoft uses the term highest-level driver primarily to distinguish between file system drivers and the storage
device drivers they call to do actual I/O. The file system driver is “highest level,” while the storage driver is not.
It would be easy to confuse this concept with the layering of WDM drivers, but it’s not the same. The way I think
of things is that all the WDM drivers for a given piece of hardware, including all the filter drivers, the function
driver, and the bus driver, are collectively either “highest level” or not. A filter driver has no business queuing
an IRP that, but for the intervention of the filter, would have flowed down the stack in the original thread
context. So if the thread context was nonarbitrary when the IRP got to the topmost filter dispatch object (FiDO),
it should still be nonarbitrary in every lower dispatch routine.

Also recall from the discussion earlier in this chapter that you must not block a thread if you’re executing at or above
DISPATCH_LEVEL.
Having recalled these facts about thread context and IRQL, we can state a simple rule about when it’s OK to block a thread:
Block only the thread that originated the request you’re working on, and only when executing at IRQL strictly less than
DISPATCH_LEVEL.
Several of the dispatcher objects, and the so-called Executive Fast Mutex I’ll discuss later in this chapter, offer “mutual
exclusion” functionality. That is, they permit one thread to access a given shared resource without interference from other
threads. This is pretty much what a spin lock does, so you might wonder how to choose between synchronization methods. In
general, I think you should prefer to synchronize below DISPATCH_LEVEL if you can because that strategy allows a thread
that owns a mutual exclusion lock to cause page faults and to be preempted by other threads if the thread continues to hold the
lock for a long time. In addition, this strategy allows other CPUs to continue doing useful work, even though threads have
blocked on those CPUs to acquire the same lock. If any of the code that accesses a shared resource can run at
DISPATCH_LEVEL, though, you must use a spin lock because the DISPATCH_LEVEL code might interrupt code running at
lower IRQL.

4.4.2 Waiting on a Single Dispatcher Object

You call KeWaitForSingleObject as illustrated in the following example:

ASSERT(KeGetCurrentIrql() <= DISPATCH_LEVEL);

LARGE_INTEGER timeout;
NTSTATUS status = KeWaitForSingleObject(object, WaitReason,
WaitMode, Alertable, &timeout);

As suggested by the ASSERT, you must be executing at or below DISPATCH_LEVEL to even call this service routine.
In this call, object points to the object you want to wait on. Although this argument is typed as a PVOID, it should be a pointer
to one of the dispatcher objects listed in Table 4-1. The object must be in nonpaged memory—for example, in a device
extension structure or other data area allocated from the nonpaged pool. For most purposes, the execution stack can be
considered nonpaged.
WaitReason is a purely advisory value chosen from the KWAIT_REASON enumeration. No code in the kernel actually cares
what value you supply here, so long as you don’t specify WrQueue. (Internally, scheduler code bases some decisions on
whether a thread is currently blocked for this “reason.”) The reason a thread is blocked is saved in an opaque data structure,
though. If you knew more about that data structure and were trying to debug a deadlock of some kind, you could perhaps gain
clues from the reason code. The bottom line: always specify Executive for this parameter; there’s no reason to say anything
else.
WaitMode is one of the two values of the MODE enumeration: KernelMode or UserMode. Alertable is a simple Boolean value.
Unlike WaitReason, these parameters do make a difference in the way the system behaves by controlling whether the wait can
be terminated early in order to deliver asynchronous procedure calls of various kinds. I’ll explain these interactions in more
detail in “Thread Alerts and APCs” later in this chapter. Waiting in user mode also authorizes the Memory Manager to swap
your thread’s kernel-mode stack out. You’ll see examples in this book and elsewhere where drivers create event objects, for
instance, as automatic variables. A bug check would result if some other thread were to call KeSetEvent at elevated IRQL at a
time when the event object was absent from memory. The bottom line: you should probably always wait in KernelMode and
specify FALSE for the Alertable parameter.
The last parameter to KeWaitForSingleObject is the address of a 64-bit timeout value, expressed in 100-nanosecond units. A
positive number for the timeout is an absolute timestamp relative to the January 1, 1601, epoch of the system clock. You can
determine the current time by calling KeQuerySystemTime, and you can add a constant to that value. A negative number is an
interval relative to the current time. If you specify an absolute time, a subsequent change to the system clock alters the duration
of the timeout you might experience. That is, the timeout doesn’t expire until the system clock equals or exceeds whatever
absolute value you specify. In contrast, if you specify a relative timeout, the duration of the timeout you experience is
unaffected by changes in the system clock.

Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
4.4 Kernel Dispatcher Objects - 93 -

Why January 1, 1601?

Years ago, when I was first learning the Win32 API, I was bemused by the choice of January 1, 1601, as the
origin for the timestamps in Windows NT. I understood the reason for this choice when I had occasion to write
a set of conversion routines. Everyone knows that years divisible by four are leap years. Many people know that
century years (such as 1900) are exceptions—they’re not leap years even though they’re divisible by 4. A few
people know that every fourth century year (such as 1600 and 2000) is an exception to the exception—they are
leap years. January 1, 1601, was the start of a 400-year cycle that ends in a leap year. If you base timestamps
on this origin, it’s possible to write programs that convert a Windows NT timestamp to a conventional
representation of the date (and vice versa) without doing any jumps.

Specifying a zero timeout causes KeWaitForSingleObject to return immediately with a status code indicating whether the
object is in the signaled state. If you’re executing at DISPATCH_LEVEL, you must specify a zero timeout because blocking is
not allowed. Each kernel dispatcher object offers a KeReadStateXxx service function that allows you to determine the state of
the object. Reading the state isn’t completely equivalent to waiting for zero time, however: when KeWaitForSingleObject
discovers that the wait is satisfied, it performs the side effects that the particular object requires. In contrast, reading the state of
the object doesn’t perform the operations, even if the object is already signaled and a wait would be satisfied if it were
requested right now.
Specifying a NULL pointer for the timeout parameter is OK and indicates an infinite wait.
The return value indicates one of several possible results. STATUS_SUCCESS is the result you expect and indicates that the
wait was satisfied. That is, either the object was in the signaled state when you made the call to KeWaitForSingleObject or else
the object was in the not-signaled state and later became signaled. When the wait is satisfied in this way, operations might need
to be performed on the object. The nature of these operations depends on the type of the object, and I’ll explain them later in
this chapter in connection with discussing each type of object. (For example, a synchronization type of event will be reset after
your wait is satisfied.)
The return value STATUS_TIMEOUT indicates that the specified timeout occurred without the object reaching the signaled
state. If you specify a zero timeout, KeWaitForSingleObject returns immediately with either this code (indicating that the
object is not-signaled) or STATUS_SUCCESS (indicating that the object is signaled). This return value isn’t possible if you
specify a NULL timeout parameter pointer because you thereby request an infinite wait.
Two other return values are possible. STATUS_ALERTED and STATUS_USER _APC mean that the wait has terminated without
the object having been signaled because the thread has received an alert or a user-mode APC, respectively. I’ll discuss these
concepts a bit further on in “Thread Alerts and APCs.”
Note that STATUS_TIMEOUT, STATUS_ALERTED, and STATUS_USER_APC all pass the NT_SUCCESS test. Therefore,
don’t simply use NT_SUCCESS on the return code from KeWaitForSingleObject in the expectation that it will distinguish
between cases in which the object was signaled and cases in which the object was not signaled.

Windows 98/Me Compatibility Note

KeWaitForSingleObject and KeWaitForMultipleObjects have a horrible bug in Windows 98 and Millennium in that
they can return the undocumented and nonsensical value 0xFFFFFFFF in two situations. One situation occurs
when a thread terminates while blocked on a WDM object. The wait returns early with this bogus code. The
return code should never happen (because it’s undocumented), and the wait shouldn’t terminate early unless
you specify TRUE for the Alertable parameter. You can work around this problem by just reissuing the wait.

The other circumstance in which you can get the bogus return occurs if the thread you’re trying to block is
already blocked. How, you might well ask, could you be executing in the context of a thread that’s really
blocked? This situation happens in Windows 98/Me when someone blocks on a VxD-level object with the
BLOCK_SVC_INTS flag and the system later calls a function in your driver at what’s called event time. You can
nominally be in the context of the blocked thread, and you simply cannot block a second time on a WDM object.
In fact, I’ve even seen KeWaitForSingleObject return with the IRQL raised to DISPATCH_LEVEL in this
circumstance. As far as I know, there’s no workaround for the problem. Thankfully, it seems to occur only with
drivers for serial devices, in which there’s a crossover between VxD and WDM code.

4.4.3 Waiting on Multiple Dispatcher Objects

KeWaitForMultipleObjects is a companion function to KeWaitForSingleObject that you use when you want to wait for one or
all of several dispatcher objects simultaneously. Call this function as in this example:

ASSERT(KeGetCurrentIrql() <= DISPATCH_LEVEL);

LARGE_INTEGER timeout;
NTSTATUS status = KeWaitForMultipleObjects(count, objects,
WaitType, WaitReason, WaitMode, Alertable, &timeout, waitblocks);

Here objects is the address of an array of pointers to dispatcher objects, and count is the number of pointers in the array. The
count must be less than or equal to the value MAXIMUM_WAIT_OBJECTS, which currently equals 64. The array, as well as
each of the objects to which the elements of the array point, must be in nonpaged memory. WaitType is one of the enumeration

Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
- 94 - Synchronization | Chapter 4

values WaitAll or WaitAny and specifies whether you want to wait until all of the objects are simultaneously in the signaled
state or whether, instead, you want to wait until any one of the objects is signaled.
The waitblocks argument points to an array of KWAIT_BLOCK structures that the kernel will use to administer the wait
operation. You don’t need to initialize these structures in any way—the kernel just needs to know where the storage is for the
group of wait blocks that it will use to record the status of each of the objects during the wait. If you’re waiting for a small
number of objects (specifically, a number no bigger than THREAD_WAIT_OBJECTS, which currently equals 3), you can
supply NULL for this parameter. If you supply NULL, KeWaitForMultipleObjects uses a preallocated array of wait blocks that
lives in the thread object. If you’re waiting for more objects than this, you must provide nonpaged memory that’s at least count
* sizeof(KWAIT_BLOCK) bytes in length.
The remaining arguments to KeWaitForMultipleObjects are the same as the corresponding arguments to
KeWaitForSingleObject, and most return codes have the same meaning.
If you specify WaitAll, the return value STATUS_SUCCESS indicates that all the objects managed to reach the signaled state
simultaneously. If you specify WaitAny, the return value is numerically equal to the objects array index of the single object that
satisfied the wait. If more than one of the objects happens to be signaled, you’ll be told about one of them—maybe the
lowest-numbered of all the ones that are signaled at that moment, but maybe some other one. You can think of this value being
STATUS_WAIT_0 plus the array index. You can’t simply perform the usual NT_SUCCESS test of the returned status before
extracting the array index from the status code, though, because other possible return codes (including STATUS_TIMEOUT,
STATUS_ALERTED, and STATUS_USER_APC) would also pass the test. Use code like this:

NTSTATUS status = KeWaitForMultipleObjects(...);

if ((ULONG) status < count)
{
ULONG iSignaled = (ULONG) status - (ULONG) STATUS_WAIT_0;

When KeWaitForMultipleObjects returns a status code equal to an object’s array index in a WaitAny case, it also performs the
operations required by that object. If more than one object is signaled and you specified WaitAny, the operations are performed
only for the one that’s deemed to satisfy the wait and whose index is returned. That object isn’t necessarily the first one in your
array that happens to be signaled.

4.4.4 Kernel Events

You use the service functions listed in Table 4-2 to work with kernel event objects. To initialize an event object, first reserve
nonpaged storage for an object of type KEVENT and then call KeInitializeEvent:

ASSERT(KeGetCurrentIrql() == PASSIVE_LEVEL);
KeInitializeEvent(event, EventType, initialstate);

Event is the address of the event object. EventType is one of the enumeration values NotificationEvent and
SynchronizationEvent. A notification event has the characteristic that, when it is set to the signaled state, it stays signaled until
it’s explicitly reset to the not-signaled state. Furthermore, all threads that wait on a notification event are released when the
event is signaled. This is like a manual-reset event in user mode. A synchronization event, on the other hand, gets reset to the
not-signaled state as soon as a single thread gets released. This is what happens in user mode when someone calls SetEvent on
an auto-reset event object. The only operation performed on an event object by KeWaitXxx is to reset a synchronization event
to not-signaled. Finally, initialstate is TRUE to specify that the initial state of the event is to be signaled and FALSE to specify
that the initial state is to be not-signaled.

Service Function Description

KeClearEvent Sets event to not-signaled; doesn’t report previous state
KeInitializeEvent Initializes event object
KeReadStateEvent Determines current state of event (Windows XP and Windows 2000 only)
KeResetEvent Sets event to not-signaled; returns previous state
KeSetEvent Sets event to signaled; returns previous state

Table 4-2. Service Functions for Use with Kernel Event Objects

NOTE
In this series of sections on synchronization primitives, I’m repeating the IRQL restrictions that the DDK
documentation describes. In the current release of Microsoft Windows XP, the DDK is sometimes more
restrictive than the operating system actually is. For example, KeClearEvent can be called at any IRQL, not just
at or below DISPATCH_LEVEL. KeInitializeEvent can be called at any IRQL, not just at PASSIVE_LEVEL.
However, you should regard the statements in the DDK as being tantamount to saying that Microsoft might
someday impose the documented restriction, which is why I haven’t tried to report the true state of affairs.

Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
4.4 Kernel Dispatcher Objects - 95 -

You can call KeSetEvent to place an event in the signaled state:

ASSERT(KeGetCurrentIrql() <= DISPATCH_LEVEL);

LONG wassignaled = KeSetEvent(event, boost, wait);

As implied by the ASSERT, you must be running at or below DISPATCH_LEVEL to call this function. The event argument is a
pointer to the event object in question, and boost is a value to be added to a waiting thread’s priority if setting the event results
in satisfying someone’s wait. See the sidebar (“That Pesky Third Argument to KeSetEvent”) for an explanation of the Boolean
wait argument, which a WDM driver would almost never want to specify as TRUE. The return value is nonzero if the event
was already in the signaled state before the call and 0 if the event was in the not-signaled state.
A multitasking scheduler needs to artificially boost the priority of a thread that waits for I/O operations or synchronization
objects in order to avoid starving threads that spend lots of time waiting. This is because a thread that blocks for some reason
generally relinquishes its time slice and won’t regain the CPU until either it has a relatively higher priority than other eligible
threads or other threads that have the same priority finish their time slices. A thread that never blocks, however, gets to
complete its time slices. Unless a boost is applied to the thread that repeatedly blocks, therefore, it will spend a lot of time
waiting for CPU-bound threads to finish their time slices.
You and I won’t always have a good idea of what value to use for a priority boost. A good rule of thumb to follow is to specify
IO_NO_INCREMENT unless you have a good reason not to. If setting the event is going to wake up a thread that’s dealing
with a time-sensitive data flow (such as a sound driver), supply the boost that’s appropriate to that kind of device (such as
IO_SOUND_INCREMENT). The important thing is not to boost the waiter for a silly reason. For example, if you’re trying to
handle an IRP_MJ_PNP request synchronously—see Chapter 6—you’ll be waiting for lower-level drivers to handle the IRP
before you proceed, and your completion routine will be calling KeSetEvent. Since Plug and Play requests have no special
claim on the processor and occur only infrequently, specify IO_NO_INCREMENT, even for a sound card.

That Pesky Third Argument to KeSetEvent

The purpose of the wait argument to KeSetEvent is to allow internal code to hand off control from one thread
to another very quickly. System components other than device drivers can, for example, create paired event
objects that are used by client and server threads to gate their communication. When the server wants to wake
up its paired client, it will call KeSetEvent with the wait argument set to TRUE and then immediately call
KeWaitXxx to put itself to sleep. The use of wait allows these two operations to be done atomically so that no
other thread can be awakened in between and possibly wrest control from the client and the server.

The DDK has always sort of described what happens internally, but I’ve found the explanation confusing. I’ll try
to explain it in a different way so that you can see why you should always say FALSE for this parameter.
Internally, the kernel uses a dispatcher database lock to guard operations related to thread blocking, waking,
and scheduling. KeSetEvent needs to acquire this lock, and so do the KeWaitXxx routines. If you say TRUE for
the wait argument, KeSetEvent sets a flag so that KeWaitXxx will know you did so, and it returns to you without
releasing this lock. When you turn around and (immediately, please—you’re running at a higher IRQL than
every hardware device, and you own a spin lock that’s very frequently in contention) call KeWaitXxx, it needn’t
acquire the lock all over again. The net effect is that you’ll wake up the waiting thread and put yourself to sleep
without giving any other thread a chance to start running.

You can see, first of all, that a function that calls KeSetEvent with wait set to TRUE has to be in nonpaged
memory because it will execute briefly above DISPATCH_LEVEL. But it’s hard to imagine why an ordinary device
driver would even need to use this mechanism because it would almost never know better than the kernel which
thread ought to be scheduled next. The bottom line: always say FALSE for this parameter. In fact, it’s not clear
why the parameter has even been exposed to tempt us.

You can determine the current state of an event (at any IRQL) by calling KeReadStateEvent:

LONG signaled = KeReadStateEvent(event);

The return value is nonzero if the event is signaled, 0 if it’s not-signaled.

NOTE
KeReadStateEvent isn’t supported in Microsoft Windows 98/Me, even though the other KeReadStateXxx
functions described here are. The absence of support has to do with how events and other synchronization
primitives are implemented in Windows 98/Me.

You can determine the current state of an event and, immediately thereafter, place it in the not-signaled state by calling the
KeResetEvent function (at or below DISPATCH_LEVEL):

ASSERT(KeGetCurrentIrql() <= DISPATCH_LEVEL);

LONG signaled = KeResetEvent(event);

If you’re not interested in the previous state of the event, you can save a little time by calling KeClearEvent instead:

Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
- 96 - Synchronization | Chapter 4

ASSERT(KeGetCurrentIrql() <= DISPATCH_LEVEL);

KeClearEvent(event);

KeClearEvent is faster because it doesn’t need to capture the current state of the event before setting it to not-signaled. But
beware of calling KeClearEvent when another thread might be using the same event since there’s no good way to control the
races between you clearing the event and some other thread setting it or waiting on it.

Using a Synchronization Event for Mutual Exclusion

I’ll tell you later in this chapter about two types of mutual exclusion objects—a kernel mutex and an executive fast
mutex—that you can use to limit access to shared data in situations in which a spin lock is inappropriate for some reason.
Sometimes you can simply use a synchronization event for this purpose. First define the event in nonpaged memory, as
follows:

typedef struct _DEVICE_EXTENSION {

KEVENT lock;
} DEVICE_EXTENSION, *PDEVICE_EXTENSION;

Initialize it as a synchronization event in the signaled state:

KeInitializeEvent(&pdx->lock, SynchronizationEvent, TRUE);

Enter your lightweight critical section by waiting on the event. Leave by setting the event.

KeWaitForSingleObject(&pdx->lock, Executive, KernelMode, FALSE, NULL);

KeSetEvent(&pdx->lock, EVENT_INCREMENT, FALSE);

Use this trick only in a system thread, though, to prevent a user-mode call to NtSuspendThread from creating a
deadlock. (This deadlock can easily happen if a user-mode debugger is running on the same process.) If you’re running
in a user thread, you should prefer to use an executive fast mutex. Don’t use this trick at all for code that executes in the
paging path, as explained later in connection with the “unsafe” way of acquiring an executive fast mutex.

4.4.5 Kernel Semaphores

A kernel semaphore is an integer counter with associated synchronization semantics. The semaphore is considered signaled
when the counter is positive and not-signaled when the counter is 0. The counter cannot take on a negative value. Releasing a
semaphore increases the counter, whereas successfully waiting on a semaphore decrements the counter. If the decrement makes
the count 0, the semaphore is then considered not-signaled, with the consequence that other KeWaitXxx callers who insist on
finding it signaled will block. Note that if more threads are waiting for a semaphore than the value of the counter, not all of the
waiting threads will be unblocked.
The kernel provides three service functions to control the state of a semaphore object. (See Table 4-3.) You initialize a
semaphore by making the following function call at PASSIVE_LEVEL:

ASSERT(KeGetCurrentIrql() == PASSIVE_LEVEL);
KeInitializeSemaphore(semaphore, count, limit);

In this call, semaphore points to a KSEMAPHORE object in nonpaged memory. The count variable is the initial value of the
counter, and limit is the maximum value that the counter will be allowed to take on, which must be as large as the initial count.

Service Function Description

KeInitializeSemaphore Initializes semaphore object
KeReadStateSemaphore Determines current state of semaphore
KeReleaseSemaphore Sets semaphore object to the signaled state
Table 4-3. Service Functions for Use with Kernel Semaphore Objects
If you create a semaphore with a limit of 1, the object is somewhat similar to a mutex in that only one thread at a time will be
able to claim it. A kernel mutex has some features that a semaphore lacks, however, to help prevent deadlocks. Accordingly,
there’s almost no point in creating a semaphore with a limit of 1.
If you create a semaphore with a limit bigger than 1, you have an object that allows multiple threads to access a given resource.
A familiar theorem in queuing theory dictates that providing a single queue for multiple servers is more fair (that is, results in
less variation in waiting times) than providing a separate queue for each of several servers. The average waiting time is the

Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
4.4 Kernel Dispatcher Objects - 97 -

same in both cases, but the variation in waiting times is smaller with the single queue. (This is why queues in stores are
increasingly organized so that customers wait in a single line for the next available clerk.) This kind of semaphore allows you
to organize a set of software or hardware servers to take advantage of that theorem.
The owner (or one of the owners) of a semaphore releases its claim to the semaphore by calling KeReleaseSemaphore:

ASSERT(KeGetCurrentIrql() <= DISPATCH_LEVEL);

LONG wassignaled = KeReleaseSemaphore(semaphore, boost, delta, wait);

This operation adds delta, which must be positive, to the counter associated with semaphore, thereby putting the semaphore in
the signaled state and allowing other threads to be released. In most cases, you’ll specify 1 for this parameter to indicate that
one claimant of the semaphore is releasing its claim. The boost and wait parameters have the same import as the corresponding
parameters to KeSetEvent, discussed earlier. The return value is 0 if the previous state of the semaphore was not-signaled and
nonzero if the previous state was signaled.
KeReleaseSemaphore doesn’t allow you to increase the counter beyond the limit specified when you initialized the semaphore.
If you try, it doesn’t adjust the counter at all, and it raises an exception with the code
STATUS_SEMAPHORE_LIMIT_EXCEEDED. Unless someone has a structured exception handler to trap the exception, a bug
check will eventuate.
You can also interrogate the current state of a semaphore with this call:

ASSERT(KeGetCurrentIrql() <= DISPATCH_LEVEL);

LONG signaled = KeReadStateSemaphore(semaphore);

The return value is nonzero if the semaphore is signaled and 0 if the semaphore is not-signaled. You shouldn’t assume that the
return value is the current value of the counter—it could be any nonzero value if the counter is positive.
Having told you all this about how to use kernel semaphores, I feel I ought to tell you that I’ve never seen a driver that uses
one of them.

4.4.6 Kernel Mutexes

The word mutex is a contraction of mutual exclusion. A kernel mutex object provides one method (and not necessarily the best
one) to serialize access by competing threads to a given shared resource. The mutex is considered signaled if no thread owns it
and not-signaled if a thread currently does own it. When a thread gains control of a mutex after calling one of the KeWaitXxx
routines, the kernel also prevents delivery of any but special kernel APCs to help avoid possible deadlocks. This is the
operation referred to in the earlier discussion of KeWaitForSingleObject (in the section “Waiting on a Single Dispatcher
Object”).
It’s generally better to use an executive fast mutex rather than a kernel mutex, as I’ll explain in more detail later in “Fast Mutex
Objects.” The main difference between the two is that acquiring a fast mutex raises the IRQL to APC_LEVEL, whereas
acquiring a kernel mutex doesn’t change the IRQL. Among the reasons you care about this fact is that completion of so-called
synchronous IRPs requires delivery of a special kernel-mode APC, which cannot occur if the IRQL is higher than
PASSIVE_LEVEL. Thus, you can create and use synchronous IRPs while owning a kernel mutex but not while owning an
executive fast mutex. Another reason for caring arises for drivers that execute in the paging path, as elaborated later on in
connection with the “unsafe” way of acquiring an executive fast mutex.
Another, less important, difference between the two kinds of mutex object is that a kernel mutex can be acquired recursively,
whereas an executive fast mutex cannot. That is, the owner of a kernel mutex can make a subsequent call to KeWaitXxx
specifying the same mutex and have the wait immediately satisfied. A thread that does this must release the mutex an equal
number of times before the mutex will be considered free.
Table 4-4 lists the service functions you use with mutex objects.

Service Function Description

KeInitializeMutex Initializes mutex object
KeReadStateMutex Determines current state of mutex
KeReleaseMutex Sets mutex object to the signaled state

Table 4-4. Service Functions for Use with Kernel Mutex Objects
To create a mutex, you reserve nonpaged memory for a KMUTEX object and make the following initialization call:

ASSERT(KeGetCurrentIrql() == PASSIVE_LEVEL);
KeInitializeMutex(mutex, level);

where mutex is the address of the KMUTEX object, and level is a parameter originally intended to help avoid deadlocks when
your own code uses more than one mutex. Since the kernel currently ignores the level parameter, I’m not going to attempt to

Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
- 98 - Synchronization | Chapter 4

describe what it used to mean.

The mutex begins life in the signaled—that is, unowned—state. An immediate call to KeWaitXxx would take control of the
mutex and put it in the not-signaled state.
You can interrogate the current state of a mutex with this function call:

ASSERT(KeGetCurrentIrql() <= DISPATCH_LEVEL);

LONG signaled = KeReadStateMutex(mutex);

The return value is 0 if the mutex is currently owned, nonzero if it’s currently unowned.
The thread that owns a mutex can release ownership and return the mutex to the signaled state with this function call:

ASSERT(KeGetCurrentIrql() <= DISPATCH_LEVEL);

LONG wassignaled = KeReleaseMutex(mutex, wait);

The wait parameter means the same thing as the corresponding argument to KeSetEvent. The return value is always 0 to
indicate that the mutex was previously owned because, if this were not the case, KeReleaseMutex would have bugchecked (it
being an error for anyone but the owner to release a mutex).
Just for the sake of completeness, I want to mention a macro in the DDK named KeWaitForMutexObject. (See WDM.H.) It’s
defined simply as follows:

#define KeWaitForMutexObject KeWaitForSingleObject

Using this special name offers no benefit at all. You don’t even get the benefit of having the compiler insist that the first
argument be a pointer to a KMUTEX instead of any random pointer type.

4.4.7 Kernel Timers

The kernel provides a timer object that functions something like an event that automatically signals itself at a specified
absolute time or after a specified interval. It’s also possible to create a timer that signals itself repeatedly and to arrange for a
DPC callback following the expiration of the timer. Table 4-5 lists the service functions you use with timer objects.

Service Function Description

KeCancelTimer Cancels an active timer
KeInitializeTimer Initializes a one-time notification timer
KeInitializeTimerEx Initializes a one-time or repetitive notification or synchronization timer
KeReadStateTimer Determines current state of a timer
KeSetTimer (Re)specifies expiration time for a notification timer
KeSetTimerEx (Re)specifies expiration time and other properties of a timer
Table 4-5. Service Functions for Use with Kernel Timer Objects
There are several usage scenarios for timers, which I’ll describe in the next few sections:
Timer used like a self-signaling event
Timer with a DPC routine to be called when a timer expires
Periodic timer used to call a DPC routine over and over again

Notification Timers Used like Events

In this scenario, we’ll create a notification timer object and wait until it expires. First allocate a KTIMER object in nonpaged
memory. Then, running at or below DISPATCH_LEVEL, initialize the timer object, as shown here:

PKTIMER timer; // <== someone gives you this

ASSERT(KeGetCurrentIrql() <= DISPATCH_LEVEL);
KeInitializeTimer(timer);

At this point, the timer is in the not-signaled state and isn’t counting down—a wait on the timer would never be satisfied. To
start the timer counting, call KeSetTimer as follows:

ASSERT(KeGetCurrentIrql() <= DISPATCH_LEVEL);

LARGE_INTEGER duetime;

Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
4.4 Kernel Dispatcher Objects - 99 -

BOOLEAN wascounting = KeSetTimer(timer, duetime, NULL);

The duetime value is a 64-bit time value expressed in 100-nanosecond units. If the value is positive, it’s an absolute time
relative to the same January 1, 1601, epoch used for the system timer. If the value is negative, it’s an interval relative to the
current time. If you specify an absolute time, a subsequent change to the system clock alters the duration of the timeout you
experience. That is, the timer doesn’t expire until the system clock equals or exceeds whatever absolute value you specify. In
contrast, if you specify a relative timeout, the duration of the timeout you experience is unaffected by changes in the system
clock. These are the same rules that apply to the timeout parameter to KeWaitXxx.
The return value from KeSetTimer, if TRUE, indicates that the timer was already counting down (in which case, our call to
KeSetTimer would have canceled it and started the count all over again).
At any time, you can determine the current state of a timer:

ASSERT(KeGetCurrentIrql() <= DISPATCH_LEVEL);

BOOLEAN counting = KeReadStateTimer(timer);

KeInitializeTimer and KeSetTimer are actually older service functions that have been superseded by newer functions. We could
have initialized the timer with this call:

ASSERT(KeGetCurrentIqrl() <= DISPATCH_LEVEL);

KeInitializeTimerEx(timer, NotificationTimer);

We could also have used the extended version of the set timer function, KeSetTimerEx:

ASSERT(KeGetCurrentIrql() <= DISPATCH_LEVEL);

LARGE_INTEGER duetime;
BOOLEAN wascounting = KeSetTimerEx(timer, duetime, 0, NULL);

I’ll explain a bit further on in this chapter the purpose of the extra parameters in these extended versions of the service
functions.
Once the timer is counting down, it’s still considered to be not-signaled until the specified due time arrives. At that point, the
object becomes signaled, and all waiting threads are released. The system guarantees only that the expiration of the timer will
be noticed no sooner than the due time you specify. If you specify a due time with a precision finer than the granularity of the
system timer (which you can’t control), the timeout will be noticed later than the exact instant you specify. You can call
KeQueryTimeIncrement to determine the granularity of the system clock.

Notification Timers Used with a DPC

In this scenario, we want expiration of the timer to trigger a DPC. You would choose this method of operation if you wanted to
be sure that you could service the timeout no matter what priority level your thread had. (Since you can wait only below
DISPATCH_LEVEL, regaining control of the CPU after the timer expires is subject to the normal vagaries of thread scheduling.
The DPC, however, executes at elevated IRQL and thereby effectively preempts all threads.)
We initialize the timer object in the same way. We also have to initialize a KDPC object for which we allocate nonpaged
memory. For example:

PKDPC dpc; // <== points to KDPC you've allocated

ASSERT(KeGetCurrentIrql() == PASSIVE_LEVEL);
KeInitializeTimer(timer);
KeInitializeDpc(dpc, DpcRoutine, context);

You can initialize the timer object by using either KeInitializeTimer or KeInitializeTimerEx, as you please. DpcRoutine is the
address of a deferred procedure call routine, which must be in nonpaged memory. The context parameter is an arbitrary 32-bit
value (typed as a PVOID) that will be passed as an argument to the DPC routine. The dpc argument is a pointer to a KDPC
object for which you provide nonpaged storage. (It might be in your device extension, for example.)
When we want to start the timer counting down, we specify the DPC object as one of the arguments to KeSetTimer or
KeSetTimerEx:

ASSERT(KeGetCurrentIrql() <= DISPATCH_LEVEL);

LARGE_INTEGER duetime;
BOOLEAN wascounting = KeSetTimer(timer, duetime, dpc);

You could also use the extended form KeSetTimerEx if you wanted to. The only difference between this call and the one we
examined in the preceding section is that we’ve specified the DPC object address as an argument. When the timer expires, the
system will queue the DPC for execution as soon as conditions permit. This would be at least as soon as you’d be able to wake

Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
- 100 - Synchronization | Chapter 4

up from a wait. Your DPC routine would have the following skeletal appearance:

VOID DpcRoutine(PKDPC dpc, PVOID context, PVOID junk1, PVOID junk2)

{

For what it’s worth, even when you supply a DPC argument to KeSetTimer or KeSetTimerEx, you can still call KeWaitXxx to
wait at PASSIVE_LEVEL or APC_LEVEL if you want. On a single-CPU system, the DPC would occur before the wait could
finish because it executes at a higher IRQL.

Synchronization Timers
Like event objects, timer objects come in both notification and synchronization flavors. A notification timer allows any number
of waiting threads to proceed once it expires. A synchronization timer, by contrast, allows only a single thread to proceed. Once
a thread’s wait is satisfied, the timer switches to the not-signaled state. To create a synchronization timer, you must use the
extended form of the initialization service function:

ASSERT(KeGetCurrentIrql() <= DISPATCH_LEVEL);

KeInitializeTimerEx(timer, SynchronizationTimer);

SynchronizationTimer is one of the values of the TIMER_TYPE enumeration. The other value is NotificationTimer.
If you use a DPC with a synchronization timer, think of queuing the DPC as being an extra thing that happens when the timer
expires. That is, expiration puts the timer in the signaled state and queues a DPC. One thread can be released as a result of the
timer being signaled.
The only use I’ve ever found for a synchronization timer is when you want a periodic timer (see the next section).

Periodic Timers
So far, I’ve discussed only timers that expire exactly once. By using the extended set timer function, you can also request a
periodic timeout:

ASSERT(KeGetCurrentIrql() <= DISPATCH_LEVEL);

LARGE_INTEGER duetime;
BOOLEAN wascounting = KeSetTimerEx(timer, duetime, period, dpc);

Here period is a periodic timeout, expressed in milliseconds (ms), and dpc is an optional pointer to a KDPC object. A timer of
this kind expires once at the due time and periodically thereafter. To achieve exact periodic expiration, specify the same
relative due time as the interval. Specifying a zero due time causes the timer to immediately expire, whereupon the periodic
behavior takes over. It often makes sense to start a periodic timer in conjunction with a DPC object, by the way, because doing
so allows you to be notified without having to repeatedly wait for the timeout.
Be sure to call KeCancelTimer to cancel a periodic timer before the KTIMER object or the DPC routine disappears
from memory. It’s quite embarrassing to let the system unload your driver and, 10 nanoseconds later, call your
nonexistent DPC routine. Not only that, but it causes a bug check. These problems are so hard to debug that the Driver
Verifier makes a special check for releasing memory that contains an active KTIMER.

An Example
One use for kernel timers is to conduct a polling loop in a system thread dedicated to the task of repeatedly checking a device
for activity. Not many devices nowadays need to be served by a polling loop, but yours may be one of the few exceptions. I’ll
discuss this subject in Chapter 14, and the companion content includes a sample driver (POLLING) that illustrates all of the
concepts involved. Part of that sample is the following loop that polls the device at fixed intervals. The logic of the driver is
such that the loop can be broken by setting a kill event. Consequently, the driver uses KeWaitForMultipleObjects. The code is
actually a bit more complicated than the following fragment, which I’ve edited to concentrate on the part related to the timer:

VOID PollingThreadRoutine(PDEVICE_EXTENSION pdx)

{
NTSTATUS status;
KTIMER timer;

KeInitializeTimerEx(&timer, SynchronizationTimer);

PVOID pollevents[] = {
(PVOID) &pdx->evKill,

Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
4.4 Kernel Dispatcher Objects - 101 -

(PVOID) &timer,
};
C_ASSERT(arraysize(pollevents) <= THREAD_WAIT_OBJECTS);

LARGE_INTEGER duetime = {0};

#define POLLING_INTERVAL 500

KeSetTimerEx(&timer, duetime, POLLING_INTERVAL, NULL);

while (TRUE)
{

status = KeWaitForMultipleObjects(arraysize(pollevents),
pollevents, WaitAny, Executive, KernelMode, FALSE,
NULL, NULL);
if (status == STATUS_WAIT_0)
break;

if (<device needs attention>)

<do something>;
}
KeCancelTimer(&timer);
PsTerminateSystemThread(STATUS_SUCCESS);
}

1. Here we initialize a kernel timer. You must specify a SynchronizationTimer here, because a NotificationTimer stays in the
signaled state after the first expiration.
2. We’ll need to supply an array of dispatcher object pointers as one of the arguments to KeWaitForMultipleObjects, and
this is where we set that up. The first element of the array is the kill event that some other part of the driver might set
when it’s time for this system thread to exit. The second element is the timer object. The C_ASSERT statement that
follows this array verifies that we have few enough objects in our array that we can implicitly use the default array of
wait blocks in our thread object.
3. The KeSetTimerEx statement starts a periodic timer running. The duetime is 0, so the timer goes immediately into the
signaled state. It will expire every 500 ms thereafter.
4. Within our polling loop, we wait for the timer to expire or for the kill event to be set. If the wait terminates because of the
kill event, we leave the loop, clean up, and exit this system thread. If the wait terminates because the timer has expired,
we go on to the next step.
5. This is where our device driver would do something related to our hardware.

Alternatives to Kernel Timers

Rather than use a kernel timer object, you can use two other timing functions that might be more appropriate. First of all, you
can call KeDelayExecutionThread to wait at PASSIVE_LEVEL for a given interval. This function is obviously less cumbersome
than creating, initializing, setting, and awaiting a timer by using separate function calls.

ASSERT(KeGetCurrentIrql() == PASSIVE_LEVEL);
LARGE_INTEGER duetime;
NSTATUS status = KeDelayExecutionThread(WaitMode, Alertable, &duetime);

Here WaitMode, Alertable, and the returned status code have the same meaning as the corresponding parameters to KeWaitXxx,
and duetime is the same kind of timestamp that I discussed previously in connection with kernel timers. Note that this function
requires a pointer to a large integer for the timeout parameter, whereas other functions related to timers require the large
integer itself.
If your requirement is to delay for a very brief period of time (less than 50 microseconds), you can call
KeStallExecutionProcessor at any IRQL:

KeStallExecutionProcessor(nMicroSeconds);

The purpose of this delay is to allow your hardware time to prepare for its next operation before your program continues
executing. The delay might end up being significantly longer than you request because KeStallExecutionProcessor can be
preempted by activities that occur at a higher IRQL than that which the caller is using.

4.4.8 Using Threads for Synchronization

The Process Structure component of the operating system provides a few routines that WDM drivers can use for creating and

Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
- 102 - Synchronization | Chapter 4

controlling system threads. I’ll be discussing these routines later on in Chapter 14 from the perspective of how you can use
these functions to help you manage a device that requires periodic polling. For the sake of thoroughness, I want to mention
here that you can use a pointer to a kernel thread object in a call to KeWaitXxx to wait for the thread to complete. The thread
terminates itself by calling PsTerminateSystemThread.
Before you can wait for a thread to terminate, you need to first obtain a pointer to the opaque KTHREAD object that internally
represents that thread, which poses a bit of a problem. While running in the context of a thread, you can determine your own
KTHREAD easily:

ASSERT(KeGetCurrentIrql() <= DISPATCH_LEVEL);

PKTHREAD thread = KeGetCurrentThread();

Unfortunately, when you call PsCreateSystemThread to create a new thread, you get back only an opaque HANDLE for the
thread. To get the KTHREAD pointer, you use an Object Manager service function:

HANDLE hthread;
PKTHREAD thread;
PsCreateSystemThread(&hthread, ...);
ObReferenceObjectByHandle(hthread, THREAD_ALL_ACCESS,
NULL, KernelMode, (PVOID*) &thread, NULL);
ZwClose(hthread);

ObReferenceObjectByHandle converts your handle to a pointer to the underlying kernel object. Once you have the pointer, you
can discard the handle by calling ZwClose. At some point, you need to release your reference to the thread object by making a
call to ObDereferenceObject:

ObDereferenceObject(thread);

4.4.9 Thread Alerts and APCs

Internally, the Windows NT kernel uses thread alerts as a way of waking threads. It uses an asynchronous procedure call as a
way of waking a thread to execute some particular subroutine in that thread’s context. The support routines that generate alerts
or APCs aren’t exposed for use by WDM driver writers. But since the DDK documentation and header files contain a great
many references to these concepts, I want to finish this discussion of kernel dispatcher objects by explaining them.
I’ll start by describing the “plumbing”—how these two mechanisms work. When someone blocks a thread by calling one of
the KeWaitXxx routines, they specify by means of a Boolean argument whether the wait is to be alertable. An alertable wait
might finish early—that is, without any of the wait conditions or the timeout being satisfied—because of a thread alert. Thread
alerts originate in user mode when someone calls the native API function NtAlertThread. The kernel returns the special status
value STATUS_ALERTED when a wait terminates early because of an alert.
An APC is a mechanism whereby the operating system can execute a function in the context of a particular thread. The
asynchronous part of an APC stems from the fact that the system effectively interrupts the target thread to execute an
out-of-line subroutine.
APCs come in three flavors: user mode, kernel mode, and special kernel mode. User-mode code requests a user-mode APC by
calling the Win32 API QueueUserAPC. Kernel-mode code requests an APC by calling an undocumented function for which
the DDK headers have no prototype. Diligent reverse engineers probably already know the name of this routine and something
about how to call it, but it’s really just for internal use and I’m not going to say any more about it. The system queues APCs to
a specific thread until appropriate execution conditions exist. Appropriate execution conditions depend on the type of APC, as
follows:
Special kernel APCs execute as soon as possible—that is, as soon as an activity at APC_LEVEL can be scheduled in the
thread. A special kernel APC can even temporarily awaken a blocked thread in many circumstances.
Normal kernel APCs execute after all special APCs have been executed but only when the target thread is running and no
other kernel-mode APC is executing in this thread. Delivery of normal kernel and user-mode APCs can be blocked by
calling KeEnterCriticalRegion.
User-mode APCs execute after both flavors of kernel-mode APC for the target thread have been executed but only if the
thread has previously been in an alertable wait in user mode. Execution actually occurs the next time the thread is
dispatched for execution in user mode.
If the system awakens a thread to deliver a user-mode APC, the wait primitive on which the thread was previously blocked
returns with one of the special status values STATUS_KERNEL_APC and STATUS_USER_APC.

The Strange Role of APC_LEVEL

The IRQ level named APC_LEVEL works in a way that I found to be unexpected. You’re allowed to block a thread running at

Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
4.4 Kernel Dispatcher Objects - 103 -

APC_LEVEL (or at PASSIVE_LEVEL, but we’re concerned only with APC_LEVEL right now). An APC_LEVEL thread can
also be interrupted by any hardware device, following which a higher-priority thread might become eligible to run. In either
situation, the thread scheduler can then give control of the CPU to another thread, which might be running at PASSIVE_LEVEL
or APC_LEVEL. In effect, the IRQL levels PASSIVE_LEVEL and APC_LEVEL pertain to a thread, whereas the higher IRQLs
pertain to a CPU.

How APCs Work with I/O Requests

The kernel uses the APC concept for several purposes. We’re concerned in this book just with writing device drivers, though,
so I’m only going to explain how APCs relate to the process of performing an I/O operation. In one of many possible scenarios,
when a user-mode program performs a synchronous ReadFile operation on a handle, the Win32 subsystem calls a kernel-mode
routine named NtReadFile. NtReadFile creates and submits an IRP to the appropriate device driver, which often returns
STATUS_PENDING to indicate that it hasn’t finished the operation. NtReadFile returns this status code to ReadFile, which
thereupon calls NtWaitForSingleObject to wait on the file object to which the user-mode handle points. NtWaitForSingleObject,
in turn, calls KeWaitForSingleObject to perform a nonalertable user-mode wait on an event object within the file object.
When the device driver eventually finishes the read operation, it calls IoCompleteRequest, which queues a special kernel-mode
APC. The APC routine calls KeSetEvent to signal the file object, thereby releasing the application to continue execution. Some
sort of APC is required because some of the tasks that need to be performed when an I/O request is completed (such as buffer
copying) must occur in the address context of the requesting thread. A kernel-mode APC is required because the thread in
question is not in an alertable wait state. A special APC is required because the thread is actually ineligible to run at the time
we need to deliver the APC. In fact, the APC routine is the mechanism for awakening the thread.
Kernel-mode routines can call ZwReadFile, which turns into a call to NtReadFile. If you obey the injunctions in the DDK
documentation when you call ZwReadFile, your call to NtReadFile will look almost like a user-mode call and will be
processed in almost the same way, with just two differences. The first, which is quite minor, is that any waiting will be done in
kernel mode. The other difference is that if you specified in your call to ZwCreateFile that you wanted to do synchronous
operations, the I/O Manager will automatically wait for your read to finish. The wait will be alertable or not, depending on the
exact option you specify to ZwCreateFile.

How to Specify Alertable and WaitMode Parameters

Now you have enough background to understand the ramifications of the Alertable and WaitMode parameters in the calls to the
various wait primitives. As a general rule, you’ll never be writing code that responds synchronously to requests from user
mode. You could do so for, say, certain I/O control requests. Generally speaking, however, it’s better to pend any operations
that take a long time to finish (by returning STATUS_PENDING from your dispatch routine) and to finish them asynchronously.
So, to continue speaking generally, you don’t often call a wait primitive in the first place. Thread blocking is appropriate in a
device driver in only a few scenarios, which I’ll describe in the following sections.

Kernel Threads
Sometimes you’ll create your own kernel-mode thread—when your device needs to be polled periodically, for example. In this
scenario, any waits performed will be in kernel mode because the thread runs exclusively in kernel mode.

Handling Plug and Play Requests

I’ll show you in Chapter 6 how to handle the I/O requests that the PnP Manager sends your way. Several such requests require
synchronous handling on your part. In other words, you pass them down the driver stack to lower levels and wait for them to
complete. You’ll be calling KeWaitForSingleObject to wait in kernel mode because the PnP Manager calls you within the
context of a kernel-mode thread. In addition, if you need to perform subsidiary requests as part of handling a PnP request—for
example, to talk to a universal serial bus (USB) device—you’ll be waiting in kernel mode.

Handling Other I/O Requests

When you’re handling other sorts of I/O requests and you know that you’re running in the context of a nonarbitrary thread that
must get the results of your deliberations before proceeding, it might conceivably be appropriate to block that thread by calling
a wait primitive. In such a case, you want to wait in the same processor mode as the entity that called you. Most of the time,
you can simply rely on the RequestorMode in the IRP you’re currently processing. If you gained control by means other than
an IRP, you could call ExGetPreviousMode to determine the previous processor mode. If you’re going to wait for a long time,
it would be well to use the result of these tests as the WaitMode argument in your KeWaitXxx call, and it would also be well to
specify TRUE for the Alertable argument.

NOTE
The bottom line: perform nonalertable waits unless you know you shouldn’t.

Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
- 104 - Synchronization | Chapter 4

4.5 Other Kernel-Mode Synchronization Primitives

The Windows XP kernel offers some additional methods for synchronizing execution between threads or for guarding access to
shared objects. In this section, I’ll discuss the fast mutex, which is a mutual exclusion object that offers faster performance than
a kernel mutex because it’s optimized for the case in which no contention is actually occurring. I’ll also describe the category
of support functions that include the word Interlocked somewhere in their name. These functions carry out certain common
operations—such as incrementing or decrementing an integer or inserting or removing an entry from a linked list—in an
atomic way that prevents multitasking or multiprocessing interference.

4.5.1 Fast Mutex Objects

An executive fast mutex provides an alternative to a kernel mutex for protecting a critical section of code. Table 4-6
summarizes the service functions you use to work with this kind of object.

Service Function Description

ExAcquireFastMutex Acquires ownership of mutex, waiting if necessary
Acquires ownership of mutex, waiting if necessary, in circumstance in which caller has
ExAcquireFastMutexUnsafe
already disabled receipt of APCs
ExInitializeFastMutex Initializes mutex object
ExReleaseFastMutex Releases mutex
ExReleaseFastMutexUnsafe Releases mutex without reenabling APC delivery
ExTryToAcquireFastMutex Acquires mutex if possible to do so without waiting

Table 4-6. Service Functions for Use with Executive Fast Mutexes
Compared with kernel mutexes, fast mutexes have the strengths and weaknesses summarized in Table 4-7. On the plus side, a
fast mutex is much faster to acquire and release if there’s no actual contention for it. On the minus side, a thread that acquires a
fast mutex will not be able to receive certain types of asynchronous procedure call, depending on exactly which functions you
call, and this constrains how you send IRPs to other drivers.

Kernel Mutex Fast Mutex

Can be acquired recursively by a single thread (system
Cannot be acquired recursively
maintains a claim counter)
Relatively slower Relatively faster
Owner won’t receive any APCs unless you use the
Owner will receive only “special” kernel APCs
XxxUnsafe functions
Cannot be used as an argument to
Can be part of a multiple-object wait
KeWaitForMultipleObjects

Table 4-7. Comparison of Kernel and Fast Mutex Objects

Incidentally, the DDK documentation about kernel mutex objects has long said that the kernel gives a priority boost to a thread
that claims a mutex. I’m reliably informed that this hasn’t actually been true since 1992 (the year, that is, not the Windows
build number). The documentation has also long said that a thread holding a mutex can’t be removed from the balance set (that
is, subjected to having all of its pages moved out of physical memory). This was true when Windows NT was young but hasn’t
been true for a long time.
To create a fast mutex, you must first allocate a FAST_MUTEX data structure in nonpaged memory. Then you initialize the
object by “calling” ExInitializeFastMutex, which is really a macro in WDM.H:

ASSERT(KeGetCurrentIrql() <= DISPATCH_LEVEL);

ExInitializeFastMutex(FastMutex);

where FastMutex is the address of your FAST_MUTEX object. The mutex begins life in the unowned state. To acquire
ownership later on, call one of these functions:

ASSERT(KeGetCurrentIrql() < DISPATCH_LEVEL);

ExAcquireFastMutex(FastMutex);

ASSERT(KeGetCurrentIrql() < DISPATCH_LEVEL);

ExAcquireFastMutexUnsafe(FastMutex);

The first of these functions waits for the mutex to become available, assigns ownership to the calling thread, and then raises the

Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
4.5 Other Kernel-Mode Synchronization Primitives - 105 -

current processor IRQL to APC_LEVEL. Raising the IRQL has the effect of blocking delivery of all APCs. The second of these
functions doesn’t change the IRQL.
You need to think about potential deadlocks if you use the “unsafe” function to acquire a fast mutex. A situation to
avoid is allowing user-mode code to suspend a thread in which you hold a mutex. That would deadlock other threads
that need the mutex. For this reason, the DDK recommends (and the Driver Verifier requires) that you forestall the
delivery of user-mode and normal kernel-mode APCs either by raising the IRQL to APC_LEVEL or by calling
KeEnterCriticalRegion before ExAcquireFastMutexUnsafe. (Thread suspension involves an APC, so user-mode code
can’t suspend your thread if you disallow user-mode APCs. Yes, I know the reasoning here is a bit of a stretch!)
Another possible deadlock can arise with a driver in the paging path—in other words, a driver that gets called to help the
memory manager process a page fault. Suppose you simply call KeEnterCriticalRegion and then ExAcquireFastMutexUnsafe.
Now suppose the system tries to execute a special kernel-mode APC in the same thread, which is possible because
KeEnterCriticalRegion doesn’t forestall special kernel APCs. The APC routine might page fault, which might then lead to you
being reentered and deadlocking on a second attempt to claim the same mutex. You avoid this situation by raising IRQL to
APC_LEVEL before acquiring the mutex in the first place or, more simply, by using KeAcquireFastMutex instead of
KeAcquireFastMutexUnsafe. The same problem can arise if you use a regular KMUTEX or a synchronization event, of course.

IMPORTANT
If you use ExAcquireFastMutex, you will be at APC_LEVEL. This means you can’t create any synchronous IRPs.
(The routines that do this must be called at PASSIVE_LEVEL.) Furthermore, you’ll deadlock if you try to wait for
a synchronous IRP to complete (because completion requires executing an APC, which can’t happen because of
the IRQL). In Chapter 5, I’ll discuss how to use asynchronous IRPs to work around this problem.

If you don’t want to wait if the mutex isn’t immediately available, use the “try to acquire” function:

ASSERT(KeGetCurrentIrql() < DISPATCH_LEVEL);

BOOLEAN acquired = ExTryToAcquireFastMutex(FastMutex);

If the return value is TRUE, you now own the mutex. If it’s FALSE, someone else owns the mutex and has prevented you from
acquiring it.
To release control of a fast mutex and allow some other thread to claim it, call the release function corresponding to the way
you acquired the fast mutex:

ASSERT(KeGetCurrentIrql() < DISPATCH_LEVEL);

ExReleaseFastMutex(FastMutex);

ASSERT(KeGetCurrentIrql() < DISPATCH_LEVEL);

ExReleaseFastMutexUnsafe(FastMutex);

A fast mutex is fast because the acquisition and release steps are optimized for the usual case in which there’s no contention for
the mutex. The critical step in acquiring the mutex is to atomically decrement and test an integer counter that indicates how
many threads either own or are waiting for the mutex. If the test indicates that no other thread owns the mutex, no additional
work is required. If the test indicates that another thread does own the mutex, the current thread blocks on a synchronization
event that’s part of the FAST_MUTEX object. Releasing the mutex entails atomically incrementing and testing the counter. If
the test indicates that no thread is currently waiting, no additional work is required. If another thread is waiting, however, the
owner calls KeSetEvent to release one of the waiters.

Note on Deadlock Prevention

Whenever you use synchronization objects such as spin locks, fast mutexes, and so on, in a driver, you should
be on the lookout for potential deadlocks. We’ve already talked about two deadlock issues: trying to acquire a
spin lock that you already hold and trying to claim a fast mutex or synchronization event with APCs enabled in
the thread. This sidebar concerns a more insidious potential deadlock that can arise when your driver uses more
than one synchronization object.

Suppose there are two synchronization objects, A and B. It doesn’t matter what types of objects these are, and
they needn’t even be the same type. Now suppose we have two subroutines—I’ll call them Fred and Barney just
so I have names to work with. Subroutine Fred claims object A followed by object B. Subroutine Barney claims
B followed by A. This sets up a potential deadlock if Fred and Barney can be simultaneously active or if a thread
running one of those routines can be preempted by a thread running the other routine.

The deadlock arises, as you probably remember from studying this sort of thing in school, when two threads
manage to execute Fred and Barney at about the same time. The Fred thread gets object A, while the Barney
thread gets object B. Fred now tries to get object B, but can’t have it (Barney has it). Barney, on the other hand,
now tries to get object A, but can’t have it (Fred has it). Both threads are now deadlocked, waiting for the other
one to release the object each needs.

Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
- 106 - Synchronization | Chapter 4

The easiest way to prevent this kind of deadlock is to always acquire objects such as A and B in the same order,
everywhere. The order in which you decide to acquire a set of resources is called the locking hierarchy. There
are other schemes, which involve conditional attempts to acquire resources combined with back-out loops, but
these are much harder to implement.

If you engage the Deadlock Detection option, the Driver Verifier will look for potential deadlocks resulting from
locking hierarchy violations involving spin locks, kernel mutexes, and executive fast mutexes.

The DDK documents another synchronization primitive that I didn’t discuss in this chapter: an ERESOURCE. File
system drivers use ERESOURCE objects extensively because they allow for shared and exclusive ownership.
Because file system drivers often have to use complex locking logic, the Driver Verifier doesn’t check the locking
hierarchy for an ERESOURCE.

4.5.2 Interlocked Arithmetic

You can call several service functions in a WDM driver to perform arithmetic in a way that’s thread-safe and
multiprocessor-safe. (See Table 4-8.) These routines come in two flavors. The first type of routine has a name beginning with
Interlocked and performs an atomic operation in such a way that no other thread or CPU can interfere. The other flavor has a
name beginning with ExInterlocked and uses a spin lock.

Service Function Description

InterlockedCompareExchange Compares and conditionally exchanges
InterlockedDecrement Subtracts 1 from an integer
InterlockedExchange Exchanges two values
InterlockedExchangeAdd Adds two values and returns sum
InterlockedIncrement Adds 1 to an integer
InterlockedOr ORs bits into an integer
InterlockedAnd ANDs bits into an integer
InterlockedXor Exclusive-ORs bits into an integer
ExInterlockedAddLargeInteger Adds value to 64-bit integer
ExInterlockedAddLargeStatistic Adds value to ULONG
ExInterlockedAddUlong Adds value to ULONG and returns initial value
ExInterlockedCompareExchange64 Exchanges two 64-bit values
Table 4-8. Service Functions for Interlocked Arithmetic
The InterlockedXxx functions can be called at any IRQL; they can also handle pageable data at PASSIVE_LEVEL because they
don’t require a spin lock. Although the ExInterlockedXxx routines can be called at any IRQL, they operate on the target data at
or above DISPATCH_LEVEL and therefore require a nonpaged argument. The only reason to use an ExInterlockedXxx function
is if you have a data variable that you sometimes need to increment or decrement and sometimes need to access throughout
some series of instructions. You would explicitly claim the spin lock around the multi-instruction accesses and use the
ExInterlockedXxx function to perform the simple increments or decrements.

InterlockedXxx Functions
InterlockedIncrement adds 1 to a long integer in memory and returns the postincrement value to you:

LONG result = InterlockedIncrement(pLong);

where pLong is the address of a variable typed as a LONG (that is, a long integer). Conceptually, the operation of the function
is equivalent to the statement return ++*pLong in C, but the implementation differs from that simple statement in order to
provide thread safety and multiprocessor safety. InterlockedIncrement guarantees that the integer is successfully incremented
even if code on other CPUs or in other eligible threads on the same CPU is simultaneously trying to alter the same variable. In
the nature of the operation, InterlockedIncrement cannot guarantee that the value it returns is still the value of the variable even
one machine cycle later because other threads or CPUs will be able to modify the variable as soon as the atomic increment
operation completes.
InterlockedDecrement is similar to InterlockedIncrement, but it subtracts 1 from the target variable and returns the
postdecrement value, just like the C statement return --*pLong but with thread safety and multiprocessor safety.

LONG result = InterlockedDecrement(pLong);

You call InterlockedCompareExchange like this:

Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
4.5 Other Kernel-Mode Synchronization Primitives - 107 -

LONG target;
LONG result = InterlockedCompareExchange(&target, newval, oldval);

Here target is a long integer used both as input and output to the function, oldval is your guess about the current contents of the
target, and newval is the new value that you want installed in the target if your guess is correct. The function performs an
operation similar to that indicated in the following C code but does so via an atomic operation that’s both thread-safe and
multiprocessor-safe:

LONG CompareExchange(PLONG ptarget, LONG newval, LONG oldval)

{
LONG value = *ptarget;
if (value == oldval)
*ptarget = newval;
return value;
}

In other words, the function always returns the previous value of the target variable to you. In addition, if that previous value
equals oldval, it sets the target equal to the newval you specify. The function uses an atomic operation to do the compare and
exchange so that the replacement happens only if you’re correct in your guess about the previous contents.
You can also call the InterlockedCompareExchangePointer function to perform a similar sort of compare-and-exchange
operation with a pointer. This function is defined either as a compiler-intrinsic (that is, a function for which the compiler
supplies an inline implementation) or a real function call, depending on how wide pointers are on the platform for which
you’re compiling and on the ability of the compiler to generate inline code.
The last function in this class is InterlockedExchange, which simply uses an atomic operation to replace the value of an integer
variable and to return the previous value:

LONG value;
LONG oldval = InterlockedExchange(&value, newval);

As you might have guessed, there’s also an InterlockedExchangePointer that exchanges a pointer value (64-bit or 32-bit,
depending on the platform). Be sure to cast the target of the exchange operation to avoid a compiler error when building 64-bit
drivers:

PIRP Irp = (PIRP) InterlockedExchangePointer( (PVOID*) &foo, NULL);

InterlockedOr, InterlockedAnd and InterlockedXor are new with the XP DDK. You can use them in drivers that will run on
earlier Windows versions because they’re actually implemented as compiler-intrinsic functions.

Interlocked Fetches and Stores?

A frequently asked question is how to do simple fetch-and-store operations on data that’s otherwise being accessed by
InterlockedXxx functions. You don’t have to do anything special to fetch a self-consistent value from a variable that other
people are modifying with interlocked operations so long as the data in question is aligned on a natural address boundary. Data
so aligned cannot cross a memory cache boundary, and the memory controller will always update a cache-sized memory block
atomically. Thus, if someone is updating a variable at about the same time you’re trying to read it, you’ll get either the
preupdate or the postupdate value but never anything in between.
For store operations, however, the answer is more complex. Suppose you write the following sort of code to guard access to
some shared data:

if (InterlockedExchange(&lock, 42) == 0)
{
sharedthing++;
lock = 0; // <== don't do this
}

This code will work fine on an Intel x86 computer, where every CPU sees memory writes in the same order. On another type
of CPU, though, there could be a problem. One CPU might actually change the memory variable lock to 0 before updating
memory for the increment statement. That behavior could allow two CPUs to simultaneously access sharedthing. This problem
could happen because of the way the CPU performs operations in parallel or because of quirks in the memory controller.
Consequently, you should rework the code to use an interlocked operation for both changes to lock:

if (InterlockedExchange(&lock, 42) == 0)
{
sharedthing++;

Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
- 108 - Synchronization | Chapter 4

InterlockedExchange(&lock, 0);
}

ExInterlockedXxx Functions
Each of the ExInterlockedXxx functions requires that you create and initialize a spin lock before you call it. Note that the
operands of these functions must all be in nonpaged memory because the functions operate on the data at elevated IRQL.

ExInterlockedAddLargeInteger adds two 64-bit integers and returns the previous value
of the target:
LARGE_INTEGER value, increment;
KSPIN_LOCK spinlock;
LARGE_INTEGER prev = ExInterlockedAddLargeInteger(&value, increment, &spinlock);

Value is the target of the addition and one of the operands. Increment is an integer operand that’s added to the target. Spinlock
is a spin lock that you previously initialized. The return value is the target’s value before the addition. In other words, the
operation of this function is similar to the following function except that it occurs under protection of the spin lock:

int64 AddLargeInteger(int64* pvalue, __int64 increment)

{
__int64 prev = *pvalue;
*pvalue += increment;
return prev;
}

Note that the return value is the preaddition value, which contrasts with the postincrement return from InterlockedExchange
and similar functions. (Also, not all compilers support the __int64 integer data type, and not all computers can perform a 64-bit
addition operation using atomic instructions.)
ExInterlockedAddUlong is analogous to ExInterlockedAddLargeInteger except that it works with 32-bit unsigned integers:

ULONG value, increment;

KSPIN_LOCK spinlock;
ULONG prev = ExInterlockedAddUlong(&value, increment, &spinlock);

This function likewise returns the preaddition value of the target of the operation.
ExInterlockedAddLargeStatistic is similar to ExInterlockedAddUlong in that it adds a 32-bit value to a 64-bit value:

VOID ExInterlockedAddLargeStatistic(PLARGE_INTEGER Addend, ULONG Increment);

This new function is faster than ExInterlockedAddUlong because it doesn’t need to return the preincrement value of the
Addend variable. It therefore doesn’t need to employ a spin lock for synchronization. The atomicity provided by this function
is, however, only with respect to other callers of the same function. In other words, if you had code on one CPU calling
ExInterlockedAddLargeStatistic at the same time as code on another CPU was accessing the Addend variable for either reading
or writing, you could get inconsistent results. I can explain why this is so by showing you this paraphrase of the Intel x86
implementation of the function (not the actual source code):

mov eax, Addend

mov ecx, Increment
lock add [eax], ecx
lock adc [eax+4], 0

This code works correctly for purposes of incrementing the Addend because the lock prefixes guarantee atomicity of each
addition operation and because no carries from the low-order 32 bits can ever get lost. The instantaneous value of the 64-bit
Addend isn’t always consistent, however, because an incrementer might be poised between the ADD and the ADC just at the
instant someone makes a copy of the complete 64-bit value. Therefore, even a caller of ExInterlockedCompareExchange64 on
another CPU could obtain an inconsistent value.

4.5.3 Interlocked List Access

The Windows NT executive offers three sets of support functions for dealing with linked lists in a thread-safe and
multiprocessor-safe way. These functions support doubly-linked lists, singly-linked lists, and a special kind of singly-linked list
called an S-List. I discussed noninterlocked doubly-linked and singly-linked lists in the preceding chapter. To close this chapter
on synchronization within WDM drivers, I’ll explain how to use these interlocked accessing primitives.
If you need the functionality of a FIFO queue, you should use a doubly-linked list. If you need the functionality of a

Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
4.5 Other Kernel-Mode Synchronization Primitives - 109 -

thread-safe and multiprocessor-safe pushdown stack, you should use an S-List. In both cases, to achieve thread safety and
multiprocessor safety, you will allocate and initialize a spin lock. The S-List might not actually use the spin lock, however,
because the presence of a sequence number might allow the kernel to implement it using just atomic compare-exchange sorts
of operations.
The support functions for performing interlocked access to list objects are similar, so I’ve organized this section along
functional lines. I’ll explain how to initialize all three kinds of lists. Then I’ll explain how to insert an item into all three kinds.
After that, I’ll explain how to remove items.

Initialization
You can initialize these lists as shown here:

LIST_ENTRY DoubleHead;
SINGLE_LIST_ENTRY SingleHead;
SLIST_HEADER SListHead;

InitializeListHead(&DoubleHead);

SingleHead.Next = NULL;

ExInitializeSListHead(&SListHead);

Don’t forget that you must also allocate and initialize a spin lock for each list. Furthermore, the storage for the list heads and
all the items you put into the lists must come from nonpaged memory because the support routines perform their accesses at
elevated IRQL. Note that the spin lock isn’t used during initialization of the list head because it doesn’t make any sense to
allow contention for list access before the list has been initialized.

Inserting Items
You can insert items at the head and tail of a doubly-linked list and at the head (only) of a singly-linked list or an S-List:

PLIST_ENTRY pdElement, pdPrevHead, pdPrevTail;

PSINGLE_LIST_ENTRY psElement, psPrevHead;
PKSPIN_LOCK spinlock;

pdPrevHead = ExInterlockedInsertHeadList(&DoubleHead, pdElement, spinlock);

pdPrevTail = ExInterlockedInsertTailList(&DoubleHead, pdElement, spinlock);

psPrevHead = ExInterlockedPushEntryList(&SingleHead, psElement, spinlock);

psPrevHead = ExInterlockedPushEntrySList(&SListHead, psElement, spinlock);

The return values are the addresses of the elements previously at the head (or tail) of the list in question. Note that the element
addresses you use with these functions are the addresses of list entry structures that are usually embedded in larger structures
of some kind, and you’ll need to use the CONTAINING_RECORD macro to recover the address of the surrounding structure.

Removing Items
You can remove items from the head of any of these lists:

pdElement = ExInterlockedRemoveHeadList(&DoubleHead, spinlock);

psElement = ExInterlockedPopEntryList(&SingleHead, spinlock);

psElement = ExInterlockedPopEntrySList(&SListHead, spinlock);

The return values are NULL if the respective lists are empty. Be sure to test the return value for NULLbefore applying the
CONTAINING_RECORD macro to recover a containing structure pointer.

IRQL Restrictions
You can call the S-List functions only while running at or below DISPATCH_LEVEL. The ExInterlockedXxx functions for
accessing doubly-linked or singly-linked lists can be called at any IRQL so long as all references to the list use an
ExInterlockedXxx call. The reason for no IRQL restrictions is that the implementations of these functions disable interrupts,
which is tantamount to raising IRQL to the highest possible level. Once interrupts are disabled, these functions then acquire the
spin lock you’ve specified. Since no other code can gain control on the same CPU, and since no code on another CPU can

Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
- 110 - Synchronization | Chapter 4

acquire the spin lock, your lists are protected.

NOTE
The DDK documentation states this rule in an overly restrictive way for at least some of the ExInterlockedXxx
functions. It says that all callers must be running at some single IRQL less than or equal to the DIRQL of your
interrupt object. There is, in fact, no requirement that all callers be at the same IRQL because you can call the
functions at any IRQL. Likewise, no <= DIRQL restriction exists either, but there’s also no reason for the code
you and I write to raise IRQL higher than that.

It’s perfectly OK for you to use ExInterlockedXxx calls to access a singly-linked or doubly-linked list (but not an S-List) in
some parts of your code and to use the noninterlocked functions (InsertHeadList and so on) in other parts of your code if you
follow a simple rule. Before using a noninterlocked primitive, acquire the same spin lock that your interlocked calls use.
Furthermore, restrict list access to code running at or below DISPATCH_LEVEL. For example:

// Access list using noninterlocked calls:

VOID Function1()
{
ASSERT(KeGetCurrentIrql() <= DISPATCH_LEVEL);
KIRQL oldirql;
KeAcquireSpinLock(spinlock, &oldirql);
InsertHeadList(...);
RemoveTailList(...);

KeReleaseSpinLock(spinlock, oldirql);
}

// Access list using interlocked calls:

VOID Function2()
{
ASSERT(KeGetCurrentIrql() <= DISPATCH_LEVEL);
ExInterlockedInsertTailList(..., spinlock);
}

The first function must be running at or below DISPATCH_LEVEL because that’s a requirement of calling KeAcquireSpinLock.
The reason for the IRQL restriction on the interlocked calls in the second function is as follows: Suppose Function1 acquires
the spin lock in preparation for performing some list accesses. Acquiring the spin lock raises IRQL to DISPATCH_LEVEL.
Now suppose an interrupt occurs on the same CPU at a higher IRQL and Function2 gains control to use one of the
ExInterlockedXxx routines. The kernel will now attempt to acquire the same spin lock, and the CPU will deadlock. This
problem arises from allowing code running at two different IRQLs to use the same spin lock: Function1 is at
DISPATCH_LEVEL, and Function2 is—practically speaking, anyway—at HIGH_LEVEL when it tries to recursively acquire
the lock.

4.5.4 Windows 98/Me Compatibility Notes

In addition to the horrible problem with KeWaitXxx functions described in an earlier sidebar and the fact that
KeReadStateEvent isn’t supported, note the following additional compatibility issues between Windows 98/Me on the one hand
and Windows 2000/XP on the other.
You cannot wait on a KTHREAD object in Windows 98/Me. Attempting to do so crashes the system because the thread object
doesn’t have the fields necessary for VWIN32 to wait on it.
DISPATCH_LEVEL in a WDM driver corresponds to what is called “interrupt time” in a VxD driver. Every WDM interrupt
service routine runs at a higher IRQL, which means that WDM interrupts have higher priority than non-WDM interrupts. If a
WDM device shares an interrupt with a VxD device, however, both interrupt routines run at the WDM driver’s DIRQL.
WDM driver code running at PASSIVE_LEVEL won’t be preempted in Windows 98/Me unless it blocks explicitly by waiting
for a dispatcher object or implicitly by causing a page fault.
Windows 98/Me is inherently a single-CPU operating system, so the spin lock primitives always just raise the IRQL. This fact,
combined with the fact that nonpaged driver code won’t be preempted, means that synchronization problems are much less
frequent in this environment. (Therefore, do most of your debugging in Windows XP so you’ll trip on the problems.)

Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
- 111 -

Chapter 5
5 The I/O Request Packet

The operating system uses a data structure known as an I/O request packet, or IRP, to communicate with a kernel-mode device
driver. In this chapter, I’ll discuss this important data structure and the means by which it’s created, sent, processed, and
ultimately destroyed. I’ll include a discussion of the relatively complex subject of IRP cancellation.
This chapter is rather abstract, I’m afraid, because I haven’t yet talked about any of the concepts that surround specific types of
I/O request packets (IRPs). You might, therefore, want to skim this chapter and refer back to it while you’re reading later
chapters. The last major section of this chapter contains a cookbook, if you will, that presents the bare-bones code for handling
IRPs in eight different scenarios. You can use the cookbook without understanding all the theory that this chapter contains.

5.1 Data Structures

Two data structures are crucial to the handling of I/O requests: the I/O request packet itself and the IO_STACK_LOCATION
structure. I’ll describe both structures in this section.

5.1.1 Structure of an IRP

Figure 5-1 illustrates the IRP data structure, with opaque fields shaded in the usual convention of this book. A brief description
of the important fields follows.

Figure 5-1. I/O request packet data structure.

MdlAddress (PMDL) is the address of a memory descriptor list (MDL) describing the user-mode buffer associated with this
request. The I/O Manager creates this MDL for IRP_MJ_READ and IRP_MJ_WRITE requests if the topmost device object’s
flags indicate DO_DIRECT_IO. It creates an MDL for the output buffer used with an IRP_MJ_DEVICE_CONTROL request if
the control code indicates METHOD_IN_DIRECT or METHOD_OUT_DIRECT. The MDL itself describes the user-mode
virtual buffer and also contains the physical addresses of locked pages containing that buffer. A driver has to do additional
work, which can be quite minimal, to actually access the user-mode buffer.
- 112 - The I/O Request Packet | Chapter 5

Flags (ULONG) contains flags that a device driver can read but not directly alter. None of these flags are relevant to a
Windows Driver Model (WDM) driver.
AssociatedIrp (union) is a union of three possible pointers. The alternative that a typical WDM driver might want to access is
named AssociatedIrp.SystemBuffer. The SystemBuffer pointer holds the address of a data buffer in nonpaged kernel-mode
memory. For IRP_MJ_READ and IRP_MJ_WRITE operations, the I/O Manager creates this data buffer if the topmost device
object’s flags specify DO_BUFFERED_IO. For IRP_MJ_DEVICE_CONTROL operations, the I/O Manager creates this buffer
if the I/O control function code indicates that it should. (See Chapter 9.) The I/O Manager copies data sent by user-mode code
to the driver into this buffer as part of the process of creating the IRP. Such data includes the data involved in a WriteFile call
or the so-called input data for a call to DeviceIoControl. For read requests, the device driver fills this buffer with data; the I/O
Manager later copies the buffer back to the user-mode buffer. For control operations that specify METHOD_BUFFERED, the
driver places the so-called output data in this buffer, and the I/O Manager copies it to the user-mode output buffer.
IoStatus (IO_STATUS_BLOCK) is a structure containing two fields that drivers set when they ultimately complete a request.
IoStatus.Status will receive an NTSTATUS code, while IoStatus.Information is a ULONG_PTR that will receive an information
value whose exact content depends on the type of IRP and the completion status. A common use of the Information field is to
hold the total number of bytes transferred by an operation such as IRP_MJ_READ that transfers data. Certain Plug and Play
(PnP) requests use this field as a pointer to a structure that you can think of as the answer to a query.
RequestorMode will equal one of the enumeration constants UserMode or KernelMode, depending on where the original I/O
request originated. Drivers sometimes inspect this value to know whether to trust some parameters.
PendingReturned (BOOLEAN) is meaningful in a completion routine and indicates whether the next lower dispatch routine
returned STATUS_PENDING. This chapter contains a disagreeably long discussion of how to use this flag.
Cancel (BOOLEAN) is TRUE if IoCancelIrp has been called to cancel this request and FALSE if it hasn’t (yet) been called. IRP
cancellation is a relatively complex topic that I’ll discuss fully later on in this chapter (in “Cancelling I/O Requests”).
CancelIrql (KIRQL) is the interrupt request level (IRQL) at which the special cancel spin lock was acquired. You reference this
field in a cancel routine when you release the spin lock.
CancelRoutine (PDRIVER_CANCEL) is the address of an IRP cancellation routine in your driver. You use IoSetCancelRoutine
to set this field instead of modifying it directly.
UserBuffer (PVOID) contains the user-mode virtual address of the output buffer for an IRP_MJ_DEVICE_CONTROL request
for which the control code specifies METHOD_NEITHER. It also holds the user-mode virtual address of the buffer for read
and write requests, but a driver should usually specify one of the device flags DO_BUFFERED_IO or DO_DIRECT_IO and
should therefore not usually need to access the field for reads or writes. When handling a METHOD_NEITHER control
operation, the driver can create its own MDL using this address.

Figure 5-2. Map of the Tail union in an IRP.

Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
5.1 Data Structures - 113 -

Tail.Overlay is a structure within a union that contains several members potentially useful to a WDM driver. Refer to Figure
5-2 for a map of the Tail union. In the figure, items at the same level as you read left to right are alternatives within a union,
while the vertical dimension portrays successive locations within a structure. Tail.Overlay.DeviceQueueEntry
(KDEVICE_QUEUE_ENTRY) and Tail.Overlay.DriverContext (PVOID[4]) are alternatives within an unnamed union within
Tail.Overlay. The I/O Manager uses DeviceQueueEntry as a linking field within the standard queue of requests for a device.
The cancel-safe queuing routines IoCsqXxx use the last entry in the DriverContext array. If these system usages don’t get in
your way, at moments when the IRP is not in some queue that uses this field and when you own the IRP, you can use the four
pointers in DriverContext in any way you please. Tail.Overlay.ListEntry (LIST_ENTRY) is available for you to use as a linking
field for IRPs in any private queue you choose to implement.
CurrentLocation (CHAR) and Tail.Overlay.CurrentStackLocation (PIO_STACK_LOCATION) aren’t documented for use by
drivers because support functions such as IoGetCurrentIrpStackLocation can be used instead. During debugging, however, it
might help you to realize that CurrentLocation is the index of the current I/O stack location and CurrentStackLocation is a
pointer to it.

5.1.2 The I/O Stack

Whenever any kernel-mode program creates an IRP, it also creates an associated array of IO_STACK_LOCATION structures:
one stack location for each of the drivers that will process the IRP and sometimes one more stack location for the use of the
originator of the IRP. (See Figure 5-3.) A stack location contains type codes and parameter information for the IRP as well as
the address of a completion routine. Refer to Figure 5-4 for an illustration of the stack structure.

Figure 5-3. Parallelism between driver and I/O stacks.

NOTE
I’ll discuss the mechanics of creating IRPs a bit further on in this chapter. It helps to know right now that the
StackSize field of a DEVICE_OBJECT indicates how many locations to reserve for an IRP sent to that device’s
driver.

MajorFunction (UCHAR) is the major function code associated with this IRP. This code is a value such as IRP_MJ_READ that
corresponds to one of the dispatch function pointers in the MajorFunction table of a driver object. Because the code is in the
I/O stack location for a particular driver, it’s conceivable that an IRP could start life as an IRP_MJ_READ (for example) and be
transformed into something else as it progresses down the stack of drivers. I’ll show you examples in Chapter 12 of how a
USB driver changes the personality of a read or write request into an internal control operation to submit the request to the
USB bus driver.
MinorFunction (UCHAR) is a minor function code that further identifies an IRP belonging to a few major function classes.
IRP_MJ_PNP requests, for example, are divided into a dozen or so subtypes with minor function codes such as
IRP_MN_START_DEVICE, IRP_MN_REMOVE_DEVICE, and so on.
Parameters (union) is a union of substructures, one for each type of request that has specific parameters. The substructures
include, for example, Create (for IRP_MJ_CREATE requests), Read (for IRP_MJ_READ requests), and StartDevice (for the
IRP_MN_START_DEVICE subtype of IRP_MJ_PNP).
DeviceObject (PDEVICE_OBJECT) is the address of the device object that corresponds to this stack entry. IoCallDriver fills
in this field.
FileObject (PFILE_OBJECT) is the address of the kernel file object to which the IRP is directed. Drivers often use the
FileObject pointer to correlate IRPs in a queue with a request (in the form of an IRP_MJ_CLEANUP) to cancel all queued
IRPs in preparation for closing the file object.

Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
- 114 - The I/O Request Packet | Chapter 5

Figure 5-4. I/O stack location data structure.

CompletionRoutine (PIO_COMPLETION_ROUTINE) is the address of an I/O completion routine installed by the driver above
the one to which this stack location corresponds. You never set this field directly—instead, you call IoSetCompletionRoutine,
which knows to reference the stack location below the one that your driver owns. The lowest-level driver in the hierarchy of
drivers for a given device never needs a completion routine because it must complete the request. The originator of a request,
however, sometimes does need a completion routine but doesn’t usually have its own stack location. That’s why each level in
the hierarchy uses the next lower stack location to hold its own completion routine pointer.
Context (PVOID) is an arbitrary context value that will be passed as an argument to the completion routine. You never set this
field directly; it’s set automatically from one of the arguments to IoSetCompletionRoutine.

5.2 The “Standard Model” for IRP Processing

Particle physics has its “standard model” for the universe, and so does WDM. Figure 5-5 illustrates a typical flow of ownership
for an IRP as it progresses through various stages in its life. Not every type of IRP will go through these steps, and some of the
steps might be missing or altered depending on the type of device and the type of IRP. Notwithstanding the possible variability,
however, the picture provides a useful starting point for discussion.

Figure 5-5. The “standard model” for IRP processing.

When you engage I/O Verification, the Driver Verifier makes a few basic checks on how you handle IRPs. Extended
I/O Verification includes many more checks. Because there are so many tests, however, I didn’t put the Driver Verifier
flag in the margin for every one of them. Basically, if the DDK or this chapter tells you not to do something, there is
probably a Driver Verifier test to make sure you don’t.

5.2.1 Creating an IRP

The IRP begins life when an entity calls an I/O Manager function to create it. In Figure 5-5, I used the term I/O Manager to
describe this entity, as though there were a single system component responsible for creating IRPs. In reality, no such single
actor in the population of operating system routines exists, and it would have been more accurate to just say that somebody
creates the IRP. Your own driver will be creating IRPs from time to time, for example, and you’ll occupy the initial ownership

Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
5.2 The “Standard Model” for IRP Processing - 115 -

box for those particular IRPs.

You can use any of four functions to create a new IRP:
IoBuildAsynchronousFsdRequest builds an IRP on whose completion you don’t plan to wait. This function and the next
are appropriate for building only certain types of IRP.
IoBuildSynchronousFsdRequest builds an IRP on whose completion you do plan to wait.
IoBuildDeviceIoControlRequest builds a synchronous IRP_MJ_DEVICE_CONTROL or
IRP_MJ_INTERNAL_DEVICE_CONTROL request.
IoAllocateIrp builds an asynchronous IRP of any type.
The Fsd in the first two of these function names stands for file system driver (FSD). Any driver is allowed to call these
functions to create an IRP destined for any other driver, though. The DDK also documents a function named
IoMakeAssociatedIrp for building an IRP that’s subordinate to some other IRP. WDM drivers should not call this function.
Indeed, completion of associated IRPs doesn’t work correctly in Microsoft Windows 98/Me anyway.

NOTE
Throughout this chapter, I use the terms synchronous and asynchronous IRPs because those are the terms used
in the DDK. Knowledgeable developers in Microsoft wish that the terms threaded and nonthreaded had been
chosen because they better reflect the way drivers use these two types of IRP. As should become clear, you use
a synchronous, or threaded, IRP in a non-arbitrary thread that you can block while you wait for the IRP to finish.
You use an asynchronous, or nonthreaded, IRP in every other case.

Creating Synchronous IRPs

Deciding which of these functions to call and determining what additional initialization you need to perform on an IRP is a
rather complicated matter. IoBuildSynchronousFsdRequest and IoBuildDeviceIoControlRequest create a so-called synchronous
IRP. The I/O Manager considers that a synchronous IRP belongs to the thread in whose context you create the IRP. This
ownership concept has several consequences:
If the owning thread terminates, the I/O Manager automatically cancels any pending synchronous IRPs that belong to that
thread.
Because the creating thread owns a synchronous IRP, you shouldn’t create one in an arbitrary thread—you most
emphatically do not want the I/O Manager to cancel the IRP because this thread happens to terminate.
Following a call to IoCompleteRequest, the I/O Manager automatically cleans up a synchronous IRP and signals an event
that you must provide.
You must take care that the event object still exists at the time the I/O Manager signals it.
Refer to IRP handling scenario number 6 at the end of this chapter for a code sample involving a synchronous IRP.
You must call these two functions at PASSIVE_LEVEL only. In particular, you must not be at APC_LEVEL (say, as a result of
acquiring a fast mutex) because the I/O Manager won’t then be able to deliver the special kernel asynchronous procedure call
(APC) that does all the completion processing. In other words, you mustn’t do this:

PIRP Irp = IoBuildSynchronousFsdRequest(...);

ExAcquireFastMutex(...);
NTSTATUS status = IoCallDriver(...);
if (status == STATUS_PENDING)
KeWaitForSingleObject(...); // <== don't do this
ExReleaseFastMutex(...);

The problem with this code is that the KeWaitForSingleObject call will deadlock: when the IRP completes, IoCompleteRequest
will schedule an APC in this thread. The APC routine, if it could run, would set the event. But because you’re already at
APC_LEVEL, the APC cannot run in order to set the event.
If you need to synchronize IRPs sent to another driver, consider the following alternatives:
Use a regular kernel mutex instead of an executive fast mutex. The regular mutex leaves you at PASSIVE_LEVEL and
doesn’t inhibit special kernel APCs.
Use KeEnterCriticalRegion to inhibit all but special kernel APCs, and then use ExAcquireFastMutexUnsafe to acquire the
mutex. This technique won’t work in the original release of Windows 98 because KeEnterCriticalRegion wasn’t
supported there. It will work on all later WDM platforms.
Use an asynchronous IRP. Signal an event in the completion routine. Refer to IRP-handling scenario 8 at the end of this
chapter for a code sample.

Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
- 116 - The I/O Request Packet | Chapter 5

A final consideration in calling the two synchronous IRP routines is that you can’t create just any kind of IRP using these
routines. See Table 5-1 for the details. A common trick for creating another kind of synchronous IRP is to ask for an
IRP_MJ_SHUTDOWN, which has no parameters, and then alter the MajorFunction code in the first stack location.

Support Function Types of IRP You Can Create

IRP_MJ_READ
IRP_MJ_WRITE
IRP_MJ_FLUSH_BUFFERS
IoBuildSynchronousFsdRequest
IRP_MJ_SHUTDOWN
IRP_MJ_PNP
IRP_MJ_POWER (but only for IRP_MN_POWER_SEQUENCE)
IRP_MJ_DEVICE_CONTROL
IoBuildDeviceIoControlRequest
IRP_MJ_INTERNAL_DEVICE_CONTROL

Table 5-1. Synchronous IRP Types

Creating Asynchronous IRPs

The other two IRP creation functions—IoBuildAsynchronousFsdRequest and IoAllocateIrp—create an asynchronous IRP.
Asynchronous IRPs don’t belong to the creating thread, and the I/O Manager doesn’t schedule an APC and doesn’t clean up
when the IRP completes. Consequently:
When a thread terminates, the I/O Manager doesn’t try to cancel any asynchronous IRPs that you happen to have created
in that thread.
It’s OK to create asynchronous IRPs in an arbitrary or nonarbitrary thread.
Because the I/O Manager doesn’t do any cleanup when the IRP completes, you must provide a completion routine that
will release buffers and call IoFreeIrp to release the memory used by the IRP.
Because the I/O Manager doesn’t automatically cancel asynchronous IRPs, you might have to provide code to do that
when you no longer want the operation to occur.
Because you don’t wait for an asynchronous IRP to complete, you can create and send one at IRQL <=
DISPATCH_LEVEL (assuming, that is, that the driver to which you send the IRP can handle the IRP at elevated
IRQL—you must check the specifications for that driver!). Furthermore, it’s OK to create and send an asynchronous IRP
while owning a fast mutex.
Refer to Table 5-2 for a list of the types of IRP you can create using the two asynchronous IRP routines. Note that
IoBuildSynchronousFsdRequest and IoBuildAsynchronousFsdRequest support the same IRP major function codes.

Support Function Types of IRP You Can Create

IRP_MJ_READ
IRP_MJ_WRITE
IRP_MJ_FLUSH_BUFFERS
IoBuildAsynchronousFsdRequest
IRP_MJ_SHUTDOWN
IRP_MJ_PNP
IRP_MJ_POWER (but only for IRP_MN_POWER_SEQUENCE)
IoAllocateIrp Any (but you must initialize the MajorFunction field of the first stack location)

Table 5-2. Asynchronous IRP Types

IRP-handling scenario numbers 5 and 8 at the end of this chapter contain “cookbook” code for using asynchronous IRPs.

5.2.2 Forwarding to a Dispatch Routine

After you create an IRP, you call IoGetNextIrpStackLocation to obtain a pointer to the first stack location. Then you initialize
just that first location. If you’ve used IoAllocateIrp to create the IRP, you need to fill in at least the MajorFunction code. If
you’ve used another of the four IRP-creation functions, the I/O Manager might have already done the required initialization.
You might then be able to skip this step, depending on the rules for that particular type of IRP. Having initialized the stack, you
call IoCallDriver to send the IRP to a device driver:

PDEVICE_OBJECT DeviceObject; // <== somebody gives you this

PIO_STACK_LOCATION stack = IoGetNextIrpStackLocation(Irp);
stack->MajorFunction = IRP_MJ_Xxx;
<other initialization of
"stack">NTSTATUS status = IoCallDriver(DeviceObject, Irp);

The first argument to IoCallDriver is the address of a device object that you’ve obtained somehow. Often you’re sending an

Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
5.2 The “Standard Model” for IRP Processing - 117 -

IRP to the driver under yours in the PnP stack. In that case, the DeviceObject in this fragment is the LowerDeviceObject you
saved in your device extension after calling IoAttachDeviceToDeviceStack. I’ll describe some other common ways of locating a
device object in a few paragraphs.
The I/O Manager initializes the stack location pointer in the IRP to 1 before the actual first location. Because the I/O stack is an
array of IO_STACK_LOCATION structures, you can think of the stack pointer as being initialized to point to the “-1” element,
which doesn’t exist. (In fact, the stack “grows” from high toward low addresses, but that detail shouldn’t obscure the concept
I’m trying to describe here.) We therefore ask for the “next” stack location when we want to initialize the first one.

What IoCallDriver Does

You can imagine IoCallDriver as looking something like this (but I hasten to add that this is not a copy of the actual source
code):

NTSTATUS IoCallDriver(PDEVICE_OBJECT DeviceObject, PIRP Irp)

{
IoSetNextIrpStackLocation(Irp);
PIO_STACK_LOCATION stack = IoGetCurrentIrpStackLocation(Irp);
stack->DeviceObject = DeviceObject;
ULONG fcn = stack->MajorFunction;
PDRIVER_OBJECT driver = DeviceObject->DriverObject;
return (*driver->MajorFunction[fcn])(DeviceObject, Irp);
}

As you can see, IoCallDriver simply advances the stack pointer and calls the appropriate dispatch routine in the driver for the
target device object. It returns the status code that that dispatch routine returns. Sometimes I see online help requests wherein
people attribute one or another unfortunate action to IoCallDriver. (For example, “IoCallDriver is returning an error code for
my IRP….”) As you can see, the real culprit is a dispatch routine in another driver.

Locating Device Objects

Apart from IoAttachDeviceToDeviceStack, drivers can locate device objects in at least two ways. I’ll tell you here about
IoGetDeviceObjectPointer and IoGetAttachedDeviceReference.

IoGetDeviceObjectPointer
If you know the name of the device object, you can call IoGetDeviceObjectPointer as shown here:

PUNICODE_STRING devname; // <== somebody gives you this

ACCESS_MASK access; // <== more about this later
PDEVICE_OBJECT DeviceObject;
PFILE_OBJECT FileObject;
NTSTATUS status;
ASSERT(KeGetCurrentIrql() == PASSIVE_LEVEL);
status = IoGetDeviceObjectPointer(devname, access, &FileObject, &DeviceObject);

This function returns two pointers: one to a FILE_OBJECT and one to a DEVICE_OBJECT.
To help defeat elevation-of-privilege attacks, specify the most restricted access consistent with your needs. For example,
if you’ll just be reading data, specify FILE_READ_DATA.
When you create an IRP for a target you discover this way, you should set the FileObject pointer in the first stack location.
Furthermore, it’s a good idea to take an extra reference to the file object until after IoCallDriver returns. The following
fragment illustrates both these ideas:

PIRP Irp = IoXxx(...);

PIO_STACK_LOCATION stack = IoGetNextIrpStackLocation(Irp);
ObReferenceObject(FileObject);
stack->FileObject = FileObject;<etc.>
IoCallDriver(DeviceObject, Irp);
ObDereferenceObject(FileObject);

The reason you put the file object pointer in each stack location is that the target driver might be using fields in the file object
to record per-handle information. The reason you take an extra reference to the file object is that you’ll have code somewhere
in your driver that dereferences the file object in order to release your hold on the target device. (See the next paragraph.)
Should that code execute before the target driver’s dispatch routine returns, the target driver might be removed from memory
before its dispatch routine returns. The extra reference prevents that bad result.

Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
- 118 - The I/O Request Packet | Chapter 5

NOTE
Removability of devices in a Plug and Play environment is the ultimate source of the early-unload problem
mentioned in the text. I discuss this problem in much greater detail in the next chapter. The upshot of that
discussion is that it’s your responsibility to avoid sending an IRP to a driver that might no longer be in memory
and to prevent the PnP manager from unloading a driver that’s still processing an IRP you’ve sent to that driver.
One aspect of how you fulfill that responsibility is shown in the text: take an extra reference to the file object
returned by IoGetDeviceObjectPointer around the call to IoCallDriver. In most drivers, you’ll probably need the
extra reference only when you’re sending an asynchronous IRP. In that case, the code that ordinarily
dereferences the file object is likely to be in some other part of your driver that runs asynchronously with the
call to IoCallDriver—say, in the completion routine you’re obliged to install for an asynchronous IRP. If you send
a synchronous IRP, you’re much more likely to code your driver in such a way that you don’t dereference the file
object until the IRP completes.

When you no longer need the device object, dereference the file object:

ObDereferenceObject(FileObject);

After making this call, don’t use either of the file or device object pointers.
IoGetDeviceObjectPointer performs several steps to locate the two pointers that it returns to you:
1. It uses ZwOpenFile to open a kernel handle to the named device object. Internally, this will cause the Object Manager to
create a file object and to send an IRP_MJ_CREATE to the target device. ZwOpenFile returns a file handle.
2. It calls ObReferenceObjectByHandle to get the address of the FILE_OBJECT that the handle represents. This address
becomes the FileObject return value.
3. It calls IoGetRelatedDeviceObject to get the address of the DEVICE_OBJECT to which the file object refers. This
address becomes the DeviceObject return value.
4. It calls ZwClose to close the handle.

Names for Device Objects

For you to use IoGetDeviceObjectPointer, a driver in the stack for the device to which you want to connect must
have named a device object. We studied device object naming in Chapter 2. Recall that a driver might have
specified a name in the \Device folder in its call to IoCreateDevice, and it might have created one or more
symbolic links in the \DosDevices folder. If you know the name of the device object or one of the symbolic links,
you can use that name in your call to IoGetDeviceObjectPointer.

Instead of naming a device object, the function driver for the target device might have registered a device
interface. I showed you the user-mode code for enumerating instances of registered interfaces in Chapter 2. I’ll
discuss the kernel-mode equivalent of that enumeration code in Chapter 6, when I discuss Plug and Play. The
upshot of that discussion is that you can obtain the symbolic link names for all the devices that expose a
particular interface. With a bit of effort, you can then locate the desired device object.

The reference that IoGetDeviceObjectPointer claims to the file object effectively pins the device object in memory too.
Releasing that reference indirectly releases the device object.
Based on this explanation of how IoGetDeviceObjectPointer works, you can see why it will sometimes fail with
STATUS_ACCESS_DENIED, even though you haven’t done anything wrong. If the target driver implements a “one handle
only” policy, and if a handle happens to be open, the driver will cause the IRP_MJ_CREATE to fail. That failure causes the
ZwOpenFile call to fail in turn. Note that you can expect this result if you try to locate a device object for a serial port or
SmartCard reader that happens to already be open.
Sometimes driver programmers decide they don’t want the clutter of two pointers to what appears to be basically the same
object, so they release the file object immediately after calling IoGetDeviceObjectPointer, as shown here:

status = IoGetDeviceObjectPointer(...);
ObReferenceObject(DeviceObject);
ObDereferenceObject(FileObject);

Referencing the device object pins it in memory until you dereference it. Dereferencing the file object allows the I/O Manager
to delete it right away.
Releasing the file object immediately might or might not be OK, depending on the target driver. Consider these fine points
before you decide to do it:

Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
5.2 The “Standard Model” for IRP Processing - 119 -

1. Deferencing the file object will cause the I/O Manager to send an immediate IRP_MJ_CLEANUP to the target driver.
2. IRPs that the target driver queues will no longer be associated with a file object. When you eventually release the device
object reference, the target driver will probably not be able to cancel any IRPs you sent it that remain on its queues.
3. In many situations, the I/O Manager will also send an IRP_MJ_CLOSE to the target driver. (If you’ve opened a disk file,
the file system driver’s use of the system cache will probably cause the IRP_MJ_CLOSE to be deferred.) Many drivers,
including the standard driver for serial ports, will now refuse to process IRPs that you send them.
4. Instead of claiming an extra reference to the file object around calls to IoCallDriver, you’ll want to reference the device
object instead.

NOTE
I recommend avoiding an older routine named IoAttachDevice, which appears superficially to be a sort-of
combination of IoGetDeviceObjectPointer and IoAttachDeviceToDeviceStack. The older routine does its internal
ZwClose call after attaching your device object. Your driver will receive the resulting IRP_MJ_CLOSE. To handle
the IRP correctly, you must call IoAttachDevice in such a way that your dispatch routine has access to the
location you specify for the output DEVICE_OBJECT pointer. It turns out that IoAttachDevice sets your output
pointer before calling ZwClose and depends on you using it to forward the IRP_MJ_CLOSE to the target device.
This is the only example I’ve seen in many decades of programming where you’re required to use the return
value from a function before the function actually returns.

IoGetAttachedDeviceReference
To send an IRP to all the drivers in your own PnP stack, use IoGetAttachedDeviceReference, as shown here:

PDEVICE_OBJECT tdo = IoGetAttachedDeviceReference(fdo);

ObDereferenceObject(tdo);

This function returns the address of the topmost device object in your own stack and claims a reference to that object. Because
of the reference you hold, you can be sure that the pointer will remain valid until you release the reference. As discussed earlier,
you might also want to take an extra reference to the topmost device object until IoCallDriver returns.

5.2.3 Duties of a Dispatch Routine

An archetypal IRP dispatch routine would look similar to this example:

NTSTATUS DispatchXxx(PDEVICE_OBJECT fdo, PIRP Irp)

{

PIO_STACK_LOCATION stack = IoGetCurrentIrpStackLocation(Irp);

PDEVICE_EXTENSION pdx =
(PDEVICE_EXTENSION) fdo->DeviceExtension;

return STATUS_Xxx;
}

1. You generally need to access the current stack location to determine parameters or to examine the minor function code.
2. You also generally need to access the device extension you created and initialized during AddDevice.
3. You’ll be returning some NTSTATUS code to IoCallDriver, which will propagate the code back to its caller.
Where I used an ellipsis in the foregoing prototypical dispatch function, a dispatch function has to choose between three
courses of action. It can complete the request immediately, pass the request down to a lower-level driver in the same driver
stack, or queue the request for later processing by other routines in this driver.

Completing an IRP
Someplace, sometime, someone must complete every IRP. You might want to complete an IRP in your dispatch routine in
cases like these:
If the request is erroneous in some easily determined way (such as a request to rewind a printer or to eject the keyboard), the
dispatch routine should cause the request to fail by completing it with an appropriate status code.

Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
- 120 - The I/O Request Packet | Chapter 5

If the request calls for information that the dispatch function can easily determine (such as a control request asking for the
driver’s version number), the dispatch routine should provide the answer and complete the request with a successful status
code.
Mechanically, completing an IRP entails filling in the Status and Information members within the IRP’s IoStatus block and
calling IoCompleteRequest. The Status value is one of the codes defined by manifest constants in the DDK header file
NTSTATUS.H. Refer to Table 5-3 for an abbreviated list of status codes for common situations. The Information value
depends on what type of IRP you’re completing and on whether you’re causing the IRP to succeed or to fail. Most of the time,
when you’re causing an IRP to fail (that is, completing it with an error status of some kind), you’ll set Information to 0. When
you cause an IRP that involves data transfer to succeed, you ordinarily set the Information field equal to the number of bytes
transferred.

Status Code Description

STATUS_SUCCESS Normal completion.
STATUS_UNSUCCESSFUL Request failed, but no other status code describes the reason specifically.
STATUS_NOT_IMPLEMENTED A function hasn’t been implemented.
STATUS_INVALID_HANDLE An invalid handle was supplied for an operation.
STATUS_INVALID_PARAMETER A parameter is in error.
STATUS_INVALID_DEVICE_REQUEST The request is invalid for this device.
STATUS_END_OF_FILE End-of-file marker reached.
STATUS_DELETE_PENDING The device is in the process of being removed from the system.
STATUS_INSUFFICIENT_RESOURCES Not enough system resources (often memory) to perform an operation.
Table 5-3. Some Commonly Used NTSTATUS Codes

NOTE
Always be sure to consult the DDK documentation for the correct setting of IoStatus.Information for the IRP
you’re dealing with. In some flavors of IRP_MJ_PNP, for example, this field is used as a pointer to a data
structure that the PnP Manager is responsible for releasing. If you were to overstore the Information field with
0 when causing the request to fail, you would unwittingly cause a resource leak.

Because completing a request is something you do so often, I find it useful to have a helper routine to carry out the mechanics:

NTSTATUS CompleteRequest(PIRP Irp, NTSTATUS status,

ULONG_PTR Information)
{
Irp->IoStatus.Status = status;
Irp->IoStatus.Information = Information;
IoCompleteRequest(Irp, IO_NO_INCREMENT);
return status;
}

I defined this routine in such a way that it returns whatever status value you supply as its second argument. That’s because I’m
such a lazy typist: the return value allows me to use this helper whenever I want to complete a request and then immediately
return a status code. For example:

NTSTATUS DispatchControl(PDEVICE_OBJECT fdo, PIRP Irp)

{
PIO_STACK_LOCATION stack = IoGetCurrentIrpStackLocation(Irp);
ULONG code = stack->Parameters.DeviceIoControl.IoControlCode;
if (code == IOCTL_TOASTER_BOGUS)
return CompleteRequest(Irp, STATUS_INVALID_DEVICE_REQUEST, 0);

You might notice that the Information argument to the CompleteRequest function is typed as a ULONG_PTR. In other words,
this value can be either a ULONG or a pointer to something (and therefore potentially 64 bits wide).
When you call IoCompleteRequest, you supply a priority boost value to be applied to whichever thread is currently waiting for
this request to complete. You normally choose a boost value that depends on the type of device, as suggested by the manifest
constant names listed in Table 5-4. The priority adjustment improves the throughput of threads that frequently wait for I/O
operations to complete. Events for which the end user is directly responsible, such as keyboard or mouse operations, result in
greater priority boosts in order to give preference to interactive tasks. Consequently, you want to choose the boost value with at
least some care. Don’t use IO_SOUND_INCREMENT for absolutely every operation a sound card driver finishes, for
example—it’s not necessary to apply this extraordinary priority increment to a get-driver-version control request.

Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
5.2 The “Standard Model” for IRP Processing - 121 -

Manifest Constant Numeric Priority Boost

IO_NO_INCREMENT 0
IO_CD_ROM_INCREMENT 1
IO_DISK_INCREMENT 1
IO_KEYBOARD_INCREMENT 6
IO_MAILSLOT_INCREMENT 2
IO_MOUSE_INCREMENT 6
IO_NAMED_PIPE_INCREMENT 2
IO_NETWORK_INCREMENT 2
IO_PARALLEL_INCREMENT 1
IO_SERIAL_INCREMENT 2
IO_SOUND_INCREMENT 8
IO_VIDEO_INCREMENT 1
Table 5-4. Priority Boost Values for IoCompleteRequest
Don’t, by the way, complete an IRP with the special status code STATUS_PENDING. Dispatch routines often return
STATUS_PENDING as their return value, but you should never set IoStatus.Status to this value. Just to make sure, the
checked build of IoCompleteRequest generates an ASSERT failure if it sees STATUS_PENDING in the ending status. Another
popular value for people to use by mistake is apparently -1, which doesn’t have any meaning as an NTSTATUS code at all.
There’s a checked-build ASSERT to catch that mistake too. The Driver Verifier will complain if you try to do either of these
bad things.
Before calling IoCompleteRequest, be sure to remove any cancel routine that you might have installed for an IRP. As you’ll
learn later in this chapter, you install a cancel routine while you keep an IRP in a queue. You must remove an IRP from the
queue before completing it. All the queuing schemes I’ll discuss in this book clear the cancel routine pointer when they
dequeue an IRP. Therefore, you probably don’t need to have additional code in your driver as in this sample:

IoSetCancelRoutine(Irp, NULL); // <== almost certainly redundant

IoCompleteRequest(Irp, ...);

So far, I’ve just explained how to call IoCompleteRequest. That function performs several tasks that you need to understand:
Calling completion routines that various drivers might have installed. I’ll discuss the important topic of I/O completion
routines later in this chapter.
Unlocking any pages belonging to Memory Descriptor List (MDL) structures attached to the IRP. An MDL will be used
for the buffer for an IRP_MJ_READ or IRP_MJ_WRITE for a device whose device object has the DO_DIRECT_IO flag
set. Control operations also use an MDL if the control code’s buffering method specifies one of the
METHOD_XX_DIRECT methods. I’ll discuss these issues more fully in Chapter 7 and Chapter 9, respectively.
Scheduling a special kernel APC to perform final cleanup on the IRP. This cleanup includes copying input data back to a
user buffer, copying the IRP’s ending status, and signaling whichever event the originator of the IRP might be waiting on.
The fact that completion processing includes an APC, and that the cleanup includes setting an event, imposes some
exacting requirements on the way a driver implements a completion routine, so I’ll also discuss this aspect of I/O
completion in more detail later.

Passing an IRP Down the Stack

The whole goal of the layering of device objects that WDM facilitates is for you to be able to easily pass IRPs from one layer
down to the next. Back in Chapter 2, I discussed how your AddDevice routine would contribute its portion of the effort
required to create a stack of device objects with a statement like this one:

pdx->LowerDeviceObject = IoAttachDeviceToDeviceStack(fdo, pdo);

where fdo is the address of your own device object and pdo is the address of the physical device object (PDO) at the bottom of
the device stack. IoAttachDeviceToDeviceStack returns to you the address of the device object immediately underneath yours.
When you decide to forward an IRP that you received from above, this is the device object you’ll specify in the eventual call to
IoCallDriver.
Before passing an IRP to another driver, be sure to remove any cancel routine that you might have installed for the IRP.
As I mentioned just a few paragraphs ago, you’ll probably fulfill this requirement without specifically worrying about it.
Your queue management code will zero the cancel routine pointer when it dequeues an IRP. If you never queued the IRP in the
first place, the driver above you will have made sure the cancel routine pointer was NULL. The Driver Verifier will make sure
that you don’t break this rule.
When you pass an IRP down, you have the additional responsibility of initializing the IO_STACK_LOCATION that the next

Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
- 122 - The I/O Request Packet | Chapter 5

driver will use to obtain its parameters. One way of doing this is to perform a physical copy, like this:

IoCopyCurrentIrpStackLocationToNext(Irp);
status = IoCallDriver(pdx->LowerDeviceObject, Irp);

IoCopyCurrentIrpStackLocationToNext is a macro in WDM.H that copies all the fields in an IO_STACK_LOCATION—except

for the ones that pertain to the I/O completion routines—from the current stack location to the next one. In previous versions of
Windows NT, kernel-mode driver writers sometimes copied the entire stack location, which would cause the caller’s
completion routine to be called twice. The IoCopyCurrentIrpStackLocationToNext macro, which is new with the WDM, avoids
the problem.
If you don’t care what happens to an IRP after you pass it down the stack, use the following alternative to
IoCopyCurrentIrpStackLocationToNext:

NTSTATUS ForwardAndForget(PDEVICE_OBJECT fdo, PIRP Irp)

{
PDEVICE_EXTENSION pdx = (PDEVICE_EXTENSION) fdo->DeviceExtension;
IoSkipCurrentIrpStackLocation(Irp);
return IoCallDriver(pdx->LowerDeviceObject, Irp);
}

IoSkipCurrentIrpStackLocation retards the IRP’s stack pointer by one position. IoCallDriver will immediately advance the
stack pointer. The net effect is to not change the stack pointer. When the next driver’s dispatch routine calls
IoGetCurrentIrpStackLocation, it will retrieve exactly the same IO_STACK_LOCATION pointer that we were working with,
and it will thereby process exactly the same request (same major and minor function codes) with the same parameters.

Figure 5-6. Comparison of copying vs. skipping I/O stack locations.

CAUTION
The version of IoSkipCurrentIrpStackLocation that you get when you use the Windows Me or Windows 2000
build environment in the DDK is a macro that generates two statements without surrounding braces. Therefore,
you mustn’t use it in a construction like this:

if (<expression>)
IoSkipCurrentIrpStackLocation(Irp); // <== don't do this!

The explanation of why IoSkipCurrentIrpStackLocation works is so tricky that I thought an illustration might help. Figure 5-6
illustrates a situation in which three drivers are in a particular stack: yours (the function device object [FDO]) and two others
(an upper filter device object [FiDO] and the PDO). In the picture on the left, you see the relationship between stack locations,
parameters, and completion routines when we do the copy step with IoCopyCurrentIrpStackLocationToNext. In the picture on
the right, you see the same relationships when we use the IoSkipCurrentIrpStackLocation shortcut. In the right-hand picture,
the third and last stack location is fallow, but nobody gets confused by that fact.

Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
5.2 The “Standard Model” for IRP Processing - 123 -

Queuing an IRP for Later Processing

The third alternative action for a dispatch routine is to queue the IRP for later processing. The following code snippet assumes
you’re using one of my DEVQUEUE queue objects for IRP queuing. I’ll explain the DEVQUEUE object later in this chapter.

NTSTATUS DispatchSomething(PDEVICE_OBJECT fdo, PIRP Irp)

{

IoMarkIrpPending(Irp);

StartPacket(&pdx->dqSomething, fdo, Irp, CancelRoutine);

return STATUS_PENDING;
}

1. Whenever we return STATUS_PENDING from a dispatch routine (as we’re about to do here), we make this call to help
the I/O Manager avoid an internal race condition. We must do this before we relinquish ownership of the IRP.
2. If our device is currently busy or stalled because of a PnP or Power event, StartPacket puts the request in a queue.
Otherwise, StartPacket marks the device as busy and calls our StartIo routine. I’ll describe the StartIo routine in the next
section. The last argument is the address of a cancel routine. I’ll discuss cancel routines later in this chapter.
3. We return STATUS_PENDING to tell our caller that we’re not done with this IRP yet.
It’s important not to touch the IRP once we call StartPacket. By the time that function returns, the IRP might have been
completed and the memory it occupies released. The pointer we have might, therefore, now be invalid.

5.2.4 The StartIo Routine

IRP-queuing schemes often revolve around calling a StartIo function to process IRPs:

VOID StartIo(PDEVICE_OBJECT device, PIRP Irp)

{
PIO_STACK_LOCATION stack = IoGetCurrentIrpStackLocation(Irp);
PDEVICE_EXTENSION pdx = (PDEVICE_EXTENSION) device->DeviceExtension;

A StartIo routine generally receives control at DISPATCH_LEVEL, meaning that it must not generate any page faults.
Your job in StartIo is to commence the IRP you’ve been handed. How you do this depends entirely on your device. Often you
will need to access hardware registers that are also used by your interrupt service routine (ISR) and, perhaps, by other routines
in your driver. In fact, sometimes the easiest way to commence a new operation is to store some state information in your
device extension and then fake an interrupt. Because either of these approaches needs to be carried out under the protection of
the same spin lock that protects your ISR, the correct way to proceed is to call KeSynchronizeExecution. For example:

VOID StartIo(...)
{

KeSynchronizeExecution(pdx->InterruptObject, TransferFirst, (PVOID) pdx);

}

BOOLEAN TransferFirst(PVOID context)

{
PDEVICE_EXTENSION pdx = (PDEVICE_EXTENSION) context;
<initialize device for new operation>
return TRUE;
}

The TransferFirst routine shown here is an example of the generic class of SynchCritSection routines, so called because
they’re synchronized with the ISR. I’ll discuss the SynchCritSection concept in more detail in Chapter 7.
In Windows XP and later systems, you can follow this template instead of calling KeSynchronizeExecution:

VOID StartIo(...)
{
KIRQL oldirql = KeAcquireInterruptSpinLock(pdx->InterruptObject);

Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
- 124 - The I/O Request Packet | Chapter 5

<initialize device for new operation>

KeReleaseInterruptSpinLock(pdx->InterruptObject, oldirql);
}

Once StartIo gets the device busy handling the new request, it returns. You’ll see the request next when your device interrupts
to signal that it’s done with whatever transfer you started.

5.2.5 The Interrupt Service Routine

When your device is finished transferring data, it might signal a hardware interrupt. In Chapter 7, I’ll show you how to use
IoConnectInterrupt to “hook” the interrupt. One of the arguments to IoConnectInterrupt is the address of your ISR. When an
interrupt occurs, the system calls your ISR. The ISR runs at the device IRQL (DIRQL) of your particular device and under the
protection of a spin lock associated specifically with your ISR. The ISR has the following skeleton:

BOOLEAN OnInterrupt(PKINTERRUPT InterruptObject, PDEVICE_EXENSION pdx)

{
if (<my device didn't interrupt>)
return FALSE;

return TRUE;
}

The first argument of your ISR is the address of the interrupt object created by IoConnectInterrupt, but you’re unlikely to use
this argument. The second argument is whatever context value you specified in your original call to IoConnectInterrupt; it will
probably be the address of your device extension, as shown in this fragment.
I’ll discuss the duties of your ISR in detail in Chapter 7 in connection with reading and writing data, the subject to which
interrupt handling is most relevant. To carry on with this discussion of the standard model, I need to tell you that one of the
likely things for the ISR to do is to schedule a deferred procedure call (DPC). The purpose of the DPC is to let you do things,
such as calling IoCompleteRequest, that can’t be done at the rarified DIRQL at which your ISR runs. So you might have a line
of code like this one:

IoRequestDpc(pdx->DeviceObject, NULL, pdx);

You’ll next see the IRP in the DPC routine you registered inside AddDevice with your call to IoInitializeDpcRequest. The
traditional name for that routine is DpcForIsr because it’s the DPC routine your ISR requests.

5.2.6 Deferred Procedure Call Routine

The DpcForIsr routine requested by your ISR receives control at DISPATCH_LEVEL. Generally, its job is to finish up the
processing of the IRP that caused the most recent interrupt. Often that job entails calling IoCompleteRequest to complete this
IRP and StartNextPacket to remove the next IRP from your device queue for forwarding to StartIo.

VOID DpcForIsr(PKDPC Dpc PDEVICE_OBJECT fdo, PIRP junk, PDEVICE_EXTENSION pdx)

{

StartNextPacket(&pdx->dqSomething, fdo);

IoCompleteRequest(Irp, boost);
}

StartNextPacket removes the next IRP from your queue and sends it to StartIo.
IoCompleteRequest completes the IRP you specify as the first argument. The second argument specifies a priority boost for the
thread that has been waiting for this IRP. You’ll also fill in the IoStatus block within the IRP before calling IoCompleteRequest,
as I explained earlier, in the section “Completing an IRP.”
I’m not (yet) showing you how to determine which IRP has just completed. You might notice that the third argument to the
DPC is typed as a pointer to an IRP. This is because, once upon a time, people often specified an IRP address as one of the
context parameters to IoRequestDpc, and that value showed up here. Trying to communicate an IRP pointer from the function
that queues a DPC is unwise, though, because it’s possible for there to be just one call to the DPC routine for any number of
requests to queue that DPC. Accordingly, the DPC routine should develop the current IRP pointer based on whatever scheme
you happen to be using for IRP queuing.
The call to IoCompleteRequest is the end of this standard way of handling an I/O request. After that call, the I/O Manager (or
whichever entity created the IRP in the first place) owns the IRP once more. That entity will destroy the IRP and might unblock
a thread that has been waiting for the request to complete.

Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
5.3 Completion Routines - 125 -

5.3 Completion Routines

You often need to know the results of I/O requests that you pass down to lower levels of the driver hierarchy or that you
originate. To find out what happened to a request, you install a completion routine by calling IoSetCompletionRoutine:

IoSetCompletionRoutine(Irp, CompletionRoutine, context,

InvokeOnSuccess, InvokeOnError, InvokeOnCancel);

Irp is the request whose completion you want to know about. CompletionRoutine is the address of the completion routine you
want called, and context is an arbitrary pointer-size value you want passed as an argument to the completion routine. The
InvokeOnXxx arguments are Boolean values indicating whether you want the completion routine called in three different
circumstances:
InvokeOnSuccess means you want the completion routine called when somebody completes the IRP with a status code
that passes the NT_SUCCESS test.
InvokeOnError means you want the completion routine called when somebody completes the IRP with a status code that
does not pass the NT_SUCCESS test.
InvokeOnCancel means you want the completion routine called when somebody calls IoCancelIrp before completing the
IRP. I worded this quite delicately: IoCancelIrp will set the Cancel flag in the IRP, and that’s the condition that gets tested
if you specify this argument. A cancelled IRP might end up being completed with STATUS_CANCELLED (which would
cause the NT_SUCCESS test to fail) or with any other status at all. If the IRP gets completed with an error and you
specified InvokeOnError, InvokeOnError by itself will cause your completion routine to be called. Conversely, if the IRP
gets completed without error and you specified InvokeOnSuccess, InvokeOnSuccess by itself will cause your completion
routine to be called. In these cases, InvokeOnCancel will be redundant. But if you left out one or the other (or both) of
InvokeOnSuccess or InvokeOnError, the InvokeOnCancel flag will let you see the eventual completion of an IRP whose
Cancel flag has been set, no matter which status is used for the completion.
At least one of these three flags must be TRUE. Note that IoSetCompletionRoutine is a macro, so you want to avoid arguments
that generate side effects. The three flag arguments and the function pointer, in particular, are each referenced twice by the
macro.
IoSetCompletionRoutine installs the completion routine address and context argument in the next
IO_STACK_LOCATION—that is, in the stack location in which the next lower driver will find its parameters. Consequently,
the lowest-level driver in a particular stack of drivers doesn’t dare attempt to install a completion routine. Doing so would be
pretty futile, of course, because—by definition of lowest-level driver—there’s no driver left to pass the request on to.

CAUTION
Recall that you are responsible for initializing the next I/O stack location before you call IoCallDriver. Do this
initialization before you install a completion routine. This step is especially important if you use
IoCopyCurrentIrpStackLocationToNext to initialize the next stack location because that function clears some
flags that IoSetCompletionRoutine sets.

A completion routine looks like this:

NTSTATUS CompletionRoutine(PDEVICE_OBJECT fdo, PIRP Irp, PVOID context)

{

return <some status code>;

}

It receives pointers to the device object and the IRP, and it also receives whichever context value you specified in the call to
IoSetCompletionRoutine. Completion routines can be called at DISPATCH_LEVEL in an arbitrary thread context but can also
be called at PASSIVE_LEVEL or APC_LEVEL. To accommodate the worst case (DISPATCH_LEVEL), completion routines
therefore need to be in nonpaged memory and must call only service functions that are callable at or below DISPATCH_LEVEL.
To accommodate the possibility of being called at a lower IRQL, however, a completion routine shouldn’t call functions such
as KeAcquireSpinLockAtDpcLevel that assume they’re at DISPATCH_LEVEL to start with.
There are really just two possible return values from a completion routine:
STATUS_MORE_PROCESSING_REQUIRED, which aborts the completion process immediately. The spelling of this
status code obscures its actual purpose, which is to short-circuit the completion of an IRP. Sometimes, a driver actually
does some additional processing on the same IRP. Other times, the flag just means, “Yo, IoCompleteRequest! Like, don’t
touch this IRP no more, dude!” Future versions of the DDK will therefore define an enumeration constant,
StopCompletion, that is numerically the same as STATUS_MORE_PROCESSING_REQUIRED but more evocatively
named. (Future printings of this book may also employ better grammar in describing the meaning to be ascribed the
constant, at least if my editors get their way.)

Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
- 126 - The I/O Request Packet | Chapter 5

Anything else, which allows the completion process to continue. Because any value besides
STATUS_MORE_PROCESSING_REQUIRED has the same meaning as any other, I usually just code STATUS_SUCCESS.
Future versions of the DDK will define STATUS_CONTINUE_COMPLETION and an enumeration constant, Con-
tinueCompletion, that are numerically the same as STATUS_SUCCESS.
I’ll have more to say about these return codes a bit further on in this chapter.

NOTE
The device object pointer argument to a completion routine is the value left in the I/O stack location’s
DeviceObject pointer. IoCallDriver ordinarily sets this value. People sometimes create an IRP with an extra stack
location so that they can pass parameters to a completion routine without creating an extra context structure.
Such a completion routine gets a NULL device object pointer unless the creator sets the DeviceObject field.

How Completion Routines Get Called

Figure 5-7. Logic of IoCompleteRequest.

IoCompleteRequest is responsible for calling all of the completion routines that drivers installed in their respective stack
locations. The way the process works, as shown in the flowchart in Figure 5-7, is this: Somebody calls IoCompleteRequest to
signal the end of processing for the IRP. IoCompleteRequest then consults the current stack location to see whether the driver
above the current level installed a completion routine. If not, it moves the stack pointer up one level and repeats the test. This
process repeats until a stack location is found that does specify a completion routine or until IoCompleteRequest reaches the
top of the stack. Then IoCompleteRequest takes steps that eventually result in somebody releasing the memory occupied by the
IRP (among other things).
When IoCompleteRequest finds a stack frame with a completion routine pointer, it calls that routine and examines the return
code. If the return code is anything other than STATUS_MORE_PROCESSING_REQUIRED, IoCompleteRequest moves the
stack pointer up one level and continues as before. If the return code is STATUS_MORE_PROCESSING_REQUIRED, however,
IoCompleteRequest stops dead in its tracks and returns to its caller. The IRP will then be in a sort of limbo state. The driver
whose completion routine halted the stack unwinding process is expected to do more work with the IRP and call
IoCompleteRequest to resume the completion process.
Within a completion routine, a call to IoGetCurrentIrpStackLocation will retrieve the same stack pointer that was current when
somebody called IoSetCompletionRoutine. You shouldn’t rely in a completion routine on the contents of any lower stack
location. To reinforce this rule, IoCompleteRequest zeroes most of the next location just before calling a completion routine.

Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
5.3 Completion Routines - 127 -

Actual Question from Who Wants to Be a Gazillionaire Driver Tycoon:

Suppose you install a completion routine and then immediately call IoCompleteRequest. What do you suppose
happens?

A. Your computer implodes, creating a gravitational singularity into which the universe instantaneously
collapses.

B. You receive the blue screen of death because you’re supposed to know better than to install a completion
routine in this situation.

C. IoCompleteRequest calls your completion routine. Unless the completion routine returns
STATUS_MORE_PROCESSING_REQUIRED, IoCompleteRequest then completes the IRP normally.

D. IoCompleteRequest doesn’t call your completion routine. It completes the IRP normally.

The Problem of IoMarkIrpPending

Completion routines have one more detail to attend to. You can learn this the easy way or the hard way, as they say in the
movies. First the easy way—just follow this rule:
Execute the following code in any completion routine that does not return STATUS_MORE_PROCESSING_REQUIRED :

if (Irp->PendingReturned) IoMarkIrpPending(Irp);

Now we’ll explore the hard way to learn about IoMarkIrpPending. Some I/O Manager routines manage an
IRP with code that functions much as does this example:

KEVENT event;
IO_STATUS_BLOCK iosb;
KeInitializeEvent(&event, ...);
PIRP Irp = IoBuildDeviceIoControlRequest(..., &event, &iosb);
NTSTATUS status = IoCallDriver(SomeDeviceObject, Irp);
if (status == STATUS_PENDING)
{
KeWaitForSingleObject(&event, ...);
status = iosb.Status;
}
else
<cleanup IRP>

The key here is that, if the returned status is STATUS_PENDING, the entity that creates this IRP will wait on the event that was
specified in the call to IoBuildDeviceIoControlRequest. This discussion could also be about an IRP built by
IoBuildSynchronousFsdRequest too—the important factor is the conditional wait on the event.
So who, you might well wonder, signals that event? IoCompleteRequest does this signaling indirectly by scheduling an APC to
the same routine that performs the <cleanup IRP> step in the preceding pseudocode. That cleanup code will do many tasks,
including calling IoFreeIrp to release the IRP and KeSetEvent to set the event on which the creator might be waiting. For some
types of IRP, IoCompleteRequest will always schedule the APC. For other types of IRP, though, IoCompleteRequest will
schedule the APC only if the SL_PENDING_RETURNED flag is set in the topmost stack location. You don’t need to know
which types of IRP fall into these two categories because Microsoft might change the way this function works and invalidate
the deductions you might make if you knew. You do need to know, though, that IoMarkPending is a macro whose only purpose
is to set SL_PENDING_RETURNED in the current stack location. Thus, if the dispatch routine in the topmost driver on the
stack does this:

NTSTATUS TopDriverDispatchSomething(PDEVICE_OBJECT fdo, PIRP Irp)

{
IoMarkIrpPending(Irp);

return STATUS_PENDING;

Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
- 128 - The I/O Request Packet | Chapter 5

things will work out nicely. (I’m violating my naming convention here to emphasize where this dispatch function lives.)
Because this dispatch routine returns STATUS_PENDING, the originator of the IRP will call KeWaitForSingleObject. Because
the dispatch routine sets the SL_PENDING_RETURNED flag, IoCompleteRequest will know to set the event on which the
originator waits.
But suppose the topmost driver merely passed the request down the stack, and the second driver pended the IRP:

NTSTATUS TopDriverDispatchSomething(PDEVICE_OBJECT fido, PIRP Irp)

{
PDEVICE_EXTENSION pdx = (PDEVICE_EXTENSION) fido->DeviceExtension;
IoCopyCurrentIrpStackLocationToNext(Irp);
return IoCallDriver(pdx->LowerDeviceObject, Irp);
}

NTSTATUS SecondDriverDispatchSomething(PDEVICE_OBJECT fdo, PIRP Irp)

{
IoMarkIrpPending(Irp);

return STATUS_PENDING;
}

Apparently, the second driver’s stack location contains the SL_PENDING_RETURNED flag, but the first driver’s does not.
IoCompleteRequest anticipates this situation, however, by propagating the SL_PENDING_RETURNED flag whenever it
unwinds a stack location that doesn’t have a completion routine associated with it. Because the top driver didn’t install a
completion routine, therefore, IoCompleteRequest will have set the flag in the topmost location, and it will have caused the
completion event to be signaled.
In another scenario, the topmost driver uses IoSkipCurrentIrpStackLocation instead of IoCopyCurrentIrpStackLocationToNext.
Here, everything works out by default. This is because the IoMarkIrpPending call in SecondDriverDispatchSomething sets the
flag in the topmost stack location to begin with.
Things get sticky if the topmost driver installs a completion routine:

NTSTATUS TopDriverDispatchSomething(PDEVICE_OBJECT fido, PIRP Irp)

NTSTATUS SecondDriverDispatchSomething(PDEVICE_OBJECT fdo, PIRP Irp)

{
IoMarkIrpPending(Irp);

return STATUS_PENDING;
}

Here IoCompleteRequest won’t propagate SL_PENDING_RETURNED into the topmost stack location. I’m not exactly sure
why the Windows NT designers decided not to do this propagation, but it’s a fact that they did so decide. Instead, just before
calling the completion routine, IoCompleteRequest sets the PendingReturned flag in the IRP to whichever value
SL_PENDING_RETURNED had in the immediately lower stack location. The completion routine must then take over the job
of setting SL_PENDING_RETURNED in its own location:

NTSTATUS TopDriverCompletionRoutine(PDEVICE_OBJECT fido, PIRP Irp, ...)

{
if (Irp->PendingReturned)
IoMarkIrpPending(Irp);

return STATUS_SUCCESS;
}

If you omit this step, you’ll find that threads deadlock waiting for someone to signal an event that’s destined never to be
signaled. So don’t omit this step.
Given the importance of the call to IoMarkIrpPending, driver programmers through the ages have tried to find other ways of
dealing with the problem. Here is a smattering of bad ideas.

Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
5.3 Completion Routines - 129 -

Bad Idea # 1—Conditionally Call IoMarkIrpPending in the Dispatch Routine

The first bad idea is to try to deal with the pending flag solely in the dispatch routine, thereby keeping the completion routine
pristine and understandable in some vague way:

NTSTATUS TopDriverDispatchSomething(PDEVICE_OBJECT fido, PIRP Irp)

{
PDEVICE_EXTENSION pdx = (PDEVICE_EXTENSION) fido->DeviceExtension;
IoCopyCurrentIrpStackLocationToNext(Irp);
IoSetCompletionRoutine(Irp, TopDriverCompletionRoutine, ...);
NTSTATUS status = IoCallDriver(pdx->LowerDeviceObject, Irp);
if (status == STATUS_PENDING)
IoMarkIrpPending(Irp); // <== Argh! Don't do this!
return status;
}

The reason this is a bad idea is that the IRP might already be complete, and someone might already have called IoFreeIrp, by
the time IoCallDriver returns. You must treat the pointer as poison as soon as you give it away to a function that might
complete the IRP.

Bad idea # 2—Always Call IoMarkIrpPending in the Dispatch Routine

Here the dispatch routine unconditionally calls IoMarkIrpPending and then returns whichever value IoCallDriver returns:

NTSTATUS TopDriverDispatchSomething(PDEVICE_OBJECT fido, PIRP Irp)

{
PDEVICE_EXTENSION pdx = (PDEVICE_EXTENSION) fido->DeviceExtension;
IoMarkIrpPending(Irp); // <== Don't do this either!
IoCopyCurrentIrpStackLocationToNext(Irp);
IoSetCompletionRoutine(Irp, TopDriverCompletionRoutine, ...);
return IoCallDriver(pdx->LowerDeviceObject, Irp);
}

This is a bad idea if the next driver happens to complete the IRP in its dispatch routine and returns a nonpending status. In this
situation, IoCompleteRequest will cause all the completion cleanup to happen. When you return a nonpending status, the I/O
Manager routine that originated the IRP might call the same completion cleanup routine a second time. This leads to a
double-completion bug check.
Remember always to pair the call to IoMarkIrpPending with returning STATUS_PENDING. That is, do both or neither, but
never one without the other.

Bad Idea # 3—Call IoMarkPending Regardless of the Return Code from the Completion Routine
In this example, the programmer forgot the qualification of the rule about when to make the call to IoMarkIrpPending from a
completion routine:

NTSTATUS TopDriverDispatchSomething(PDEVICE_OBJECT fido, PIRP Irp)

{
PDEVICE_EXTENSION pdx = (PDEVICE_EXTENSION) fido->DeviceExtension;
KEVENT event;
KeInitializeEvent(&event, NotificationEvent, FALSE);
IoCopyCurrentIrpStackLocationToNext(Irp);
IoSetCompletionRoutine(Irp, TopDriverCompletionRoutine, &event,
TRUE, TRUE, TRUE);
IoCallDriver(pdx->LowerDeviceObject, Irp);
KeWaitForSingleObject(&event, ...);

Irp->IoStatus.Status = status;
IoCompleteRequest(Irp, IO_NO_INCREMENT);
return status;
}

NTSTATUS TopDriverCompletionRoutine(PDEVICE_OBJECT fido, PIRP Irp, PVOID pev)

{
if (Irp->PendingReturned)
IoMarkIrpPending(Irp); // <== oops
KeSetEvent((PKEVENT) pev, IO_NO_INCREMENT, FALSE);

Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
- 130 - The I/O Request Packet | Chapter 5

return STATUS_MORE_PROCESSING_REQUIRED;
}

What’s probably going on here is that the programmer wants to forward the IRP synchronously and then resume processing the
IRP after the lower driver finishes with it. (See IRP-handling scenario 7 at the end of this chapter.) That’s how you’re supposed
to handle certain PnP IRPs, in fact. This example can cause a double-completion bug check, though, if the lower driver
happens to return STATUS_PENDING. This is actually the same scenario as in the previous bad idea: your dispatch routine is
returning a nonpending status, but your stack frame has the pending flag set. People often get away with this bad idea, which
existed in the IRP_MJ_PNP handlers of many early Windows 2000 DDK samples, because no one ever posts a Plug and Play
IRP. (Therefore, PendingReturned is never set, and the incorrect call to IoMarkIrpPending never happens.)
A variation on this idea occurs when you create an asynchronous IRP of some kind. You’re supposed to provide a completion
routine to free the IRP, and you’ll necessarily return STATUS_MORE_PROCESSING_REQUIRED from that completion
routine to prevent IoCompleteRequest from attempting to do any more work on an IRP that has disappeared:

SOMETYPE SomeFunction()
{
PIRP Irp = IoBuildAsynchronousFsdRequest(...);
IoSetCompletionRoutine(Irp, MyCompletionRoutine, ...);
IoCallDriver(...);
}

NTSTATUS MyCompletionRoutine(PDEVICE_OBJECT junk, PIRP Irp,

PVOID context)
{
if (Irp->PendingReturned)
IoMarkIrpPending(Irp); // <== oops!
IoFreeIrp(Irp);
return STATUS_MORE_PROCESSING_REQUIRED;
}

The problem here is that there is no current stack location inside this completion routine! Consequently, IoMarkIrpPending
modifies a random piece of storage. Besides, it’s fundamentally silly to worry about setting a flag that IoCompleteRequest will
never inspect: you’re returning STATUS_MORE_PROCESSING_REQUIRED, which is going to cause IoCompleteRequest to
immediately return to its own caller without doing another single thing with your IRP.
Avoid both of these problems by remembering not to call IoMarkIrpPending from a completion routine that returns
STATUS_MORE_PROCESSING_REQUIRED.

Bad Idea # 4—Always Pend the IRP

Here the programmer gives up trying to understand and just always pends the IRP. This strategy avoids needing to do anything
special in the completion routine.

NTSTATUS TopDriverDispatchSomething(PDEVICE_OBJECT fido, PIRP Irp)

{
PDEVICE_EXTENSION pdx = (PDEVICE_EXTENSION) fido->DeviceExtension;
IoMarkIrpPending(Irp);
IoCopyCurrentIrpStackLocationToNext(Irp);
IoSetCompletionRoutine(Irp, TopDriverCompletionRoutine, ...);
IoCallDriver(pdx->LowerDeviceObject, Irp);
return STATUS_PENDING;
}

NTSTATUS TopDriverCompletionRoutine(PDEVICE_OBJECT fido, PIRP Irp, ...)

{

return STATUS_SUCCESS;
}

This strategy isn’t so much bad as inefficient. If SL_PENDING_RETURNED is set in the topmost stack location,
IoCompleteRequest schedules a special kernel APC to do the work in the context of the originating thread. Generally speaking,
if a dispatch routine posts an IRP, the IRP will end up being completed in some other thread. An APC is needed to get back into
the original context in order to do some buffer copying. But scheduling an APC is relatively expensive, and it would be nice to
avoid the overhead if you’re still in the original thread. Thus, if your dispatch routine doesn’t actually return
STATUS_PENDING, you shouldn’t mark your stack frame pending.
But nothing really awful will happen if you implement this bad idea, in the sense that the system will keep working normally.
Note also that Microsoft might someday change the way completion cleanup happens, so don’t write your driver on the

Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
5.3 Completion Routines - 131 -

assumption that an APC is always going to occur.

A Plug and Play Complication

The PnP Manager might conceivably decide to unload your driver before one of your completion routines has a chance to
return to the I/O Manager. Anyone who sends you an IRP is supposed to prevent this unhappy occurrence by making sure you
can’t be unloaded until you’ve finished handling that IRP. When you create an IRP, however, you have to protect yourself. Part
of the protection involves a so-called remove lock object, discussed in Chapter 6, which gates PnP removal until drivers under
you finish handling all outstanding IRPs. Another part of the protection is the following function, available in XP and later
releases of Windows:

IoSetCompletionRoutineEx(DeviceObject, Irp, CompletionRoutine,

context, InvokeOnSuccess, InvokeOnError, InvokeOnCancel);

NOTE
The DDK documentation for IoSetCompletionRoutineEx suggests that it’s useful only for non-PnP drivers. As
discussed here, however, on many occasions a PnP driver might need to use this function to achieve full
protection from early unloading.

The DeviceObject parameter is a pointer to your own device object. IoSetCompletionRoutineEx takes an extra reference to this
object just before calling your completion routine, and it releases the reference when your completion routine returns. The
extra reference pins the device object and, more important, your driver, in memory. But because this function doesn’t exist in
Windows versions prior to XP, you need to consider carefully whether you want to go to the trouble of calling
MmGetSystemRoutineAddress (and loading a Windows 98/Me implementation of the same function) to dynamically link to this
routine if it happens to be available. It seems to me that there are five discrete situations to consider:

Situation 1: Synchronous Subsidiary IRP

The first situation to consider occurs when you create a synchronous IRP to help you process an IRP that
someone else has sent you. You intend to complete the main IRP after the subsidiary IRP completes.
You wouldn’t ordinarily use a completion routine with a synchronous IRP, but you might want to if you were going to
implement the safe cancel logic discussed later in this chapter. If you follow that example, your completion routine will safely
return before you completely finish handling the subsidiary IRP and, therefore, comfortably before you complete the main IRP.
The sender of the main IRP is keeping you in memory until then. Consequently, you won’t need to use
IoSetCompletionRoutineEx.

Situation 2: Asynchronous Subsidiary IRP

In this situation, you use an asynchronous subsidiary IRP to help you implement a main IRP that someone sends you. You
complete the main IRP in the completion routine that you’re obliged to install for the subsidiary IRP.
Here you should use IoSetCompletionRoutineEx if it’s available because the main IRP sender’s protection expires as soon as
you complete the main IRP. Your completion routine still has to return to the I/O Manager and therefore needs the protection
offered by this new routine.

Situation 3: IRP Issued from Your Own System Thread

The third situation in our analysis of completion routines occurs when a system thread you’ve created (see Chapter 14 for a
discussion of system threads) installs completion routines for IRPs it sends to other drivers. If you create a truly asynchronous
IRP in this situation, use IoSetCompletionRoutineEx to install the obligatory completion routine and make sure that your driver
can’t unload before the completion routine is actually called. You could, for example, claim an IO_REMOVE_LOCK that you
release in the completion routine. If you use scenario 8 from the cookbook at the end of this chapter to send a nominally
asynchronous IRP in a synchronous way, however, or if you use synchronous IRPs in the first place, there’s no particular
reason to use IoSetCompletionRoutineEx because you’ll presumably wait for these IRPs to finish before calling
PsTerminateSystemThread to end the thread. Some other function in your driver will be waiting for the thread to terminate
before allowing the operating system to finally unload your driver. This combination of protections makes it safe to use an
ordinary completion routine.

Situation 4: IRP Issued from a Work Item

Here I hope you’ll be using IoAllocateWorkItem and IoQueueWorkItem, which protect your driver from being unloaded until
the work item callback routine returns. As in the previous situation, you’ll want to use IoSetCompletionRoutineEx if you issue
an asynchronous IRP and don’t wait (as in scenario 8) for it to finish. Otherwise, you don’t need the new routine unless you
somehow return before the IRP completes, which would be against all the rules for IRP handling and not just the rules for
completion routines.

Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
- 132 - The I/O Request Packet | Chapter 5

Situation 5: Synchronous or Asynchronous IRP for Some Other Purpose

Maybe you have some reason for issuing a synchronous IRP that is not in aid of an IRP that someone else has sent you and is
not issued from the context of your own system thread or a work item. I confess that I can’t think of a circumstance in which
you’d actually want to do this, but I think you’d basically be toast if you tried. Protecting your completion routine, if any,
probably helps a bit, but there’s no bulletproof way for you to guarantee that you’ll still be there when IoCallDriver returns. If
you think of a way, you’ll simply move the problem to after you do whatever it is you think of, at which point there has to be at
least a return instruction that will get executed without protection from outside your driver.
So don’t do this.

5.4 Queuing I/O Requests

Sometimes your driver receives an IRP that it can’t handle right away. Rather than reject the IRP by causing it to fail with an
error status, your dispatch routine places the IRP on a queue. In another part of your driver, you provide logic that removes one
IRP from the queue and passes it to a StartIo routine.
Queuing an IRP is conceptually very simple. You can provide a list anchor in your device extension, which you initialize in
your AddDevice function:

typedef struct _DEVICE_EXTENSION {

LIST_ENTRY IrpQueue;
BOOLEAN DeviceBusy;
} DEVICE_EXTENSION, *PDEVICE_EXTENSION;

NTSTATUS AddDevice(...)
{

InitializeListHead(&pdx->IrpQueue);

Then you can write two naive routines for queuing and dequeuing IRPs:

VOID NaiveStartPacket(PDEVICE_EXTENSION pdx, PIRP Irp)

{
if (pdx->DeviceBusy)
InsertTailList(&pdx->IrpQueue, &Irp->Tail.Overlay.ListEntry);
else
{
pdx->DeviceBusy = TRUE;
StartIo(pdx->DeviceObject, Irp);
}
}

VOID NaiveStartNextPacket(PDEVICE_EXTENSION pdx, PIRP Irp)

{
if (IsListEmpty(&pdx->IrpQueue))
pdx->DeviceBusy = FALSE;
else
{
PLIST_ENTRY foo = RemoveHeadList(&pdx->IrpQueue);
PIRP Irp = CONTAINING_RECORD(foo, IRP,
Tail.Overlay.ListEntry);
StartIo(pdx->DeviceObject, Irp);
}
}

Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
5.4 Queuing I/O Requests - 133 -

Microsoft Queuing Routines|

Apart from this sidebar, I’m omitting discussion of the functions IoStartPacket and IoStartNextPacket, which
have been part of Windows NT since the beginning. These functions implement a queuing model that’s
inappropriate for WDM drivers. In that model, a device is in one of three states: idle, busy with an empty queue,
or busy with a nonempty queue. If you call IoStartPacket at a time when the device is idle, it unconditionally
sends the IRP to your StartIo routine. Unfortunately, many times a WDM driver needs to queue an IRP even
though the device is idle. These functions also rely heavily on a global spin lock whose overuse has created a
serious performance bottleneck.

Just in case you happen to be working on an old driver that uses these obsolete routines, however, here’s how
they work. A dispatch routine would queue an IRP like this:

NTSTATUS DispatchSomething(PDEVICE_OBJECT fdo, PIRP Irp)

{
IoMarkIrpPending(Irp);
IoStartPacket(fdo, Irp, NULL, CancelRoutine);
return STATUS_PENDING;
}

Your driver would have a single StartIo routine. Your DriverEntry routine would set the DriverStartIo field of the
driver object to point to this routine. If your StartIo routine completes IRPs, you would also call IoSetStart-
IoAttributes (in Windows XP or later) to help prevent excessive recursion into StartIo. IoStartPacket and
IoStartNextPacket call StartIo to process one IRP at a time. In other words, StartIo is the place where the I/O
manager serializes access to your hardware.

A DPC routine (see the later discussion of how DPC routines work) would complete the previous IRP and start
the next one using this code:

VOID DpcForIsr(PKDPC junk, PDEVICE_OBJECT fdo, PIRP Irp, PVOID morejunk)

{
IoCompleteRequest(Irp, STATUS_NO_INCREMENT);
IoStartNextPacket(fdo, TRUE);
}

To provide for canceling a queued IRP, you would need to write a cancel routine. Illustrating that and the cancel
logic in StartIo is beyond the scope of this book.

In addition, you can rely on the CurrentIrp field of a DEVICE_OBJECT to always contain NULL or the address of
the IRP most recently sent (by IoStartPacket or IoStartNextPacket) to your StartIo routine.

Then your dispatch routine calls NaiveStartPacket, and your DPC routine calls NaiveStartNextPacket in the manner discussed
earlier in connection with the standard model.
There are many problems with this scheme, which is why I called it naive. The most basic problem is that your DPC routine
and multiple instances of your dispatch routine could all be simultaneously active on different CPUs. They would likely
conflict in trying to access the queue and the busy flag. You could address that problem by creating a spin lock and using it to
guard against the obvious races, as follows:

typedef struct _DEVICE_EXTENSION {

LIST_ENTRY IrpQueue;
KSPIN_LOCK IrpQueueLock;
BOOLEAN DeviceBusy;
} DEVICE_EXTENSION, *PDEVICE_EXTENSION;

NTSTATUS AddDevice(...)
{

InitializeListHead(&pdx->IrpQueue);
KeInitializeSpinLock(&pdx->IrpQueueLock);

VOID LessNaiveStartPacket(PDEVICE_EXTENSION pdx, PIRP Irp)

{
KIRQL oldirql;
KeAcquireSpinLock(&pdx->IrpQueueLock, &oldirql);
if (pdx->DeviceBusy)
{

Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
- 134 - The I/O Request Packet | Chapter 5

InsertTailList(&pdx->IrpQueue, &Irp->Tail.Overlay.ListEntry;
KeReleaseSpinLock(&pdx->IrpQueueLock, oldirql);
}
else
{
pdx->DeviceBusy = TRUE;
KeReleaseSpinLock(&pdx->IrpQueueLock, DISPATCH_LEVEL);
StartIo(pdx->DeviceObject, Irp);
KeLowerIrql(oldirql);
}
}

VOID LessNaiveStartNextPacket(PDEVICE_EXTENSION pdx, PIRP Irp)

{
KIRQL oldirql;
KeAcquireSpinLock(&pdx->IrpQueueLock, &oldirql);
if (IsListEmpty(&pdx->IrpQueue)
{
pdx->DeviceBusy = FALSE;
KeReleaseSpinLock(&pdx->IrpQueueLock, oldirql);
else
{
PLIST_ENTRY foo = RemoveHeadList(&pdx->IrpQueue);
KeReleaseSpinLock(&pdx->IrpQueueLock, DISPATCH_LEVEL);
PIRP Irp = CONTAINING_RECORD(foo, IRP, Tail.Overlay.ListEntry);
StartIo(pdx->DeviceObject, Irp);
KeLowerIrql(oldirql);
}
}

Incidentally, we always want to call StartIo at a single IRQL. Because DPC routines are among the callers of
LessNaiveStartNextPacket, and they run at DISPATCH_LEVEL, we pick DISPATCH_LEVEL. That means we want to stay at
DISPATCH_LEVEL when we release the spin lock.
(You did remember that these two queue management routines need to be in nonpaged memory because they run at
DISPATCH_LEVEL, right?)
These queueing routines are actually almost OK, but they have one more defect and a shortcoming. The shortcoming is that we
need a way to stall a queue for the duration of certain PnP and Power states. IRPs accumulate in a stalled queue until someone
unstalls the queue, whereupon the queue manager can resume sending IRPs to a StartIo routine. The defect in the “less naive”
set of routines is that someone could decide to cancel an IRP at essentially any time. IRP cancellation complicates IRP queuing
logic so much that I’ve devoted the next major section to discussing it. Before we get to that, though, let me explain how to use
the queuing routines that I crafted to deal with all the problems.

5.4.1 Using the DEVQUEUE Object

To solve a variety of IRP queuing problems, I created a package of subroutines for managing a queue object that I call a
DEVQUEUE. I’ll show you first the basic usage of a DEVQUEUE. Later in this chapter, I’ll explain how the major
DEVQUEUE service routines work. I’ll discuss in later chapters how your PnP and power management code interacts with the
DEVQUEUE object or objects you define.
You define a DEVQUEUE object for each queue of requests you’ll manage in the driver. For example, if your device manages
reads and writes in a single queue, you define one DEVQUEUE:

typedef struct _DEVICE_EXTENSION {

DEVQUEUE dqReadWrite;

} DEVICE_EXTENSION, *PDEVICE_EXTENSION;

On the CD Code for the DEVQUEUE is part of GENERIC.SYS. In addition, if you use my WDMWIZ to create a
skeleton driver and don’t ask for GENERIC.SYS support, your skeleton project will include the files
DEVQUEUE.CPP and DEVQUEUE.H, which fully implement exactly the same object. I don’t recommend trying to
type this code from the book because the code from the companion content will contain even more features
than I can describe in the book. I also recommend checking my Web site (www.oneysoft.com) for updates and
corrections.

Figure 5-8 illustrates the IRP processing logic for a typical driver using DEVQUEUE objects. Each DEVQUEUE has its own
StartIo routine, which you specify when you initialize the object in AddDevice:

Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
5.4 Queuing I/O Requests - 135 -

NTSTATUS AddDevice(...)
{

PDEVICE_EXTENSION pdx = ...;

InitializeQueue(&pdx->dqReadWrite, StartIo);

Figure 5-8. IRP flow with a DEVQUEUE and a StartIo routine.

You can specify a common dispatch function for both IRP_MJ_READ and IRP_MJ_WRITE:

NTSTATUS DriverEntry(PDRIVER_OBJECT DriverObject, PUNICODE_STRING RegistryPath)

{

DriverObject->MajorFunction[IRP_MJ_READ] = DispatchReadWrite;
DriverObject->MajorFunction[IRP_MJ_WRITE] = DispatchReadWrite;

#pragma PAGEDCODE

NTSTATUS DispatchReadWrite(PDEVICE_OBJECT fdo, PIRP Irp)

{
PAGED_CODE();
PDEVICE_EXTENSION pdx = (PDEVICE_EXTENSION) fdo->DeviceExtension;
IoMarkIrpPending(Irp);
StartPacket(&pdx->dqReadWrite, fdo, Irp, CancelRoutine);
return STATUS_PENDING;
}

#pragma LOCKEDCODE

VOID CancelRoutine(PDEVICE_OBJECT fdo, PIRP Irp)

{
PDEVICE_EXTENSION pdx = (PDEVICE_EXTENSION) fdo->DeviceExtension;
CancelRequest(&pdx->dqReadWrite, Irp);
}

Note that the cancel argument to StartPacket is not optional: you must supply a cancel routine, but you can see how simple that
routine will be.
If you complete IRPs in a DPC routine, you’ll also call StartNextPacket:

VOID DpcForIsr(PKPDC junk1, PDEVICE_OBJECT fdo, PIRP junk2,

PDEVICE_EXTENSION pdx)
{

StartNextPacket(&pdx->dqReadWrite, fdo);
}

If you complete IRPs in your StartIo routine, schedule a DPC to make the call to StartNextPacket in order to avoid excessive

Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
- 136 - The I/O Request Packet | Chapter 5

recursion. For example:

typedef struct _DEVICE_EXTENSION {

KDPC StartNextDpc;
} DEVICE_EXTENSION, *PDEVICE_EXTENSION;

NTSTATUS AddDevice(...)
{

KeInitializeDpc(&pdx->StartNextDpc,
(PKDEFERRED_ROUTINE) StartNextDpcRoutine, pdx);

VOID StartIo(...)
{

IoCompleteRequest(...);
KeInsertQueueDpc(&pdx->StartNextDpc, NULL, NULL);
}

VOID StartNextDpcRoutine(PKDPC junk1, PDEVICE_EXTENSION pdx,

PVOID junk2, PVOID junk3)
{
StartNextPacket(&pdx->dqReadWrite, pdx->DeviceObject);
}

In this example, StartIo calls IoCompleteRequest to complete the IRP it has just handled. Calling StartNextPacket directly
might lead to a recursive call to StartIo. After enough recursive calls, we’ll run out of stack. To avoid the potential stack
overflow, we queue the StartNextDpc DPC object and return. Because StartIo runs at DISPATCH_LEVEL, it won’t be possible
for the DPC routine to be called before StartIo returns. Therefore, StartNextDpcRoutine can call StartNextPacket without
worrying about recursion.

NOTE
If you were using the Microsoft queue routines IoStartPacket and IoStartNextPacket, you’d have a single
StartIo routine. Your DriverEntry routine would set the DriverStartIo pointer in the driver object to the address
of this routine. To avoid the recursion problem discussed in the text in Windows XP or later, you could call
IoSetStartIoAttributes.

5.4.2 Using Cancel-Safe Queues

Some drivers work better if they operate with a separate I/O thread. Such a thread wakes up each time there is an IRP to be
processed, processes IRPs until a queue is empty, and then goes back to sleep. I’ll discuss the details of how such a thread
routine works in Chapter 14, but this is the appropriate time to talk about how you can queue IRPs in such a driver. See Figure
5-9.
A DEVQUEUE isn’t appropriate for this situation because the DEVQUEUE wants to call a StartIo routine to process IRPs.
When you have a separate I/O thread, you want to be responsible in that thread for fetching IRPs. Microsoft provides a set of
routines for cancel-safe queue operations that provide most of the functionality you need. These routines don’t work
automatically with your PnP and Power logic, but I predict it won’t be hard to add such support. The Cancel sample in the
DDK shows how to work with a cancel-safe queue in exactly this situation, but I’ll go over the mechanics here as well.

NOTE
In their original incarnation, the cancel-safe queue functions weren’t appropriate when you wanted to use a
StartIo routine for actual I/O because they didn’t provide a way to set a CurrentIrp pointer and do a queue
operation inside one invocation of the queue lock. They were modified while I was writing this book to support
StartIo usage, but we didn’t have time to include an explanation of how to use the new features. I commend
you, therefore, to the DDK documentation.

Note also that the cancel-safe queue functions were first described in an XP release of the DDK. They are
implemented in a static library, however, and are therefore available for use on all prior platforms.

Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
5.4 Queuing I/O Requests - 137 -

Figure 5-9. IRP flow with an I/O thread.

Initialization for Cancel-Safe Queue

To take advantage of the cancel-safe queue functions, first declare six helper functions (see Table 5-5) that the I/O Manager
can call to perform operations on your queue. Declare an instance of the IO_CSQ structure in your device extension structure.
Also declare an anchor for your IRP queue and whichever synchronization object you want to use. You initialize these objects
in your AddDevice function. For example:

typedef struct _DEVICE_EXTENSION {

IO_CSQ IrpQueue;
LIST_ENTRY IrpQueueAnchor;
KSPIN_LOCK IrpQueueLock;

} DEVICE_EXTENSION, *PDEVICE_EXTENSION;

NTSTATUS AddDevice(PDRIVER_OBJECT DriverObject, PDEVICE_OBJECT pdo)

{

KeInitializeSpinLock(&pdx->IrpQueueLock);
InitializeListHead(&pdx->IrpQueueAnchor);
IoCsqInitialize(&pdx->IrpQueue, InsertIrp, RemoveIrp,
PeekNextIrp, AcquireLock, ReleaseLock, CompleteCanceledIrp);

Callback Routine Purpose

AcquireLock Acquire lock on the queue
CompleteCanceledIrp Complete an IRP that has been cancelled
InsertIrp Insert IRP into queue
PeekNextIrp Retrieve pointer to next IRP in queue without removing it
ReleaseLock Release queue lock
RemoveIrp Remove IRP from queue

Table 5-5. Cancel-Safe Queue Callback Routines

Using the Queue

You queue an IRP in a dispatch routine like this:

Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
- 138 - The I/O Request Packet | Chapter 5

NTSTATUS DispatchSomething(PDEVICE_OBJECT fdo, PIRP Irp)

{
PDEVICE_EXTENSION pdx = (PDEVICE_EXTENSION) fdo->DeviceExtension;
IoCsqInsertIrp(&pdx->IrpQueue, Irp, NULL);
return STATUS_PENDING;
}

It’s unnecessary and incorrect to call IoMarkIrpPending yourself because IoCsqInsertIrp does so automatically. As is true with
other queuing schemes, the IRP might be complete by the time IoCsqInsertIrp returns, so don’t touch the pointer afterwards.
To remove an IRP from the queue (say, in your I/O thread), use this code:

PIRP Irp = IoCsqRemoveNextIrp(&pdx->IrpQueue, PeekContext);

I’ll describe the PeekContext argument a bit further on. Note that the return value is NULL if no IRPs on are on the queue. The
IRP you get back hasn’t been cancelled, and any future call to IoCancelIrp is guaranteed to do nothing more than set the
Cancel flag in the IRP.
You’ll also want to provide a dispatch routine for IRP_MJ_CLEANUP that will interact with the queue. I’ll show you code for
that purpose a bit later in this chapter.

Cancel-Safe Queue Callback Routines

The I/O Manager calls your cancel-safe queue callback routines with the address of the queue object as one of the arguments.
To recover the address of your device extension structure, use the CONTAINING_RECORD macro:

#define GET_DEVICE_EXTENSION(csq) \
CONTAINING_RECORD(csq, DEVICE_EXTENSION, IrpQueue)

You supply callback routines for acquiring and releasing the lock you’ve decided to use for your queue. For example, if you
had settled on using a spin lock, you’d write these two routines:

VOID AcquireLock(PIO_CSQ csq, PKIRQL Irql)

{
PDEVICE_EXTENSION pdx = GET_DEVICE_EXTENSION(csq);
KeAcquireSpinLock(&pdx->IrpQueueLock, Irql);
}

VOID ReleaseLock(PIO_CSQ csq, KIRQL Irql)

{
PDEVICE_EXTENSION pdx = GET_DEVICE_EXTENSION(csq);
KeReleaseSpinLock(&pdx->IrpQueueLock, Irql);
}

You don’t have to use a spin lock for synchronization, though. You can use a mutex, a fast mutex, or any other object that suits
your fancy.
When you call IoCsqInsertIrp, the I/O Manager locks your queue by calling your AcquireLock routine and then calls your
InsertIrp routine:

VOID InsertIrp(PIO_CSQ csq, PIRP Irp)

{
PDEVICE_EXTENSION pdx = GET_DEVICE_EXTENSION(csq);
InsertTailList(&pdx->IrpQueueAnchor, &Irp->Tail.Overlay.ListEntry);
}

When you call IoCsqRemoveNextIrp, the I/O Manager locks your queue and calls your PeekNextIrp and RemoveIrp functions:

PIRP PeekNextIrp(PIO_CSQ csq, PIRP Irp, PVOID PeekContext)

Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
5.4 Queuing I/O Requests - 139 -

Tail.Overlay.ListEntry);
if (PeekContext && <NextIrp matches PeekContext>)
return NextIrp;
if (!PeekContext)
return NextIrp;
next = next->Flink;
}
return NULL;
}

VOID RemoveIrp(PIO_CSQ csq, PIRP Irp)

{
RemoveEntryList(&Irp->Tail.Overlay.ListEntry);
}

The parameters to PeekNextIrp require a bit of explanation. Irp, if not NULL, is the predecessor of the first IRP you should
look at. If Irp is NULL, you should look at the IRP at the front of the list. PeekContext is an arbitrary parameter that you can
use for any purpose you want as a way for the caller of IoCsqRemoveNextIrp to communicate with PeekNextIrp. A common
convention is to use this argument to point to a FILE_OBJECT that’s the current subject of an IRP_MJ_CLEANUP. I wrote
this function so that a NULL value for PeekContext means, “Return the next IRP, period.” A non-NULL value means, “Return
the next value that matches PeekContext.” You define what it means to “match” the peek context.
The sixth and last callback function is this one, which the I/O Manager calls when an IRP needs to be cancelled:

VOID CompleteCanceledIrp(PIO_CSQ csq, PIRP Irp)

{
PDEVICE_EXTENSION pdx = GET_DEVICE_EXTENSION(csq);
Irp->IoStatus.Status = STATUS_CANCELLED;
Irp->IoStatus.Information = 0;
IoCompleteRequest(Irp, IO_NO_INCREMENT);
}

That is, all you do is complete the IRP with STATUS_CANCELLED.

To reiterate, the advantage you gain by using the cancel-safe queue functions is that you don’t need to write a cancel routine,
and you don’t need to include any code in your driver (apart from the CompleteCanceledIrp function, that is) that relates to
cancelling queued IRPs. The I/O Manager installs its own cancel routine, and it promises never to deliver a cancelled IRP back
from IoCsqRemoveNextIrp.

Parking an IRP on a Cancel-Safe Queue

The preceding sections described how you can use a cancel-safe queue to serialize I/O processing in a kernel thread. Another
way to use the cancel-safe queue functions is for parking IRPs while you process them. The idea is that you would place the
IRP into the queue when you first received it. Then, when it comes time to complete the IRP, you remove that specific IRP
from the queue. You’re not using the queue as a real queue in this scenario, because you don’t pay any attention to the order of
the IRPs in the queue.
To park an IRP, define a persistent context structure for use by the cancel-safe queue package. You need one such structure for
each separate IRP that you plan to park. Suppose, for example, that your driver processes “red” requests and “blue” requests
(fanciful names to avoid the baggage that real examples sometimes bring along with them).

typedef struct _DEVICE_EXTENSION {

IO_CSQ_IRP_CONTEXT RedContext;
IO_CSQ_IRP_CONTEXT BlueContext;
} DEVICE_EXTENSION, *PDEVICE_EXTENSION;

When you receive a “red” IRP, you specify the context structure in your call to IoCsqInsertIrp:

IoCsqInsertIrp(&pdx->IrpQueue, RedIrp, &pdx->RedContext);

How to park a “blue” IRP should be pretty obvious.

When you later decide you want to complete a parked IRP, you write code like this:

PIRP RedIrp = IoCsqRemoveIrp(&pdx->IrpQueue, &pdx->RedContext);

if (RedIrp)
{

Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
- 140 - The I/O Request Packet | Chapter 5

RedIrp->IoStatus.Status = STATUS_XXX;
RedIrp->IoStatus.Information = YYY;
IoCompleteRequest(RedIrp, IO_NO_INCREMENT);
}

IoCsqRemoveIrp will return NULL if the IRP associated with the context structure has already been cancelled.
Bear in mind the following caveats when using this mechanism:
It’s up to you to make sure that you haven’t previously parked an IRP using a particular context structure. IoCsqInsertIrp
is a VOID function and therefore has no way to tell you when you violate this rule.
You mustn’t touch an I/O buffer associated with a parked IRP because the IRP can be cancelled (and the I/O buffer
released!) at any time while it’s parked. You should remove the IRP from the queue before trying to use a buffer.

5.5 Cancelling I/O Requests

Just as happens with people in real life, programs sometimes change their mind about the I/O requests they’ve asked you to
perform for them. We’re not talking about simple fickleness here. Applications might terminate after issuing requests that will
take a long time to complete, leaving requests outstanding. Such an occurrence is especially likely in the WDM world, where
the insertion of new hardware might require you to stall requests while the Configuration Manager rebalances resources or
where you might be told at any moment to power down your device.
To cancel a request in kernel mode, someone calls IoCancelIrp. The operating system automatically calls IoCancelIrp for
every IRP that belongs to a thread that’s terminating with requests still outstanding. A user-mode application can call CancelIo
to cancel all outstanding asynchronous operations issued by a given thread on a file handle. IoCancelIrp would like to simply
complete the IRP it’s given with STATUS_CANCELLED, but there’s a hitch: IoCancelIrp doesn’t know where you have salted
away pointers to the IRP, and it doesn’t know for sure whether you’re currently processing the IRP. So it relies on a cancel
routine you provide to do most of the work of cancelling an IRP.
It turns out that a call to IoCancelIrp is more of a suggestion than a demand. It would be nice if every IRP that somebody tried
to cancel really got completed with STATUS_CANCELLED. But it’s OK if a driver wants to go ahead and finish the IRP
normally if that can be done relatively quickly. You should provide a way to cancel I/O requests that might spend significant
time waiting in a queue between a dispatch routine and a StartIo routine. How long is significant is a matter for your own
sound judgment; my advice is to err on the side of providing for cancellation because it’s not that hard to do and makes your
driver fit better into the operating system.

5.5.1 If It Weren’t for Multitasking…

An intricate synchronization problem is associated with cancelling IRPs. Before I explain the problem and the solution, I want
to describe the way cancellation would work in a world where there was no multitasking and no concern with multiprocessor
computers. In that utopia, several pieces of the I/O Manager would fit together with your StartIo routine and with a cancel
routine you’d provide, as follows:
When you queue an IRP, you set the CancelRoutine pointer in the IRP to the address of your cancel routine. When you
dequeue the IRP, you set CancelRoutine to NULL.
IoCancelIrp unconditionally sets the Cancel flag in the IRP. Then it checks to see whether the CancelRoutine pointer in
the IRP is NULL. While the IRP is in your queue, CancelRoutine will be non-NULL. In this case, IoCancelIrp calls your
cancel routine. Your cancel routine removes the IRP from the queue where it currently resides and completes the IRP with
STATUS_CANCELLED.
Once you dequeue the IRP, IoCancelIrp finds the CancelRoutine pointer set to NULL, so it doesn’t call your cancel
routine. You process the IRP to completion with reasonable promptness (a concept that calls for engineering judgment),
and it doesn’t matter to anyone that you didn’t actually cancel the IRP.

5.5.2 Synchronizing Cancellation

Unfortunately for us as programmers, we write code for a multiprocessing, multitasking environment in which effects can
sometimes appear to precede causes. There are many possible race conditions between the queue insertion, queue removal, and
cancel routines in the naive scenario I just described. For example, what would happen if IoCancelIrp called your cancel
routine to cancel an IRP that happened to be at the head of your queue? If you were simultaneously removing an IRP from the
queue on another CPU, you can see that your cancel routine would probably conflict with your queue removal logic. But this is
just the simplest of the possible races.
In earlier times, driver programmers dealt with the cancel races by using a global spin lock—the cancel spin lock. Because you
shouldn’t use this spin lock for synchronization in your own driver, I’ve explained it briefly in the sidebar. Read the sidebar for
its historical perspective, but don’t plan to use this lock.

Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
5.5 Cancelling I/O Requests - 141 -

The Global Cancel Spin Lock

The original Microsoft scheme for synchronizing IRP cancellation revolved around a global cancel spin lock.
Routines named IoAcquireCancelSpinLock and IoReleaseCancelSpinLock acquire and release this lock. The
Microsoft queuing routines IoStartPacket and IoStartNextPacket acquire and release the lock to guard their
access to the cancel fields in an IRP and to the CurrentIrp field of the device object. IoCancelIrp acquires the
lock before calling your cancel routine but doesn’t release the lock. Your cancel routine runs briefly under the
protection of the lock and must call IoReleaseCancelSpinLock before returning.

In this scheme, your own StartIo routine must also acquire and release the cancel spin lock to safely test the
Cancel flag in the IRP and to reset the CancelRoutine pointer to NULL.

Hardly anyone was able to craft queuing and cancel logic that approached being bulletproof using this original
scheme. Even the best algorithms actually have a residual flaw arising from a coincidence in IRP pointer values.
In addition, the fact that every driver in the system needed to use a single spin lock two or three times in the
normal execution path created a measurable performance problem. Consequently, Microsoft now recommends
that drivers either use the cancel-safe queue routines or else copy someone else’s proven queue logic. Neither
Microsoft nor I would recommend that you try to design your own queue logic with cancellation because getting
it right is very hard.

Nowadays, we handle the cancel races in one of two ways. We can implement our own IRP queue (or, more probably, cut and
paste someone else’s). Or, in certain kinds of drivers, we can use the IoCsqXxx family of functions. You don’t need to
understand how the IoCsqXxx functions handle IRP cancellation because Microsoft intends these functions to be a black box.
I’ll discuss in detail how my own DEVQUEUE handles cancellation, but I first need to tell you a bit more about the internal
workings of IoCancelIrp.

5.5.3 Some Details of IRP Cancellation

Here is a sketch of IoCancelIrp. You need to know this to correctly write IRP-handling code. (This isn’t a copy of the Windows
XP source code—it’s an abridged excerpt.)

BOOLEAN IoCancelIrp(PIRP Irp)

{

IoAcquireCancelSpinLock(&Irp->CancelIrql);

Irp->Cancel = TRUE;

PDRIVER_CANCEL CancelRoutine = IoSetCancelRoutine(Irp, NULL);

if (CancelRoutine)
{
PIO_STACK_LOCATION stack = IoGetCurrentIrpStackLocation(Irp);

(*CancelRoutine)(stack->DeviceObject, Irp);
return TRUE;
}
else
{

IoReleaseCancelSpinLock(Irp->CancelIrql);
return FALSE;
}
}

1. IoCancelIrp first acquires the global cancel spin lock. As you know if you read the sidebar earlier, lots of old drivers
contend for the use of this lock in their normal IRP-handling path. New drivers hold this lock only briefly while handling
the cancellation of an IRP.
2. Setting the Cancel flag to TRUE alerts any interested party that IoCancelIrp has been called for this IRP.
3. IoSetCancelRoutine performs an interlocked exchange to simultaneously retrieve the existing CancelRoutine pointer and
set the field to NULL in one atomic operation.
4. IoCancelIrp calls the cancel routine, if there is one, without first releasing the global cancel spin lock. The cancel routine
must release the lock! Note also that the device object argument to the cancel routine comes from the current stack
location, where IoCallDriver is supposed to have left it.
5. If there is no cancel routine, IoCancelIrp itself releases the global cancel spin lock. Good idea, huh?

Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
- 142 - The I/O Request Packet | Chapter 5

5.5.4 How the DEVQUEUE Handles Cancellation

As I promised, I’ll now show you how the major DEVQUEUE routines work so you can see how they safely cope with IRP
cancellation.

DEVQUEUE Internals—Initialization
The DEVQUEUE object has this declaration in my DEVQUEUE.H and GENERIC.H header files:

typedef struct _DEVQUEUE {

LIST_ENTRY head;

KSPIN_LOCK lock;

PDRIVER_START StartIo;

LONG stallcount;

PIRP CurrentIrp;

KEVENT evStop;

NTSTATUS abortstatus;
} DEVQUEUE, *PDEVQUEUE;

InitializeQueue initializes one of these objects like this:

VOID NTAPI InitializeQueue(PDEVQUEUE pdq,

PDRIVER_STARTIO StartIo)
{
InitializeListHead(&pdq->head);
KeInitializeSpinLock(&pdq->lock);
pdq->StartIo = StartIo;
pdq->stallcount = 1;
pdq->CurrentIrp = NULL;
KeInitializeEvent(&pdq->evStop, NotificationEvent, FALSE);
pdq->abortstatus = (NTSTATUS) 0;
}

1. We use an ordinary (noninterlocked) doubly-linked list to queue IRPs. We don’t need to use an interlocked list because
we’ll always access it within the protection of our own spin lock.
2. This spin lock guards access to the queue and other fields in the DEVQUEUE structure. It also takes the place of the
global cancel spin lock for guarding nearly all of the cancellation process, thereby improving system performance.
3. Each queue has its own associated StartIo function that we call automatically in the appropriate places.
4. The stall counter indicates how many times somebody has requested that IRP delivery to StartIo be stalled. Initializing
the counter to 1 means that the IRP_MN_START_DEVICE handler must call RestartRequests to release an IRP. I’ll
discuss this issue more fully in Chapter 6.
5. The CurrentIrp field records the IRP most recently sent to the StartIo routine. Initializing this field to NULL indicates that
the device is initially idle.
6. We use this event when necessary to block WaitForCurrentIrp, one of the DEVQUEUE routines involved in handling
PnP requests. We’ll set the event inside StartNextPacket, which should always be called when the current IRP completes.
7. We reject incoming IRPs in two situations. The first situation occurs after we irrevocably commit to removing the device,
when we must start causing new IRPs to fail with STATUS_DELETE_PENDING. The second situation occurs during a
period of low power, when, depending on the type of device we’re managing, we might choose to cause new IRPs to fail
with the STATUS_DEVICE_POWERED_OFF code. The abortstatus field records the status code we should use in
rejecting IRPs in these situations.
In the steady state after all PnP initialization finishes, each DEVQUEUE will have a zero stallcount and abortstatus.

DEVQUEUE Internals—Queuing and Cancellation

Here is the complete implementation of the three DEVQUEUE routines whose usage I just showed you. I cut and pasted the

Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
5.5 Cancelling I/O Requests - 143 -

source code directly from GENERIC.SYS and did some minor formatting for the sake of readability on the printed page. I also
removed some power management code from StartNextPacket because it would just confuse this presentation.

VOID StartPacket(PDEVQUEUE pdq, PDEVICE_OBJECT fdo, PIRP Irp,

PDRIVER_CANCEL cancel)
{
KIRQL oldirql;

KeAcquireSpinLock(&pdq->lock, &oldirql);
NTSTATUS abortstatus = pdq->abortstatus;

if (abortstatus)
{
KeReleaseSpinLock(&pdq->lock, oldirql);
Irp->IoStatus.Status = abortstatus;
IoCompleteRequest(Irp, IO_NO_INCREMENT);
}

else if (pdq->CurrentIrp ││ pdq->stallcount)

{

IoSetCancelRoutine(Irp, cancel);

if (Irp->Cancel && IoSetCancelRoutine(Irp, NULL))

{
KeReleaseSpinLock(&pdq->lock, oldirql);
Irp->IoStatus.Status = STATUS_CANCELLED;
IoCompleteRequest(Irp, IO_NO_INCREMENT);
}

else
{
InsertTailList(&pdq->head, &Irp->Tail.Overlay.ListEntry);
KeReleaseSpinLock(&pdq->lock, oldirql);
}
}

else
{
pdq->CurrentIrp = Irp;
KeReleaseSpinLockFromDpcLevel(&pdq->lock);
(*pdq->StartIo)(fdo, Irp);
KeLowerIrql(oldirql);
}
}

VOID StartNextPacket(PDEVQUEUE pdq, PDEVICE_OBJECT fdo)

{
KIRQL oldirql;

KeAcquireSpinLock(&pdq->lock, &oldirql);

pdq->CurrentIrp = NULL;

while (!pdq->stallcount && !pdq->abortstatus

&& !IsListEmpty(&pdq->head))
{

PLIST_ENTRY next = RemoveHeadList(&pdq->head);

PIRP Irp = CONTAINING_RECORD(next, IRP, Tail.Overlay.ListEntry);

if (!IoSetCancelRoutine(Irp, NULL))
{
InitializeListHead(&Irp->Tail.Overlay.ListEntry);
continue;
}

pdq->CurrentIrp = Irp;
KeReleaseSpinLockFromDpcLevel(&pdq->lock);

Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
- 144 - The I/O Request Packet | Chapter 5

(*pdq->StartIo)(fdo, Irp);
KeLowerIrql(oldirql);
}
KeReleaseSpinLock(&pdq->lock, oldirql);
}

VOID CancelRequest(PDEVQUEUE pdq, PIRP Irp)

{
KIRQL oldirql = Irp->CancelIrql;

IoReleaseCancelSpinLock(DISPATCH_LEVEL);

KeAcquireSpinLockAtDpcLevel(&pdq->lock);

RemoveEntryList(&Irp->Tail.Overlay.ListEntry);
KeReleaseSpinLock(&pdq->lock, oldirql);

Irp->IoStatus.Status = STATUS_CANCELLED;
IoCompleteRequest(Irp, IO_NO_INCREMENT);
}

Now I’ll explain in detail how these functions work together to provide cancel-safe queuing. I’ll do this by describing a series
of scenarios that involve all of the code paths.

1. The Normal Case for StartPacket

The normal case for StartPacket occurs in the steady state when an IRP, which we assume has not been cancelled, arrives after
all PnP processing has taken place and at a time when the device was fully powered. In this situation, stallcount and
abortstatus will both be 0. The path through StartPacket depends on whether the device is busy, as follows:
We first acquire the spin lock associated with the queue. (See point 1.) Nearly all the DEVQUEUE routines acquire this
lock (see points 8 and 15), so we can be sure that no other code on any other CPU can do anything to the queue that
would invalidate the decisions we’re about to make.
If the device is busy, the if statement at point 3 will find CurrentIrp not set to NULL. The if statement at point 5 will also
fail (I’ll explain later exactly why), so we’ll get to point 6 to put the IRP in the queue. Releasing the spin lock is the last
thing we do in this code path.
If the device is idle, the if statement at point 3 will find CurrentIrp set to NULL. I’ve already assumed that stallcount is 0,
so we’ll get to point 7 in order to process this IRP. Note how we manage to call StartIo at DISPATCH_LEVEL after
releasing the spin lock.

2. The Normal Case for StartNextPacket

The normal case for StartNextPacket is similar to that for StartPacket. The stallcount and abortstatus members are 0, and the
IRP at the head of the queue hasn’t been cancelled. StartNextPacket executes these steps:
Acquires the queue spin lock (point 8). This protects the queue from simultaneous access by other CPUs trying to execute
StartPacket or CancelRequest. No other CPU can be trying to execute StartNextPacket because the only caller of
StartNextPacket is someone who has just finished processing some other IRP. We allow only one IRP to be active at a
time, so there should never be more than one such entity.
If the list is empty, we just release the spin lock and return. If StartPacket had been waiting for the lock, it will now find
that the device isn’t busy and will call StartIo.
If the list isn’t empty, the if test at point 10 will succeed, and we’ll enter a loop looking for the next uncancelled IRP.
The first step in the loop (point 11) is to remove the next IRP from the list. Note that RemoveHeadList returns the address
of a LIST_ENTRY built into the IRP. We use CONTAINING_RECORD to get the address of the IRP.
IoSetCancelRoutine (point 12) will return the non-NULL address of the cancel routine originally supplied to StartPacket.
This is because nothing, least of all IoCancelIrp, has changed this pointer since StartPacket set it. Consequently, we’ll get
to point 13, where we’ll send this IRP to the StartIo routine at DISPATCH_LEVEL.

3. IRP Cancelled Prior to StartPacket; Device Idle

Suppose StartPacket receives an IRP that was cancelled some time ago. At the time IoCancelIrp executed, there wouldn’t have
been a cancel routine for the IRP. (If there had been, it would have belonged to a driver higher up the stack than us. That other
driver would have completed the IRP instead of sending it down to us.) All that IoCancelIrp would have done, therefore, is to
set the Cancel flag in the IRP.

Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
5.5 Cancelling I/O Requests - 145 -

If the device is idle, the if test at point 3 fails and we once again go directly to point 7, where we send the IRP to StartIo. In
effect, we’re going to ignore the Cancel flag. This is fine so long as we process the IRP “relatively quickly,” which is an
engineering judgment. If we won’t process the IRP with reasonable dispatch, StartIo and the downstream logic for handling the
IRP should have code to detect the Cancel flag and to complete the IRP early.

4. IRP Cancelled During StartPacket; Device Idle

In this scenario, someone calls IoCancelIrp while StartPacket is running. Just as in scenario 3, IoCancelIrp will set the Cancel
flag and return. We’ll ignore the flag and send the IRP to StartIo.

5. IRP Cancelled Prior to StartPacket; Device Busy

The initial conditions are the same as in scenario 3 except that now the device is busy and the if test at point 3 succeeds. We’ll
set the cancel routine (point 4) and then test the Cancel flag (point 5). Because the Cancel flag is TRUE, we’ll go on to call
IoSetCancelRoutine a second time. The function will return the non-NULL address we just installed, whereupon we’ll
complete the IRP with STATUS_CANCELLED.

6. IRP Cancelled During StartPacket; Device Busy

This is the first sticky wicket we encounter in the analysis. Assume the same initial conditions as scenario 3, but now the
device is busy and someone calls IoCancelIrp at about the same time StartPacket is running. There are several possible
situations now:
Suppose we test the Cancel flag (point 5) before IoCancelIrp manages to set that flag. Since we find the flag set to FALSE,
we go to point 6 and queue the IRP. What happens next depends on how IoCancelIrp, CancelRequest, and
StartNextPacket interact. StartPacket is in a not-my-problem field at this point, however, and needn’t worry about this
IRP any more.
Suppose we test the Cancel flag (point 5) after IoCancelIrp sets the flag. We have already set the cancel pointer (point 4).
What happens next depends on whether IoCancelIrp or we are first to execute the IoSetCancelRoutine call that changes
the cancel pointer back to NULL. Recall that IoSetCancelRoutine is an atomic operation based on an
InterlockedExchangePointer. If we execute our call first, we get back a non-NULL value and complete the IRP.
IoCancelIrp gets back NULL and therefore doesn’t call any cancel routine.
On the other hand, if IoCancelIrp executes its IoSetCancelRoutine first, we will get back NULL from our call. We’ll go
on to queue the IRP (point 6) and to enter that not-my-problem field I just referred to. IoCancelIrp will call our cancel
routine, which will block (point 15) until we release the queue spin lock. Our cancel routine will eventually complete the
IRP.

7. Normal IRP Cancellation

IRPs don’t get cancelled very often, so I’m not sure it’s really right to use the word normal in this context. But if there were a
normal scenario for IRP cancellation, this would be it: someone calls IoCancelIrp to cancel an IRP that’s in our queue, but the
cancel process runs to conclusion before StartNextPacket can possibly try to reach it. The potential race between
StartNextPacket and CancelRequest therefore can’t materialize. Events will unfold this way:
IoCancelIrp acquires the global cancel spin lock, sets the Cancel flag, and executes IoSetCancelRoutine to
simultaneously retrieve the address of our cancel routine and set the cancel pointer in the IRP to NULL. (Refer to the
earlier sketch of IoCancelIrp.)
IoCancelIrp calls our cancel routine without releasing the lock. The cancel routine locates the correct DEVQUEUE and
calls CancelRequest. CancelRequest immediately releases the global cancel spin lock (point 14).
CancelRequest acquires the queue spin lock (point 15). Past this point, there can be no more races with other
DEVQUEUE routines.
CancelRequest removes the IRP from the queue (point 16) and then releases the spin lock. If StartNextPacket were to run
now, it wouldn’t find this IRP on the queue.
CancelRequest completes the IRP with STATUS_CANCELLED (point 17).

8. Pathological IRP Cancellation

The most difficult IRP cancellation scenario to handle occurs when IoCancelIrp tries to cancel the IRP at the head of our queue
while StartNextPacket is active. At point 12, StartNextPacket will nullify the cancel pointer. If the return value from
IoSetCancelRoutine is not NULL, we’ve beaten IoCancelIrp to the punch and can go on to process the IRP (point 13).
If the return value from IoSetCancelRoutine is NULL, however, it means that IoCancelIrp has gotten there first. CancelRequest
is probably waiting right now on another CPU for us to release the queue spin lock, whereupon it will dequeue the IRP and
complete it. The trouble is, we’ve already removed the IRP from the queue. I’m a bit proud of the trick I devised for coping

Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
- 146 - The I/O Request Packet | Chapter 5

with the situation: we simply initialize the linking field of the IRP as if it were the anchor of a list! The call to RemoveEntryList
at point 16 in CancelRequest will perform several motions with no net result to “remove” the IRP from the degenerate list it
now inhabits.

9. Things That Can’t Happen or Won’t Matter

The preceding list exhausts the possibilities for conflict between these DEVQUEUE routines and IoCancelIrp. (There is still a
race between IRP_MJ_CLEANUP and IRP cancellation, but I’ll discuss that a bit later in this chapter.) Here is a list of things
that might be causing you needless worry:
Could CancelRoutine be non-NULL when StartPacket gets control? It better not be, because a driver is supposed to
remove its cancel routine from an IRP before sending the IRP to another driver. StartPacket contains an ASSERT to this
effect. If you engage the Driver Verifier for your driver, it will verify that you nullify the cancel routine pointer in IRPs
that you pass down the stack, but it will not verify that the drivers above you have done this for IRPs they pass to you.
Could the cancel argument to StartPacket be NULL? It better not be: you might have noticed that much of the cancel
logic I described hinges on whether the IRP’s CancelRoutine pointer is NULL. StartPacket contains an ASSERT to test
this assumption.
Could someone call IoCancelIrp twice? The thing to think about is that the Cancel flag might be set in an IRP because of
some number of primeval calls to IoCancelIrp and that someone might call IoCancelIrp one more time (getting a little
impatient, are we?) while StartPacket is active. This wouldn’t matter because our first test of the Cancel flag occurs after
we install our cancel pointer. We would find the flag set to TRUE in this hypothetical situation and would therefore
execute the second call to IoSetCancelRoutine. Either IoCancelIrp or we win the race to reset the cancel pointer to NULL,
and whoever wins ends up completing the IRP. The residue from the primeval calls is simply irrelevant.

5.5.5 Cancelling IRPs You Create or Handle

Sometimes you’ll want to cancel an IRP that you’ve created or passed to another driver. Great care is required to avoid an
obscure, low-probability problem. Just for the sake of illustration, suppose you want to impose an overall 5-second timeout on
a synchronous I/O operation. If the time period elapses, you want to cancel the operation. Here is some naive code that, you
might suppose, would execute this plan:

SomeFunction()
{
KEVENT event;
IO_STATUS_BLOCK iosb;
KeInitializeEvent(&event, ...);
PIRP Irp = IoBuildSynchronousFsdRequest(..., &event, &iosb);
NTSTATUS status = IoCallDriver(DeviceObject, Irp);
if (status == STATUS_PENDING)
{
LARGE_INTEGER timeout;
timeout.QuadPart = -5 * 10000000;

if (KeWaitForSingleObject(&event, Executive, KernelMode,

FALSE, &timeout) == STATUS_TIMEOUT)
{
IoCancelIrp(Irp); // <== don't do this!

KeWaitForSingleObject(&event, Executive, KernelMode,

FALSE, NULL);
}
}
}

The first call (A) to KeWaitForSingleObject waits until one of two things happens. First, someone might complete the IRP, and
the I/O Manager’s cleanup code will then run and signal event.
Alternatively, the timeout might expire before anyone completes the IRP. In this case, KeWaitForSingleObject will return
STATUS_TIMEOUT. The IRP should now be completed quite soon in one of two paths. The first completion path is taken
when whoever was processing the IRP was really just about done when the timeout happened and has, therefore, already called
(or will shortly call) IoCompleteRequest. The other completion path is through the cancel routine that, we must assume, the
lower driver has installed. That cancel routine should complete the IRP. Recall that we have to trust other kernel-mode
components to do their jobs, so we have to rely on whomever we sent the IRP to complete it soon. Whichever path is taken, the
I/O Manager’s completion logic will set event and store the IRP’s ending status in iosb. The second call (B) to
KeWaitForSingleObject makes sure that the event and iosb objects don’t pass out of scope too soon. Without that second call,
we might return from this function, thereby effectively deleting event and iosb. The I/O Manager might then end up walking on
memory that belongs to some other subroutine.

Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
5.5 Cancelling I/O Requests - 147 -

The problem with the preceding code is truly minuscule. Imagine that someone manages to call IoCompleteRequest for this
IRP right around the same time we decide to cancel it by calling IoCancelIrp. Maybe the operation finishes shortly after the 5
‐second timeout terminates the first KeWaitForSingleObject, for example. IoCompleteRequest initiates a process that finishes
with a call to IoFreeIrp. If the call to IoFreeIrp were to happen before IoCancelIrp was done mucking about with the IRP, you
can see that IoCancelIrp could inadvertently corrupt memory when it touched the CancelIrql, Cancel, and CancelRoutine fields
of the IRP. It’s also possible, depending on the exact sequence of events, for IoCancelIrp to call a cancel routine, just before
someone clears the CancelRoutine pointer in preparation for completing the IRP, and for the cancel routine to be in a race with
the completion process.
It’s very unlikely that the scenario I just described will happen. But, as someone (James Thurber?) once said in connection with
the chances of being eaten by a tiger on Main Street (one in a million, as I recall), “Once is enough.” This kind of bug is almost
impossible to find, so you want to prevent it if you can. I’ll show you two ways of cancelling your own IRPs. One way is
appropriate for synchronous IRPs, the other for asynchronous IRPs.

Don’t Do This…
A once common but now deprecated technique for avoiding the tiger-on-main-street bug described in the text
relies on the fact that, in earlier versions of Windows, the call to IoFreeIrp happened in the context of an APC
in the thread that originates the IRP. You could make sure you were in that same thread, raise IRQL to
APC_LEVEL, check whether the IRP had been completed yet, and (if not) call IoCancelIrp. You could be sure of
blocking the APC and the problematic call to IoFreeIrp.

You shouldn’t rely on future releases of Windows always using an APC to perform the cleanup for a synchronous
IRP. Consequently, you shouldn’t rely on boosting IRQL to APC_LEVEL as a way to avoid a race between
IoCancelIrp and IoFreeIrp.

Cancelling Your Own Synchronous IRP

Refer to the example in the preceding section, which illustrates a function that creates a synchronous IRP, sends it to another
driver, and then wants to wait no longer than 5 seconds for the IRP to complete. The key thing we need to accomplish in a
solution to the race between IoFreeIrp and IoCancelIrp is to prevent the call to IoFreeIrp from happening until after any
possible call to IoCancelIrp. We do this by means of a completion routine that returns
STATUS_MORE_PROCESSING_REQUIRED, as follows:

SomeFunction()
{
KEVENT event;
IO_STATUS_BLOCK iosb;
KeInitializeEvent(&event, ...);
PIRP Irp = IoBuildSynchronousFsdRequest(..., &event, &iosb);
IoSetCompletionRoutine(Irp, OnComplete, (PVOID) &event, TRUE, TRUE, TRUE);
NTSTATUS status = IoCallDriver(...);
if (status == STATUS_PENDING)
{
LARGE_INTEGER timeout;
timeout.QuadPart = -5 * 10000000;

if (KeWaitForSingleObject(&event, Executive, KernelMode,

FALSE, &timeout) == STATUS_TIMEOUT)
{
IoCancelIrp(Irp); // <== okay in this context

KeWaitForSingleObject(&event, Executive, KernelMode,

FALSE, NULL);
}
}
IoCompleteRequest(Irp, IO_NO_INCREMENT);
}

NTSTATUS OnComplete(PDEVICE_OBJECT junk, PIRP Irp, PVOID pev)

{
if (Irp->PendingReturned)
KeSetEvent((PKEVENT) pev, IO_NO_INCREMENT, FALSE);
return STATUS_MORE_PROCESSING_REQUIRED;
}

The new code in boldface prevents the race. Suppose IoCallDriver returns STATUS_PENDING. In a normal case, the operation
will complete normally, and a lower-level driver will call IoCompleteRequest. Our completion routine gains control and signals
the event on which our mainline is waiting. Because the completion routine returns
STATUS_MORE_PROCESSING_REQUIRED, IoCompleteRequest will then stop working on this IRP. We eventually regain

Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
- 148 - The I/O Request Packet | Chapter 5

control in our SomeFunction and notice that our wait (the one labeled A) terminated normally. The IRP hasn’t yet been cleaned
up, though, so we need to call IoCompleteRequest a second time to trigger the normal cleanup mechanism.
Now suppose we decide we want to cancel the IRP and that Thurber’s tiger is loose so we have to worry about a call to
IoFreeIrp releasing the IRP out from under us. Our first wait (labeled A) finishes with STATUS_TIMEOUT, so we perform a
second wait (labeled B). Our completion routine sets the event on which we’re waiting. It will also prevent the cleanup
mechanism from running by returning STATUS_MORE_PROCESSING_REQUIRED. IoCancelIrp can stomp away to its
heart’s content on our hapless IRP without causing any harm. The IRP can’t be released until the second call to
IoCompleteRequest from our mainline, and that can’t happen until IoCancelIrp has safely returned.
Notice that the completion routine in this example calls KeSetEvent only when the IRP’s PendingReturned flag is set to
indicate that the lower driver’s dispatch routine returned STATUS_PENDING. Making this step conditional is an optimization
that avoids the potentially expensive step of setting the event when SomeFunction won’t be waiting on the event in the first
place.
I want to mention one last fine point in connection with the preceding code. The call to IoCompleteRequest at the very end of
the subroutine will trigger a process that includes setting event and iosb so long as the IRP originally completed with a success
status. In the first edition, I had an additional call to KeWaitForSingleObject at this point to make sure that event and iosb
could not pass out of scope before the I/O Manager was done touching them. A reviewer pointed out that the routine that
references event and iosb will already have run by the time IoCompleteRequest returns; consequently, the additional wait is not
needed.

Cancelling Your Own Asynchronous IRP

To safely cancel an IRP that you’ve created with IoAllocateIrp or IoBuildAsynchronousFsdRequest, you can follow this
general plan. First define a couple of extra fields in your device extension structure:

typedef struct _DEVICE_EXTENSION {

PIRP TheIrp;
ULONG CancelFlag;
} DEVICE_EXTENSION, *PDEVICE_EXTENSION;

Initialize these fields just before you call IoCallDriver to launch the IRP:

pdx->TheIrp = IRP;
pdx->CancelFlag = 0;
IoSetCompletionRoutine(Irp,
(PIO_COMPLETION_ROUTINE) CompletionRoutine,
(PVOID) pdx, TRUE, TRUE, TRUE);
IoCallDriver(..., Irp);

If you decide later on that you want to cancel this IRP, do something like the following:

VOID CancelTheIrp(PDEVICE_EXENSION pdx)

{

PIRP Irp =(PIRP) InterlockedExchangePointer((PVOID*)&pdx->TheIrp, NULL);

if (Irp)
{
IoCancelIrp(Irp);

if (InterlockedExchange(&pdx->CancelFlag, 1)

IoFreeIrp(Irp);
}
}

This function dovetails with the completion routine you install for the IRP:

NTSTATUS CompletionRoutine(PDEVICE_OBJECT junk, PIRP Irp,

PDEVICE_EXTENSION pdx)
{

if (InterlockedExchangePointer(&pdx->TheIrp, NULL)

││ InterlockedExchange(&pdx->CancelFlag, 1))

IoFreeIrp(Irp);

Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
5.5 Cancelling I/O Requests - 149 -

return STATUS_MORE_PROCESSING_REQUIRED;
}

The basic idea underlying this deceptively simple code is that whichever routine sees the IRP last (either CompletionRoutine or
CancelTheIrp) will make the requisite call to IoFreeIrp, at point 3 or 6. Here’s how it works:
The normal case occurs when you don’t ever try to cancel the IRP. Whoever you sent the IRP to eventually completes it,
and your completion routine gets control. The first InterlockedExchangePointer (point 4) returns the non-NULL address
of the IRP. Since this is not 0, the compiler short-circuits the evaluation of the Boolean expression and executes the call to
IoFreeIrp. Any subsequent call to CancelTheIrp will find the IRP pointer set to NULL at point 1 and won’t do anything
else.
Another easy case to analyze occurs when CancelTheIrp is called long before anyone gets around to completing this IRP,
which means that we don’t have any actual race. At point 1, we nullify the TheIrp pointer. Because the IRP pointer was
previously not NULL, we go ahead and call IoCancelIrp. In this situation, our call to IoCancelIrp will cause somebody to
complete the IRP reasonably soon, and our completion routine runs. It sees TheIrp as NULL and goes on to evaluate the
second half of the Boolean expression. Whoever executes the InterlockedExchange on CancelFlag first will get back 0
and skip calling IoFreeIrp. Whoever executes it second will get back 1 and will call IoFreeIrp.
Now for the case we were worried about: suppose someone is completing the IRP right about the time CancelTheIrp
wants to cancel it. The worst that can happen is that our completion routine runs before we manage to call IoCancelIrp.
The completion routine sees TheIrp as NULL and therefore exchanges CancelFlag with 1. Just as in the previous case,
the routine will get 0 as the return value and skip the IoFreeIrp call. IoCancelIrp can safely operate on the IRP. (It will
presumably just return without calling a cancel routine because whoever completed this IRP will undoubtedly have set
the CancelRoutine pointer to NULL first.)
The appealing thing about the technique I just showed you is its elegance: we rely solely on interlocked operations and
therefore don’t need any potentially expensive synchronization primitives.

Cancelling Someone Else’s IRP

To round out our discussion of IRP cancellation, suppose someone sends you an IRP that you then forward to another driver.
Situations might arise where you’d like to cancel that IRP. For example, perhaps you need that IRP out of the way so you can
proceed with a power-down operation. Or perhaps you’re waiting synchronously for the IRP to finish and you’d like to impose
a timeout as in the first example of this section.
To avoid the IoCancelIrp/IoFreeIrp race, you need to have your own completion routine in place. The details of the coding
then depend on whether you’re waiting for the IRP.

Canceling Someone Else’s IRP on Which You’re Waiting

Suppose your dispatch function passes down an IRP and waits synchronously for it to complete. (See usage scenario 7 at the
end of this chapter for the cookbook version.) Use code like this to cancel the IRP if it doesn’t finish quickly enough to suit
you:

NTSTATUS DispatchSomething(PDEVICE_OBJECT fdo, PIRP Irp)

{
PDEVICE_EXTENSION pdx = (PDEVICE_EXTENSION) fdo->DeviceExtension;
KEVENT event;
KeInitializeEvent(&event, NotificationEvent, FALSE);
IoSetCompletionRoutine(Irp, OnComplete, (PVOID) &event, TRUE, TRUE, TRUE);
NTSTATUS status = IoCallDriver(...);
if (status == STATUS_PENDING)
{
LARGE_INTEGER timeout;
timeout.QuadPart = -5 * 10000000;
if (KeWaitForSingleObject(&event, Executive, KernelMode,
FALSE, &timeout) == STATUS_TIMEOUT)
{
IoCancelIrp(Irp);
KeWaitForSingleObject(&event, Executive, KernelMode,
FALSE, NULL);
}
}
status = Irp->IoStatus.Status;
IoCompleteRequest(Irp, IO_NO_INCREMENT);
return status;
}

Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
- 150 - The I/O Request Packet | Chapter 5

NTSTATUS OnComplete(PDEVICE_OBJECT junk, PIRP Irp, PVOID pev)

{
if (Irp->PendingReturned)
KeSetEvent((PKEVENT) pev, IO_NO_INCREMENT, FALSE);
return STATUS_MORE_PROCESSING_REQUIRED;
}

This code is almost the same as what I showed earlier for canceling your own synchronous IRP. The only difference is that this
example involves a dispatch routine, which must return a status code. As in the earlier example, we install our own completion
routine to prevent the completion process from running to its ultimate conclusion before we get past the point where we might
call IoCancelIrp.
You might notice that I didn’t say anything about whether the IRP itself was synchronous or asynchronous. This is because the
difference between the two types of IRP only matters to the driver that creates them in the first place. File system drivers must
make distinctions between synchronous and asynchronous IRPs with respect to how they call the system cache manager, but
device drivers don’t typically have this complication. What matters to a lower-level driver is whether it’s appropriate to block a
thread in order to handle an IRP synchronously, and that depends on the current IRQL and whether you’re in an arbitrary or a
nonarbitrary thread.

Canceling Someone Else’s IRP on Which You’re Not Waiting

Suppose you’ve forwarded somebody else’s IRP to another driver, but you weren’t planning to wait for it to complete. For
whatever reason, you decide later on that you’d like to cancel that IRP.

typedef struct _DEVICE_EXTENSION {

PIRP TheIrp;
ULONG CancelFlag;
} DEVICE_EXTENSION, *PDEVICE_EXTENSION;

NTSTATUS DispatchSomething(PDEVICE_OBJECT fdo, PIRP Irp)

{
PDEVICE_EXTENSION pdx =
(PDEVICE_EXTENSION) fdo->DeviceExtension;
IoCopyCurrentIrpStackLocationToNext(Irp);
IoSetCompletionRoutine(Irp, (PIO_COMPLETION_ROUTINE) OnComplete,
(PVOID) pdx,
TRUE, TRUE, TRUE);
pdx->CancelFlag = 0;
pdx->TheIrp = Irp;
IoMarkIrpPending(Irp);
IoCallDriver(pdx->LowerDeviceObject, Irp);
return STATUS_PENDING;
}

VOID CancelTheIrp(PDEVICE_EXTENSION pdx)

{
PIRP Irp = (PIRP) InterlockedExchangePointer(
(PVOID*) &pdx->TheIrp, NULL);
if (Irp)
{
IoCancelIrp(Irp);
if (InterlockedExchange(&pdx->CancelFlag, 1))
IoCompleteRequest(Irp, IO_NO_INCREMENT);
}
}

NTSTATUS OnComplete(PDEVICE_OBJECT fdo, PIRP Irp,

PDEVICE_EXTENSION pdx)
{
if (InterlockedExchangePointer((PVOID*) &pdx->TheIrp, NULL)
││ InterlockedExchange(&pdx->CancelFlag, 1))
return STATUS_SUCCESS;
return STATUS_MORE_PROCESSING_REQUIRED;
}

This code is similar to the code I showed earlier for cancelling your own asynchronous IRP. Here, however, allowing
IoCompleteRequest to finish completing the IRP takes the place of the call to IoFreeIrp we made when we were dealing with
our own IRP. If the completion routine is last on the scene, it returns STATUS_SUCCESS to allow IoCompleteRequest to finish
completing the IRP. If CancelTheIrp is last on the scene, it calls IoCompleteRequest to resume the completion processing that

Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
5.5 Cancelling I/O Requests - 151 -

the completion routine short-circuited by returning STATUS_MORE_PROCESSING_REQUIRED.

One extremely subtle point regarding this example is the call to IoMarkIrpPending in the dispatch routine. Ordinarily, it would
be safe to just do this step conditionally in the completion routine, but not this time. If we should happen to call CancelTheIrp
in the context of some thread other than the one in which the dispatch routine runs, the pending flag is needed so that
IoCompleteRequest will schedule an APC to clean up the IRP in the proper thread. The easiest way to make that true is
simple—always mark the IRP pending.

5.5.6 Handling IRP_MJ_CLEANUP

Closely allied to the subject of IRP cancellation is the I/O request with the major function code IRP_MJ_CLEANUP. To
explain how you should process this request, I need to give you a little additional background.
When applications and other drivers want to access your device, they first open a handle to the device. Applications call
CreateFile to do this; drivers call ZwCreateFile. Internally, these functions create a kernel file object and send it to your driver
in an IRP_MJ_CREATE request. When the entity that opened the handle is done accessing your driver, it will call another
function, such as CloseHandle or ZwClose. Internally, these functions send your driver an IRP_MJ_CLOSE request. Just
before sending you the IRP_MJ_CLOSE, however, the I/O Manager sends you an IRP_MJ_CLEANUP so that you can cancel
any IRPs that belong to the same file object but that are still sitting in one of your queues. From the perspective of your driver,
the one thing all the requests have in common is that the stack location you receive points to the same file object in every
instance.
Figure 5-10 illustrates your responsibility when you receive IRP_MJ_CLEANUP. You should run through your queues of IRPs,
removing those that are tagged as belonging to the same file object. You should complete those IRPs with
STATUS_CANCELLED.

Figure 5-10. Driver responsibility for IRP_MJ_CLEANUP.

File Objects
Ordinarily, just one driver (the function driver, in fact) in a device stack implements all three of the following
requests: IRP_MJ_CREATE, IRP_MJ_CLOSE, and IRP_MJ_CLEANUP. The I/O Manager creates a file object (a
regular kernel object) and passes it in the I/O stack to the dispatch routines for all three of these IRPs. Anybody
who sends an IRP to a device should have a pointer to the same file object and should insert that pointer into
the I/O stack as well. The driver that handles these three IRPs acts as the owner of the file object in some sense,
in that it’s the driver that’s entitled to use the FsContext and FsContext2 fields of the object. So your
DispatchCreate routine can put something into one of these context fields for use by other dispatch routines and
for eventual cleanup by your DispatchClose routine.

It’s easy to get confused about IRP_MJ_CLEANUP. In fact, programmers who have a hard time understanding IRP
cancellation sometimes decide (incorrectly) to just ignore this IRP. You need both cancel and cleanup logic in your driver,
though:
IRP_MJ_CLEANUP means a handle is being closed. You should purge all the IRPs that pertain to that handle.
The I/O Manager and other drivers cancel individual IRPs for a variety of reasons that have nothing to do with closing
handles.
One of the times the I/O Manager cancels IRPs is when a thread terminates. Threads often terminate because their parent
process is terminating, and the I/O Manager will also automatically close all handles that are still open when a process

terminates. The coincidence between this kind of cancellation and the automatic handle closing contributes to the
incorrect idea that a driver can get by with support for just one concept.
In this book, I’ll show you two ways of painlessly implementing support for IRP_MJ_CLEANUP, depending on whether
you’re using one of my DEVQUEUE objects or one of Microsoft’s cancel-safe queues.

5.5.7 Cleanup with a DEVQUEUE

If you’ve used a DEVQUEUE to queue IRPs, your IRP_MJ _CLEANUP routine will be astonishingly simple:

NTSTATUS DispatchCleanup(PDEVICE_OBJECT fdo, PIRP Irp)

{
PDEVICE_EXTENSION pdx = (PDEVICE_EXTENSION) fdo->DeviceExtension;
PIO_STACK_LOCATION stack = IoGetCurrentIrpStackLocation(Irp);
PFILE_OBJECT fop = stack->FileObject;
CleanupRequests(&pdx->dqReadWrite, fop, STATUS_CANCELLED);
return CompleteRequest(Irp, STATUS_SUCCESS, 0);
}

CleanupRequests will remove all IRPs from the queue that belong to the same file object and will complete those IRPs with
STATUS_CANCELLED. Note that you complete the IRP_MJ_CLEANUP request itself with STATUS_SUCCESS.
CleanupRequests contains a wealth of detail:

VOID CleanupRequests(PDEVQUEUE pdq, PFILE_OBJECT fop, NTSTATUS status)

{
LIST_ENTRY cancellist;

InitializeListHead(&cancellist);
KIRQL oldirql;
KeAcquireSpinLock(&pdq->lock, &oldirql);
PLIST_ENTRY first = &pdq->head;
PLIST_ENTRY next;

for (next = first->Flink; next != first; )

{
PIRP Irp = CONTAINING_RECORD(next, IRP,
Tail.Overlay.ListEntry);

PIO_STACK_LOCATION stack = IoGetCurrentIrpStackLocation(Irp);

PLIST_ENTRY current = next;

next = next->Flink;
if (fop && stack->FileObject != fop)
continue;

if (!IoSetCancelRoutine(Irp, NULL))
continue;

RemoveEntryList(current);
InsertTailList(&cancellist, current);
}

KeReleaseSpinLock(&pdq->lock, oldirql);
while (!IsListEmpty(&cancellist))
{
next = RemoveHeadList(&cancellist);
PIRP Irp = CONTAINING_RECORD(next, IRP,
Tail.Overlay.ListEntry);
Irp->IoStatus.Status = status;
IoCompleteRequest(Irp, IO_NO_INCREMENT);
}
}

1. Our strategy will be to move the IRPs that need to be cancelled into a private queue under protection of the queue’s spin
lock. Hence, we initialize the private queue and acquire the spin lock before doing anything else.
2. This loop traverses the entire queue until we return to the list head. Notice the absence of a loop increment step—the
third clause in the for statement. I’ll explain in a moment why it’s desirable to have no loop increment.

3. If we’re being called to help out with IRP_MJ_CLEANUP, the fop argument is the address of a file object that’s about to
be closed. We’re supposed to isolate the IRPs that pertain to the same file object, which requires us to first find the stack
location.
4. If we decide to remove this IRP from the queue, we won’t thereafter have an easy way to find the next IRP in the main
queue. We therefore perform the loop increment step here.
5. This especially clever statement comes to us courtesy of Jamie Hanrahan. We need to worry that someone might be trying
to cancel the IRP that we’re currently looking at during this iteration. They could get only as far as the point where
CancelRequest tries to acquire the spin lock. Before getting that far, however, they necessarily had to execute the
statement inside IoCancelIrp that nullifies the cancel routine pointer. If we find that pointer set to NULL when we call
IoSetCancelRoutine, therefore, we can be sure that someone really is trying to cancel this IRP. By simply skipping the
IRP during this iteration, we allow the cancel routine to complete it later on.
6. Here’s where we take the IRP out of the main queue and put it in the private queue instead.
7. Once we finish moving IRPs into the private queue, we can release our spin lock. Then we cancel all the IRPs we moved.

5.5.8 Cleanup with a Cancel-Safe Queue

To easily clean up IRPs that you’ve queued by calling IoCsqInsertIrp, simply adopt the convention that the peek context
parameter you use with IoCsqRemoveNextIrp, if not NULL, will be the address of a FILE_OBJECT. Your IRP_MJ_CANCEL
routine will look like this (compare with the Cancel sample in the DDK):

NTSTATUS DispatchCleanup(PDEVICE_OBJECT fdo, PIRP Irp)

{
PDEVICE_EXTENSION pdx = (PDEVICE_EXTENSION) fdo->DeviceExtension;
PIO_STACK_LOCATION stack = IoGetCurrentIrpStackLocation(Irp);
PFILE_OBJECT fop = stack->FileObject;
PIRP qirp;
while ((qirp = IoCsqRemoveNextIrp(&pdx->csq, fop)))
CompleteRequest(qirp, STATUS_CANCELLED, 0);
return CompleteRequest(Irp, STATUS_SUCCESS, 0);
}

Implement your PeekNextIrp callback routine this way:

PIRP PeekNextIrp(PIO_CSQ csq, PIRP Irp, PVOID PeekContext)

{
PDEVICE_EXTENSION pdx = GET_DEVICE_EXTENSION(csq);
PLIST_ENTRY next = Irp ? Irp->Tail.Overlay.ListEntry.Flink
: pdx->IrpQueueAnchor.Flink;
while (next != &pdx->IrpQueueAnchor)
{
PIRP NextIrp = CONTAINING_RECORD(next, IRP,
Tail.Overlay.ListEntry);
PIO_STACK_LOCATION stack =
IoGetCurrentIrpStackLocation(NextIrp);
if (!PeekContext ││ (PFILE_OBJECT) PeekContext == stack->FileObject)
return NextIrp;
next = next->Flink;
}
return NULL;
}

5.6 Summary—Eight IRP-Handling Scenarios

Notwithstanding the length of the preceding explanations, IRP handling is actually quite easy. By my reckoning, only eight
significantly different scenarios are in common use, and the code required to handle those scenarios is pretty simple. In this
final section of this chapter, I’ve assembled some pictures and code samples to help you sort out all the theoretical knowledge.
Because this section is intended as a cookbook that you can use without completely understanding every last nuance, I’ve
included calls to the remove lock functions that I’ll discuss in detail in Chapter 6. I’ve also used the shorthand
IoSetCompletionRoutine[Ex] to indicate places where you ought to call IoSetCompletionRoutineEx, in a system where it’s
available, to install a completion routine. I’ve also used an overloaded version of my CompleteRequest helper routine that
doesn’t change IoStatus.Information in these examples because that would be correct for IRP_MJ_PNP and not incorrect for
other types of IRP.

5.6.1 Scenario 1—Pass Down with Completion Routine

In this scenario, someone sends you an IRP. You’ll forward this IRP to the lower driver in your PnP stack, and you’ll do some
postprocessing in a completion routine. See Figure 5-11. Adopt this strategy when all of the following are true:
Someone is sending you an IRP (as opposed to you creating the IRP yourself).
The IRP might arrive at DISPATCH_LEVEL or in an arbitrary thread (so you can’t block while the lower drivers handle
the IRP).
Your postprocessing can be done at DISPATCH_LEVEL if need be (because completion routines might be called at
DISPATCH_LEVEL).

Figure 5-11. Pass down with completion routine.

Your dispatch and completion routines will have this skeletal form:

NTSTATUS DispatchSomething(PDEVICE_OBJECT fdo, PIRP Irp)

{
PDEVICE_EXTENSION pdx = (PDEVICE_EXTENSION) fdo->DeviceExtension;
NTSTATUS status = IoAcquireRemoveLock(&pdx->RemoveLock, Irp);
if (!NT_SUCCESS(status))
return CompleteRequest(Irp, status);
IoCopyCurrentIrpStackLocationToNext(Irp);
IoSetCompletionRoutine(Irp,
(PIO_COMPLETION_ROUTINE) CompletionRoutine, pdx, TRUE, TRUE, TRUE);
return IoCallDriver(fdo, Irp);
}

NTSTATUS CompletionRoutine(PDEVICE_OBJECT fdo, PIRP Irp,PDEVICE_EXTENSION pdx)

{
if (Irp->PendingReturned)
IoMarkIrpPending(Irp);
<whatever post processing you wanted to do>
IoReleaseRemoveLock(&pdx->RemoveLock, Irp);
return STATUS_SUCCESS;
}

5.6.2 Scenario 2—Pass Down Without Completion Routine

In this scenario, someone sends you an IRP. You’ll forward the IRP to the lower driver in your PnP stack, but you don’t need to
do anything with the IRP. See Figure 5-12. Adopt this strategy, which can also be called the “Let Mikey try it” approach, when
both of the following are true:
Someone is sending you an IRP (as opposed to you creating the IRP yourself).
You don’t process this IRP, but a driver below you might want to.

Figure 5-12. Pass down without completion routine.

This scenario is often used in a filter driver, which should act as a simple conduit for every IRP that it doesn’t specifically need
to filter.
I recommend writing the following helper routine, which you can use whenever you need to employ this strategy.

NTSTATUS ForwardAndForget(PDEVICE_EXTENSION pdx, PIRP Irp)

{
//PDEVICE_EXTENSION pdx = (PDEVICE_EXTENSION) fdo->DeviceExtension;
NTSTATUS status = IoAcquireRemoveLock(&pdx->RemoveLock, Irp);
if (!NT_SUCCESS(status))
return CompleteRequest(Irp, status);
IoSkipCurrentIrpStackLocation (Irp);
status = IoCallDriver(pdx->LowerDeviceObject, Irp);
IoReleaseRemoveLock(&pdx->RemoveLock, Irp);
return status;
}

5.6.3 Scenario 3—Complete in the Dispatch Routine

In this scenario, you immediately complete an IRP that someone sends you. See Figure 5-13. Adopt this strategy when:
Someone is sending you an IRP (as opposed to you creating the IRP yourself), and
You can process the IRP immediately. This would be the case for many kinds of I/O control (IOCTL) requests. Or
Something is obviously wrong with the IRP, in which case causing it to fail immediately might be the kindest thing to do.

Figure 5-13. Complete in the dispatch routine.

Your dispatch routine has this skeletal form:

NTSTATUS DispatchSomething(PDEVICE_OBJECT fdo, PIRP Irp)

{
PDEVICE_EXTENSION pdx = (PDEVICE_EXTENSION) fdo->DeviceExtension;
<process the IRP>
Irp->IoStatus.Status = STATUS_XXX;
Irp->IoStatus.Information = YYY;
IoCompleteRequest(Irp, IO_NO_INCREMENT);
return STATUS_XXX;
}

5.6.4 Scenario 4—Queue for Later Processing

In this scenario, someone sends you an IRP that you can’t handle right away. You put the IRP on a queue for later processing in
a StartIo routine. See Figure 5-14. Adopt this strategy when both of the following are true:
Someone is sending you an IRP (as opposed to you creating the IRP yourself).
You don’t know that you can process the IRP right away. This would frequently be the case for IRPs that require
serialized hardware access, such as reads and writes.

Figure 5-14. Queue for later processing.

Although you have many choices, a typical way of implementing this scenario involves using a DEVQUEUE to manage the
IRP queue. The following fragments show how various parts of a driver for a programmed I/O interrupt-driven device would
interact. Only the parts shown in boldface pertain specifically to IRP handling.

typedef struct _DEVICE_EXTENSION {

DEVQUEUE dqReadWrite;
} DEVICE_EXTENSION, *PDEVICE_EXTENSION;

NTSTATUS AddDevice(PDRIVER_OBJECT DriverObject, PDEVICE_OBJECT pdo)

{

InitializeQueue(&pdx->dqReadWrite, StartIo);
IoInitializeDpcRequest(fdo, (PIO_DPC_ROUTINE) DpcForIsr);

NTSTATUS DispatchReadWrite(PDEVICE_OBJECT fdo, PIRP Irp)

{
PDEVICE_EXTENSION pdx = (PDEVICE_EXTENSION) fdo->DeviceExtension;
IoMarkIrpPending(Irp);
StartPacket(&pdx->dqReadWrite, fdo, Irp, CancelRoutine);
return STATUS_PENDING;
}

VOID CancelRoutine(PDEVICE_OBJECT fdo, PIRP Irp)

{
PDEVICE_EXTENSION pdx = (PDEVICE_EXTENSION) fdo->DeviceExtension;
CancelRequest(&pdx->dqReadWrite, Irp);
}
VOID StartIo(PDEVICE_OBJECT fdo, PIRP Irp)
{

BOOLEAN OnInterrupt(PKINTERRUPT junk, PDEVICE_EXTENSION pdx)

{

PIRP Irp = GetCurrentIrp(&pdx->dqReadWrite);

Irp->IoStatus.Status = STATUS_XXX;
Irp->IoStatus.Information = YYY;

IoRequestDpc(pdx->DeviceObject, NULL, pdx);

VOID DpcForIsr(PKDPC junk1, PDEVICE_OBJECT fdo, PIRP junk2,

PDEVICE_EXTENSION pdx)
{

PIRP Irp = GetCurrentIrp(&pdx->dqReadWrite);

StartNextPacket(&pdx->dqReadWrite, fdo);
IoCompleteRequest(Irp, IO_NO_INCREMENT);
}

5.6.5 Scenario 5—Your Own Asynchronous IRP

In this scenario, you create an asynchronous IRP, which you forward to another driver. See Figure 5-15. Adopt this strategy
when the following conditions are true:
You need another driver to perform an operation on your behalf.
Either you’re in an arbitrary thread (which you shouldn’t block) or you’re running at DISPATCH_LEVEL (in which case you
can’t block).

Figure 5-15. Your own asynchronous IRP.

You’ll have code like the following in your driver. This won’t necessarily be in an IRP dispatch routine, and the target device
object won’t necessarily be the next lower one in your PnP stack. Look in the DDK documentation for full details about how to
call IoBuildAsynchronousFsdRequest and IoAllocateIrp.

SOMETYPE SomeFunction(PDEVICE_EXTENSION pdx, PDEVICE_OBJECT DeviceObject)

{

NTSTATUS status = IoAcquireRemoveLock(&pdx->RemoveLock, (PVOID) 42);

if (!NT_SUCCESS(status))

ObReferenceObject(DeviceObject);
IoCallDriver(DeviceObject, Irp);

ObDereferenceObject(DeviceObject);
}

NTSTATUS CompletionRoutine(PDEVICE_OBJECT junk, PIRP Irp, PDEVICE_EXTENSION pdx)

{
<IRP cleanup -- see below>
IoFreeIrp(Irp);

IoReleaseRemoveLock(&pdx->RemoveLock, (PVOID) 42);

return STATUS_MORE_PROCESSING_REQUIRED;
}

The calls to IoAcquireRemoveLock and IoReleaseRemoveLock (the points labeled A) are necessary only if the device to which
you’re sending this IRP is the LowerDeviceObject in your PnP stack. The 42 is an arbitrary tag—it’s simply too complicated to
try to acquire the remove lock after the IRP exists just so we can use the IRP pointer as a tag in the debug build.
The calls to ObReferenceObject and ObDereferenceObject that precede and follow the call to IoCallDriver (the points labeled
B) are necessary only when you’ve used IoGetDeviceObjectPointer to obtain the DeviceObject pointer and when the
completion routine (or something it calls) will release the resulting reference to a device or file object.
You do not have both the A code and the B code—you have one set or neither.
If you use IoBuildAsynchronousFsdRequest to build an IRP_MJ_READ or IRP_MJ_WRITE, you have some relatively
complex cleanup to perform in the completion routine.

Cleanup for DO_DIRECT_IO Target

If the target device object indicates the DO_DIRECT_IO buffering method, you’ll have to release the memory descriptor lists
that the I/O Manager allocated for your data buffer:

NTSTATUS CompletionRoutine(...)
{
PMDL mdl;
while ((mdl = Irp->MdlAddress))
{
Irp->MdlAddress = mdl->Next;
MmUnlockPages(mdl); // <== only if you earlier
// called MmProbeAndLockPages
IoFreeMdl(mdl);
}
IoFreeIrp(Irp);
<optional release of remove lock>
return STATUS_MORE_PROCESSING_REQUIRED;
}

Cleanup for DO_BUFFERED_IO Target

If the target device object indicates DO_BUFFERED_IO, the I/O Manager will create a system buffer. Your completion routine
should theoretically copy data from the system buffer to your own buffer and then release the system buffer. Unfortunately, the
flag bits and fields needed to do this are not documented in the DDK. My advice is to simply not send reads and writes directly
to a driver that uses buffered I/O. Instead, call ZwReadFile or ZwWriteFile.

Cleanup for Other Targets

If the target device indicates neither DO_DIRECT_IO nor DO_BUFFERED_IO, there is no additional cleanup. Phew!

5.6.6 Scenario 6—Your Own Synchronous IRP

In this scenario, you create a synchronous IRP, which you forward to another driver. See Figure 5-16. Adopt this strategy when
all of the following are true:
You need another driver to perform an operation on your behalf.
You must wait for the operation to complete before proceeding.
You’re running at PASSIVE_LEVEL in a nonarbitrary thread.

Figure 5-16. Your own synchronous IRP.

SOMETYPE SomeFunction(PDEVICE_EXTENSION pdx, PDEVICE_OBJECT DeviceObject)

{

NTSTATUS status = IoAcquireRemoveLock(&pdx->RemoveLock,

(PVOID) 42);

if (!NT_SUCCESS(status))

return <status>;
PIRP Irp;
KEVENT event;
IO_STATUS_BLOCK iosb;
KeInitializeEvent(&event, NotificationEvent, FALSE);
Irp = IoBuildSynchronousFsdRequest(IRP_MJ_XXX,
DeviceObject, ..., &event, &iosb);
-or-
Irp = IoBuildDeviceIoControlRequest(IOCTL_XXX, DeviceObject,
..., &event, &iosb);
status = IoCallDriver(DeviceObject, Irp);
if (status == STATUS_PENDING)
{
KeWaitForSingleObject(&event, Executive, KernelMode, FALSE, NULL);
status = iosb.Status;
}

IoReleaseRemoveLock(&pdx->RemoveLock, (PVOID) 42);

}
As in scenario 5, the calls to IoAcquireRemoveLock and IoReleaseRemoveLock (the points labeled A) are necessary only if the
device to which you’re sending this IRP is the LowerDeviceObject in your PnP stack. The 42 is an arbitrary tag—it’s simply
too complicated to try to acquire the remove lock after the IRP exists just so we can use the IRP pointer as a tag in the debug
build.
We’ll use this scenario frequently in Chapter 12 to send USB Request Blocks (URBs) synchronously down the stack. In the
examples we’ll study there, we’ll usually be doing this in the context of an IRP dispatch routine that independently acquires the
remove lock. Therefore, you won’t see the extra remove lock code in those examples.
You do not clean up after this IRP! The I/O Manager does it automatically.

5.6.7 Scenario 7—Synchronous Pass Down

In this scenario, someone sends you an IRP. You pass the IRP down synchronously in your PnP stack and then continue
processing. See Figure 5-17. Adopt this strategy when all of the following are true:

Someone is sending you an IRP (as opposed to you creating the IRP yourself).
You’re running at PASSIVE_LEVEL in a nonarbitrary thread.
Your postprocessing for the IRP must be done at PASSIVE_LEVEL.

Figure 5-17. Synchronous pass down.

A good example of when you would need to use this strategy is while processing an IRP_MN_START_DEVICE flavor of PnP
request.
I recommend writing two helper routines to make it easy to perform this synchronous pass-down operation:

NTSTATUS ForwardAndWait(PDEVICE_EXTENSION pdx, PIRP Irp)

{
KEVENT event;
KeInitialize(&event, NotificationRoutine, FALSE);
IoCopyCurrentIrpStackLocationToNext(Irp);
IoSetCompletionRoutine(Irp, (PIO_COMPLETION_ROUTINE)
ForwardAndWaitCompletionRoutine, &event, TRUE, TRUE, TRUE);
NTSTATUS status = IoCallDriver(pdx->LowerDeviceObject, Irp);
if (status == STATUS_PENDING)
{
KeWaitForSingleObject(&event, Executive, KernelMode,
FALSE, NULL);
status = Irp->IoStatus.Status;
}
return status;
}

NTSTATUS ForwardAndWaitCompletionRoutine(PDEVICE_OBJECT fdo,

PIRP Irp, PKEVENT pev)
{
if (Irp->PendingReturned)
KeSetEvent(pev, IO_NO_INCREMENT, FALSE);
return STATUS_MORE_PROCESSING_REQUIRED;
}

The caller of this routine needs to call IoCompleteRequest for this IRP and to acquire and release the remove lock. It’s
inappropriate for ForwardAndWait to contain the remove lock logic because the caller might not want to release the lock so
soon.
Note that the Windows XP DDK function IoForwardIrpSynchronously encapsulates these same steps.

5.6.8 Scenario 8—Asynchronous IRP Handled Synchronously

In this scenario, you create an asynchronous IRP, which you forward to another driver. Then you wait for the IRP to complete.
See Figure 5-18. Adopt this strategy when all of the following are true:
You need another driver to perform an operation on your behalf.
You need to wait for the operation to finish before you can go on.
You’re running at APC_LEVEL in a nonarbitrary thread.

Figure 5-18. Asynchronous IRP handled synchronously.

I use this technique when I’ve acquired an executive fast mutex and need to perform a synchronous operation. Your code
combines elements you’ve seen before (compare with scenarios 5 and 7):

SOMETYPE SomeFunction(PDEVICE_EXTENSION pdx, PDEVICE_OBJECT DeviceObject)

{

NTSTATUS status = IoAcquireRemoveLock(&pdx->RemoveLock, (PVOID) 42);

if (!NT_SUCCESS(status))

return <status>;
PIRP Irp;
Irp = IoBuildAsynchronousFsdRequest(IRP_MJ_XXX, DeviceObject, ...);
-or-
Irp = IoAllocateIrp(DeviceObject->StackSize, FALSE);
PIO_STACK_LOCATION stack = IoGetNextIrpStackLocation(Irp);
Stack->MajorFunction = IRP_MJ_XXX;
<additional initialization)
KEVENT event;
KeInitializeEvent(&event, NotificationEvent, FALSE);
IoSetCompletionRoutine[Ex]([pdx->DeviceObject], Irp,
(PIO_COMPLETION_ROUTINE) CompletionRoutine,
&event, TRUE, TRUE, TRUE);
status = IoCallDriver(DeviceObject, Irp);
if (status == STATUS_PENDING)
KeWaitForSingleObject(&event, Executive, KernelMode, FALSE, NULL);

IoReleaseRemoveLock(&pdx->RemoveLock, (PVOID) 42);

}

NTSTATUS CompletionRoutine(PDEVICE_OBJECT junk, PIRP Irp, PKEVENT pev)

{
if (Irp->PendingReturned)
KeSetEvent(pev, EVENT_INCREMENT, FALSE);
<IRP cleanup -- see above>
IoFreeIrp(Irp);
return STATUS_MORE_PROCESSING_REQUIRED;
}

The portions that differ from scenario 5 are in boldface.

As in earlier scenarios, the calls to IoAcquireRemoveLock and IoReleaseRemoveLock (the points labeled A) are necessary only
if the device to which you’re sending this IRP is the LowerDeviceObject in your PnP stack. The 42 is an arbitrary tag—it’s
simply too complicated to try to acquire the remove lock after the IRP exists just so we can use the IRP pointer as a tag in the
debug build.
Note that you must still perform all the same cleanup discussed earlier because the I/O Manager doesn’t clean up after an
asynchronous IRP. You might also need to provide for cancelling this IRP, in which case you should use the technique

Chapter 6
6 Plug and Play for Function Drivers

The Plug and Play (PnP) Manager communicates information and requests to device drivers via I/O request packets (IRPs)
with the major function code IRP_MJ_PNP. This type of request was new with Microsoft Windows 2000 and the Windows
Driver Model (WDM): previous versions of Microsoft Windows NT required device drivers to do most of the work of
detecting and configuring their devices. Happily, WDM drivers can let the PnP Manager do that work. To work with the PnP
Manager, driver authors will have to understand a few relatively complicated IRPs.
Plug and Play requests play two roles in the WDM. In their first role, these requests instruct the driver when and how to
configure or deconfigure itself and the hardware. Table 6-1 lists the roughly two dozen minor functions that a PnP request can
designate. Only a bus driver handles the nine minor functions shown with an asterisk; a filter driver or function driver would
simply pass these IRPs down the stack. Of the remaining minor functions, three have special importance to a typical filter
driver or function driver. The PnP Manager uses IRP_MN_START_DEVICE to inform the function driver which I/O resources
it has assigned to the hardware and to instruct the function driver to do any necessary hardware and software setup so that the
device can function. IRP_MN_STOP_DEVICE tells the function driver to shut down the device. IRP_MN_REMOVE_DEVICE
tells the function driver to shut down the device and release the associated device object. I’ll discuss these three minor
functions in detail in this chapter and the next; along the way, I’ll also describe the purpose of the other unstarred minor
functions that a filter driver or function driver might need to handle.

IRP Minor Function Code Description

IRP_MN_START_DEVICE Configure and initialize device
IRP_MN_QUERY_REMOVE_DEVICE Can device be removed safely?
IRP_MN_REMOVE_DEVICE Shut down and remove device
IRP_MN_CANCEL_REMOVE_DEVICE Ignore previous QUERY_REMOVE
IRP_MN_STOP_DEVICE Shut down device
IRP_MN_QUERY_STOP_DEVICE Can device be shut down safely?
IRP_MN_CANCEL_STOP_DEVICE Ignore previous QUERY_STOP
IRP_MN_QUERY_DEVICE_RELATIONS Get list of devices that are related in some specified way
IRP_MN_QUERY_INTERFACE Obtain direct-call function addresses
IRP_MN_QUERY_CAPABILITIES Determine capabilities of device
IRP_MN_QUERY_RESOURCES* Determine boot configuration
IRP_MN_QUERY_RESOURCE_REQUIREMENTS* Determine I/O resource requirements
IRP_MN_QUERY_DEVICE_TEXT* Obtain description or location string
IRP_MN_FILTER_RESOURCE_REQUIREMENTS Modify I/O resource requirements list
IRP_MN_READ_CONFIG* Read configuration space
IRP_MN_WRITE_CONFIG* Write configuration space
IRP_MN_EJECT* Eject the device
IRP_MN_SET_LOCK* Lock/unlock device against ejection
IRP_MN_QUERY_ID* Determine hardware ID of device
IRP_MN_QUERY_PNP_DEVICE_STATE Determine state of device
IRP_MN_QUERY_BUS_INFORMATION* Determine parent bus type
IRP_MN_DEVICE_USAGE_NOTIFICATION Note creation or deletion of paging, dump, or hibernate file
IRP_MN_SURPRISE_REMOVAL Note fact that device has been removed

Table 6-1. Minor Function Codes for IRP_MJ_PNP (* Indicates Handled Only by Bus Drivers)
A second and more complicated purpose of PnP requests is to guide the driver through a series of state transitions, as illustrated
in Figure 6-1. WORKING and STOPPED are the two fundamental states of the device. The STOPPED state is the initial state
of a device immediately after you create the device object. The WORKING state indicates that the device is fully operational.
Two of the intermediate states—PENDINGSTOP and PENDINGREMOVE—arise because of queries that all drivers for a
device must process before making the transition from WORKING. SURPRISEREMOVED occurs after the sudden and
unexpected removal of the physical hardware.
I introduced my DEVQUEUE queue management routines in the preceding chapter. The main reason for needing a custom
queuing scheme in the first place is to facilitate the PnP state transitions shown in Figure 6-1 and the power state transitions I’ll
discuss in Chapter 8. I’ll describe the DEVQUEUE routines that support these transitions in this chapter.
- 164 - Plug and Play for Function Drivers | Chapter 6

Figure 6-1. State diagram for a device.

This chapter also discusses PnP notifications, which provide a way for drivers and user-mode programs to learn
asynchronously about the arrival and departure of devices. Properly handling these notifications is important for applications
that work with devices that can be hot plugged and unplugged.
I’ve devoted a separate chapter (Chapter 11) to bus and multifunction drivers.

TIP
You can save yourself a lot of work by copying and using my GENERIC.SYS library. Instead of writing your own
elaborate dispatch function for IRP_MJ_PNP, simply delegate this IRP to GenericDispatchPnp. See the
Introduction for a table that lists the callback functions your driver supplies to perform device-specific
operations. I’ve used the same callback function names in this chapter. In addition, I’m basically using
GENERIC’s PnP handling code for all of the examples.

6.1 IRP_MJ_PNP Dispatch Function

A simplified version of the dispatch function for IRP_MJ_PNP might look like the following:

NTSTATUS DispatchPnp(PDEVICE_OBJECT fdo, PIRP Irp)

{

PIO_STACK_LOCATION stack = IoGetCurrentIrpStackLocation(Irp);

ULONG fcn = stack->MinorFunction;

static NTSTATUS (*fcntab[])(PDEVICE_OBJECT, PIRP) = {

HandleStartDevice, // IRP_MN_START_DEVICE
HandleQueryRemove, // IRP_MN_QUERY_REMOVE_DEVICE
<etc.>,
};

if (fcn >= arraysize(fcntab))

return DefaultPnpHandler(fdo, Irp);

return (*fcntab[fcn])(fdo, Irp);

}

NTSTATUS DefaultPnpHandler(PDEVICE_OBJECT fdo, PIRP Irp)

{

IoSkipCurrentIrpStackLocation(Irp);
PDEVICE_EXTENSION pdx =
(PDEVICE_EXTENSION) fdo->DeviceExtension;

return IoCallDriver(pdx->LowerDeviceObject, Irp);

}

1. All the parameters for the IRP, including the all-important minor function code, are in the stack location. Hence, we
obtain a pointer to the stack location by calling IoGetCurrentIrpStackLocation.
2. We expect the IRP’s minor function code to be one of those listed in Table 6‐1.
3. A method of handling the two dozen possible minor function codes is to write a subdispatch function for each one we’re
going to handle and then to define a table of pointers to those subdispatch functions. Many of the entries in the table will
be DefaultPnpHandler. Subdispatch functions such as HandleStartDevice will take pointers to a device object and an IRP
as parameters and will return an NTSTATUS code.
4. If we get a minor function code we don’t recognize, it’s probably because Microsoft defined a new one in a release of the
DDK after the DDK with which we built our driver. The right thing to do is to pass the minor function code down the
stack by calling the default handler. By the way, arraysize is a macro in one of my own header files that returns the
number of elements in an array. It’s defined as #define arraysize(p) (sizeof(p)/sizeof((p)[0])).
5. This is the operative statement in the dispatch routine, in which we index the table of subdispatch functions and call the
right one.
6. The DefaultPnpHandler routine is essentially the ForwardAndForget function I showed in connection with IPR-handling
scenario 2 in the preceding chapter. We’re passing the IRP down without a completion routine and therefore use
IoSkipCurrentIrpStackLocation to retard the IRP stack pointer in anticipation that IoCallDriver will immediately advance
it.

Using a Function Pointer Table

Using a table of function pointers to dispatch handlers for minor function codes as I’m showing you in
DispatchPnp entails some slight danger. A future version of the operating system might change the meaning of
some of the codes. That’s not a practical worry except during the beta test phase of a system, though, because
a later change would invalidate an unknown number of existing drivers. I like using a table of pointers to
subdispatch functions because having separate functions for the minor function codes seems like the right
engineering solution to me. If I were designing a C++ class library, for instance, I’d define a base class that
used virtual functions for each of the minor function codes.

Most programmers would probably place a switch statement in their DispatchPnp routine. You can simply
recompile your driver to conform to any reassignment of minor function codes. Recompilation will also
highlight—by producing compilation errors!—name changes that might signal functionality shifts. That
happened a time or two during the Microsoft Windows 98 and Windows 2000 betas, in fact. Furthermore, an
optimizing compiler should be able to use a jump table to produce slightly faster code for a switch statement
than for calls to subdispatch functions.

I think the choice between a switch statement and a table of function pointers is mostly a matter of taste, with
readability and modularity winning over efficiency in my own evaluation. You can avoid uncertainty during a
beta test by placing appropriate assertions in your code. For example, the HandleStartDevice function can
assert that stack->MinorFunction == IRP_MN_START_DEVICE. If you recompile your driver with each new beta
DDK, you’ll catch any number reassignments or name changes.

6.2 Starting and Stopping Your Device

Working with the bus driver, the PnP Manager automatically detects hardware and assigns I/O resources in Windows XP and
Windows 98/Me. Most modern devices have PnP features that allow system software to detect them automatically and to
electronically determine which I/O resources they require. In the case of legacy devices that have no electronic means of
identifying themselves to the operating system or of expressing their resource requirements, the registry database contains the
information needed for the detection and assignment operations.

NOTE
I find it hard to give an abstract definition of the term I/O resource that isn’t circular (for example, a resource
used for I/O), so I’ll give a concrete one instead. The WDM encompasses four standard I/O resource types: I/O
ports, memory registers, direct memory access (DMA) channels, and interrupt requests.

When the PnP Manager detects hardware, it consults the registry to learn which filter drivers and function drivers will manage
the hardware. As I discussed in Chapter 2, the PnP Manager loads these drivers (if necessary—one or more of them might
already be present, having been called into memory on behalf of some other hardware) and calls their AddDevice functions.
The AddDevice functions, in turn, create device objects and link them into a stack. At this point, the stage is set for the PnP
Manager, working with all of the device drivers, to assign I/O resources.
The PnP Manager initially creates a list of resource requirements for each device and allows the drivers to filter that list. I’m
going to ignore the filtering step for now because not every driver will need to participate in this step. Given a list of

requirements, the PnP Manager can then assign resources so as to harmonize the potentially conflicting requirements of all the
hardware present on the system. Figure 6-2 illustrates how the PnP Manager can arbitrate between two different devices that
have overlapping requirements for an interrupt request number, for example.

Figure 6-2. Arbitration of conflicting I/O resource requirements.

6.2.1 IRP_MN_START_DEVICE
Once the resource assignments are known, the PnP Manager notifies each device by sending it a PnP request with the minor
function code IRP_MN_START_DEVICE. Filter drivers are typically not interested in this IRP, so they usually pass the request
down the stack by using the DefaultPnpHandler technique I showed you earlier in “IRP_MJ_PNP Dispatch Function.”
Function drivers, on the other hand, need to do a great deal of work on the IRP to allocate and configure additional software
resources and to prepare the device for operation. This work needs to be done, furthermore, at PASSIVE_LEVELafter the lower
layers in the device hierarchy have processed this IRP.
You might implement IRP_MN_START_DEVICE in a subdispatch routine—reached from the DispatchPnp dispatch routine
shown earlier—that has the following skeletal form:

NTSTATUS HandleStartDevice(PDEVICE_OBJECT fdo, PIRP Irp)

{

Irp->IoStatus.Status = STATUS_SUCCESS;

NTSTATUS status = ForwardAndWait(fdo, Irp);

if (!NT_SUCCESS(status))
return CompleteRequest(Irp, status);

PIO_STACK_LOCATION stack = IoGetCurrentIrpStackLocation(Irp);

status = StartDevice(fdo, <additional args>);

EnableAllInterfaces (pdx, True);

return CompleteRequest(Irp, status);

}

1. The bus driver uses the incoming setting of IoStatus.Status to determine whether upper-level drivers have handled this
IRP. The bus driver makes a similar determination for several other minor functions of IRP_MJ_PNP. We therefore need
to initialize the Status field of the IRP to STATUS_SUCCESS before passing it down.
2. ForwardAndWait is the function I showed you in Chapter 5 in connection with IRP-handling scenario 7 (synchronous
pass down). The function returns a status code. If the status code denotes some sort of failure in the lower layers, we
propagate the code back to our own caller. Because our completion routine returned STATUS_MORE_PRO-
CESSING_REQUIRED, we halted the completion process inside IoCompleteRequest. Therefore, we have to complete the
request all over again, as shown here.
3. Our configuration information is buried inside the stack parameters. I’ll show you where a bit further on.
4. StartDevice is a helper routine you write to handle the details of extracting and dealing with configuration information. In
my sample drivers, I’ve placed it in a separate source module named READWRITE.CPP. I’ll explain shortly what

arguments you would pass to this routine besides the address of the device object.
5. EnableAllInterfaces enables all the device interfaces that you registered in your AddDevice routine. This step allows
applications to find your device when they use SetupDiXxx functions to enumerate instances of your registered interfaces.
6. Since ForwardAndWait short-circuited the completion process for the START_DEVICE request, we need to complete the
IRP a second time. In this example, I’m using an overloaded version of CompleteRequest that doesn’t change
IoStatus.Information, in accordance with the DDK rules for handling PnP requests.
You might guess (correctly!) that the IRP_MN_START_DEVICE handler has work to do that concerns the transition from the
initial STOPPED state to the WORKING state. I can’t explain that yet because I need to first explain the ramifications of other
PnP requests on state transitions, IRP queuing, and IRP cancellation. So I’m going to concentrate for a while on the
configuration aspects of the PnP requests.
The I/O stack location’s Parameters union has a substructure named StartDevice that contains the configuration information
you pass to the StartDevice helper function. See Table 6-2.

Field Name Description

AllocatedResources Contains raw resource assignments
AllocatedResourcesTranslated Contains translated resource assignments
Table 6-2. Fields in the Parameters.StartDevice Substructure of an I/O Stack Location
Both AllocatedResources and AllocatedResourcesTranslated are instances of the same kind of data structure, called a
CM_RESOURCE_LIST. This seems like a very complicated data structure if you judge only by its declaration in WDM.H. As
used in a start device IRP, however, all that remains of the complication is a great deal of typing. The “lists” will have just one
entry, a CM_PARTIAL_RESOURCE_LIST that describes all of the I/O resources assigned to the device. You can use statements
like the following to access the two lists:

PCM_PARTIAL_RESOURCE_LIST raw, translated;

raw = &stack->Parameters.StartDevice
.AllocatedResources->List[0].PartialResourceList;
translated = &stack->Parameters.StartDevice
.AllocatedResourcesTranslated->List[0].PartialResourceList;

The only difference between the last two statements is the reference to either the AllocatedResources or
AllocatedResourcesTranslated member of the parameters structure.
The raw and translated resource lists are the logical arguments to send to the StartDevice helper function, by the way:

status = StartDevice(fdo, raw, translated);

There are two different lists of resources because I/O buses and the CPU can address the same physical hardware in different
ways. The raw resources contain numbers that are bus-relative, whereas the translated resources contain numbers that are
system-relative. Prior to the WDM, a kernel-mode driver might expect to retrieve raw resource values from the registry, the
Peripheral Component Interconnect (PCI) configuration space, or some other source, and to translate them by calling routines
such as HalTranslateBusAddress and HalGetInterruptVector. See, for example, Art Baker’s The Windows NT Device Driver
Book: A Guide for Programmers (Prentice Hall, 1997), pages 122-62. Both the retrieval and translation steps are done by the
PnP Manager now, and all a WDM driver needs to do is access the parameters of a start device IRP as I’m now describing.
What you actually do with the resource descriptions inside your StartDevice function is a subject for Chapter 7.

6.2.2 IRP_MN_STOP_DEVICE
The stop device request tells you to shut your device down so that the PnP Manager can reassign I/O resources. At the
hardware level, shutting down involves pausing or halting current activity and preventing further interrupts. At the software
level, it involves releasing the I/O resources you configured at start device time. Within the framework of the
dispatch/subdispatch architecture I’ve been illustrating, you might have a subdispatch function like this one:

NTSTATUS HandleStopDevice(PDEVICE_OBJECT fdo, PIRP Irp)

{

StopDevice(fdo, oktouch);
Irp->IoStatus.Status = STATUS_SUCCESS;

return DefaultPnpHandler(fdo, Irp);

}

1. Right about here, you need to insert some more or less complicated code that concerns IRP queuing and cancellation. I’ll
show you the code that belongs in this spot further on in this chapter in “While the Device Is Stopped.”
2. In contrast with the start device case, in which we passed the request down and then did device-dependent work, here we
do our device-dependent stuff first and then pass the request down. The idea is that our hardware will be quiescent by the
time the lower layers see this request. I wrote a helper function named StopDevice to do the shutdown work. The second
argument indicates whether it will be OK for StopDevice to touch the hardware if it needs to. Refer to the sidebar
“Touching the Hardware When Stopping the Device” for an explanation of how to set this argument.
3. We always pass PnP requests down the stack. In this case, we don’t care what the lower layers do with the request, so we
can simply use the DefaultPnpHandler code to perform the mechanics.
The StopDevice helper function called in the preceding example is code you write that essentially reverses the configuration
steps you took in StartDevice. I’ll show you that function in the next chapter. One important fact about the function is that you
should code it in such a way that it can be called more than once for a single call to StartDevice. It’s not always easy for a PnP
IRP handler to know whether you’ve already called StopDevice, but it is easy to make StopDevice proof against duplicative
calls.

Touching the Hardware When Stopping the Device

In the skeleton of HandleStopDevice, I used an oktouch variable that I didn’t show you how to initialize. In the
scheme I’m teaching you in this book for writing a driver, the StopDevice function gets a BOOLEAN argument
that indicates whether it should be safe to address actual I/O operations to the hardware. The idea behind this
argument is that you might want to send certain instructions to your device as part of your shutdown protocol,
but there might be some reason why you can’t. You might want to tell your Personal Computer Memory Card
International Association (PCMCIA) modem to hang up the phone, for example, but there’s no point in trying if
the end user has already removed the modem card from the computer.

There’s no certain way to know whether your hardware is physically connected to the computer except by trying
to access it. Microsoft recommends, however, that if you succeeded in processing a START_DEVICE request,
you should go ahead and try to access your hardware when you process STOP_DEVICE and certain other PnP
requests. When I discuss how you track PnP state changes later in this chapter, I’ll honor this recommendation
by setting the oktouch argument to TRUE if we believe that the device is currently working and FALSE
otherwise.

6.2.3 IRP_MN_REMOVE_DEVICE
Recall that the PnP Manager calls the AddDevice function in your driver to notify you about an instance of the hardware you
manage and to give you an opportunity to create a device object. Instead of calling a function to do the complementary
operation, however, the PnP Manager sends you a PnP IRP with the minor function code IRP_MN_REMOVE_DEVICE. In
response to that, you’ll do the same things you did for IRP_MN_STOP_DEVICE to shut down your device, and then you’ll
delete the device object:

NTSTATUS HandleRemoveDevice(PDEVICE_OBJECT fdo, PIRP Irp)

{
PDEVICE_EXTENSION pdx = (PDEVICE_EXTENSION) fdo->DeviceExtension;
<complicated stuff>
DeregisterAllInterfaces(pdx);
StopDevice(fdo, oktouch);
Irp->IoStatus.Status = STATUS_SUCCESS;
NTSTATUS status = DefaultPnpHandler(fdo, Irp);
RemoveDevice(fdo);
return status;
}

This fragment looks similar to HandleStopDevice, with a couple of additions. DeregisterAllInterfaces will disable any device
interfaces you registered (probably in AddDevice) and enabled (probably in StartDevice), and it will release the memory
occupied by their symbolic link names. RemoveDevice will undo all the work you did inside AddDevice. For example:

VOID RemoveDevice(PDEVICE_OBJECT fdo)

{
PDEVICE_EXTENSION pdx = (PDEVICE_EXTENSION) fdo->DeviceExtension;

IoDetachDevice(pdx->LowerDeviceObject);

IoDeleteDevice(fdo);
}

1. This call to IoDetachDevice balances the call AddDevice made to IoAttachDeviceToDeviceStack.

2. This call to IoDeleteDevice balances the call AddDevice made to IoCreateDevice. Once this function returns, you should
act as if the device object no longer exists. If your driver isn’t managing any other devices, it will shortly be unloaded
from memory too.
Note, by the way, that you don’t get a stop device request followed by a remove device request. The remove device request
implies a shutdown, so you do both pieces of work in reply.

6.2.4 IRP_MN_SURPRISE_REMOVAL
Sometimes the end user has the physical ability to remove a device without going through any user interface elements first. If
the system detects that such a surprise removal has occurred, or that the device appears to be broken, it sends the driver a PnP
request with the minor function code IRP_MN_SURPRISE_REMOVAL. It will later send an IRP_MN_REMOVE_DEVICE.
Unless you previously set the SurpriseRemovalOK flag while processing IRP_MN_QUERY_CAPABILITIES (as I’ll discuss in
Chapter 8), some platforms also post a dialog box to inform the user that it’s potentially dangerous to yank hardware out of the
computer.
In response to the surprise removal request, a device driver should disable any registered interfaces. This will give applications
a chance to close handles to your device if they’re on the lookout for the notifications I discuss later in “PnP Notifications.”
Then the driver should release I/O resources and pass the request down:

NTSTATUS HandleSurpriseRemoval(PDEVICE_OBJECT fdo, PIRP Irp)

{
PDEVICE_EXTENSION pdx = (PDEVICE_EXTENSION) fdo->DeviceExtension;
<complicated stuff>
EnableAllInterfaces(pdx, FALSE);
StopDevice(fdo, oktouch);
Irp->IoStatus.Status = STATUS_SUCCESS;
return DefaultPnpHandler(fdo, Irp);
}

Whence IRP_MN_SURPRISE_REMOVAL?
The surprise removal PnP notification doesn’t happen as a simple and direct result of the end user yanking the
device from the computer. Some bus drivers can know when a device disappears. For example, removing a
universal serial bus (USB) device generates an electronic signal that the bus driver notices. For many other
buses, however, there isn’t any signal to alert the bus driver. The PnP Manager therefore relies on other
methods to decide that a device has disappeared.

A function driver can signal the disappearance of its device (if it knows) by calling IoInvalidateDeviceState and
then returning any of the values PNP_DEVICE_FAILED, PNP_DEVICE_REMOVED, or PNP_DEVICE_DISABLED
from the ensuing IRP_MN_QUERY_PNP_DEVICE_STATE. You might want to do this in your own driver if—to give
one example of many—your interrupt service routines (ISRs) read all 1 bits from a status port that normally
returns a mixture of 1s and 0s. More commonly, a bus driver calls IoInvalidateDeviceRelations to trigger a
re-enumeration and then fails to report the newly missing device. It’s worth knowing that when the end user
removes a device while the system is hibernating or in another low-power state, when power is restored, the
driver receives a series of power management IRPs before it receives the IRP_MN_SURPRISE_REMOVAL
request.

What these facts mean, practically speaking, is that your driver should be able to cope with errors that might
arise from having your device suddenly not present.

6.3 Managing PnP State Transitions

As I said at the outset of this chapter, WDM drivers need to track their devices through the state transitions diagrammed in
Figure 6-1. This state tracking also ties in with how you queue and cancel I/O requests. Cancellation in turn implicates the
global cancel spin lock, which is a performance bottleneck in a multi-CPU system. The standard model of IRP processing with
Microsoft queuing functions can’t solve all these interrelated problems. In this section, therefore, I’ll describe how my
DEVQUEUE object helps you cope with the complications Plug and Play creates.
Figure 6-3 illustrates the states of a DEVQUEUE. In the READY state, the queue accepts and forwards requests to your StartIo
routine in such a way that the device stays busy. In the STALLED state, however, the queue doesn’t forward IRPs to StartIo,
even when the device is idle. In the REJECTING state, the queue doesn’t even accept new IRPs. Figure 6-4 illustrates the flow
of IRPs through the queue.

Figure 6-3. States of a DEVQUEUE object.

Figure 6-4. Flow of IRPs through a DEVQUEUE.

Table 6-3 lists the support functions you can use with a DEVQUEUE. I discussed how to use InitializeQueue, StartPacket,
StartNextPacket, and CancelRequest in the preceding chapter. Now it’s time to discuss all the other functions.

Support Function Description

AbortRequests Aborts current and future requests
AllowRequests Undoes effect of previous AbortRequests
AreRequestsBeingAborted Are we currently aborting new requests?
CancelRequest Generic cancel routine
CheckBusyAndStall Checks for idle device and stalls requests in one atomic operation
CleanupRequests Cancels all requests for a given file object in order to service IRP_MJ_CLEANUP
GetCurrentIrp Determines which IRP is currently being processed by associated StartIo routine
InitializeQueue Initializes DEVQUEUE object
RestartRequests Restarts a stalled queue
StallRequests Stalls the queue
StartNextPacket Dequeues and starts the next request
StartPacket Starts or queues a new request
WaitForCurrentIrp Waits for current IRP to finish

Table 6-3. DEVQUEUE Service Routines

The real point of using a DEVQUEUE instead of one of the queue objects defined in the DDK is that a DEVQUEUE makes it
easier to manage the transitions between PnP states. In all of my sample drivers, the device extension contains a state variable
with the imaginative name state. I also define an enumeration named DEVSTATE whose values correspond to the PnP states.
When you initialize your device object in AddDevice, you’ll call InitializeQueue for each of your device queues and also
indicate that the device is in the STOPPED state:

NTSTATUS AddDevice(...)
{

PDEVICE_EXTENSION pdx = ...;

InitializeQueue(&pdx->dqReadWrite, StartIo);
pdx->state = STOPPED;

After AddDevice returns, the system sends IRP_MJ_PNP requests to direct you through the various PnP states the device can
assume.

NOTE
If your driver uses GENERIC.SYS, GENERIC will initialize your DEVQUEUE object or objects for you. Just be sure
to give GENERIC the addresses of those objects in your call to InitializeGenericExtension.

6.3.1 Starting the Device

A newly initialized DEVQUEUE is in a STALLED state, such that a call to StartPacket will queue a request even when the
device is idle. You’ll keep the queue (or queues) in the STALLED state until you successfully process
IRP_MN_START_DEVICE, whereupon you’ll execute code like the following:

NTSTATUS HandleStartDevice(...)
{
status = StartDevice(...);
if (NT_SUCCESS(status))
{
pdx->state = WORKING;
RestartRequests(&pdx->dqReadWrite, fdo);
}
}

You record WORKING as the current state of your device, and you call RestartRequests for each of your queues to release any
IRPs that might have arrived between the time AddDevice ran and the time you received the IRP_MN_START_DEVICE
request.

6.3.2 Is It OK to Stop the Device?

The PnP Manager always asks your permission before sending you an IRP_MN_STOP_DEVICE. The query takes the form of
an IRP_MN_QUERY_STOP_DEVICE request that you can cause to succeed or fail as you choose. The query basically means,
“Would you be able to immediately stop your device if the system were to send you an IRP_MN_STOP_DEVICE in a few
nanoseconds?” You can handle this query in two slightly different ways. Here’s the first way, which is appropriate when your
device might be busy with an IRP that either finishes quickly or can be easily terminated in the middle:

NTSTATUS HandleQueryStop(PDEVICE_OBJECT fdo, PIRP Irp)

{
Irp->IoStatus.Status = STATUS_SUCCESS;
PDEVICE_EXTENSION pdx = (PDEVICE_EXTENSION) fdo->DeviceExtension;

if (pdx->state != WORKING)
return DefaultPnpHandler(fdo, Irp);

if (!OkayToStop(pdx))
return CompleteRequest(Irp, STATUS_UNSUCCESSFUL, 0);

StallRequests(&pdx->dqReadWrite);
WaitForCurrentIrp(&pdx->dqReadWrite);

pdx->state = PENDINGSTOP;
return DefaultPnpHandler(fdo, Irp);
}

1. This statement handles a peculiar situation that can arise for a boot device: the PnP Manager might send you a
QUERY_STOP when you haven’t initialized yet. You want to ignore such a query, which is tantamount to saying yes.
2. At this point, you perform some sort of investigation to see whether it will be OK to revert to the STOPPED state. I’ll
discuss factors bearing on the investigation next.
3. StallRequests puts the DEVQUEUE in the STALLED state so that any new IRP just goes into the queue.
WaitForCurrentIrp waits until the current request, if there is one, finishes on the device. These two steps make the device
Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
- 172 - Plug and Play for Function Drivers | Chapter 6

quiescent until we know whether the device is really going to stop or not. If the current IRP won’t finish quickly of its
own accord, you’ll do something (such as calling IoCancelIrp to force a lower-level driver to finish the current IRP) to
“encourage” it to finish; otherwise, WaitForCurrentIrp won’t return.
4. At this point, we have no reason to demur. We therefore record our state as PENDINGSTOP. Then we pass the request
down the stack so that other drivers can have a chance to accept or decline this query.
The other basic way of handling QUERY_STOP is appropriate when your device might be busy with a request that will take a
long time and can’t be stopped in the middle, such as a tape retension operation that can’t be stopped without potentially
breaking the tape. In this case, you can use the DEVQUEUE object’s CheckBusyAndStall function. That function returns TRUE
if the device is busy, whereupon you cause the QUERY_STOP to fail with STATUS_UNSUCCESSFUL. The function returns
FALSE if the device is idle, in which case it also stalls the queue. (The operations of checking the state of the device and
stalling the queue need to be protected by a spin lock, which is why I wrote this function in the first place.)
You can cause a stop query to fail for many reasons. Disk devices that are used for paging, for example, cannot be stopped.
Neither can devices that are used for storing hibernation or crash dump files. (You’ll know about these characteristics as a
result of an IRP_MN_DEVICE_USAGE_NOTIFICATION request, which I’ll discuss later in “Other Configuration
Functionality.”) Other reasons may also apply to your device.
Even if you have the query succeed, one of the drivers underneath you might cause it to fail for some reason. Even if all the
drivers have the query succeed, the PnP Manager might decide not to shut you down. In any of these cases, you’ll receive
another PnP request with the minor code IRP_MN_CANCEL_STOP_DEVICE to tell you that your device won’t be shut down.
You should then clear whatever state you set during the initial query:

NTSTATUS HandleCancelStop(PDEVICE_OBJECT fdo, PIRP Irp)

{
Irp->IoStatus.Status = STATUS_SUCCESS;
PDEVICE_EXTENSION pdx = (PDEVICE_EXTENSION) fdo->DeviceExtension;
if (pdx->state != PENDINGSTOP)
return DefaultPnpHandler(fdo, Irp);
NTSTATUS status = ForwardAndWait(fdo, Irp);
pdx->state = WORKING;
RestartRequests(&pdx->dqReadWrite, fdo);
return CompleteRequest(Irp, status);
}

We first check to see whether a stop operation is even pending. Some higher-level driver might have vetoed a query that we
never saw, so we’d still be in the WORKING state. If we’re not in the PENDINGSTOP state, we simply forward the IRP.
Otherwise, we send the CANCEL_STOP IRP synchronously to the lower-level drivers. That is, we use our ForwardAndWait
helper function to send the IRP down the stack and await its completion. We wait for low-level drivers because we’re about to
resume processing IRPs, and the drivers might have work to do before we send them an IRP. We then change our state variable
to indicate that we’re back in the WORKING state, and we call RestartRequests to unstall the queues we stalled when we
caused the query to succeed.

6.3.3 While the Device Is Stopped

If, on the other hand, all device drivers have the query succeed and the PnP Manager decides to go ahead with the shutdown,
you’ll get an IRP_MN_STOP_DEVICE next. Your subdispatch function will look like this one:

NTSTATUS HandleStopDevice(PDEVICE_OBJECT fdo, PIRP Irp)

{
Irp->IoStatus.Status = STATUS_SUCCESS;
PDEVICE_EXTENSION pdx = (PDEVICE_EXTENSION) fdo->DeviceExtension;

if (pdx->state != PENDINGSTOP);
{
<complicated stuff>
}

StopDevice(fdo, pdx->state == WORKING);

pdx->state = STOPPED;

return DefaultPnpHandler(fdo, Irp);

}

1. We expect the system to send us a QUERY_STOP before it sends us a STOP, so we should already be in the
PENDINGSTOP state with all of our queues stalled. There is, however, a bug in Windows 98 such that we can sometimes
get a STOP (without a QUERY_STOP) instead of a REMOVE. You need to take some action at this point that causes you

to reject any new IRPs, but you mustn’t really remove your device object or do the other things you do when you really
receive a REMOVE request.
2. StopDevice is the helper function I’ve already discussed that deconfigures the device.
3. We now enter the STOPPED state. We’re in almost the same situation as we were when AddDevice was done. That is, all
queues are stalled, and the device has no I/O resources. The only difference is that we’ve left our registered interfaces
enabled, which means that applications won’t have received removal notifications and will leave their handles open.
Applications can also open new handles in this situation. Both aspects are just as they should be because the stop
condition won’t last long.
4. As I previously discussed, the last thing we do to handle IRP_MN_STOP_DEVICE is pass the request down to the lower
layers of the driver hierarchy.

6.3.4 Is It OK to Remove the Device?

Just as the PnP Manager asks your permission before shutting your device down with a stop device request, it also might ask
your permission before removing your device. This query takes the form of an IRP_MN_QUERY_REMOVE_DEVICE request
that you can, once again, cause to succeed or fail as you choose. And, just as with the stop query, the PnP Manager will use an
IRP_MN_CANCEL_REMOVE_DEVICE request if it changes its mind about removing the device.

NTSTATUS HandleQueryRemove(PDEVICE_OBJECT fdo, PIRP Irp)

{
Irp->IoStatus.Status = STATUS_SUCCESS;
PDEVICE_EXTENSION pdx = (PDEVICE_EXTENSION) fdo->DeviceExtension;

if (OkayToRemove(fdo))
{

StallRequests(&pdx->dqReadWrite);
WaitForCurrentIrp(&pdx->dqReadWrite);

pdx->prevstate = pdx->state;
pdx->state = PENDINGREMOVE;
return DefaultPnpHandler(fdo, Irp);
}
return CompleteRequest(Irp, STATUS_UNSUCCESSFUL, 0);
}

NTSTATUS HandleCancelRemove(PDEVICE_OBJECT fdo, PIRP Irp)

{
Irp->IoStatus.Status = STATUS_SUCCESS;
PDEVICE_EXTENSION pdx = (PDEVICE_EXTENSION) fdo->DeviceExtension;

if (pdx->state != PENDINGREMOVE)
return DefaultPnpHandler(fdo, Irp);
NTSTATUS status = ForwardAndWait(fdo, Irp);

pdx->state = pdx->prevstate;
RestartRequests(&pdx->dqReadWrite, fdo);
return CompleteRequest(Irp, status);
}

1. This OkayToRemove helper function provides the answer to the question, “Is it OK to remove this device?” In general,
this answer includes some device-specific ingredients, such as whether the device holds a paging or hibernation file, and
so on.
2. Just as I showed you for IRP_MN_QUERY_STOP_DEVICE, you want to stall the request queue and wait for a short
period, if necessary, until the current request finishes.
3. If you look at Figure 6-1 carefully, you’ll notice that it’s possible to get a QUERY_REMOVE when you’re in either the
WORKING or the STOPPED state. The right thing to do if the current query is later cancelled is to return to the original
state. Hence, I have a prevstate variable in the device extension to record the prequery state.
4. We get the CANCEL_REMOVE request when someone either above or below us vetoes a QUERY_REMOVE. If we
never saw the query, we’ll still be in the WORKING state and don’t need to do anything with this IRP. Otherwise, we
need to forward it to the lower levels before we process it because we want the lower levels to be ready to process the
IRPs we’re about to release from our queues.
5. Here we undo the steps we took when we succeeded the QUERY_REMOVE. We revert to the previous state. We stalled
the queues when we handled the query and need to unstall them now.
Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
- 174 - Plug and Play for Function Drivers | Chapter 6

6.3.5 Synchronizing Removal

It turns out that the I/O Manager can send you PnP requests simultaneously with other substantive I/O requests, such as
requests that involve reading or writing. It’s entirely possible, therefore, for you to receive an IRP_MN_REMOVE_DEVICE at
a time when you’re still processing another IRP. It’s up to you to prevent untoward consequences, and the standard way to do
that involves using an IO_REMOVE_LOCK object and several associated kernel-mode support routines.
The basic idea behind the standard scheme for preventing premature removal is that you acquire the remove lock each time
you start processing a request that you will pass down the PnP stack, and you release the lock when you’re done. Before you
remove your device object, you make sure that the lock is free. If not, you wait until all references to the lock are released.
Figure 6-5 illustrates the process.

Figure 6-5. Operation of an IO_REMOVE_LOCK.

To handle the mechanics of this process, you define a variable in the device extension:

struct DEVICE_EXTENSION {

IO_REMOVE_LOCK RemoveLock;

};

You initialize the lock object during AddDevice:

NTSTATUS AddDevice(PDRIVER_OBJECT DriverObject, PDEVICE_OBJECT pdo)

{

IoInitializeRemoveLock(&pdx->RemoveLock, 0, 0, 0);

The last three parameters to IoInitializeRemoveLock are, respectively, a tag value, an expected maximum lifetime for a lock,
and a maximum lock count, none of which is used in the free build of the operating system.
These preliminaries set the stage for what you do during the lifetime of the device object. Whenever you receive an I/O request
that you plan to forward down the stack, you call IoAcquireRemoveLock. IoAcquireRemoveLock will return
STATUS_DELETE_PENDING if a removal operation is under way. Otherwise, it will acquire the lock and return
STATUS_SUCCESS. Whenever you finish such an I/O operation, you call IoReleaseRemoveLock, which will release the lock
and might unleash a heretofore pending removal operation. In the context of some purely hypothetical dispatch function that
synchronously forwards an IRP, the code might look like this:

NTSTATUS DispatchSomething(PDEVICE_OBJECT fdo, PIRP Irp)

{
PDEVICE_EXTENSION pdx = (PDEVICE_EXTENSION) fdo->DeviceExtension;
NTSTATUS status = IoAcquireRemoveLock(&pdx->RemoveLock, Irp);
if (!NT_SUCCESS(status))
return CompleteRequest(Irp, status, 0);
status = ForwardAndWait(fdo, Irp);
if (!NT_SUCCESS(status))
{
IoReleaseRemoveLock(&pdx->RemoveLock, Irp);
return CompleteRequest(Irp, status, 0);
}

IoReleaseRemoveLock(&pdx->RemoveLock, Irp);
return CompleteRequest(Irp, <some code>, <info value>);
}

The second argument to IoAcquireRemoveLock and IoReleaseRemoveLock is just a tag value that a checked build of the
operating system can use to match up acquisition and release calls, by the way.
The calls to acquire and release the remove lock dovetail with additional logic in the PnP dispatch function and the remove
device subdispatch function. First DispatchPnp has to obey the rule about locking and unlocking the device, so it will contain
the following code, which I didn’t show you earlier in “IRP_MJ_PNP Dispatch Function”:

NTSTATUS DispatchPnp(PDEVICE_OBJECT fdo, PIRP Irp)

{
PDEVICE_EXTENSION pdx = (PDEVICE_EXTENSION) fdo->DeviceExtension;
NTSTATUS status = IoAcquireRemoveLock(&pdx->RemoveLock, Irp);
if (!NT_SUCCESS(status))
return CompleteRequest(Irp, status, 0);

status = (*fcntab[fcn](fdo, Irp);

if (fcn != IRP_MN_REMOVE_DEVICE)
IoReleaseRemoveLock(&pdx->RemoveLock, Irp);
return status;
}

In other words, DispatchPnp locks the device, calls the subdispatch routine, and then (usually) unlocks the device afterward.
The subdispatch routine for IRP_MN_REMOVE_DEVICE has additional special logic that you also haven’t seen yet:

NTSTATUS HandleRemoveDevice(PDEVICE_OBJECT fdo, PIRP Irp)

{
Irp->IoStatus.Status = STATUS_SUCCESS;
PDEVICE_EXTENSION pdx = (PDEVICE_EXTENSION) fdo->DeviceExtension;

AbortRequests(&pdx->dqReadWrite, STATUS_DELETE_PENDING);
DeregisterAllInterfaces(pdx);
StopDevice(fdo, pdx->state == WORKING);
pdx->state = REMOVED;

NTSTATUS status = DefaultPnpHandler(pdx->LowerDeviceObject, Irp);

IoReleaseRemoveLockAndWait(&pdx->RemoveLock, Irp);
RemoveDevice(fdo);
return status;
}

1. Windows 98/Me doesn’t send the SURPRISE_REMOVAL request, so this REMOVE IRP may be the first indication you
have that the device has disappeared. Calling StopDevice allows you to release all your I/O resources in case you didn’t
get an earlier IRP that caused you to release them. Calling AbortRequests causes you to complete any queued IRPs and to
start rejecting any new IRPs.
2. We pass this request to the lower layers now that we’ve done our work.
3. The PnP dispatch routine acquired the remove lock. We now call the special function IoReleaseRemoveLockAndWait to
release that lock reference and wait until all references to the lock are released. Once you call
IoReleaseRemoveLockAndWait, any subsequent call to IoAcquireRemoveLock will elicit a STATUS_DELETE_PENDING
status to indicate that device removal is under way.

NOTE
You’ll notice that the IRP_MN_REMOVE_DEVICE handler might block while an IRP finishes. This is certainly OK
in Windows 98/Me and Windows XP, which were designed with this possibility in mind—the IRP gets sent in the
context of a system thread that’s allowed to block. Some WDM functionality (a Microsoft developer even called
it “embryonic”) is present in OEM releases of Microsoft Windows 95, but you can’t block a remove device
request there. Consequently, if your driver needs to run in Windows 95, you need to discover that fact and avoid
blocking. That discovery process is left as an exercise for you.

It bears repeating that you need to use the remove lock only for an IRP that you pass down the PnP stack. If you have the
stamina, you can read the next section to understand exactly why this conclusion is true—and note that it differs from the
conventional wisdom that I and others have been espousing for several years. If someone sends you an IRP that you handle
entirely inside your own driver, you can rely on whoever sent you the IRP to make sure your driver remains in memory until
you both complete the IRP and return from your dispatch routine. If you send an IRP to someone outside your PnP stack, you’ll
use other means (such as a referenced file or device object) to keep the target driver in memory until it both completes the IRP
and returns from its dispatch routine.

6.3.6 Why Do I Need This @#$! Remove Lock, Anyway?

A natural question at this point is why, in the context of a robust and full-featured modern operating system, you even need to
worry about somebody unloading a driver when it knows, or should know, that it’s busy handling an IRP. This question is hard
to answer, but here goes.
The remove lock isn’t necessary to guard against having your device object removed out from under you while
you’re processing an IRP. Rather, it protects you from sending an IRP down your PnP stack to a lower device
object that no longer exists or that might cease to exist before the IRP finishes. To make this clear, I need to
explain rather fully how the PnP Manager and the Object Manager work together to keep drivers and device
objects around while they’re needed. I’m grossly oversimplifying here in order to emphasize the basic things you need to
understand.
First of all, every object that the Object Manager manages carries a reference count. When someone creates such an object, the
Object Manager initializes the reference count to 1. Thereafter, anyone can call ObReferenceObject to increment the reference
count and ObDereferenceObject to decrement it. For each type of object, there is a routine that you can call to destroy the
object. For example, IoDeleteDevice is the routine you call to delete a DEVICE_OBJECT. That routine never directly releases
the memory occupied by the object. Instead, it directly or indirectly calls ObDereferenceObject to release the original reference.
Only when the reference count drops to 0 will the Object Manager actually destroy the object.

NOTE
In Chapter 5, I advised you to take an extra reference to a file object or device object discovered via
IoGetDeviceObjectPointer around the call to IoCallDriver for an asynchronous IRP. The reason for the advice
may now be clear: you want to be sure the target driver for the IRP is pinned in memory until its dispatch
routine returns regardless of whether your completion routine releases the reference taken by
IoGetDeviceObjectPointer. Dang, but this is getting complicated!

IoDeleteDevice makes some checks before it releases the last reference to a device object. In both operating systems, it checks
whether the AttachedDevice pointer is NULL. This field in the device object points upward to the device object for the next
upward driver. This field is set by IoAttachDeviceToDeviceStack and reset by IoDetachDevice, which are functions that WDM
drivers call in their AddDevice and RemoveDevice functions, respectively.
You want to think about the entire PnP stack of device objects as being the target of IRPs that the I/O Manager and drivers
outside the stack send to “your” device. This is because the driver for the topmost device object in the stack is always first to
process any IRP. Before anyone sends an IRP to your stack, however, they will have a referenced pointer to this topmost device
object, and they won’t release the reference until after the IRP completes. So if a driver stack contains just one device object,
there will never be any danger of having a device object or driver code disappear while the driver is processing an IRP: the IRP
sender’s reference pins the device object in memory, even if someone calls IoDeleteDevice before the IRP completes, and the
device object pins the driver code in memory.
WDM driver stacks usually contain two or more device objects, so you have to wonder about the second and lower objects in a
stack. After all, whoever sends an IRP to the device has a reference only to the topmost device object, not to the objects lower
down in the stack. Imagine the following scenario, then. Someone sends an IRP_MJ_SOMETHING (a made-up major function
to keep us focused on the remove lock) to the topmost filter device object (FiDO), whose driver sends it down the stack to your
function driver. You plan to send this IRP down to the filter driver underneath you. But, at about the same time on another CPU,
the PnP Manager has sent your driver stack an IRP_MN_REMOVE_DEVICE request.
Before the PnP Manager sends REMOVE_DEVICE requests, it takes an extra reference to every device object in the stack.
Then it sends the IRP. Each driver passes the IRP down the stack and then calls IoDetachDevice followed by IoDeleteDevice.
At each level, IoDeleteDevice sees that AttachedDevice is not (yet) NULL and decides that the time isn’t quite right to
dereference the device object. When the driver at the next higher level calls IoDetachDevice, however, the time is right, and
the I/O Manager dereferences the device object. Without the PnP Manager’s extra reference, the object would then disappear,
and that might trigger unloading the driver at that level of the stack. Once the REMOVE_DEVICE request is complete, the PnP
Manager will release all the extra references. That will allow all but the topmost device object to disappear because only the
topmost object is protected by the reference owned by the sender of the IRP_MJ_SOMETHING.

IMPORTANT
Every driver I’ve ever seen or written processes REMOVE_DEVICE synchronously. That is, no driver ever pends
a REMOVE_DEVICE request. Consequently, the calls to IoDetachDevice and IoDeleteDevice at any level of the
PnP stack always happen after the lower-level drivers have already performed those calls. This fact doesn’t
impact our analysis of the remove lock because the PnP Manager won’t release its extra reference to the stack
until after REMOVE_DEVICE actually completes, which requires IoCompleteRequest to run to conclusion.

Can you see why the Microsoft folks who understand the PnP Manager deeply are fond of saying, “Game Over” at this point?
We’re going to trust whoever is above us in the PnP stack to keep our device object and driver code in memory until we’re
done handling the IRP_MJ_SOMETHING that I hypothesized. But we haven’t (yet) done anything to keep the next lower
device object and driver in memory. While we were getting ready to send the IRP down, the IRP_MN_REMOVE_DEVICE ran
to completion, and the lower driver is now gone!
And that’s the problem that the remove lock solves: we simply don’t want to pass an IRP down the stack if we’ve already

returned from handling an IRP_MN_REMOVE_DEVICE. Conversely, we don’t want to return from

IRP_MN_REMOVE_DEVICE (and thereby allow the PnP Manager to release what might be the last reference to the lower
device object) until we know the lower driver is done with all the IRPs that we’ve sent to it.
Armed with this understanding, let’s look again at an IRP-handling scenario in which the remove lock is helpful. This is an
example of my IRP-handling scenario 1 (pass down with completion routine) from Chapter 5:

NTSTATUS DispatchSomething(PDEVICE_OBJECT fdo, PIRP Irp)

{
PDEVICE_EXTENSION pdx = (PDEVICE_EXTENSION) fdo->DeviceExtension;
A NTSTATUS status = IoAcquireRemoveLock(&pdx->RemoveLock, Irp);
if (!NT_SUCCESS(status))
return CompleteRequest(Irp, status, 0);
IoCopyCurrentIrpStackLocationToNext(Irp);
IoSetCompletionRoutine(Irp,
(PIO_COMPLETION_ROUTINE) CompletionRoutine, pdx, TRUE, TRUE, TRUE);
return IoCallDriver(pdx->LowerDeviceObject, Irp);
}

NTSTATUS CompletionRoutine(PDEVICE_OBJECT fdo, PIRP Irp, PDEVICE_EXTENSION pdx)

{
if (Irp->PendingReturned)
IoMarkIrpPending(Irp);
<desired completion processing>
B IoReleaseRemoveLock(&pdx->RemoveLock, Irp);
return STATUS_SUCCESS;
}

In summary, we acquire the remove lock for this IRP in the dispatch routine, and we release it in the completion routine.
Suppose this IRP is racing an IRP_MN_REMOVE_DEVICE down the stack. If our HandleRemoveDevice function has gotten
to the point of calling IoReleaseRemoveLockAndWait before we get to point A, perhaps all the device objects in the stack are
teetering on the edge of extinction because the REMOVE_DEVICE may have finished long ago. If we’re the topmost device
object, somebody’s reference is keeping us alive. If we’re lower down the stack, the driver above us is keeping us alive. Either
way, it’s certainly OK for us to execute instructions. We’ll find that our call to IoAcquireRemoveLock returns
STATUS_DELETE_PENDING, so we’ll just complete the IRP and return.
Suppose instead that we win the race by calling IoAcquireRemoveLock before our HandleRemoveDevice function calls
IoReleaseRemoveLockAndWait. In this case, we’ll pass the IRP down the stack. IoReleaseRemoveLockAndWait will block until
our completion routine (at point B) releases the lock. At this exact instant, we fall back on the IRP sender’s reference or the
driver above us to keep us in memory long enough for our completion routine to return.
At this point in the analysis, I have to raise an alarming point that everyone who writes WDM drivers or writes or lectures
about them, including me, has missed until now. Passing an IRP down without a completion routine is actually unsafe because
it allows us to send an IRP down to a driver that isn’t pinned in memory. Anytime you see a call to
IoSkipCurrentIrpStackLocation (there are 204 of them in the Windows XP DDK), your antennae should twitch. We’ve all been
getting away with this because some redundant protections are in place and because the coincidence of an
IRP_MN_REMOVE_DEVICE with some kind of problem IRP is very rare. Refer to the sidebar for a discussion.

The Redundant Guards Against Early Removal

As the text says, Windows XP contains some redundant protections against early removal of device objects. In
both Windows XP and Windows 2000, the PnP Manager won’t send an IRP_MN_REMOVE_DEVICE if any file
objects exist that point to any device object in the stack. Many IRPs are handle based in that they originate in
callers that hold a referenced pointer to a file object. Consequently, there is never a concern with these
handle-based IRPs that your lower device object might disappear. You can dispense with the remove lock
altogether for these IRPs if you trust all the drivers who send them to you to either have a referenced file object
or hold their own remove lock while they’re outstanding.

There is a large class of IRP that device drivers never see because these IRPs involve file system operations on
volumes. Thus, worrying about what might happen as a device driver handles an
IRP_MJ_QUERY_VOLUME_INFORMATION, for example, isn’t practical.

Only a few IRPs aren’t handle based or aimed at file system drivers, and most of them carry their own built-in
safeguards. To get an IRP_MJ_SHUTDOWN, you have to specifically register with the I/O Manager by calling
IoRegisterShutdownNotification. IoDeleteDevice automatically deregisters you if you happen to forget, and you
won’t be getting REMOVE_DEVICE requests while shutdown notifications are in progress. (While we’re on the
subject, note these additional details about IRP_MJ_SHUTDOWN. Like every other IRP, this one will be sent first
to the topmost FiDO in the PnP stack if any driver in the stack has called IoRegisterShutdownNotification.
Furthermore, as many IRPs will be sent as there are drivers in the stack with active notification requests. Thus,
drivers should take care to do their shutdown processing only once and should pass this IRP down the stack
after doing their own shutdown processing.)

IRP_MJ_SYSTEM_CONTROL is another special case. The Windows Management Instrumentation (WMI)

subsystem uses this request to perform WMI query and set operations. Part of your StopDevice processing
ought to be deregistering with WMI, and the deregistration call doesn’t return until all of these IRPs have
drained through your device. After the deregistration call, you won’t get any more WMI requests.

The PnP Manager itself is the source of most IRP_MJ_PNP requests, and you can be sure that it won’t overlap
a REMOVE_DEVICE request with another PnP IRP. You can’t, however, be sure there’s no overlap with PnP IRPs
sent by other drivers, such as a QUERY_DEVICE_RELATIONS to get the physical device object (PDO) address or
a QUERY_INTERFACE to locate a direct-call interface.

Finally, there’s IRP_MJ_POWER, which is a potential problem because the Power Manager doesn’t lock an entire
device stack and doesn’t hold a file object pointer.

The window of vulnerability is actually pretty small. Consider the following fragment of dispatch routines in two drivers:

NTSTATUS DriverA_DispatchSomething(...)
{

NTSTATUS status = IoAcquireRemoveLock(...);

if (!NT_SUCCESS(status))
return CompleteRequest(...);
IoSkipCurrentIrpStackLocation(...);
status = IoCallDriver(...);
IoReleaseRemoveLock(...);
return status;
}

NTSTATUS DriverB_DispatchSomething(...)
{

return ??;
}

Driver A’s use of the remove lock protects Driver B until Driver B’s dispatch routine returns. Thus, if Driver B completes the
IRP or itself passes the IRP down using IoSkipCurrentIrpStackLocation, Driver B’s involvement with the IRP will certainly be
finished by the time Driver A is able to release the remove lock. If Driver B were to pend the IRP, Driver A wouldn’t be
holding the remove lock by the time Driver B got around to completing the IRP. We can assume, however, that Driver B will
have some mechanism in place for purging its queues of pending IRPs before returning from its own HandleRemoveDevice
function. Driver A won’t call IoDetachDevice or return from its own HandleRemoveDevice function until afterwards.
The only time there will be a problem is if Driver B passes the IRP down with a completion routine installed via the original
IoSetCompletionRoutine macro. Even here, if the lowest driver that handles this IRP does so correctly, itsHandleRemoveDevice
function won’t return until the IRP is completed. We’ll have just a slim chance that Driver B could be unloaded before its
completion routine runs.
There is, unfortunately, no way for a driver to completely protect itself from being unloaded while processing an IRP. Any
scheme you or I can devise will inevitably risk executing at least one instruction (a return) after the system removes the driver
image from memory. You can, however, hope that the drivers above you minimize the risk by using the techniques I’ve
outlined here.

6.3.7 How the DEVQUEUE Works with PnP

In contrast with other examples in this book, I’m going to show you the full implementation of the DEVQUEUE object, even
though the source code is in the companion content. I’m making an exception in this case because I think an annotated listing
of the functions will make it easier for you to understand how to use it. We’ve already discussed the major routines in the
preceding chapter, so I can focus here on the routines that dovetail with IRP_MJ_PNP.

Stalling the Queue

Stalling the IRP queue involves two DEVQUEUE functions:

VOID NTAPI StallRequests(PDEVQUEUE pdq)

{

InterlockedIncrement(&pdq->stallcount);
}

BOOLEAN NTAPI CheckBusyAndStall(PDEVQUEUE pdq)

{
KIRQL oldirql;

KeAcquireSpinLock(&pdq->lock, &oldirql);

BOOLEAN busy = pdq->CurrentIrp != NULL;

if (!busy)

InterlockedIncrement(&pdq->stallcount);
KeReleaseSpinLock(&pdq->lock, oldirql);
return busy;
}

1. To stall requests, we just need to set the stall counter to a nonzero value. It’s unnecessary to protect the increment with a
spin lock because any thread that might be racing with us to change the value will also be using an interlocked increment
or decrement.
2. Since CheckBusyAndStall needs to operate as an atomic function, we first take the queue’s spin lock.
3. CurrentIrp being non-NULL is the signal that the device is busy handling one of the requests from this queue.
4. If the device is currently idle, this statement starts stalling the queue, thereby preventing the device from becoming busy
later on.
Recall that StartPacket and StartNextPacket don’t send IRPs to the queue’s StartIo routine while the stall counter is nonzero. In
addition, InitializeQueue initializes the stall counter to 1, so the queue begins life in the stalled state.

Restarting the Queue

RestartRequests is the function that unstalls a queue. This function is quite similar to StartNextPacket, which I showed you in
Chapter 5.

VOID RestartRequests(PDEVQUEUE pdq, PDEVICE_OBJECT fdo)

{
KIRQL oldirql;

KeAcquireSpinLock(&pdq->lock, &oldirql);

if (InterlockedDecrement(&pdq->stallcount) > 0)
{
KeReleaseSpinLock(&pdq->lock, oldirql);
return;
}

while (!pdq->stallcount && !pdq->CurrentIrp && !pdq->abortstatus

&& !IsListEmpty(&pdq->head))
{
PLIST_ENTRY next = RemoveHeadList(&pdq->head);
PIRP Irp = CONTAINING_RECORD(next, IRP, Tail.Overlay.ListEntry);
if (!IoSetCancelRoutine(Irp, NULL))
{
InitializeListHead(&Irp->Tail.Overlay.ListEntry);
continue;
}
pdq->CurrentIrp = Irp;
KeReleaseSpinLockFromDpcLevel(&pdq->lock);
(*pdq->StartIo)(fdo, Irp);
KeLowerIrql(oldirql);
return;
}
KeReleaseSpinLock(&pdq->lock, oldirql);
}

1. We acquire the queue spin lock to prevent interference from a simultaneous invocation of StartPacket.
2. Here we decrement the stall counter. If it’s still nonzero, the queue remains stalled, and we return.
3. This loop duplicates a similar loop inside StartNextPacket. We need to duplicate the code here to accomplish all of this
function’s actions within one invocation of the spin lock.

NOTE
True confession: The first edition described a much simpler—and incorrect—implementation of
RestartRequests. A reader pointed out a race between the earlier implementation and StartPacket, which was
corrected on my Web site as shown here.

Awaiting the Current IRP

The handler for IRP_MN_STOP_DEVICE might need to wait for the current IRP, if any, to finish by calling
WaitForCurrentIrp:

VOID NTAPI WaitForCurrentIrp(PDEVQUEUE pdq)

{

KeClearEvent(&pdq->evStop);

ASSERT(pdq->stallcount != 0);
KIRQL oldirql;

KeAcquireSpinLock(&pdq->lock, &oldirql);
BOOLEAN mustwait = pdq->CurrentIrp != NULL;
KeReleaseSpinLock(&pdq->lock, oldirql);
if (mustwait)
KeWaitForSingleObject(&pdq->evStop, Executive, KernelMode, FALSE, NULL);
}

1. StartNextPacket signals the evStop event each time it’s called. We want to be sure that the wait we’re about to perform
doesn’t complete because of a now-stale signal, so we clear the event before doing anything else.
2. It doesn’t make sense to call this routine without first stalling the queue. Otherwise, StartNextPacket will just start the
next IRP if there is one, and the device will become busy again.
3. If the device is currently busy, we’ll wait on the evStop event until someone calls StartNextPacket to signal that event. We
need to protect our inspection of CurrentIrp with the spin lock because, in general, testing a pointer for NULL isn’t an
atomic event. If the pointer is NULL now, it can’t change later because we’ve assumed that the queue is stalled.

Aborting Requests
Surprise removal of the device demands that we immediately halt every outstanding IRP that might try to touch the hardware.
In addition, we want to make sure that all further IRPs are rejected. The AbortRequests function helps with these tasks:

VOID NTAPI AbortRequests(PDEVQUEUE pdq, NTSTATUS status)

{
pdq->abortstatus = status;
CleanupRequests(pdq, NULL, status);
}

Setting abortstatus puts the queue in the REJECTING state so that all future IRPs will be rejected with the status value our
caller supplied. Calling CleanupRequests at this point—with a NULL file object pointer so that CleanupRequests will process
the entire queue—empties the queue.
We don’t dare try to do anything with the IRP, if any, that’s currently active on the hardware. Drivers that don’t use the
hardware abstraction layer (HAL) to access the hardware—USB drivers, for example, which rely on the hub and
host-controller drivers—can count on another driver to cause the current IRP to fail. Drivers that use the HAL might, however,
need to worry about hanging the system or, at the very least, leaving an IRP in limbo because the nonexistent hardware can’t
generate the interrupt that would let the IRP finish. To deal with situations such as this, you call AreRequestsBeingAborted:

NTSTATUS AreRequestsBeingAborted(PDEVQUEUE pdq)

{
return pdq->abortstatus;
}

It would be silly, by the way, to use the queue spin lock in this routine. Suppose we capture the instantaneous value of
abortstatus in a thread-safe and multiprocessor-safe way. The value we return can become obsolete as soon as we release the
spin lock.

NOTE
If your device might be removed in such a way that an outstanding request simply hangs, you should also have
some sort of watchdog timer running that will let you kill the IRP after a specified period of time. See the
“Watchdog Timers“ section in Chapter 14.

Sometimes we need to undo the effect of a previous call to AbortRequest. AllowRequests lets us do that:

VOID NTAPI AllowRequests(PDEVQUEUE pdq)

{
pdq->abortstatus = (NTSTATUS) 0;
}

6.4 Other Configuration Functionality

Up to this point, I’ve talked about the important concepts you need to know to write a hardware device driver. I’ll discuss two
less important minor function codes—IRP_MN_FILTER_RESOURCE_REQUIREMENTS and IRP_MN_DEVICE_
USAGE_NOTIFICATION—that you might need to process in a practical driver. Finally I’ll mention how you can register to
receive notifications about PnP events that affect devices other than your own.

6.4.1 Filtering Resource Requirements

Sometimes the PnP Manager is misinformed about the resource requirements of your driver. This can occur because of
hardware and firmware bugs, mistakes in the INF file for a legacy device, or other reasons. The system provides an escape
valve in the form of the IRP_MN_FILTER_RESOURCE_REQUIREMENTS request, which affords you a chance to examine
and possibly alter the list of resources before the PnP Manager embarks on the arbitration and assignment process that
culminates in your receiving a start device IRP.
When you receive a filter request, the FilterResourceRequirements substructure of the Parameters union in your stack location
points to an IO_RESOURCE_REQUIREMENTS_LIST data structure that lists the resource requirements for your device. In
addition, if any of the drivers above you have processed the IRP and modified the resource requirements, the
IoStatus.Information field of the IRP will point to a second IO_RESOURCE_REQUIREMENTS_LIST, which is the one from
which you should work. Your overall strategy will be as follows: If you want to add a resource to the current list of
requirements, you do so in your dispatch routine. Then you pass the IRP down the stack synchronously—that is, by using the
ForwardAndWait method you use with a start device request. When you regain control, you can modify or delete any of the
resource descriptions that appear in the list.
Here’s a brief and not very useful example that illustrates the mechanics of the filtering process:

NTSTATUS HandleFilterResources(PDEVICE_OBJECT fdo, PIRP Irp)

{
PDEVICE_EXTENSION pdx = (PDEVICE_EXTENSION) fdo->DeviceExtension;
PIO_STACK_LOCATION stack = IoGetCurrentIrpStackLocation(Irp);

PIO_RESOURCE_REQUIREMENTS_LIST original = stack->Parameters

.FilterResourceRequirements.IoResourceRequirementList;

PIO_RESOURCE_REQUIREMENTS_LIST filtered =
(PIO_RESOURCE_REQUIREMENTS_LIST) Irp->IoStatus.Information;

PIO_RESOURCE_REQUIREMENTS_LIST source = filtered ? filtered : original;

if (source->AlternativeLists != 1)
return DefaultPnpHandler(fdo, Irp);

ULONG sizelist = source->ListSize;

PIO_RESOURCE_REQUIREMENTS_LIST newlist =
(PIO_RESOURCE_REQUIREMENTS_LIST) ExAllocatePool(PagedPool,
sizelist + sizeof(IO_RESOURCE_DESCRIPTOR));
if (!newlist)
return DefaultPnpHandler(fdo, Irp);
RtlCopyMemory(newlist, source, sizelist);

newlist->ListSize += sizeof(IO_RESOURCE_DESCRIPTOR);
PIO_RESOURCE_DESCRIPTOR resource =
&newlist->List[0].Descriptors[newlist->List[0].Count++];
RtlZeroMemory(resource, sizeof(IO_RESOURCE_DESCRIPTOR));

resource->Type = CmResourceTypeDevicePrivate;
resource->ShareDisposition = CmResourceShareDeviceExclusive;
resource->u.DevicePrivate.Data[0] = 42;

Irp->IoStatus.Information = (ULONG_PTR) newlist;

if (filtered && filtered != original)
ExFreePool(filtered);

NTSTATUS status = ForwardAndWait(fdo, Irp);

if (NT_SUCCESS(status))
{
// stuff
}

Irp->IoStatus.Status = status;
IoCompleteRequest(Irp, IO_NO_INCREMENT);
return status;
}

1. The parameters for this request include a list of I/O resource requirements. These are derived from the device’s
configuration space, the registry, or wherever the bus driver happens to find them.
2. Higher-level drivers might have already filtered the resources by adding requirements to the original list. If so, they set
the IoStatus.Information field to point to the expanded requirements list structure.
3. If there’s no filtered list, we’ll extend the original list. If there’s a filtered list, we’ll extend that.
4. Theoretically, several alternative lists of requirements could exist, but dealing with that situation is beyond the scope of
this simple example.
5. We need to add any resources before we pass the request down the stack. First we allocate a new requirements list and
copy the old requirements into it.
6. Taking care to preserve the preexisting order of the descriptors, we add our own resource description. In this example,
we’re adding a resource that’s private to the driver.
7. We store the address of the expanded list of requirements in the IRP’s IoStatus.Information field, which is where
lower-level drivers and the PnP system will be looking for it. If we just extended an already filtered list, we need to
release the memory occupied by the old list.
8. We pass the request down using the same ForwardAndWait helper function that we used for IRP_MN_START_DEVICE.
If we weren’t going to modify any resource descriptors on the IRP’s way back up the stack, we could just call
DefaultPnpHandler here and propagate the returned status.
9. When we complete this IRP, whether we indicate success or failure, we must take care not to modify the Information
field of the I/O status block: it might hold a pointer to a resource requirements list that some driver—maybe even
ours!—installed on the way down. The PnP Manager will release the memory occupied by that structure when it’s no
longer needed.

6.4.2 Device Usage Notifications

Disk drivers (and the drivers for disk controllers) in particular sometimes need to know extrinsic facts about how they’re being
used by the operating system, and the IRP_MN_DEVICE_USAGE_NOTIFICATION request provides a means to gain that
knowledge. The I/O stack location for the IRP contains two parameters in the Parameters.UsageNotification substructure. See
Table 6-4. The InPath value (a Boolean) indicates whether the device is in the device path required to support that usage, and
the Type value indicates one of several possible special usages.

Parameter Description
InPath TRUE if device is in the path of the Type usage; FALSE if not
Type Type of usage to which the IRP applies

Table 6-4. Fields in the Parameters.UsageNotification Substructure of an I/O Stack Location

In the subdispatch routine for the notification, you should have a switch statement (or other logic) that differentiates among the
notifications you know about. In most cases, you’ll pass the IRP down the stack. Consequently, a skeleton for the subdispatch
function is as follows:

NTSTATUS HandleUsageNotification(PDEVICE_OBJECT fdo, PIRP Irp)

{
PDEVICE_EXTENSION pdx = (PDEVICE_EXTENSION) fdo->DeviceExtension;

Problem Review
50% (2)
Problem Review
16 pages
Input Output Organization Question Answer
No ratings yet
Input Output Organization Question Answer
33 pages
Operating System: Operating Systems: Internals and Design Principles
No ratings yet
Operating System: Operating Systems: Internals and Design Principles
86 pages
Chapter 1: Introduction To Computers What Is Computer?
No ratings yet
Chapter 1: Introduction To Computers What Is Computer?
44 pages
Beveridge1997 - Multithreading Applications in Win32
100% (1)
Beveridge1997 - Multithreading Applications in Win32
376 pages
Scada SPC For Mumbai-Area
No ratings yet
Scada SPC For Mumbai-Area
50 pages
OS - Operating System Akash
No ratings yet
OS - Operating System Akash
69 pages
Aix Vio
No ratings yet
Aix Vio
8 pages
Ec8791 - Embedded and Real Time Systems Year / Sem: Iv / 07
No ratings yet
Ec8791 - Embedded and Real Time Systems Year / Sem: Iv / 07
12 pages
Unit 2 Process
No ratings yet
Unit 2 Process
94 pages
Chapter2 - Processes and Threads
No ratings yet
Chapter2 - Processes and Threads
96 pages
Con Currency
No ratings yet
Con Currency
99 pages
Concurrency
No ratings yet
Concurrency
99 pages
Cs Intro Os
No ratings yet
Cs Intro Os
58 pages
CH02 - Operating - System - Overview
No ratings yet
CH02 - Operating - System - Overview
70 pages
CH 02
No ratings yet
CH 02
98 pages
Chapter2 - Processes and Threads
No ratings yet
Chapter2 - Processes and Threads
90 pages
Ch2 - Process and Process Management
No ratings yet
Ch2 - Process and Process Management
70 pages
Module 2.1 Process
No ratings yet
Module 2.1 Process
54 pages
IMP Questions With Answer OSY Msbte
No ratings yet
IMP Questions With Answer OSY Msbte
39 pages
Week 03
No ratings yet
Week 03
51 pages
01-Introduction To Operating System
No ratings yet
01-Introduction To Operating System
48 pages
Escape From OS
No ratings yet
Escape From OS
19 pages
DevOps Linux Lecture-1
No ratings yet
DevOps Linux Lecture-1
44 pages
Compal Confidential: NAWE6 Schematics Document
No ratings yet
Compal Confidential: NAWE6 Schematics Document
48 pages
Exceptional Control Flow
No ratings yet
Exceptional Control Flow
45 pages
Operating System Concepts
No ratings yet
Operating System Concepts
32 pages
Week 02
No ratings yet
Week 02
31 pages
Operating System
No ratings yet
Operating System
19 pages
Operating Systems Finals Revision
No ratings yet
Operating Systems Finals Revision
21 pages
Ssos - U3
No ratings yet
Ssos - U3
31 pages
Basic Operating System Concepts
No ratings yet
Basic Operating System Concepts
54 pages
Process Concepts
No ratings yet
Process Concepts
27 pages
Chapter 6
No ratings yet
Chapter 6
74 pages
CSCE 313-501 Definitions: 1.1 & The Basic Elements
No ratings yet
CSCE 313-501 Definitions: 1.1 & The Basic Elements
6 pages
Unit-1 2
No ratings yet
Unit-1 2
29 pages
PPL Unit-4 Material
No ratings yet
PPL Unit-4 Material
16 pages
Motherboard Manual Ga-945gm-S2 e
No ratings yet
Motherboard Manual Ga-945gm-S2 e
80 pages
Basic Operating System Concepts: A Review
No ratings yet
Basic Operating System Concepts: A Review
53 pages
Process Concept
No ratings yet
Process Concept
27 pages
Cooperative Threading in A Preemptive Environment
No ratings yet
Cooperative Threading in A Preemptive Environment
6 pages
TOPCIT Reviewer OS and ComArch
No ratings yet
TOPCIT Reviewer OS and ComArch
20 pages
OS Short Question
No ratings yet
OS Short Question
26 pages
LKK
No ratings yet
LKK
40 pages
Process and Threads
No ratings yet
Process and Threads
18 pages
A Notes
No ratings yet
A Notes
5 pages
1 Os Mechanisms
No ratings yet
1 Os Mechanisms
18 pages
PPL Unit-4
No ratings yet
PPL Unit-4
9 pages
2process Management
No ratings yet
2process Management
15 pages
Lecture Two
No ratings yet
Lecture Two
13 pages
Chapter1notes 2up
No ratings yet
Chapter1notes 2up
9 pages
Processes: Questions Answered in This
No ratings yet
Processes: Questions Answered in This
30 pages
Computer Organization: Lecture 2 - Interrupts & System Calls
No ratings yet
Computer Organization: Lecture 2 - Interrupts & System Calls
6 pages
Java Concurrency
No ratings yet
Java Concurrency
70 pages
Concurrent and Parallel Programming .Unit-1
No ratings yet
Concurrent and Parallel Programming .Unit-1
8 pages
OS Unit5
No ratings yet
OS Unit5
8 pages
Device Tree Tut - Power - ePAPR - APPROVED - v1.1 PDF
No ratings yet
Device Tree Tut - Power - ePAPR - APPROVED - v1.1 PDF
108 pages
The Kernel Abstraction
No ratings yet
The Kernel Abstraction
50 pages
Getting Full Speed With Delphi
No ratings yet
Getting Full Speed With Delphi
25 pages
2011 Fall Midterm1 Soln CS439
No ratings yet
2011 Fall Midterm1 Soln CS439
8 pages
64-Bit Insider Volume 1 Issue 14
No ratings yet
64-Bit Insider Volume 1 Issue 14
6 pages
Abb PLC System
No ratings yet
Abb PLC System
25 pages
8255 - Programmable Peripheral Interface
No ratings yet
8255 - Programmable Peripheral Interface
40 pages
Operating System
No ratings yet
Operating System
3 pages
Semaphore 1
No ratings yet
Semaphore 1
4 pages
ICS 2305 Systems Programming
No ratings yet
ICS 2305 Systems Programming
20 pages
What Is Thread ?: Description
No ratings yet
What Is Thread ?: Description
3 pages
AT iMG1000 USB DriverUpdate A2
No ratings yet
AT iMG1000 USB DriverUpdate A2
25 pages
What Is Thread ?: Description
No ratings yet
What Is Thread ?: Description
3 pages
SAILOR Data Terminal TT-3006E (DT4646E) Software Installation Guide
No ratings yet
SAILOR Data Terminal TT-3006E (DT4646E) Software Installation Guide
3 pages
ROG Crosshair VIII Series Memory QVL 20200211
No ratings yet
ROG Crosshair VIII Series Memory QVL 20200211
8 pages
We've Received Your Order!: View Purchase
No ratings yet
We've Received Your Order!: View Purchase
4 pages
SteelConnect Spec Sheet
No ratings yet
SteelConnect Spec Sheet
4 pages
TUT2
No ratings yet
TUT2
3 pages
Eastern Africa Power Pool (EAPP) Address: P.O Box 100644, Addis-Ababa (Ethiopia) House 059, Wereda 02, Bole Sub City
No ratings yet
Eastern Africa Power Pool (EAPP) Address: P.O Box 100644, Addis-Ababa (Ethiopia) House 059, Wereda 02, Bole Sub City
12 pages
Snake Game
50% (2)
Snake Game
13 pages
ASCOT Product Description
No ratings yet
ASCOT Product Description
7 pages
DS1302 PDF
No ratings yet
DS1302 PDF
5 pages
SQL Server On Vmware Best Practices Guide
No ratings yet
SQL Server On Vmware Best Practices Guide
53 pages
Proposal On IOT Using Raspberry Pi
No ratings yet
Proposal On IOT Using Raspberry Pi
7 pages
Scout and Guide
No ratings yet
Scout and Guide
19 pages
Microchip PolarFire FPGA and PolarFire SoC FPGA Security User Guide VA
No ratings yet
Microchip PolarFire FPGA and PolarFire SoC FPGA Security User Guide VA
85 pages
Dell Docking Compatibility Guide
No ratings yet
Dell Docking Compatibility Guide
8 pages
Cpunxt A006182 Series
No ratings yet
Cpunxt A006182 Series
7 pages
FNC-0109TX Spec V5.0
No ratings yet
FNC-0109TX Spec V5.0
2 pages
Brains Behind Chips (AWEVF)
No ratings yet
Brains Behind Chips (AWEVF)
6 pages
user-manual-CANON-IR 2018-E
No ratings yet
user-manual-CANON-IR 2018-E
2 pages
Laptop Lenovo ThinkPad X230 4GB Intel Core I5 HDD 500GB in Central Division - Laptops & Computers, Mea Hub Solutions Josephine
No ratings yet
Laptop Lenovo ThinkPad X230 4GB Intel Core I5 HDD 500GB in Central Division - Laptops & Computers, Mea Hub Solutions Josephine
1 page
Laptop 02 Januari 2025 3
No ratings yet
Laptop 02 Januari 2025 3
2 pages
50 Recipes for Programming Node.js
From Everand
50 Recipes for Programming Node.js
Jamie Munro
3/5 (4)