Programming The Microsoft Windows Driver Model 2nd Edition (101-200)
Programming The Microsoft Windows Driver Model 2nd Edition (101-200)
Chapter 4
4 Synchronization
Microsoft Windows XP is a multitasking operating system that can run in a symmetric multiprocessor environment. It’s not my
purpose here to provide a rigorous description of the multitasking capabilities of Microsoft Windows XP; one good place to get
more information is David Solomon and Mark Russinovich’s Inside Windows 2000, Third Edition (Microsoft Press, 2000). All
we need to understand as driver writers is that our code executes in the context of one thread or another (and the thread context
can change from one invocation of our code to another) and that the exigencies of multitasking can yank control away from us
at practically any moment. Furthermore, true simultaneous execution of multiple threads is possible on a multiprocessor
machine. In general, we need to assume two worst-case scenarios:
1. The operating system can preempt any subroutine at any moment for an arbitrarily long period of time, so we cannot be
sure of completing critical tasks without interference or delay.
2. Even if we take steps to prevent preemption, code executing simultaneously on another CPU in the same computer can
interfere with our code—it’s even possible that the same set of instructions belonging to one of our programs could be
executing in parallel in the context of two different threads.
Windows XP allows you to solve these general synchronization problems by using a variety of synchronization primitives. The
system prioritizes the handling of hardware and software interrupts with the interrupt request level (IRQL). The system offers a
variety of synchronization primitives. Some of these primitives are appropriate at times when you can safely block and
unblock threads. One primitive, the spin lock, allows you to synchronize access to shared resources even at times when thread
blocking wouldn’t be allowed because of the priority level at which a program runs.
Suppose further that you increment this variable when you receive a request and decrement it when you later complete the
request:
I’m sure you recognize already that a counter such as this one ought not to be a static variable: it should be a member of your
device extension so that each device object has its own unique counter. Bear with me, and pretend that your driver always
manages only a single device. To make the example more meaningful, suppose finally that a function in your driver will be
called when it’s time to delete your device object. You might want to defer the operation until no more requests are outstanding,
so you might insert a test of the counter:
This example describes a real problem, by the way, which we’ll tackle in Chapter 6 in our discussion of Plug and Play (PnP)
requests. The I/O Manager can try to remove one of our devices at a time when requests are active, and we need to guard
against that by keeping some sort of counter. I’ll show you in Chapter 6 how to use IoAcquireRemoveLock and some related
functions to solve the problem.
A horrible synchronization problem lurks in the code fragments I just showed you, but it becomes apparent only if you look
behind the increment and decrement operations inside DispatchPnp. On an x86 processor, the compiler might implement them
- 84 - Synchronization | Chapter 4
; ++lActiveRequests;
mov eax, lActiveRequests
add eax, 1
mov lActiveRequests, eax
; --lActiveRequests;
mov eax, lActiveRequests
sub eax, 1
mov lActiveRequests, eax
To expose the synchronization problem, let’s consider first what might go wrong on a single CPU. Imagine two threads that are
both trying to advance through DispatchPnp at roughly the same time. We know they’re not both executing truly
simultaneously because we have only a single CPU for them to share. But imagine that one of the threads is executing near the
end of the function and manages to load the current contents of lActiveRequests into the EAX register just before the other
thread preempts it. Suppose lActiveRequests equals 2 at that instant. As part of the thread switch, the operating system saves
the EAX register (containing the value 2) as part of the outgoing thread’s context image somewhere in main memory.
NOTE
The point being made in the text isn’t limited to thread preemption that occurs as a result of a time slice
expiring. Threads can also involuntarily lose control because of page faults, changes in CPU affinity, or priority
changes instigated by outside agents. Think, therefore, of preemption as being an all-encompassing term that
includes all means of giving control of a CPU to another thread without explicit permission from the currently
running thread.
Now imagine that the other thread manages to get past the incrementing code at the beginning of DispatchPnp. It will
increment lActiveRequests from 2 to 3 (because the first thread never got to update the variable). If the first thread preempts
this other thread, the operating system will restore the first thread’s context, which includes the value 2 in the EAX register.
The first thread now proceeds to subtract 1 from EAX and store the result back in lActiveRequests. At this point,
lActiveRequests contains the value 1, which is incorrect. Somewhere down the road, we might prematurely delete our device
object because we’ve effectively lost track of one I/O request.
Solving this particular problem is easy on an x86 computer—we just replace the load/add/store and load/subtract/store
instruction sequences with atomic instructions:
; ++lActiveRequests;
inc lActiveRequests
; --lActiveRequests;
dec lActiveRequests
On an Intel x86, the INC and DEC instructions cannot be interrupted, so there will never be a case in which a thread can be
preempted in the middle of updating the counter. As it stands, though, this code still isn’t safe in a multiprocessor environment
because INC and DEC are implemented in several microcode steps. It’s possible for two different CPUs to be executing their
microcode just slightly out of step such that one of them ends up updating a stale value. The multi-CPU problem can also be
avoided in the x86 architecture by using a LOCK prefix:
; ++lActiveRequests;
lock inc lActiveRequests
; --lActiveRequests;
lock dec lActiveRequests
The LOCK instruction prefix locks out all other CPUs while the microcode for the current instruction executes, thereby
guaranteeing data integrity.
Not all synchronization problems have such an easy solution, unfortunately. The point of this example isn’t to demonstrate how
to solve one simple problem on one of the platforms where Windows XP runs but rather to illustrate the two sources of
difficulty: preemption of one thread by another in the middle of a state change and simultaneous execution of conflicting
state-change operations. We can avoid difficulty by judiciously using synchronization primitives, such as mutual exclusion
objects, to block other threads while our thread accesses shared data. At times when thread blocking is impermissible, we can
avoid preemption by using the IRQL priority scheme, and we can pre vent simultaneous execution by judiciously using spin
locks.
Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
4.2 Interrupt Request Level - 85 -
in the device object and the device extension without interference from driver dispatch routines and one another. When one of
these routines is running, the rule stated earlier guarantees that no thread can preempt it on the same CPU to execute a driver
dispatch routine because the dispatch routine runs at a lower IRQL. Furthermore, no thread can preempt it to run another of
these special routines because that other routine will run at the same IRQL.
NOTE
Dispatch routine and DISPATCH_LEVEL unfortunately have similar names. Dispatch routines are so called
because the I/O Manager dispatches I/O requests to them. DISPATCH_LEVEL is so called because it’s the IRQL
at which the kernel’s thread dispatcher originally ran when deciding which thread to run next. (The thread
dispatcher runs at SYNCH_LEVEL, if you care. This is the same as DISPATCH_LEVEL on a uniprocessor machine,
if you really care.)
Between DISPATCH_LEVEL and PROFILE_LEVEL is room for various hardware interrupt levels. In general, each device that
generates interrupts has an IRQL that defines its interrupt priority vis-à-vis other devices. A WDM driver discovers the IRQL
for its interrupt when it receives an IRP_MJ_PNP request with the minor function code IRP_MN_START_DEVICE. The
device’s interrupt level is one of the many items of configuration information passed as a parameter to this request. We often
refer to this level as the device IRQL, or DIRQL for short. DIRQL isn’t a single request level. Rather, it’s the IRQL for the
interrupt associated with whichever device is under discussion at the time.
The other IRQL levels have meanings that sometimes depend on the particular CPU architecture. Since those levels are used
internally by the kernel, their meanings aren’t especially germane to the job of writing a device driver. The purpose of
APC_LEVEL, for example, is to allow the system to schedule an asynchronous procedure call (APC), which I’ll describe in
detail later in this chapter. Operations that occur at HIGH_LEVEL include taking a memory snapshot just prior to hibernating
the computer, processing a bug check, handling a totally spurious interrupt, and others. I’m not going to attempt to provide an
exhaustive list here because, as I said, you and I don’t really need to know all the details.
To summarize, drivers are normally concerned with three interrupt request levels:
PASSIVE_LEVEL, at which many dispatch routines and a few special routines execute
DISPATCH_LEVEL, at which StartIo and DPC routines execute
DIRQL, at which an interrupt service routine executes
Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
4.2 Interrupt Request Level - 87 -
NOTE
Driver dispatch routines usually execute at PASSIVE_LEVEL but not always. You can designate that you want to
receive IRP_MJ_POWER requests at DISPATCH_LEVEL by setting the DO_POWER_INRUSH flag, or by clearing
the DO_POWER_PAGABLE flag, in a device object. Sometimes a driver architecture requires that other drivers
be able to send certain IRPs at DISPATCH_LEVEL. The USB bus driver, for example, accepts data transfer
requests at DISPATCH_LEVEL or below. A standard serial-port driver accepts any read, write, or control
operation at or below DISPATCH_LEVEL.
If your dispatch routine queues the IRP by calling IoStartPacket, your next encounter with the request will be when the I/O
Manager calls your StartIo routine. This call occurs at DISPATCH_LEVEL because the system needs to access the queue of I/O
requests without interference from the other routines that are inserting and removing IRPs from the queue. As I’ll discuss later
in this chapter, queue access occurs under protection of a spin lock, and that carries with it execution at DISPATCH_LEVEL.
Later on, your device might generate an interrupt, whereupon your interrupt service routine will be called at DIRQL. It’s likely
that some registers in your device can’t safely be shared. If you access those registers only at DIRQL, you can be sure that no
one can interfere with your interrupt service routine (ISR) on a single-CPU computer. If other parts of your driver need to
access these crucial hardware registers, you would guarantee that those other parts execute only at DIRQL. The
KeSynchronizeExecution service function helps you enforce that rule, and I’ll discuss it in Chapter 7 in connection with
interrupt handling.
Still later, you might arrange to have a DPC routine called. DPC routines execute at DISPATCH_LEVEL because, among other
things, they need to access your IRP queue to remove the next request from a queue and pass it to your StartIo routine. You call
the IoStartNextPacket service routine to extract the next request from the queue, and it must be called at DISPATCH_LEVEL. It
might call your StartIo routine before returning. Notice how neatly the IRQL requirements dovetail here: queue access, the call
to IoStartNextPacket, and the possible call to StartIo are all required to occur at DISPATCH_LEVEL, and that’s the level at
which the system calls the DPC routine.
Although it’s possible for you to explicitly control IRQL (and I’ll explain how in the next section), there’s seldom any reason
to do so because of the correspondence between your needs and the level at which the system calls you. Consequently, you
don’t need to get hung up on which IRQL you’re executing at from moment to moment: it’s almost surely the correct level for
the work you’re supposed to do right then.
Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
- 88 - Synchronization | Chapter 4
KIRQL oldirql;
KeRaiseIrql(DISPATCH_LEVEL, &oldirql);
KeLowerIrql(oldirql);
1. KIRQL is the typedef name for an integer that holds an IRQL value. We’ll need a variable to hold the current IRQL, so
we declare it this way.
2. This ASSERT expresses a necessary condition for calling KeRaiseIrql: the new IRQL must be greater than or equal to the
current level. If this relation isn’t true, KeRaiseIrql will bugcheck (that is, report a fatal error via a blue screen of death).
3. KeRaiseIrql raises the current IRQL to the level specified by the first argument. It also saves the current IRQL at the
location pointed to by the second argument. In this example, we’re raising IRQL to DISPATCH_LEVEL and saving the
current level in oldirql.
4. After executing whatever code we desired to execute at elevated IRQL, we lower the request level back to its previous
value by calling KeLowerIrql and specifying the oldirql value previously returned by KeRaiseIrql.
After raising the IRQL, you should eventually restore it to the original value. Otherwise, various assumptions made by code
you call later or by the code that called you can later turn out to be incorrect. The DDK documentation says that you must
always call KeLowerIrql with the same value as that returned by the immediately preceding call to KeRaiseIrql, but this
information isn’t exactly right. The only rule that KeLowerIrql actually applies is that the new IRQL must be less than or equal
to the current one. You can lower the IRQL in steps if you want to.
It’s a mistake (and a big one!) to lower IRQL below whatever it was when a system routine called your driver, even if you raise
it back before returning. Such a break in synchronization might allow some activity to preempt you and interfere with a data
object that your caller assumed would remain inviolate.
Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
4.3 Spin Locks - 89 -
CAUTION
You can certainly avoid the deadlock that occurs when a CPU tries to acquire a spin lock it already owns by
following this rule: make sure that the subroutine that claims the lock releases it and never tries to claim it
twice, and then don’t call any other subroutine while you own the lock. There’s no policeman in the operating
system to ensure you don’t call other subroutines—it’s just an engineering rule of thumb that will help you avoid
an inadvertent mistake. The danger you’re guarding against is that you (or some maintenance programmer
who follows in your footsteps) might forget that you’ve already claimed a certain spin lock. I’ll tell you about an
ugly exception to this salutary rule in Chapter 5, when I discuss IRP cancel routines.
In addition, acquiring a spin lock raises the IRQL to DISPATCH_LEVEL automatically. Consequently, code that acquires a lock
must be in nonpaged memory and must not block the thread in which it runs. (There is an exception in Windows XP and later
systems. KeAcquireInterruptSpinLock raises the IRQL to the DIRQL for an interrupt and claims the spin lock associated with
the interrupt.)
As an obvious corollary of the previous fact, you can request a spin lock only when you’re running at or below
DISPATCH_LEVEL. Internally, the kernel is able to acquire spin locks at an IRQL higher than DISPATCH_LEVEL, but you
and I are unable to accomplish that feat.
Another fact about spin locks is that very little useful work occurs on a CPU that’s waiting for a spin lock. The spinning
happens at DISPATCH_LEVEL with interrupts enabled, so a CPU that’s waiting for a spin lock can service hardware interrupts.
But to avoid harming performance, you need to minimize the amount of work you do while holding a spin lock that some other
CPU is likely to want.
Two CPUs can simultaneously hold two different spin locks, by the way. This arrangement makes sense: you associate a spin
lock with a certain shared resource, or some collection of shared resources. There’s no reason to hold up processing related to
different resources protected by different spin locks.
As it happens, there are separate uniprocessor and multiprocessor kernels. The Windows XP setup program decides which
kernel to install after inspecting the computer. The multiprocessor kernel implements spin locks as I’ve just described. The
uniprocessor kernel realizes, however, that another CPU can’t be in the picture, so it implements spin locks a bit more simply.
On a uniprocessor system, acquiring a spin lock raises the IRQL to DISPATCH_LEVEL and does nothing else. Do you see how
you still get the synchronization benefit from claiming the so-called lock in this case? For some piece of code to attempt to
claim the same spin lock (or any other spin lock, actually, but that’s not the point here), it would have to be running at or below
DISPATCH_LEVEL—you can request a lock starting at or below DISPATCH_LEVEL only. But we already know that’s
impossible because, once you’re above PASSIVE_LEVEL, you can’t be interrupted by any other activity that would run at the
same or a lower IRQL. Q., as we used to say in my high school geometry class, E.D.
Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
- 90 - Synchronization | Chapter 4
KSPIN_LOCK QLock;
} DEVICE_EXTENSION, *PDEVICE_EXTENSION;
NTSTATUS AddDevice(...)
{
Elsewhere in your driver, say in the dispatch function for some type of IRP, you can claim (and quickly release) the lock
around some queue manipulation that you need to perform. Note that this function must be in nonpaged memory because it
executes for a period of time at an elevated IRQL.
NTSTATUS DispatchSomething(...)
{
KIRQL oldirql;
PDEVICE_EXTENSION pdx = ...;
KeAcquireSpinLock(&pdx->QLock, &oldirql);
KeReleaseSpinLock(&pdx->QLock, oldirql);
}
1. When KeAcquireSpinLock acquires the spin lock, it also raises IRQL to DISPATCH_LEVEL and returns the current (that
is, preacquisition) level in the variable to which the second argument points.
2. When KeReleaseSpinLock releases the spin lock, it also lowers IRQL back to the value specified in the second argument.
If you know you’re already executing at DISPATCH_LEVEL, you can save a little time by calling two special routines. This
technique is appropriate, for example, in DPC, StartIo, and other driver routines that execute at DISPATCH_LEVEL:
KeAcquireSpinLockAtDpcLevel(&pdx->QLock);
KeReleaseSpinLockFromDpcLevel(&pdx->QLock);
KLOCK_QUEUE_HANDLE qh;
KeAcquireInStackQueuedSpinLock(&pdx->QLock, &qh);
KeReleaseInStackQueuedSpinLock(&qh);
1. The KLOCK_QUEUE_HANDLE structure is opaque—you’re not supposed to know what it contains, but you do have to
reserve storage for it. The best way to do that is to define an automatic variable (hence the in-stack part of the name).
2. Call KeAcquireInStackQueuedSpinLock instead of KeAcquireSpinLock to acquire the lock, and supply the address of the
KLOCK_QUEUE_HANDLE object as the second argument.
3. Call KeReleaseInStackQueuedSpinLock instead of KeReleaseSpinLock to release the lock.
The reason an in-stack queued spin lock is more efficient relates to the performance impact of a standard spin lock. With a
standard spin lock, each CPU that is contending for ownership constantly modifies the same memory location. Each
modification requires every contending CPU to reload the same dirty cache line. A queued spin lock, introduced for internal
use in Windows 2000, avoids this adverse effect by cleverly using interlocked exchange and compare-exchange operations to
track users and waiters for a lock. A waiting CPU continually reads (but does not write) a unique memory location. A CPU that
releases a lock alters the memory variable on which the next waiter is spinning.
Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
4.4 Kernel Dispatcher Objects - 91 -
Internal queued spin locks can’t be directly used by driver code because they rely on a fixed-size table of lock pointers to
which drivers don’t have access. Windows XP added the in-stack queued spin lock, which relies on an automatic variable
instead of the fixed-size table.
In addition to the two routines I showed you for acquiring and releasing this new kind of spin lock, you can also use two other
routines if you know you’re already executing at DISPATCH_LEVEL: KeAcquireInStackQueuedSpinLockAtDpcLevel and
KeReleaseInStackQueuedSpinLockFromDpcLevel. (Try spelling those names three times fast!)
NOTE
Because Windows versions earlier than XP don’t support the in-stack queued spin lock or interrupt spin lock
routines, you can’t directly call them in a driver intended to be binary portable between versions. The SPINLOCK
sample driver shows how to make a run-time decision to use the newer spin locks under XP and the old spin
locks otherwise.
Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
- 92 - Synchronization | Chapter 4
NOTE
Microsoft uses the term highest-level driver primarily to distinguish between file system drivers and the storage
device drivers they call to do actual I/O. The file system driver is “highest level,” while the storage driver is not.
It would be easy to confuse this concept with the layering of WDM drivers, but it’s not the same. The way I think
of things is that all the WDM drivers for a given piece of hardware, including all the filter drivers, the function
driver, and the bus driver, are collectively either “highest level” or not. A filter driver has no business queuing
an IRP that, but for the intervention of the filter, would have flowed down the stack in the original thread
context. So if the thread context was nonarbitrary when the IRP got to the topmost filter dispatch object (FiDO),
it should still be nonarbitrary in every lower dispatch routine.
Also recall from the discussion earlier in this chapter that you must not block a thread if you’re executing at or above
DISPATCH_LEVEL.
Having recalled these facts about thread context and IRQL, we can state a simple rule about when it’s OK to block a thread:
Block only the thread that originated the request you’re working on, and only when executing at IRQL strictly less than
DISPATCH_LEVEL.
Several of the dispatcher objects, and the so-called Executive Fast Mutex I’ll discuss later in this chapter, offer “mutual
exclusion” functionality. That is, they permit one thread to access a given shared resource without interference from other
threads. This is pretty much what a spin lock does, so you might wonder how to choose between synchronization methods. In
general, I think you should prefer to synchronize below DISPATCH_LEVEL if you can because that strategy allows a thread
that owns a mutual exclusion lock to cause page faults and to be preempted by other threads if the thread continues to hold the
lock for a long time. In addition, this strategy allows other CPUs to continue doing useful work, even though threads have
blocked on those CPUs to acquire the same lock. If any of the code that accesses a shared resource can run at
DISPATCH_LEVEL, though, you must use a spin lock because the DISPATCH_LEVEL code might interrupt code running at
lower IRQL.
As suggested by the ASSERT, you must be executing at or below DISPATCH_LEVEL to even call this service routine.
In this call, object points to the object you want to wait on. Although this argument is typed as a PVOID, it should be a pointer
to one of the dispatcher objects listed in Table 4-1. The object must be in nonpaged memory—for example, in a device
extension structure or other data area allocated from the nonpaged pool. For most purposes, the execution stack can be
considered nonpaged.
WaitReason is a purely advisory value chosen from the KWAIT_REASON enumeration. No code in the kernel actually cares
what value you supply here, so long as you don’t specify WrQueue. (Internally, scheduler code bases some decisions on
whether a thread is currently blocked for this “reason.”) The reason a thread is blocked is saved in an opaque data structure,
though. If you knew more about that data structure and were trying to debug a deadlock of some kind, you could perhaps gain
clues from the reason code. The bottom line: always specify Executive for this parameter; there’s no reason to say anything
else.
WaitMode is one of the two values of the MODE enumeration: KernelMode or UserMode. Alertable is a simple Boolean value.
Unlike WaitReason, these parameters do make a difference in the way the system behaves by controlling whether the wait can
be terminated early in order to deliver asynchronous procedure calls of various kinds. I’ll explain these interactions in more
detail in “Thread Alerts and APCs” later in this chapter. Waiting in user mode also authorizes the Memory Manager to swap
your thread’s kernel-mode stack out. You’ll see examples in this book and elsewhere where drivers create event objects, for
instance, as automatic variables. A bug check would result if some other thread were to call KeSetEvent at elevated IRQL at a
time when the event object was absent from memory. The bottom line: you should probably always wait in KernelMode and
specify FALSE for the Alertable parameter.
The last parameter to KeWaitForSingleObject is the address of a 64-bit timeout value, expressed in 100-nanosecond units. A
positive number for the timeout is an absolute timestamp relative to the January 1, 1601, epoch of the system clock. You can
determine the current time by calling KeQuerySystemTime, and you can add a constant to that value. A negative number is an
interval relative to the current time. If you specify an absolute time, a subsequent change to the system clock alters the duration
of the timeout you might experience. That is, the timeout doesn’t expire until the system clock equals or exceeds whatever
absolute value you specify. In contrast, if you specify a relative timeout, the duration of the timeout you experience is
unaffected by changes in the system clock.
Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
4.4 Kernel Dispatcher Objects - 93 -
Specifying a zero timeout causes KeWaitForSingleObject to return immediately with a status code indicating whether the
object is in the signaled state. If you’re executing at DISPATCH_LEVEL, you must specify a zero timeout because blocking is
not allowed. Each kernel dispatcher object offers a KeReadStateXxx service function that allows you to determine the state of
the object. Reading the state isn’t completely equivalent to waiting for zero time, however: when KeWaitForSingleObject
discovers that the wait is satisfied, it performs the side effects that the particular object requires. In contrast, reading the state of
the object doesn’t perform the operations, even if the object is already signaled and a wait would be satisfied if it were
requested right now.
Specifying a NULL pointer for the timeout parameter is OK and indicates an infinite wait.
The return value indicates one of several possible results. STATUS_SUCCESS is the result you expect and indicates that the
wait was satisfied. That is, either the object was in the signaled state when you made the call to KeWaitForSingleObject or else
the object was in the not-signaled state and later became signaled. When the wait is satisfied in this way, operations might need
to be performed on the object. The nature of these operations depends on the type of the object, and I’ll explain them later in
this chapter in connection with discussing each type of object. (For example, a synchronization type of event will be reset after
your wait is satisfied.)
The return value STATUS_TIMEOUT indicates that the specified timeout occurred without the object reaching the signaled
state. If you specify a zero timeout, KeWaitForSingleObject returns immediately with either this code (indicating that the
object is not-signaled) or STATUS_SUCCESS (indicating that the object is signaled). This return value isn’t possible if you
specify a NULL timeout parameter pointer because you thereby request an infinite wait.
Two other return values are possible. STATUS_ALERTED and STATUS_USER _APC mean that the wait has terminated without
the object having been signaled because the thread has received an alert or a user-mode APC, respectively. I’ll discuss these
concepts a bit further on in “Thread Alerts and APCs.”
Note that STATUS_TIMEOUT, STATUS_ALERTED, and STATUS_USER_APC all pass the NT_SUCCESS test. Therefore,
don’t simply use NT_SUCCESS on the return code from KeWaitForSingleObject in the expectation that it will distinguish
between cases in which the object was signaled and cases in which the object was not signaled.
The other circumstance in which you can get the bogus return occurs if the thread you’re trying to block is
already blocked. How, you might well ask, could you be executing in the context of a thread that’s really
blocked? This situation happens in Windows 98/Me when someone blocks on a VxD-level object with the
BLOCK_SVC_INTS flag and the system later calls a function in your driver at what’s called event time. You can
nominally be in the context of the blocked thread, and you simply cannot block a second time on a WDM object.
In fact, I’ve even seen KeWaitForSingleObject return with the IRQL raised to DISPATCH_LEVEL in this
circumstance. As far as I know, there’s no workaround for the problem. Thankfully, it seems to occur only with
drivers for serial devices, in which there’s a crossover between VxD and WDM code.
Here objects is the address of an array of pointers to dispatcher objects, and count is the number of pointers in the array. The
count must be less than or equal to the value MAXIMUM_WAIT_OBJECTS, which currently equals 64. The array, as well as
each of the objects to which the elements of the array point, must be in nonpaged memory. WaitType is one of the enumeration
Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
- 94 - Synchronization | Chapter 4
values WaitAll or WaitAny and specifies whether you want to wait until all of the objects are simultaneously in the signaled
state or whether, instead, you want to wait until any one of the objects is signaled.
The waitblocks argument points to an array of KWAIT_BLOCK structures that the kernel will use to administer the wait
operation. You don’t need to initialize these structures in any way—the kernel just needs to know where the storage is for the
group of wait blocks that it will use to record the status of each of the objects during the wait. If you’re waiting for a small
number of objects (specifically, a number no bigger than THREAD_WAIT_OBJECTS, which currently equals 3), you can
supply NULL for this parameter. If you supply NULL, KeWaitForMultipleObjects uses a preallocated array of wait blocks that
lives in the thread object. If you’re waiting for more objects than this, you must provide nonpaged memory that’s at least count
* sizeof(KWAIT_BLOCK) bytes in length.
The remaining arguments to KeWaitForMultipleObjects are the same as the corresponding arguments to
KeWaitForSingleObject, and most return codes have the same meaning.
If you specify WaitAll, the return value STATUS_SUCCESS indicates that all the objects managed to reach the signaled state
simultaneously. If you specify WaitAny, the return value is numerically equal to the objects array index of the single object that
satisfied the wait. If more than one of the objects happens to be signaled, you’ll be told about one of them—maybe the
lowest-numbered of all the ones that are signaled at that moment, but maybe some other one. You can think of this value being
STATUS_WAIT_0 plus the array index. You can’t simply perform the usual NT_SUCCESS test of the returned status before
extracting the array index from the status code, though, because other possible return codes (including STATUS_TIMEOUT,
STATUS_ALERTED, and STATUS_USER_APC) would also pass the test. Use code like this:
When KeWaitForMultipleObjects returns a status code equal to an object’s array index in a WaitAny case, it also performs the
operations required by that object. If more than one object is signaled and you specified WaitAny, the operations are performed
only for the one that’s deemed to satisfy the wait and whose index is returned. That object isn’t necessarily the first one in your
array that happens to be signaled.
ASSERT(KeGetCurrentIrql() == PASSIVE_LEVEL);
KeInitializeEvent(event, EventType, initialstate);
Event is the address of the event object. EventType is one of the enumeration values NotificationEvent and
SynchronizationEvent. A notification event has the characteristic that, when it is set to the signaled state, it stays signaled until
it’s explicitly reset to the not-signaled state. Furthermore, all threads that wait on a notification event are released when the
event is signaled. This is like a manual-reset event in user mode. A synchronization event, on the other hand, gets reset to the
not-signaled state as soon as a single thread gets released. This is what happens in user mode when someone calls SetEvent on
an auto-reset event object. The only operation performed on an event object by KeWaitXxx is to reset a synchronization event
to not-signaled. Finally, initialstate is TRUE to specify that the initial state of the event is to be signaled and FALSE to specify
that the initial state is to be not-signaled.
Table 4-2. Service Functions for Use with Kernel Event Objects
NOTE
In this series of sections on synchronization primitives, I’m repeating the IRQL restrictions that the DDK
documentation describes. In the current release of Microsoft Windows XP, the DDK is sometimes more
restrictive than the operating system actually is. For example, KeClearEvent can be called at any IRQL, not just
at or below DISPATCH_LEVEL. KeInitializeEvent can be called at any IRQL, not just at PASSIVE_LEVEL.
However, you should regard the statements in the DDK as being tantamount to saying that Microsoft might
someday impose the documented restriction, which is why I haven’t tried to report the true state of affairs.
Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
4.4 Kernel Dispatcher Objects - 95 -
As implied by the ASSERT, you must be running at or below DISPATCH_LEVEL to call this function. The event argument is a
pointer to the event object in question, and boost is a value to be added to a waiting thread’s priority if setting the event results
in satisfying someone’s wait. See the sidebar (“That Pesky Third Argument to KeSetEvent”) for an explanation of the Boolean
wait argument, which a WDM driver would almost never want to specify as TRUE. The return value is nonzero if the event
was already in the signaled state before the call and 0 if the event was in the not-signaled state.
A multitasking scheduler needs to artificially boost the priority of a thread that waits for I/O operations or synchronization
objects in order to avoid starving threads that spend lots of time waiting. This is because a thread that blocks for some reason
generally relinquishes its time slice and won’t regain the CPU until either it has a relatively higher priority than other eligible
threads or other threads that have the same priority finish their time slices. A thread that never blocks, however, gets to
complete its time slices. Unless a boost is applied to the thread that repeatedly blocks, therefore, it will spend a lot of time
waiting for CPU-bound threads to finish their time slices.
You and I won’t always have a good idea of what value to use for a priority boost. A good rule of thumb to follow is to specify
IO_NO_INCREMENT unless you have a good reason not to. If setting the event is going to wake up a thread that’s dealing
with a time-sensitive data flow (such as a sound driver), supply the boost that’s appropriate to that kind of device (such as
IO_SOUND_INCREMENT). The important thing is not to boost the waiter for a silly reason. For example, if you’re trying to
handle an IRP_MJ_PNP request synchronously—see Chapter 6—you’ll be waiting for lower-level drivers to handle the IRP
before you proceed, and your completion routine will be calling KeSetEvent. Since Plug and Play requests have no special
claim on the processor and occur only infrequently, specify IO_NO_INCREMENT, even for a sound card.
The DDK has always sort of described what happens internally, but I’ve found the explanation confusing. I’ll try
to explain it in a different way so that you can see why you should always say FALSE for this parameter.
Internally, the kernel uses a dispatcher database lock to guard operations related to thread blocking, waking,
and scheduling. KeSetEvent needs to acquire this lock, and so do the KeWaitXxx routines. If you say TRUE for
the wait argument, KeSetEvent sets a flag so that KeWaitXxx will know you did so, and it returns to you without
releasing this lock. When you turn around and (immediately, please—you’re running at a higher IRQL than
every hardware device, and you own a spin lock that’s very frequently in contention) call KeWaitXxx, it needn’t
acquire the lock all over again. The net effect is that you’ll wake up the waiting thread and put yourself to sleep
without giving any other thread a chance to start running.
You can see, first of all, that a function that calls KeSetEvent with wait set to TRUE has to be in nonpaged
memory because it will execute briefly above DISPATCH_LEVEL. But it’s hard to imagine why an ordinary device
driver would even need to use this mechanism because it would almost never know better than the kernel which
thread ought to be scheduled next. The bottom line: always say FALSE for this parameter. In fact, it’s not clear
why the parameter has even been exposed to tempt us.
You can determine the current state of an event (at any IRQL) by calling KeReadStateEvent:
NOTE
KeReadStateEvent isn’t supported in Microsoft Windows 98/Me, even though the other KeReadStateXxx
functions described here are. The absence of support has to do with how events and other synchronization
primitives are implemented in Windows 98/Me.
You can determine the current state of an event and, immediately thereafter, place it in the not-signaled state by calling the
KeResetEvent function (at or below DISPATCH_LEVEL):
If you’re not interested in the previous state of the event, you can save a little time by calling KeClearEvent instead:
Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
- 96 - Synchronization | Chapter 4
KeClearEvent is faster because it doesn’t need to capture the current state of the event before setting it to not-signaled. But
beware of calling KeClearEvent when another thread might be using the same event since there’s no good way to control the
races between you clearing the event and some other thread setting it or waiting on it.
KEVENT lock;
} DEVICE_EXTENSION, *PDEVICE_EXTENSION;
Enter your lightweight critical section by waiting on the event. Leave by setting the event.
Use this trick only in a system thread, though, to prevent a user-mode call to NtSuspendThread from creating a
deadlock. (This deadlock can easily happen if a user-mode debugger is running on the same process.) If you’re running
in a user thread, you should prefer to use an executive fast mutex. Don’t use this trick at all for code that executes in the
paging path, as explained later in connection with the “unsafe” way of acquiring an executive fast mutex.
ASSERT(KeGetCurrentIrql() == PASSIVE_LEVEL);
KeInitializeSemaphore(semaphore, count, limit);
In this call, semaphore points to a KSEMAPHORE object in nonpaged memory. The count variable is the initial value of the
counter, and limit is the maximum value that the counter will be allowed to take on, which must be as large as the initial count.
Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
4.4 Kernel Dispatcher Objects - 97 -
same in both cases, but the variation in waiting times is smaller with the single queue. (This is why queues in stores are
increasingly organized so that customers wait in a single line for the next available clerk.) This kind of semaphore allows you
to organize a set of software or hardware servers to take advantage of that theorem.
The owner (or one of the owners) of a semaphore releases its claim to the semaphore by calling KeReleaseSemaphore:
This operation adds delta, which must be positive, to the counter associated with semaphore, thereby putting the semaphore in
the signaled state and allowing other threads to be released. In most cases, you’ll specify 1 for this parameter to indicate that
one claimant of the semaphore is releasing its claim. The boost and wait parameters have the same import as the corresponding
parameters to KeSetEvent, discussed earlier. The return value is 0 if the previous state of the semaphore was not-signaled and
nonzero if the previous state was signaled.
KeReleaseSemaphore doesn’t allow you to increase the counter beyond the limit specified when you initialized the semaphore.
If you try, it doesn’t adjust the counter at all, and it raises an exception with the code
STATUS_SEMAPHORE_LIMIT_EXCEEDED. Unless someone has a structured exception handler to trap the exception, a bug
check will eventuate.
You can also interrogate the current state of a semaphore with this call:
The return value is nonzero if the semaphore is signaled and 0 if the semaphore is not-signaled. You shouldn’t assume that the
return value is the current value of the counter—it could be any nonzero value if the counter is positive.
Having told you all this about how to use kernel semaphores, I feel I ought to tell you that I’ve never seen a driver that uses
one of them.
Table 4-4. Service Functions for Use with Kernel Mutex Objects
To create a mutex, you reserve nonpaged memory for a KMUTEX object and make the following initialization call:
ASSERT(KeGetCurrentIrql() == PASSIVE_LEVEL);
KeInitializeMutex(mutex, level);
where mutex is the address of the KMUTEX object, and level is a parameter originally intended to help avoid deadlocks when
your own code uses more than one mutex. Since the kernel currently ignores the level parameter, I’m not going to attempt to
Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
- 98 - Synchronization | Chapter 4
The return value is 0 if the mutex is currently owned, nonzero if it’s currently unowned.
The thread that owns a mutex can release ownership and return the mutex to the signaled state with this function call:
The wait parameter means the same thing as the corresponding argument to KeSetEvent. The return value is always 0 to
indicate that the mutex was previously owned because, if this were not the case, KeReleaseMutex would have bugchecked (it
being an error for anyone but the owner to release a mutex).
Just for the sake of completeness, I want to mention a macro in the DDK named KeWaitForMutexObject. (See WDM.H.) It’s
defined simply as follows:
Using this special name offers no benefit at all. You don’t even get the benefit of having the compiler insist that the first
argument be a pointer to a KMUTEX instead of any random pointer type.
At this point, the timer is in the not-signaled state and isn’t counting down—a wait on the timer would never be satisfied. To
start the timer counting, call KeSetTimer as follows:
Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
4.4 Kernel Dispatcher Objects - 99 -
The duetime value is a 64-bit time value expressed in 100-nanosecond units. If the value is positive, it’s an absolute time
relative to the same January 1, 1601, epoch used for the system timer. If the value is negative, it’s an interval relative to the
current time. If you specify an absolute time, a subsequent change to the system clock alters the duration of the timeout you
experience. That is, the timer doesn’t expire until the system clock equals or exceeds whatever absolute value you specify. In
contrast, if you specify a relative timeout, the duration of the timeout you experience is unaffected by changes in the system
clock. These are the same rules that apply to the timeout parameter to KeWaitXxx.
The return value from KeSetTimer, if TRUE, indicates that the timer was already counting down (in which case, our call to
KeSetTimer would have canceled it and started the count all over again).
At any time, you can determine the current state of a timer:
KeInitializeTimer and KeSetTimer are actually older service functions that have been superseded by newer functions. We could
have initialized the timer with this call:
We could also have used the extended version of the set timer function, KeSetTimerEx:
I’ll explain a bit further on in this chapter the purpose of the extra parameters in these extended versions of the service
functions.
Once the timer is counting down, it’s still considered to be not-signaled until the specified due time arrives. At that point, the
object becomes signaled, and all waiting threads are released. The system guarantees only that the expiration of the timer will
be noticed no sooner than the due time you specify. If you specify a due time with a precision finer than the granularity of the
system timer (which you can’t control), the timeout will be noticed later than the exact instant you specify. You can call
KeQueryTimeIncrement to determine the granularity of the system clock.
You can initialize the timer object by using either KeInitializeTimer or KeInitializeTimerEx, as you please. DpcRoutine is the
address of a deferred procedure call routine, which must be in nonpaged memory. The context parameter is an arbitrary 32-bit
value (typed as a PVOID) that will be passed as an argument to the DPC routine. The dpc argument is a pointer to a KDPC
object for which you provide nonpaged storage. (It might be in your device extension, for example.)
When we want to start the timer counting down, we specify the DPC object as one of the arguments to KeSetTimer or
KeSetTimerEx:
You could also use the extended form KeSetTimerEx if you wanted to. The only difference between this call and the one we
examined in the preceding section is that we’ve specified the DPC object address as an argument. When the timer expires, the
system will queue the DPC for execution as soon as conditions permit. This would be at least as soon as you’d be able to wake
Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
- 100 - Synchronization | Chapter 4
up from a wait. Your DPC routine would have the following skeletal appearance:
For what it’s worth, even when you supply a DPC argument to KeSetTimer or KeSetTimerEx, you can still call KeWaitXxx to
wait at PASSIVE_LEVEL or APC_LEVEL if you want. On a single-CPU system, the DPC would occur before the wait could
finish because it executes at a higher IRQL.
Synchronization Timers
Like event objects, timer objects come in both notification and synchronization flavors. A notification timer allows any number
of waiting threads to proceed once it expires. A synchronization timer, by contrast, allows only a single thread to proceed. Once
a thread’s wait is satisfied, the timer switches to the not-signaled state. To create a synchronization timer, you must use the
extended form of the initialization service function:
SynchronizationTimer is one of the values of the TIMER_TYPE enumeration. The other value is NotificationTimer.
If you use a DPC with a synchronization timer, think of queuing the DPC as being an extra thing that happens when the timer
expires. That is, expiration puts the timer in the signaled state and queues a DPC. One thread can be released as a result of the
timer being signaled.
The only use I’ve ever found for a synchronization timer is when you want a periodic timer (see the next section).
Periodic Timers
So far, I’ve discussed only timers that expire exactly once. By using the extended set timer function, you can also request a
periodic timeout:
Here period is a periodic timeout, expressed in milliseconds (ms), and dpc is an optional pointer to a KDPC object. A timer of
this kind expires once at the due time and periodically thereafter. To achieve exact periodic expiration, specify the same
relative due time as the interval. Specifying a zero due time causes the timer to immediately expire, whereupon the periodic
behavior takes over. It often makes sense to start a periodic timer in conjunction with a DPC object, by the way, because doing
so allows you to be notified without having to repeatedly wait for the timeout.
Be sure to call KeCancelTimer to cancel a periodic timer before the KTIMER object or the DPC routine disappears
from memory. It’s quite embarrassing to let the system unload your driver and, 10 nanoseconds later, call your
nonexistent DPC routine. Not only that, but it causes a bug check. These problems are so hard to debug that the Driver
Verifier makes a special check for releasing memory that contains an active KTIMER.
An Example
One use for kernel timers is to conduct a polling loop in a system thread dedicated to the task of repeatedly checking a device
for activity. Not many devices nowadays need to be served by a polling loop, but yours may be one of the few exceptions. I’ll
discuss this subject in Chapter 14, and the companion content includes a sample driver (POLLING) that illustrates all of the
concepts involved. Part of that sample is the following loop that polls the device at fixed intervals. The logic of the driver is
such that the loop can be broken by setting a kill event. Consequently, the driver uses KeWaitForMultipleObjects. The code is
actually a bit more complicated than the following fragment, which I’ve edited to concentrate on the part related to the timer:
KeInitializeTimerEx(&timer, SynchronizationTimer);
PVOID pollevents[] = {
(PVOID) &pdx->evKill,
Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
4.4 Kernel Dispatcher Objects - 101 -
(PVOID) &timer,
};
C_ASSERT(arraysize(pollevents) <= THREAD_WAIT_OBJECTS);
status = KeWaitForMultipleObjects(arraysize(pollevents),
pollevents, WaitAny, Executive, KernelMode, FALSE,
NULL, NULL);
if (status == STATUS_WAIT_0)
break;
1. Here we initialize a kernel timer. You must specify a SynchronizationTimer here, because a NotificationTimer stays in the
signaled state after the first expiration.
2. We’ll need to supply an array of dispatcher object pointers as one of the arguments to KeWaitForMultipleObjects, and
this is where we set that up. The first element of the array is the kill event that some other part of the driver might set
when it’s time for this system thread to exit. The second element is the timer object. The C_ASSERT statement that
follows this array verifies that we have few enough objects in our array that we can implicitly use the default array of
wait blocks in our thread object.
3. The KeSetTimerEx statement starts a periodic timer running. The duetime is 0, so the timer goes immediately into the
signaled state. It will expire every 500 ms thereafter.
4. Within our polling loop, we wait for the timer to expire or for the kill event to be set. If the wait terminates because of the
kill event, we leave the loop, clean up, and exit this system thread. If the wait terminates because the timer has expired,
we go on to the next step.
5. This is where our device driver would do something related to our hardware.
ASSERT(KeGetCurrentIrql() == PASSIVE_LEVEL);
LARGE_INTEGER duetime;
NSTATUS status = KeDelayExecutionThread(WaitMode, Alertable, &duetime);
Here WaitMode, Alertable, and the returned status code have the same meaning as the corresponding parameters to KeWaitXxx,
and duetime is the same kind of timestamp that I discussed previously in connection with kernel timers. Note that this function
requires a pointer to a large integer for the timeout parameter, whereas other functions related to timers require the large
integer itself.
If your requirement is to delay for a very brief period of time (less than 50 microseconds), you can call
KeStallExecutionProcessor at any IRQL:
KeStallExecutionProcessor(nMicroSeconds);
The purpose of this delay is to allow your hardware time to prepare for its next operation before your program continues
executing. The delay might end up being significantly longer than you request because KeStallExecutionProcessor can be
preempted by activities that occur at a higher IRQL than that which the caller is using.
Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
- 102 - Synchronization | Chapter 4
controlling system threads. I’ll be discussing these routines later on in Chapter 14 from the perspective of how you can use
these functions to help you manage a device that requires periodic polling. For the sake of thoroughness, I want to mention
here that you can use a pointer to a kernel thread object in a call to KeWaitXxx to wait for the thread to complete. The thread
terminates itself by calling PsTerminateSystemThread.
Before you can wait for a thread to terminate, you need to first obtain a pointer to the opaque KTHREAD object that internally
represents that thread, which poses a bit of a problem. While running in the context of a thread, you can determine your own
KTHREAD easily:
Unfortunately, when you call PsCreateSystemThread to create a new thread, you get back only an opaque HANDLE for the
thread. To get the KTHREAD pointer, you use an Object Manager service function:
HANDLE hthread;
PKTHREAD thread;
PsCreateSystemThread(&hthread, ...);
ObReferenceObjectByHandle(hthread, THREAD_ALL_ACCESS,
NULL, KernelMode, (PVOID*) &thread, NULL);
ZwClose(hthread);
ObReferenceObjectByHandle converts your handle to a pointer to the underlying kernel object. Once you have the pointer, you
can discard the handle by calling ZwClose. At some point, you need to release your reference to the thread object by making a
call to ObDereferenceObject:
ObDereferenceObject(thread);
Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
4.4 Kernel Dispatcher Objects - 103 -
APC_LEVEL (or at PASSIVE_LEVEL, but we’re concerned only with APC_LEVEL right now). An APC_LEVEL thread can
also be interrupted by any hardware device, following which a higher-priority thread might become eligible to run. In either
situation, the thread scheduler can then give control of the CPU to another thread, which might be running at PASSIVE_LEVEL
or APC_LEVEL. In effect, the IRQL levels PASSIVE_LEVEL and APC_LEVEL pertain to a thread, whereas the higher IRQLs
pertain to a CPU.
Kernel Threads
Sometimes you’ll create your own kernel-mode thread—when your device needs to be polled periodically, for example. In this
scenario, any waits performed will be in kernel mode because the thread runs exclusively in kernel mode.
NOTE
The bottom line: perform nonalertable waits unless you know you shouldn’t.
Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
- 104 - Synchronization | Chapter 4
Table 4-6. Service Functions for Use with Executive Fast Mutexes
Compared with kernel mutexes, fast mutexes have the strengths and weaknesses summarized in Table 4-7. On the plus side, a
fast mutex is much faster to acquire and release if there’s no actual contention for it. On the minus side, a thread that acquires a
fast mutex will not be able to receive certain types of asynchronous procedure call, depending on exactly which functions you
call, and this constrains how you send IRPs to other drivers.
where FastMutex is the address of your FAST_MUTEX object. The mutex begins life in the unowned state. To acquire
ownership later on, call one of these functions:
or
The first of these functions waits for the mutex to become available, assigns ownership to the calling thread, and then raises the
Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
4.5 Other Kernel-Mode Synchronization Primitives - 105 -
current processor IRQL to APC_LEVEL. Raising the IRQL has the effect of blocking delivery of all APCs. The second of these
functions doesn’t change the IRQL.
You need to think about potential deadlocks if you use the “unsafe” function to acquire a fast mutex. A situation to
avoid is allowing user-mode code to suspend a thread in which you hold a mutex. That would deadlock other threads
that need the mutex. For this reason, the DDK recommends (and the Driver Verifier requires) that you forestall the
delivery of user-mode and normal kernel-mode APCs either by raising the IRQL to APC_LEVEL or by calling
KeEnterCriticalRegion before ExAcquireFastMutexUnsafe. (Thread suspension involves an APC, so user-mode code
can’t suspend your thread if you disallow user-mode APCs. Yes, I know the reasoning here is a bit of a stretch!)
Another possible deadlock can arise with a driver in the paging path—in other words, a driver that gets called to help the
memory manager process a page fault. Suppose you simply call KeEnterCriticalRegion and then ExAcquireFastMutexUnsafe.
Now suppose the system tries to execute a special kernel-mode APC in the same thread, which is possible because
KeEnterCriticalRegion doesn’t forestall special kernel APCs. The APC routine might page fault, which might then lead to you
being reentered and deadlocking on a second attempt to claim the same mutex. You avoid this situation by raising IRQL to
APC_LEVEL before acquiring the mutex in the first place or, more simply, by using KeAcquireFastMutex instead of
KeAcquireFastMutexUnsafe. The same problem can arise if you use a regular KMUTEX or a synchronization event, of course.
IMPORTANT
If you use ExAcquireFastMutex, you will be at APC_LEVEL. This means you can’t create any synchronous IRPs.
(The routines that do this must be called at PASSIVE_LEVEL.) Furthermore, you’ll deadlock if you try to wait for
a synchronous IRP to complete (because completion requires executing an APC, which can’t happen because of
the IRQL). In Chapter 5, I’ll discuss how to use asynchronous IRPs to work around this problem.
If you don’t want to wait if the mutex isn’t immediately available, use the “try to acquire” function:
If the return value is TRUE, you now own the mutex. If it’s FALSE, someone else owns the mutex and has prevented you from
acquiring it.
To release control of a fast mutex and allow some other thread to claim it, call the release function corresponding to the way
you acquired the fast mutex:
or
A fast mutex is fast because the acquisition and release steps are optimized for the usual case in which there’s no contention for
the mutex. The critical step in acquiring the mutex is to atomically decrement and test an integer counter that indicates how
many threads either own or are waiting for the mutex. If the test indicates that no other thread owns the mutex, no additional
work is required. If the test indicates that another thread does own the mutex, the current thread blocks on a synchronization
event that’s part of the FAST_MUTEX object. Releasing the mutex entails atomically incrementing and testing the counter. If
the test indicates that no thread is currently waiting, no additional work is required. If another thread is waiting, however, the
owner calls KeSetEvent to release one of the waiters.
Suppose there are two synchronization objects, A and B. It doesn’t matter what types of objects these are, and
they needn’t even be the same type. Now suppose we have two subroutines—I’ll call them Fred and Barney just
so I have names to work with. Subroutine Fred claims object A followed by object B. Subroutine Barney claims
B followed by A. This sets up a potential deadlock if Fred and Barney can be simultaneously active or if a thread
running one of those routines can be preempted by a thread running the other routine.
The deadlock arises, as you probably remember from studying this sort of thing in school, when two threads
manage to execute Fred and Barney at about the same time. The Fred thread gets object A, while the Barney
thread gets object B. Fred now tries to get object B, but can’t have it (Barney has it). Barney, on the other hand,
now tries to get object A, but can’t have it (Fred has it). Both threads are now deadlocked, waiting for the other
one to release the object each needs.
Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
- 106 - Synchronization | Chapter 4
The easiest way to prevent this kind of deadlock is to always acquire objects such as A and B in the same order,
everywhere. The order in which you decide to acquire a set of resources is called the locking hierarchy. There
are other schemes, which involve conditional attempts to acquire resources combined with back-out loops, but
these are much harder to implement.
If you engage the Deadlock Detection option, the Driver Verifier will look for potential deadlocks resulting from
locking hierarchy violations involving spin locks, kernel mutexes, and executive fast mutexes.
The DDK documents another synchronization primitive that I didn’t discuss in this chapter: an ERESOURCE. File
system drivers use ERESOURCE objects extensively because they allow for shared and exclusive ownership.
Because file system drivers often have to use complex locking logic, the Driver Verifier doesn’t check the locking
hierarchy for an ERESOURCE.
InterlockedXxx Functions
InterlockedIncrement adds 1 to a long integer in memory and returns the postincrement value to you:
where pLong is the address of a variable typed as a LONG (that is, a long integer). Conceptually, the operation of the function
is equivalent to the statement return ++*pLong in C, but the implementation differs from that simple statement in order to
provide thread safety and multiprocessor safety. InterlockedIncrement guarantees that the integer is successfully incremented
even if code on other CPUs or in other eligible threads on the same CPU is simultaneously trying to alter the same variable. In
the nature of the operation, InterlockedIncrement cannot guarantee that the value it returns is still the value of the variable even
one machine cycle later because other threads or CPUs will be able to modify the variable as soon as the atomic increment
operation completes.
InterlockedDecrement is similar to InterlockedIncrement, but it subtracts 1 from the target variable and returns the
postdecrement value, just like the C statement return --*pLong but with thread safety and multiprocessor safety.
Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
4.5 Other Kernel-Mode Synchronization Primitives - 107 -
LONG target;
LONG result = InterlockedCompareExchange(&target, newval, oldval);
Here target is a long integer used both as input and output to the function, oldval is your guess about the current contents of the
target, and newval is the new value that you want installed in the target if your guess is correct. The function performs an
operation similar to that indicated in the following C code but does so via an atomic operation that’s both thread-safe and
multiprocessor-safe:
In other words, the function always returns the previous value of the target variable to you. In addition, if that previous value
equals oldval, it sets the target equal to the newval you specify. The function uses an atomic operation to do the compare and
exchange so that the replacement happens only if you’re correct in your guess about the previous contents.
You can also call the InterlockedCompareExchangePointer function to perform a similar sort of compare-and-exchange
operation with a pointer. This function is defined either as a compiler-intrinsic (that is, a function for which the compiler
supplies an inline implementation) or a real function call, depending on how wide pointers are on the platform for which
you’re compiling and on the ability of the compiler to generate inline code.
The last function in this class is InterlockedExchange, which simply uses an atomic operation to replace the value of an integer
variable and to return the previous value:
LONG value;
LONG oldval = InterlockedExchange(&value, newval);
As you might have guessed, there’s also an InterlockedExchangePointer that exchanges a pointer value (64-bit or 32-bit,
depending on the platform). Be sure to cast the target of the exchange operation to avoid a compiler error when building 64-bit
drivers:
InterlockedOr, InterlockedAnd and InterlockedXor are new with the XP DDK. You can use them in drivers that will run on
earlier Windows versions because they’re actually implemented as compiler-intrinsic functions.
if (InterlockedExchange(&lock, 42) == 0)
{
sharedthing++;
lock = 0; // <== don't do this
}
This code will work fine on an Intel x86 computer, where every CPU sees memory writes in the same order. On another type
of CPU, though, there could be a problem. One CPU might actually change the memory variable lock to 0 before updating
memory for the increment statement. That behavior could allow two CPUs to simultaneously access sharedthing. This problem
could happen because of the way the CPU performs operations in parallel or because of quirks in the memory controller.
Consequently, you should rework the code to use an interlocked operation for both changes to lock:
if (InterlockedExchange(&lock, 42) == 0)
{
sharedthing++;
Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
- 108 - Synchronization | Chapter 4
InterlockedExchange(&lock, 0);
}
ExInterlockedXxx Functions
Each of the ExInterlockedXxx functions requires that you create and initialize a spin lock before you call it. Note that the
operands of these functions must all be in nonpaged memory because the functions operate on the data at elevated IRQL.
ExInterlockedAddLargeInteger adds two 64-bit integers and returns the previous value
of the target:
LARGE_INTEGER value, increment;
KSPIN_LOCK spinlock;
LARGE_INTEGER prev = ExInterlockedAddLargeInteger(&value, increment, &spinlock);
Value is the target of the addition and one of the operands. Increment is an integer operand that’s added to the target. Spinlock
is a spin lock that you previously initialized. The return value is the target’s value before the addition. In other words, the
operation of this function is similar to the following function except that it occurs under protection of the spin lock:
Note that the return value is the preaddition value, which contrasts with the postincrement return from InterlockedExchange
and similar functions. (Also, not all compilers support the __int64 integer data type, and not all computers can perform a 64-bit
addition operation using atomic instructions.)
ExInterlockedAddUlong is analogous to ExInterlockedAddLargeInteger except that it works with 32-bit unsigned integers:
This function likewise returns the preaddition value of the target of the operation.
ExInterlockedAddLargeStatistic is similar to ExInterlockedAddUlong in that it adds a 32-bit value to a 64-bit value:
This new function is faster than ExInterlockedAddUlong because it doesn’t need to return the preincrement value of the
Addend variable. It therefore doesn’t need to employ a spin lock for synchronization. The atomicity provided by this function
is, however, only with respect to other callers of the same function. In other words, if you had code on one CPU calling
ExInterlockedAddLargeStatistic at the same time as code on another CPU was accessing the Addend variable for either reading
or writing, you could get inconsistent results. I can explain why this is so by showing you this paraphrase of the Intel x86
implementation of the function (not the actual source code):
This code works correctly for purposes of incrementing the Addend because the lock prefixes guarantee atomicity of each
addition operation and because no carries from the low-order 32 bits can ever get lost. The instantaneous value of the 64-bit
Addend isn’t always consistent, however, because an incrementer might be poised between the ADD and the ADC just at the
instant someone makes a copy of the complete 64-bit value. Therefore, even a caller of ExInterlockedCompareExchange64 on
another CPU could obtain an inconsistent value.
Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
4.5 Other Kernel-Mode Synchronization Primitives - 109 -
thread-safe and multiprocessor-safe pushdown stack, you should use an S-List. In both cases, to achieve thread safety and
multiprocessor safety, you will allocate and initialize a spin lock. The S-List might not actually use the spin lock, however,
because the presence of a sequence number might allow the kernel to implement it using just atomic compare-exchange sorts
of operations.
The support functions for performing interlocked access to list objects are similar, so I’ve organized this section along
functional lines. I’ll explain how to initialize all three kinds of lists. Then I’ll explain how to insert an item into all three kinds.
After that, I’ll explain how to remove items.
Initialization
You can initialize these lists as shown here:
LIST_ENTRY DoubleHead;
SINGLE_LIST_ENTRY SingleHead;
SLIST_HEADER SListHead;
InitializeListHead(&DoubleHead);
SingleHead.Next = NULL;
ExInitializeSListHead(&SListHead);
Don’t forget that you must also allocate and initialize a spin lock for each list. Furthermore, the storage for the list heads and
all the items you put into the lists must come from nonpaged memory because the support routines perform their accesses at
elevated IRQL. Note that the spin lock isn’t used during initialization of the list head because it doesn’t make any sense to
allow contention for list access before the list has been initialized.
Inserting Items
You can insert items at the head and tail of a doubly-linked list and at the head (only) of a singly-linked list or an S-List:
The return values are the addresses of the elements previously at the head (or tail) of the list in question. Note that the element
addresses you use with these functions are the addresses of list entry structures that are usually embedded in larger structures
of some kind, and you’ll need to use the CONTAINING_RECORD macro to recover the address of the surrounding structure.
Removing Items
You can remove items from the head of any of these lists:
The return values are NULL if the respective lists are empty. Be sure to test the return value for NULLbefore applying the
CONTAINING_RECORD macro to recover a containing structure pointer.
IRQL Restrictions
You can call the S-List functions only while running at or below DISPATCH_LEVEL. The ExInterlockedXxx functions for
accessing doubly-linked or singly-linked lists can be called at any IRQL so long as all references to the list use an
ExInterlockedXxx call. The reason for no IRQL restrictions is that the implementations of these functions disable interrupts,
which is tantamount to raising IRQL to the highest possible level. Once interrupts are disabled, these functions then acquire the
spin lock you’ve specified. Since no other code can gain control on the same CPU, and since no code on another CPU can
Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
- 110 - Synchronization | Chapter 4
NOTE
The DDK documentation states this rule in an overly restrictive way for at least some of the ExInterlockedXxx
functions. It says that all callers must be running at some single IRQL less than or equal to the DIRQL of your
interrupt object. There is, in fact, no requirement that all callers be at the same IRQL because you can call the
functions at any IRQL. Likewise, no <= DIRQL restriction exists either, but there’s also no reason for the code
you and I write to raise IRQL higher than that.
It’s perfectly OK for you to use ExInterlockedXxx calls to access a singly-linked or doubly-linked list (but not an S-List) in
some parts of your code and to use the noninterlocked functions (InsertHeadList and so on) in other parts of your code if you
follow a simple rule. Before using a noninterlocked primitive, acquire the same spin lock that your interlocked calls use.
Furthermore, restrict list access to code running at or below DISPATCH_LEVEL. For example:
VOID Function1()
{
ASSERT(KeGetCurrentIrql() <= DISPATCH_LEVEL);
KIRQL oldirql;
KeAcquireSpinLock(spinlock, &oldirql);
InsertHeadList(...);
RemoveTailList(...);
KeReleaseSpinLock(spinlock, oldirql);
}
VOID Function2()
{
ASSERT(KeGetCurrentIrql() <= DISPATCH_LEVEL);
ExInterlockedInsertTailList(..., spinlock);
}
The first function must be running at or below DISPATCH_LEVEL because that’s a requirement of calling KeAcquireSpinLock.
The reason for the IRQL restriction on the interlocked calls in the second function is as follows: Suppose Function1 acquires
the spin lock in preparation for performing some list accesses. Acquiring the spin lock raises IRQL to DISPATCH_LEVEL.
Now suppose an interrupt occurs on the same CPU at a higher IRQL and Function2 gains control to use one of the
ExInterlockedXxx routines. The kernel will now attempt to acquire the same spin lock, and the CPU will deadlock. This
problem arises from allowing code running at two different IRQLs to use the same spin lock: Function1 is at
DISPATCH_LEVEL, and Function2 is—practically speaking, anyway—at HIGH_LEVEL when it tries to recursively acquire
the lock.
Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
- 111 -
Chapter 5
5 The I/O Request Packet
The operating system uses a data structure known as an I/O request packet, or IRP, to communicate with a kernel-mode device
driver. In this chapter, I’ll discuss this important data structure and the means by which it’s created, sent, processed, and
ultimately destroyed. I’ll include a discussion of the relatively complex subject of IRP cancellation.
This chapter is rather abstract, I’m afraid, because I haven’t yet talked about any of the concepts that surround specific types of
I/O request packets (IRPs). You might, therefore, want to skim this chapter and refer back to it while you’re reading later
chapters. The last major section of this chapter contains a cookbook, if you will, that presents the bare-bones code for handling
IRPs in eight different scenarios. You can use the cookbook without understanding all the theory that this chapter contains.
Flags (ULONG) contains flags that a device driver can read but not directly alter. None of these flags are relevant to a
Windows Driver Model (WDM) driver.
AssociatedIrp (union) is a union of three possible pointers. The alternative that a typical WDM driver might want to access is
named AssociatedIrp.SystemBuffer. The SystemBuffer pointer holds the address of a data buffer in nonpaged kernel-mode
memory. For IRP_MJ_READ and IRP_MJ_WRITE operations, the I/O Manager creates this data buffer if the topmost device
object’s flags specify DO_BUFFERED_IO. For IRP_MJ_DEVICE_CONTROL operations, the I/O Manager creates this buffer
if the I/O control function code indicates that it should. (See Chapter 9.) The I/O Manager copies data sent by user-mode code
to the driver into this buffer as part of the process of creating the IRP. Such data includes the data involved in a WriteFile call
or the so-called input data for a call to DeviceIoControl. For read requests, the device driver fills this buffer with data; the I/O
Manager later copies the buffer back to the user-mode buffer. For control operations that specify METHOD_BUFFERED, the
driver places the so-called output data in this buffer, and the I/O Manager copies it to the user-mode output buffer.
IoStatus (IO_STATUS_BLOCK) is a structure containing two fields that drivers set when they ultimately complete a request.
IoStatus.Status will receive an NTSTATUS code, while IoStatus.Information is a ULONG_PTR that will receive an information
value whose exact content depends on the type of IRP and the completion status. A common use of the Information field is to
hold the total number of bytes transferred by an operation such as IRP_MJ_READ that transfers data. Certain Plug and Play
(PnP) requests use this field as a pointer to a structure that you can think of as the answer to a query.
RequestorMode will equal one of the enumeration constants UserMode or KernelMode, depending on where the original I/O
request originated. Drivers sometimes inspect this value to know whether to trust some parameters.
PendingReturned (BOOLEAN) is meaningful in a completion routine and indicates whether the next lower dispatch routine
returned STATUS_PENDING. This chapter contains a disagreeably long discussion of how to use this flag.
Cancel (BOOLEAN) is TRUE if IoCancelIrp has been called to cancel this request and FALSE if it hasn’t (yet) been called. IRP
cancellation is a relatively complex topic that I’ll discuss fully later on in this chapter (in “Cancelling I/O Requests”).
CancelIrql (KIRQL) is the interrupt request level (IRQL) at which the special cancel spin lock was acquired. You reference this
field in a cancel routine when you release the spin lock.
CancelRoutine (PDRIVER_CANCEL) is the address of an IRP cancellation routine in your driver. You use IoSetCancelRoutine
to set this field instead of modifying it directly.
UserBuffer (PVOID) contains the user-mode virtual address of the output buffer for an IRP_MJ_DEVICE_CONTROL request
for which the control code specifies METHOD_NEITHER. It also holds the user-mode virtual address of the buffer for read
and write requests, but a driver should usually specify one of the device flags DO_BUFFERED_IO or DO_DIRECT_IO and
should therefore not usually need to access the field for reads or writes. When handling a METHOD_NEITHER control
operation, the driver can create its own MDL using this address.
Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
5.1 Data Structures - 113 -
Tail.Overlay is a structure within a union that contains several members potentially useful to a WDM driver. Refer to Figure
5-2 for a map of the Tail union. In the figure, items at the same level as you read left to right are alternatives within a union,
while the vertical dimension portrays successive locations within a structure. Tail.Overlay.DeviceQueueEntry
(KDEVICE_QUEUE_ENTRY) and Tail.Overlay.DriverContext (PVOID[4]) are alternatives within an unnamed union within
Tail.Overlay. The I/O Manager uses DeviceQueueEntry as a linking field within the standard queue of requests for a device.
The cancel-safe queuing routines IoCsqXxx use the last entry in the DriverContext array. If these system usages don’t get in
your way, at moments when the IRP is not in some queue that uses this field and when you own the IRP, you can use the four
pointers in DriverContext in any way you please. Tail.Overlay.ListEntry (LIST_ENTRY) is available for you to use as a linking
field for IRPs in any private queue you choose to implement.
CurrentLocation (CHAR) and Tail.Overlay.CurrentStackLocation (PIO_STACK_LOCATION) aren’t documented for use by
drivers because support functions such as IoGetCurrentIrpStackLocation can be used instead. During debugging, however, it
might help you to realize that CurrentLocation is the index of the current I/O stack location and CurrentStackLocation is a
pointer to it.
NOTE
I’ll discuss the mechanics of creating IRPs a bit further on in this chapter. It helps to know right now that the
StackSize field of a DEVICE_OBJECT indicates how many locations to reserve for an IRP sent to that device’s
driver.
MajorFunction (UCHAR) is the major function code associated with this IRP. This code is a value such as IRP_MJ_READ that
corresponds to one of the dispatch function pointers in the MajorFunction table of a driver object. Because the code is in the
I/O stack location for a particular driver, it’s conceivable that an IRP could start life as an IRP_MJ_READ (for example) and be
transformed into something else as it progresses down the stack of drivers. I’ll show you examples in Chapter 12 of how a
USB driver changes the personality of a read or write request into an internal control operation to submit the request to the
USB bus driver.
MinorFunction (UCHAR) is a minor function code that further identifies an IRP belonging to a few major function classes.
IRP_MJ_PNP requests, for example, are divided into a dozen or so subtypes with minor function codes such as
IRP_MN_START_DEVICE, IRP_MN_REMOVE_DEVICE, and so on.
Parameters (union) is a union of substructures, one for each type of request that has specific parameters. The substructures
include, for example, Create (for IRP_MJ_CREATE requests), Read (for IRP_MJ_READ requests), and StartDevice (for the
IRP_MN_START_DEVICE subtype of IRP_MJ_PNP).
DeviceObject (PDEVICE_OBJECT) is the address of the device object that corresponds to this stack entry. IoCallDriver fills
in this field.
FileObject (PFILE_OBJECT) is the address of the kernel file object to which the IRP is directed. Drivers often use the
FileObject pointer to correlate IRPs in a queue with a request (in the form of an IRP_MJ_CLEANUP) to cancel all queued
IRPs in preparation for closing the file object.
Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
- 114 - The I/O Request Packet | Chapter 5
Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
5.2 The “Standard Model” for IRP Processing - 115 -
NOTE
Throughout this chapter, I use the terms synchronous and asynchronous IRPs because those are the terms used
in the DDK. Knowledgeable developers in Microsoft wish that the terms threaded and nonthreaded had been
chosen because they better reflect the way drivers use these two types of IRP. As should become clear, you use
a synchronous, or threaded, IRP in a non-arbitrary thread that you can block while you wait for the IRP to finish.
You use an asynchronous, or nonthreaded, IRP in every other case.
The problem with this code is that the KeWaitForSingleObject call will deadlock: when the IRP completes, IoCompleteRequest
will schedule an APC in this thread. The APC routine, if it could run, would set the event. But because you’re already at
APC_LEVEL, the APC cannot run in order to set the event.
If you need to synchronize IRPs sent to another driver, consider the following alternatives:
Use a regular kernel mutex instead of an executive fast mutex. The regular mutex leaves you at PASSIVE_LEVEL and
doesn’t inhibit special kernel APCs.
Use KeEnterCriticalRegion to inhibit all but special kernel APCs, and then use ExAcquireFastMutexUnsafe to acquire the
mutex. This technique won’t work in the original release of Windows 98 because KeEnterCriticalRegion wasn’t
supported there. It will work on all later WDM platforms.
Use an asynchronous IRP. Signal an event in the completion routine. Refer to IRP-handling scenario 8 at the end of this
chapter for a code sample.
Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
- 116 - The I/O Request Packet | Chapter 5
A final consideration in calling the two synchronous IRP routines is that you can’t create just any kind of IRP using these
routines. See Table 5-1 for the details. A common trick for creating another kind of synchronous IRP is to ask for an
IRP_MJ_SHUTDOWN, which has no parameters, and then alter the MajorFunction code in the first stack location.
The first argument to IoCallDriver is the address of a device object that you’ve obtained somehow. Often you’re sending an
Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
5.2 The “Standard Model” for IRP Processing - 117 -
IRP to the driver under yours in the PnP stack. In that case, the DeviceObject in this fragment is the LowerDeviceObject you
saved in your device extension after calling IoAttachDeviceToDeviceStack. I’ll describe some other common ways of locating a
device object in a few paragraphs.
The I/O Manager initializes the stack location pointer in the IRP to 1 before the actual first location. Because the I/O stack is an
array of IO_STACK_LOCATION structures, you can think of the stack pointer as being initialized to point to the “-1” element,
which doesn’t exist. (In fact, the stack “grows” from high toward low addresses, but that detail shouldn’t obscure the concept
I’m trying to describe here.) We therefore ask for the “next” stack location when we want to initialize the first one.
As you can see, IoCallDriver simply advances the stack pointer and calls the appropriate dispatch routine in the driver for the
target device object. It returns the status code that that dispatch routine returns. Sometimes I see online help requests wherein
people attribute one or another unfortunate action to IoCallDriver. (For example, “IoCallDriver is returning an error code for
my IRP….”) As you can see, the real culprit is a dispatch routine in another driver.
IoGetDeviceObjectPointer
If you know the name of the device object, you can call IoGetDeviceObjectPointer as shown here:
This function returns two pointers: one to a FILE_OBJECT and one to a DEVICE_OBJECT.
To help defeat elevation-of-privilege attacks, specify the most restricted access consistent with your needs. For example,
if you’ll just be reading data, specify FILE_READ_DATA.
When you create an IRP for a target you discover this way, you should set the FileObject pointer in the first stack location.
Furthermore, it’s a good idea to take an extra reference to the file object until after IoCallDriver returns. The following
fragment illustrates both these ideas:
The reason you put the file object pointer in each stack location is that the target driver might be using fields in the file object
to record per-handle information. The reason you take an extra reference to the file object is that you’ll have code somewhere
in your driver that dereferences the file object in order to release your hold on the target device. (See the next paragraph.)
Should that code execute before the target driver’s dispatch routine returns, the target driver might be removed from memory
before its dispatch routine returns. The extra reference prevents that bad result.
Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
- 118 - The I/O Request Packet | Chapter 5
NOTE
Removability of devices in a Plug and Play environment is the ultimate source of the early-unload problem
mentioned in the text. I discuss this problem in much greater detail in the next chapter. The upshot of that
discussion is that it’s your responsibility to avoid sending an IRP to a driver that might no longer be in memory
and to prevent the PnP manager from unloading a driver that’s still processing an IRP you’ve sent to that driver.
One aspect of how you fulfill that responsibility is shown in the text: take an extra reference to the file object
returned by IoGetDeviceObjectPointer around the call to IoCallDriver. In most drivers, you’ll probably need the
extra reference only when you’re sending an asynchronous IRP. In that case, the code that ordinarily
dereferences the file object is likely to be in some other part of your driver that runs asynchronously with the
call to IoCallDriver—say, in the completion routine you’re obliged to install for an asynchronous IRP. If you send
a synchronous IRP, you’re much more likely to code your driver in such a way that you don’t dereference the file
object until the IRP completes.
When you no longer need the device object, dereference the file object:
ObDereferenceObject(FileObject);
After making this call, don’t use either of the file or device object pointers.
IoGetDeviceObjectPointer performs several steps to locate the two pointers that it returns to you:
1. It uses ZwOpenFile to open a kernel handle to the named device object. Internally, this will cause the Object Manager to
create a file object and to send an IRP_MJ_CREATE to the target device. ZwOpenFile returns a file handle.
2. It calls ObReferenceObjectByHandle to get the address of the FILE_OBJECT that the handle represents. This address
becomes the FileObject return value.
3. It calls IoGetRelatedDeviceObject to get the address of the DEVICE_OBJECT to which the file object refers. This
address becomes the DeviceObject return value.
4. It calls ZwClose to close the handle.
Instead of naming a device object, the function driver for the target device might have registered a device
interface. I showed you the user-mode code for enumerating instances of registered interfaces in Chapter 2. I’ll
discuss the kernel-mode equivalent of that enumeration code in Chapter 6, when I discuss Plug and Play. The
upshot of that discussion is that you can obtain the symbolic link names for all the devices that expose a
particular interface. With a bit of effort, you can then locate the desired device object.
The reference that IoGetDeviceObjectPointer claims to the file object effectively pins the device object in memory too.
Releasing that reference indirectly releases the device object.
Based on this explanation of how IoGetDeviceObjectPointer works, you can see why it will sometimes fail with
STATUS_ACCESS_DENIED, even though you haven’t done anything wrong. If the target driver implements a “one handle
only” policy, and if a handle happens to be open, the driver will cause the IRP_MJ_CREATE to fail. That failure causes the
ZwOpenFile call to fail in turn. Note that you can expect this result if you try to locate a device object for a serial port or
SmartCard reader that happens to already be open.
Sometimes driver programmers decide they don’t want the clutter of two pointers to what appears to be basically the same
object, so they release the file object immediately after calling IoGetDeviceObjectPointer, as shown here:
status = IoGetDeviceObjectPointer(...);
ObReferenceObject(DeviceObject);
ObDereferenceObject(FileObject);
Referencing the device object pins it in memory until you dereference it. Dereferencing the file object allows the I/O Manager
to delete it right away.
Releasing the file object immediately might or might not be OK, depending on the target driver. Consider these fine points
before you decide to do it:
Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
5.2 The “Standard Model” for IRP Processing - 119 -
1. Deferencing the file object will cause the I/O Manager to send an immediate IRP_MJ_CLEANUP to the target driver.
2. IRPs that the target driver queues will no longer be associated with a file object. When you eventually release the device
object reference, the target driver will probably not be able to cancel any IRPs you sent it that remain on its queues.
3. In many situations, the I/O Manager will also send an IRP_MJ_CLOSE to the target driver. (If you’ve opened a disk file,
the file system driver’s use of the system cache will probably cause the IRP_MJ_CLOSE to be deferred.) Many drivers,
including the standard driver for serial ports, will now refuse to process IRPs that you send them.
4. Instead of claiming an extra reference to the file object around calls to IoCallDriver, you’ll want to reference the device
object instead.
NOTE
I recommend avoiding an older routine named IoAttachDevice, which appears superficially to be a sort-of
combination of IoGetDeviceObjectPointer and IoAttachDeviceToDeviceStack. The older routine does its internal
ZwClose call after attaching your device object. Your driver will receive the resulting IRP_MJ_CLOSE. To handle
the IRP correctly, you must call IoAttachDevice in such a way that your dispatch routine has access to the
location you specify for the output DEVICE_OBJECT pointer. It turns out that IoAttachDevice sets your output
pointer before calling ZwClose and depends on you using it to forward the IRP_MJ_CLOSE to the target device.
This is the only example I’ve seen in many decades of programming where you’re required to use the return
value from a function before the function actually returns.
IoGetAttachedDeviceReference
To send an IRP to all the drivers in your own PnP stack, use IoGetAttachedDeviceReference, as shown here:
ObDereferenceObject(tdo);
This function returns the address of the topmost device object in your own stack and claims a reference to that object. Because
of the reference you hold, you can be sure that the pointer will remain valid until you release the reference. As discussed earlier,
you might also want to take an extra reference to the topmost device object until IoCallDriver returns.
PDEVICE_EXTENSION pdx =
(PDEVICE_EXTENSION) fdo->DeviceExtension;
return STATUS_Xxx;
}
1. You generally need to access the current stack location to determine parameters or to examine the minor function code.
2. You also generally need to access the device extension you created and initialized during AddDevice.
3. You’ll be returning some NTSTATUS code to IoCallDriver, which will propagate the code back to its caller.
Where I used an ellipsis in the foregoing prototypical dispatch function, a dispatch function has to choose between three
courses of action. It can complete the request immediately, pass the request down to a lower-level driver in the same driver
stack, or queue the request for later processing by other routines in this driver.
Completing an IRP
Someplace, sometime, someone must complete every IRP. You might want to complete an IRP in your dispatch routine in
cases like these:
If the request is erroneous in some easily determined way (such as a request to rewind a printer or to eject the keyboard), the
dispatch routine should cause the request to fail by completing it with an appropriate status code.
Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
- 120 - The I/O Request Packet | Chapter 5
If the request calls for information that the dispatch function can easily determine (such as a control request asking for the
driver’s version number), the dispatch routine should provide the answer and complete the request with a successful status
code.
Mechanically, completing an IRP entails filling in the Status and Information members within the IRP’s IoStatus block and
calling IoCompleteRequest. The Status value is one of the codes defined by manifest constants in the DDK header file
NTSTATUS.H. Refer to Table 5-3 for an abbreviated list of status codes for common situations. The Information value
depends on what type of IRP you’re completing and on whether you’re causing the IRP to succeed or to fail. Most of the time,
when you’re causing an IRP to fail (that is, completing it with an error status of some kind), you’ll set Information to 0. When
you cause an IRP that involves data transfer to succeed, you ordinarily set the Information field equal to the number of bytes
transferred.
NOTE
Always be sure to consult the DDK documentation for the correct setting of IoStatus.Information for the IRP
you’re dealing with. In some flavors of IRP_MJ_PNP, for example, this field is used as a pointer to a data
structure that the PnP Manager is responsible for releasing. If you were to overstore the Information field with
0 when causing the request to fail, you would unwittingly cause a resource leak.
Because completing a request is something you do so often, I find it useful to have a helper routine to carry out the mechanics:
I defined this routine in such a way that it returns whatever status value you supply as its second argument. That’s because I’m
such a lazy typist: the return value allows me to use this helper whenever I want to complete a request and then immediately
return a status code. For example:
You might notice that the Information argument to the CompleteRequest function is typed as a ULONG_PTR. In other words,
this value can be either a ULONG or a pointer to something (and therefore potentially 64 bits wide).
When you call IoCompleteRequest, you supply a priority boost value to be applied to whichever thread is currently waiting for
this request to complete. You normally choose a boost value that depends on the type of device, as suggested by the manifest
constant names listed in Table 5-4. The priority adjustment improves the throughput of threads that frequently wait for I/O
operations to complete. Events for which the end user is directly responsible, such as keyboard or mouse operations, result in
greater priority boosts in order to give preference to interactive tasks. Consequently, you want to choose the boost value with at
least some care. Don’t use IO_SOUND_INCREMENT for absolutely every operation a sound card driver finishes, for
example—it’s not necessary to apply this extraordinary priority increment to a get-driver-version control request.
Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
5.2 The “Standard Model” for IRP Processing - 121 -
So far, I’ve just explained how to call IoCompleteRequest. That function performs several tasks that you need to understand:
Calling completion routines that various drivers might have installed. I’ll discuss the important topic of I/O completion
routines later in this chapter.
Unlocking any pages belonging to Memory Descriptor List (MDL) structures attached to the IRP. An MDL will be used
for the buffer for an IRP_MJ_READ or IRP_MJ_WRITE for a device whose device object has the DO_DIRECT_IO flag
set. Control operations also use an MDL if the control code’s buffering method specifies one of the
METHOD_XX_DIRECT methods. I’ll discuss these issues more fully in Chapter 7 and Chapter 9, respectively.
Scheduling a special kernel APC to perform final cleanup on the IRP. This cleanup includes copying input data back to a
user buffer, copying the IRP’s ending status, and signaling whichever event the originator of the IRP might be waiting on.
The fact that completion processing includes an APC, and that the cleanup includes setting an event, imposes some
exacting requirements on the way a driver implements a completion routine, so I’ll also discuss this aspect of I/O
completion in more detail later.
where fdo is the address of your own device object and pdo is the address of the physical device object (PDO) at the bottom of
the device stack. IoAttachDeviceToDeviceStack returns to you the address of the device object immediately underneath yours.
When you decide to forward an IRP that you received from above, this is the device object you’ll specify in the eventual call to
IoCallDriver.
Before passing an IRP to another driver, be sure to remove any cancel routine that you might have installed for the IRP.
As I mentioned just a few paragraphs ago, you’ll probably fulfill this requirement without specifically worrying about it.
Your queue management code will zero the cancel routine pointer when it dequeues an IRP. If you never queued the IRP in the
first place, the driver above you will have made sure the cancel routine pointer was NULL. The Driver Verifier will make sure
that you don’t break this rule.
When you pass an IRP down, you have the additional responsibility of initializing the IO_STACK_LOCATION that the next
Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
- 122 - The I/O Request Packet | Chapter 5
driver will use to obtain its parameters. One way of doing this is to perform a physical copy, like this:
IoCopyCurrentIrpStackLocationToNext(Irp);
status = IoCallDriver(pdx->LowerDeviceObject, Irp);
IoSkipCurrentIrpStackLocation retards the IRP’s stack pointer by one position. IoCallDriver will immediately advance the
stack pointer. The net effect is to not change the stack pointer. When the next driver’s dispatch routine calls
IoGetCurrentIrpStackLocation, it will retrieve exactly the same IO_STACK_LOCATION pointer that we were working with,
and it will thereby process exactly the same request (same major and minor function codes) with the same parameters.
CAUTION
The version of IoSkipCurrentIrpStackLocation that you get when you use the Windows Me or Windows 2000
build environment in the DDK is a macro that generates two statements without surrounding braces. Therefore,
you mustn’t use it in a construction like this:
if (<expression>)
IoSkipCurrentIrpStackLocation(Irp); // <== don't do this!
The explanation of why IoSkipCurrentIrpStackLocation works is so tricky that I thought an illustration might help. Figure 5-6
illustrates a situation in which three drivers are in a particular stack: yours (the function device object [FDO]) and two others
(an upper filter device object [FiDO] and the PDO). In the picture on the left, you see the relationship between stack locations,
parameters, and completion routines when we do the copy step with IoCopyCurrentIrpStackLocationToNext. In the picture on
the right, you see the same relationships when we use the IoSkipCurrentIrpStackLocation shortcut. In the right-hand picture,
the third and last stack location is fallow, but nobody gets confused by that fact.
Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
5.2 The “Standard Model” for IRP Processing - 123 -
IoMarkIrpPending(Irp);
return STATUS_PENDING;
}
1. Whenever we return STATUS_PENDING from a dispatch routine (as we’re about to do here), we make this call to help
the I/O Manager avoid an internal race condition. We must do this before we relinquish ownership of the IRP.
2. If our device is currently busy or stalled because of a PnP or Power event, StartPacket puts the request in a queue.
Otherwise, StartPacket marks the device as busy and calls our StartIo routine. I’ll describe the StartIo routine in the next
section. The last argument is the address of a cancel routine. I’ll discuss cancel routines later in this chapter.
3. We return STATUS_PENDING to tell our caller that we’re not done with this IRP yet.
It’s important not to touch the IRP once we call StartPacket. By the time that function returns, the IRP might have been
completed and the memory it occupies released. The pointer we have might, therefore, now be invalid.
A StartIo routine generally receives control at DISPATCH_LEVEL, meaning that it must not generate any page faults.
Your job in StartIo is to commence the IRP you’ve been handed. How you do this depends entirely on your device. Often you
will need to access hardware registers that are also used by your interrupt service routine (ISR) and, perhaps, by other routines
in your driver. In fact, sometimes the easiest way to commence a new operation is to store some state information in your
device extension and then fake an interrupt. Because either of these approaches needs to be carried out under the protection of
the same spin lock that protects your ISR, the correct way to proceed is to call KeSynchronizeExecution. For example:
VOID StartIo(...)
{
The TransferFirst routine shown here is an example of the generic class of SynchCritSection routines, so called because
they’re synchronized with the ISR. I’ll discuss the SynchCritSection concept in more detail in Chapter 7.
In Windows XP and later systems, you can follow this template instead of calling KeSynchronizeExecution:
VOID StartIo(...)
{
KIRQL oldirql = KeAcquireInterruptSpinLock(pdx->InterruptObject);
Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
- 124 - The I/O Request Packet | Chapter 5
Once StartIo gets the device busy handling the new request, it returns. You’ll see the request next when your device interrupts
to signal that it’s done with whatever transfer you started.
return TRUE;
}
The first argument of your ISR is the address of the interrupt object created by IoConnectInterrupt, but you’re unlikely to use
this argument. The second argument is whatever context value you specified in your original call to IoConnectInterrupt; it will
probably be the address of your device extension, as shown in this fragment.
I’ll discuss the duties of your ISR in detail in Chapter 7 in connection with reading and writing data, the subject to which
interrupt handling is most relevant. To carry on with this discussion of the standard model, I need to tell you that one of the
likely things for the ISR to do is to schedule a deferred procedure call (DPC). The purpose of the DPC is to let you do things,
such as calling IoCompleteRequest, that can’t be done at the rarified DIRQL at which your ISR runs. So you might have a line
of code like this one:
You’ll next see the IRP in the DPC routine you registered inside AddDevice with your call to IoInitializeDpcRequest. The
traditional name for that routine is DpcForIsr because it’s the DPC routine your ISR requests.
StartNextPacket(&pdx->dqSomething, fdo);
IoCompleteRequest(Irp, boost);
}
StartNextPacket removes the next IRP from your queue and sends it to StartIo.
IoCompleteRequest completes the IRP you specify as the first argument. The second argument specifies a priority boost for the
thread that has been waiting for this IRP. You’ll also fill in the IoStatus block within the IRP before calling IoCompleteRequest,
as I explained earlier, in the section “Completing an IRP.”
I’m not (yet) showing you how to determine which IRP has just completed. You might notice that the third argument to the
DPC is typed as a pointer to an IRP. This is because, once upon a time, people often specified an IRP address as one of the
context parameters to IoRequestDpc, and that value showed up here. Trying to communicate an IRP pointer from the function
that queues a DPC is unwise, though, because it’s possible for there to be just one call to the DPC routine for any number of
requests to queue that DPC. Accordingly, the DPC routine should develop the current IRP pointer based on whatever scheme
you happen to be using for IRP queuing.
The call to IoCompleteRequest is the end of this standard way of handling an I/O request. After that call, the I/O Manager (or
whichever entity created the IRP in the first place) owns the IRP once more. That entity will destroy the IRP and might unblock
a thread that has been waiting for the request to complete.
Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
5.3 Completion Routines - 125 -
Irp is the request whose completion you want to know about. CompletionRoutine is the address of the completion routine you
want called, and context is an arbitrary pointer-size value you want passed as an argument to the completion routine. The
InvokeOnXxx arguments are Boolean values indicating whether you want the completion routine called in three different
circumstances:
InvokeOnSuccess means you want the completion routine called when somebody completes the IRP with a status code
that passes the NT_SUCCESS test.
InvokeOnError means you want the completion routine called when somebody completes the IRP with a status code that
does not pass the NT_SUCCESS test.
InvokeOnCancel means you want the completion routine called when somebody calls IoCancelIrp before completing the
IRP. I worded this quite delicately: IoCancelIrp will set the Cancel flag in the IRP, and that’s the condition that gets tested
if you specify this argument. A cancelled IRP might end up being completed with STATUS_CANCELLED (which would
cause the NT_SUCCESS test to fail) or with any other status at all. If the IRP gets completed with an error and you
specified InvokeOnError, InvokeOnError by itself will cause your completion routine to be called. Conversely, if the IRP
gets completed without error and you specified InvokeOnSuccess, InvokeOnSuccess by itself will cause your completion
routine to be called. In these cases, InvokeOnCancel will be redundant. But if you left out one or the other (or both) of
InvokeOnSuccess or InvokeOnError, the InvokeOnCancel flag will let you see the eventual completion of an IRP whose
Cancel flag has been set, no matter which status is used for the completion.
At least one of these three flags must be TRUE. Note that IoSetCompletionRoutine is a macro, so you want to avoid arguments
that generate side effects. The three flag arguments and the function pointer, in particular, are each referenced twice by the
macro.
IoSetCompletionRoutine installs the completion routine address and context argument in the next
IO_STACK_LOCATION—that is, in the stack location in which the next lower driver will find its parameters. Consequently,
the lowest-level driver in a particular stack of drivers doesn’t dare attempt to install a completion routine. Doing so would be
pretty futile, of course, because—by definition of lowest-level driver—there’s no driver left to pass the request on to.
CAUTION
Recall that you are responsible for initializing the next I/O stack location before you call IoCallDriver. Do this
initialization before you install a completion routine. This step is especially important if you use
IoCopyCurrentIrpStackLocationToNext to initialize the next stack location because that function clears some
flags that IoSetCompletionRoutine sets.
It receives pointers to the device object and the IRP, and it also receives whichever context value you specified in the call to
IoSetCompletionRoutine. Completion routines can be called at DISPATCH_LEVEL in an arbitrary thread context but can also
be called at PASSIVE_LEVEL or APC_LEVEL. To accommodate the worst case (DISPATCH_LEVEL), completion routines
therefore need to be in nonpaged memory and must call only service functions that are callable at or below DISPATCH_LEVEL.
To accommodate the possibility of being called at a lower IRQL, however, a completion routine shouldn’t call functions such
as KeAcquireSpinLockAtDpcLevel that assume they’re at DISPATCH_LEVEL to start with.
There are really just two possible return values from a completion routine:
STATUS_MORE_PROCESSING_REQUIRED, which aborts the completion process immediately. The spelling of this
status code obscures its actual purpose, which is to short-circuit the completion of an IRP. Sometimes, a driver actually
does some additional processing on the same IRP. Other times, the flag just means, “Yo, IoCompleteRequest! Like, don’t
touch this IRP no more, dude!” Future versions of the DDK will therefore define an enumeration constant,
StopCompletion, that is numerically the same as STATUS_MORE_PROCESSING_REQUIRED but more evocatively
named. (Future printings of this book may also employ better grammar in describing the meaning to be ascribed the
constant, at least if my editors get their way.)
Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
- 126 - The I/O Request Packet | Chapter 5
Anything else, which allows the completion process to continue. Because any value besides
STATUS_MORE_PROCESSING_REQUIRED has the same meaning as any other, I usually just code STATUS_SUCCESS.
Future versions of the DDK will define STATUS_CONTINUE_COMPLETION and an enumeration constant, Con-
tinueCompletion, that are numerically the same as STATUS_SUCCESS.
I’ll have more to say about these return codes a bit further on in this chapter.
NOTE
The device object pointer argument to a completion routine is the value left in the I/O stack location’s
DeviceObject pointer. IoCallDriver ordinarily sets this value. People sometimes create an IRP with an extra stack
location so that they can pass parameters to a completion routine without creating an extra context structure.
Such a completion routine gets a NULL device object pointer unless the creator sets the DeviceObject field.
Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
5.3 Completion Routines - 127 -
A. Your computer implodes, creating a gravitational singularity into which the universe instantaneously
collapses.
B. You receive the blue screen of death because you’re supposed to know better than to install a completion
routine in this situation.
C. IoCompleteRequest calls your completion routine. Unless the completion routine returns
STATUS_MORE_PROCESSING_REQUIRED, IoCompleteRequest then completes the IRP normally.
D. IoCompleteRequest doesn’t call your completion routine. It completes the IRP normally.
if (Irp->PendingReturned) IoMarkIrpPending(Irp);
Now we’ll explore the hard way to learn about IoMarkIrpPending. Some I/O Manager routines manage an
IRP with code that functions much as does this example:
KEVENT event;
IO_STATUS_BLOCK iosb;
KeInitializeEvent(&event, ...);
PIRP Irp = IoBuildDeviceIoControlRequest(..., &event, &iosb);
NTSTATUS status = IoCallDriver(SomeDeviceObject, Irp);
if (status == STATUS_PENDING)
{
KeWaitForSingleObject(&event, ...);
status = iosb.Status;
}
else
<cleanup IRP>
The key here is that, if the returned status is STATUS_PENDING, the entity that creates this IRP will wait on the event that was
specified in the call to IoBuildDeviceIoControlRequest. This discussion could also be about an IRP built by
IoBuildSynchronousFsdRequest too—the important factor is the conditional wait on the event.
So who, you might well wonder, signals that event? IoCompleteRequest does this signaling indirectly by scheduling an APC to
the same routine that performs the <cleanup IRP> step in the preceding pseudocode. That cleanup code will do many tasks,
including calling IoFreeIrp to release the IRP and KeSetEvent to set the event on which the creator might be waiting. For some
types of IRP, IoCompleteRequest will always schedule the APC. For other types of IRP, though, IoCompleteRequest will
schedule the APC only if the SL_PENDING_RETURNED flag is set in the topmost stack location. You don’t need to know
which types of IRP fall into these two categories because Microsoft might change the way this function works and invalidate
the deductions you might make if you knew. You do need to know, though, that IoMarkPending is a macro whose only purpose
is to set SL_PENDING_RETURNED in the current stack location. Thus, if the dispatch routine in the topmost driver on the
stack does this:
return STATUS_PENDING;
Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
- 128 - The I/O Request Packet | Chapter 5
things will work out nicely. (I’m violating my naming convention here to emphasize where this dispatch function lives.)
Because this dispatch routine returns STATUS_PENDING, the originator of the IRP will call KeWaitForSingleObject. Because
the dispatch routine sets the SL_PENDING_RETURNED flag, IoCompleteRequest will know to set the event on which the
originator waits.
But suppose the topmost driver merely passed the request down the stack, and the second driver pended the IRP:
return STATUS_PENDING;
}
Apparently, the second driver’s stack location contains the SL_PENDING_RETURNED flag, but the first driver’s does not.
IoCompleteRequest anticipates this situation, however, by propagating the SL_PENDING_RETURNED flag whenever it
unwinds a stack location that doesn’t have a completion routine associated with it. Because the top driver didn’t install a
completion routine, therefore, IoCompleteRequest will have set the flag in the topmost location, and it will have caused the
completion event to be signaled.
In another scenario, the topmost driver uses IoSkipCurrentIrpStackLocation instead of IoCopyCurrentIrpStackLocationToNext.
Here, everything works out by default. This is because the IoMarkIrpPending call in SecondDriverDispatchSomething sets the
flag in the topmost stack location to begin with.
Things get sticky if the topmost driver installs a completion routine:
return STATUS_PENDING;
}
Here IoCompleteRequest won’t propagate SL_PENDING_RETURNED into the topmost stack location. I’m not exactly sure
why the Windows NT designers decided not to do this propagation, but it’s a fact that they did so decide. Instead, just before
calling the completion routine, IoCompleteRequest sets the PendingReturned flag in the IRP to whichever value
SL_PENDING_RETURNED had in the immediately lower stack location. The completion routine must then take over the job
of setting SL_PENDING_RETURNED in its own location:
return STATUS_SUCCESS;
}
If you omit this step, you’ll find that threads deadlock waiting for someone to signal an event that’s destined never to be
signaled. So don’t omit this step.
Given the importance of the call to IoMarkIrpPending, driver programmers through the ages have tried to find other ways of
dealing with the problem. Here is a smattering of bad ideas.
Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
5.3 Completion Routines - 129 -
The reason this is a bad idea is that the IRP might already be complete, and someone might already have called IoFreeIrp, by
the time IoCallDriver returns. You must treat the pointer as poison as soon as you give it away to a function that might
complete the IRP.
This is a bad idea if the next driver happens to complete the IRP in its dispatch routine and returns a nonpending status. In this
situation, IoCompleteRequest will cause all the completion cleanup to happen. When you return a nonpending status, the I/O
Manager routine that originated the IRP might call the same completion cleanup routine a second time. This leads to a
double-completion bug check.
Remember always to pair the call to IoMarkIrpPending with returning STATUS_PENDING. That is, do both or neither, but
never one without the other.
Bad Idea # 3—Call IoMarkPending Regardless of the Return Code from the Completion Routine
In this example, the programmer forgot the qualification of the rule about when to make the call to IoMarkIrpPending from a
completion routine:
Irp->IoStatus.Status = status;
IoCompleteRequest(Irp, IO_NO_INCREMENT);
return status;
}
Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
- 130 - The I/O Request Packet | Chapter 5
return STATUS_MORE_PROCESSING_REQUIRED;
}
What’s probably going on here is that the programmer wants to forward the IRP synchronously and then resume processing the
IRP after the lower driver finishes with it. (See IRP-handling scenario 7 at the end of this chapter.) That’s how you’re supposed
to handle certain PnP IRPs, in fact. This example can cause a double-completion bug check, though, if the lower driver
happens to return STATUS_PENDING. This is actually the same scenario as in the previous bad idea: your dispatch routine is
returning a nonpending status, but your stack frame has the pending flag set. People often get away with this bad idea, which
existed in the IRP_MJ_PNP handlers of many early Windows 2000 DDK samples, because no one ever posts a Plug and Play
IRP. (Therefore, PendingReturned is never set, and the incorrect call to IoMarkIrpPending never happens.)
A variation on this idea occurs when you create an asynchronous IRP of some kind. You’re supposed to provide a completion
routine to free the IRP, and you’ll necessarily return STATUS_MORE_PROCESSING_REQUIRED from that completion
routine to prevent IoCompleteRequest from attempting to do any more work on an IRP that has disappeared:
SOMETYPE SomeFunction()
{
PIRP Irp = IoBuildAsynchronousFsdRequest(...);
IoSetCompletionRoutine(Irp, MyCompletionRoutine, ...);
IoCallDriver(...);
}
The problem here is that there is no current stack location inside this completion routine! Consequently, IoMarkIrpPending
modifies a random piece of storage. Besides, it’s fundamentally silly to worry about setting a flag that IoCompleteRequest will
never inspect: you’re returning STATUS_MORE_PROCESSING_REQUIRED, which is going to cause IoCompleteRequest to
immediately return to its own caller without doing another single thing with your IRP.
Avoid both of these problems by remembering not to call IoMarkIrpPending from a completion routine that returns
STATUS_MORE_PROCESSING_REQUIRED.
return STATUS_SUCCESS;
}
This strategy isn’t so much bad as inefficient. If SL_PENDING_RETURNED is set in the topmost stack location,
IoCompleteRequest schedules a special kernel APC to do the work in the context of the originating thread. Generally speaking,
if a dispatch routine posts an IRP, the IRP will end up being completed in some other thread. An APC is needed to get back into
the original context in order to do some buffer copying. But scheduling an APC is relatively expensive, and it would be nice to
avoid the overhead if you’re still in the original thread. Thus, if your dispatch routine doesn’t actually return
STATUS_PENDING, you shouldn’t mark your stack frame pending.
But nothing really awful will happen if you implement this bad idea, in the sense that the system will keep working normally.
Note also that Microsoft might someday change the way completion cleanup happens, so don’t write your driver on the
Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
5.3 Completion Routines - 131 -
NOTE
The DDK documentation for IoSetCompletionRoutineEx suggests that it’s useful only for non-PnP drivers. As
discussed here, however, on many occasions a PnP driver might need to use this function to achieve full
protection from early unloading.
The DeviceObject parameter is a pointer to your own device object. IoSetCompletionRoutineEx takes an extra reference to this
object just before calling your completion routine, and it releases the reference when your completion routine returns. The
extra reference pins the device object and, more important, your driver, in memory. But because this function doesn’t exist in
Windows versions prior to XP, you need to consider carefully whether you want to go to the trouble of calling
MmGetSystemRoutineAddress (and loading a Windows 98/Me implementation of the same function) to dynamically link to this
routine if it happens to be available. It seems to me that there are five discrete situations to consider:
Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
- 132 - The I/O Request Packet | Chapter 5
LIST_ENTRY IrpQueue;
BOOLEAN DeviceBusy;
} DEVICE_EXTENSION, *PDEVICE_EXTENSION;
NTSTATUS AddDevice(...)
{
InitializeListHead(&pdx->IrpQueue);
Then you can write two naive routines for queuing and dequeuing IRPs:
Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
5.4 Queuing I/O Requests - 133 -
Just in case you happen to be working on an old driver that uses these obsolete routines, however, here’s how
they work. A dispatch routine would queue an IRP like this:
Your driver would have a single StartIo routine. Your DriverEntry routine would set the DriverStartIo field of the
driver object to point to this routine. If your StartIo routine completes IRPs, you would also call IoSetStart-
IoAttributes (in Windows XP or later) to help prevent excessive recursion into StartIo. IoStartPacket and
IoStartNextPacket call StartIo to process one IRP at a time. In other words, StartIo is the place where the I/O
manager serializes access to your hardware.
A DPC routine (see the later discussion of how DPC routines work) would complete the previous IRP and start
the next one using this code:
To provide for canceling a queued IRP, you would need to write a cancel routine. Illustrating that and the cancel
logic in StartIo is beyond the scope of this book.
In addition, you can rely on the CurrentIrp field of a DEVICE_OBJECT to always contain NULL or the address of
the IRP most recently sent (by IoStartPacket or IoStartNextPacket) to your StartIo routine.
Then your dispatch routine calls NaiveStartPacket, and your DPC routine calls NaiveStartNextPacket in the manner discussed
earlier in connection with the standard model.
There are many problems with this scheme, which is why I called it naive. The most basic problem is that your DPC routine
and multiple instances of your dispatch routine could all be simultaneously active on different CPUs. They would likely
conflict in trying to access the queue and the busy flag. You could address that problem by creating a spin lock and using it to
guard against the obvious races, as follows:
LIST_ENTRY IrpQueue;
KSPIN_LOCK IrpQueueLock;
BOOLEAN DeviceBusy;
} DEVICE_EXTENSION, *PDEVICE_EXTENSION;
NTSTATUS AddDevice(...)
{
InitializeListHead(&pdx->IrpQueue);
KeInitializeSpinLock(&pdx->IrpQueueLock);
Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
- 134 - The I/O Request Packet | Chapter 5
InsertTailList(&pdx->IrpQueue, &Irp->Tail.Overlay.ListEntry;
KeReleaseSpinLock(&pdx->IrpQueueLock, oldirql);
}
else
{
pdx->DeviceBusy = TRUE;
KeReleaseSpinLock(&pdx->IrpQueueLock, DISPATCH_LEVEL);
StartIo(pdx->DeviceObject, Irp);
KeLowerIrql(oldirql);
}
}
Incidentally, we always want to call StartIo at a single IRQL. Because DPC routines are among the callers of
LessNaiveStartNextPacket, and they run at DISPATCH_LEVEL, we pick DISPATCH_LEVEL. That means we want to stay at
DISPATCH_LEVEL when we release the spin lock.
(You did remember that these two queue management routines need to be in nonpaged memory because they run at
DISPATCH_LEVEL, right?)
These queueing routines are actually almost OK, but they have one more defect and a shortcoming. The shortcoming is that we
need a way to stall a queue for the duration of certain PnP and Power states. IRPs accumulate in a stalled queue until someone
unstalls the queue, whereupon the queue manager can resume sending IRPs to a StartIo routine. The defect in the “less naive”
set of routines is that someone could decide to cancel an IRP at essentially any time. IRP cancellation complicates IRP queuing
logic so much that I’ve devoted the next major section to discussing it. Before we get to that, though, let me explain how to use
the queuing routines that I crafted to deal with all the problems.
DEVQUEUE dqReadWrite;
} DEVICE_EXTENSION, *PDEVICE_EXTENSION;
On the CD Code for the DEVQUEUE is part of GENERIC.SYS. In addition, if you use my WDMWIZ to create a
skeleton driver and don’t ask for GENERIC.SYS support, your skeleton project will include the files
DEVQUEUE.CPP and DEVQUEUE.H, which fully implement exactly the same object. I don’t recommend trying to
type this code from the book because the code from the companion content will contain even more features
than I can describe in the book. I also recommend checking my Web site (www.oneysoft.com) for updates and
corrections.
Figure 5-8 illustrates the IRP processing logic for a typical driver using DEVQUEUE objects. Each DEVQUEUE has its own
StartIo routine, which you specify when you initialize the object in AddDevice:
Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
5.4 Queuing I/O Requests - 135 -
NTSTATUS AddDevice(...)
{
DriverObject->MajorFunction[IRP_MJ_READ] = DispatchReadWrite;
DriverObject->MajorFunction[IRP_MJ_WRITE] = DispatchReadWrite;
#pragma PAGEDCODE
#pragma LOCKEDCODE
Note that the cancel argument to StartPacket is not optional: you must supply a cancel routine, but you can see how simple that
routine will be.
If you complete IRPs in a DPC routine, you’ll also call StartNextPacket:
StartNextPacket(&pdx->dqReadWrite, fdo);
}
If you complete IRPs in your StartIo routine, schedule a DPC to make the call to StartNextPacket in order to avoid excessive
Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
- 136 - The I/O Request Packet | Chapter 5
KDPC StartNextDpc;
} DEVICE_EXTENSION, *PDEVICE_EXTENSION;
NTSTATUS AddDevice(...)
{
KeInitializeDpc(&pdx->StartNextDpc,
(PKDEFERRED_ROUTINE) StartNextDpcRoutine, pdx);
VOID StartIo(...)
{
IoCompleteRequest(...);
KeInsertQueueDpc(&pdx->StartNextDpc, NULL, NULL);
}
In this example, StartIo calls IoCompleteRequest to complete the IRP it has just handled. Calling StartNextPacket directly
might lead to a recursive call to StartIo. After enough recursive calls, we’ll run out of stack. To avoid the potential stack
overflow, we queue the StartNextDpc DPC object and return. Because StartIo runs at DISPATCH_LEVEL, it won’t be possible
for the DPC routine to be called before StartIo returns. Therefore, StartNextDpcRoutine can call StartNextPacket without
worrying about recursion.
NOTE
If you were using the Microsoft queue routines IoStartPacket and IoStartNextPacket, you’d have a single
StartIo routine. Your DriverEntry routine would set the DriverStartIo pointer in the driver object to the address
of this routine. To avoid the recursion problem discussed in the text in Windows XP or later, you could call
IoSetStartIoAttributes.
NOTE
In their original incarnation, the cancel-safe queue functions weren’t appropriate when you wanted to use a
StartIo routine for actual I/O because they didn’t provide a way to set a CurrentIrp pointer and do a queue
operation inside one invocation of the queue lock. They were modified while I was writing this book to support
StartIo usage, but we didn’t have time to include an explanation of how to use the new features. I commend
you, therefore, to the DDK documentation.
Note also that the cancel-safe queue functions were first described in an XP release of the DDK. They are
implemented in a static library, however, and are therefore available for use on all prior platforms.
Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
5.4 Queuing I/O Requests - 137 -
IO_CSQ IrpQueue;
LIST_ENTRY IrpQueueAnchor;
KSPIN_LOCK IrpQueueLock;
} DEVICE_EXTENSION, *PDEVICE_EXTENSION;
KeInitializeSpinLock(&pdx->IrpQueueLock);
InitializeListHead(&pdx->IrpQueueAnchor);
IoCsqInitialize(&pdx->IrpQueue, InsertIrp, RemoveIrp,
PeekNextIrp, AcquireLock, ReleaseLock, CompleteCanceledIrp);
Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
- 138 - The I/O Request Packet | Chapter 5
It’s unnecessary and incorrect to call IoMarkIrpPending yourself because IoCsqInsertIrp does so automatically. As is true with
other queuing schemes, the IRP might be complete by the time IoCsqInsertIrp returns, so don’t touch the pointer afterwards.
To remove an IRP from the queue (say, in your I/O thread), use this code:
I’ll describe the PeekContext argument a bit further on. Note that the return value is NULL if no IRPs on are on the queue. The
IRP you get back hasn’t been cancelled, and any future call to IoCancelIrp is guaranteed to do nothing more than set the
Cancel flag in the IRP.
You’ll also want to provide a dispatch routine for IRP_MJ_CLEANUP that will interact with the queue. I’ll show you code for
that purpose a bit later in this chapter.
#define GET_DEVICE_EXTENSION(csq) \
CONTAINING_RECORD(csq, DEVICE_EXTENSION, IrpQueue)
You supply callback routines for acquiring and releasing the lock you’ve decided to use for your queue. For example, if you
had settled on using a spin lock, you’d write these two routines:
You don’t have to use a spin lock for synchronization, though. You can use a mutex, a fast mutex, or any other object that suits
your fancy.
When you call IoCsqInsertIrp, the I/O Manager locks your queue by calling your AcquireLock routine and then calls your
InsertIrp routine:
When you call IoCsqRemoveNextIrp, the I/O Manager locks your queue and calls your PeekNextIrp and RemoveIrp functions:
Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
5.4 Queuing I/O Requests - 139 -
Tail.Overlay.ListEntry);
if (PeekContext && <NextIrp matches PeekContext>)
return NextIrp;
if (!PeekContext)
return NextIrp;
next = next->Flink;
}
return NULL;
}
The parameters to PeekNextIrp require a bit of explanation. Irp, if not NULL, is the predecessor of the first IRP you should
look at. If Irp is NULL, you should look at the IRP at the front of the list. PeekContext is an arbitrary parameter that you can
use for any purpose you want as a way for the caller of IoCsqRemoveNextIrp to communicate with PeekNextIrp. A common
convention is to use this argument to point to a FILE_OBJECT that’s the current subject of an IRP_MJ_CLEANUP. I wrote
this function so that a NULL value for PeekContext means, “Return the next IRP, period.” A non-NULL value means, “Return
the next value that matches PeekContext.” You define what it means to “match” the peek context.
The sixth and last callback function is this one, which the I/O Manager calls when an IRP needs to be cancelled:
IO_CSQ_IRP_CONTEXT RedContext;
IO_CSQ_IRP_CONTEXT BlueContext;
} DEVICE_EXTENSION, *PDEVICE_EXTENSION;
When you receive a “red” IRP, you specify the context structure in your call to IoCsqInsertIrp:
Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
- 140 - The I/O Request Packet | Chapter 5
RedIrp->IoStatus.Status = STATUS_XXX;
RedIrp->IoStatus.Information = YYY;
IoCompleteRequest(RedIrp, IO_NO_INCREMENT);
}
IoCsqRemoveIrp will return NULL if the IRP associated with the context structure has already been cancelled.
Bear in mind the following caveats when using this mechanism:
It’s up to you to make sure that you haven’t previously parked an IRP using a particular context structure. IoCsqInsertIrp
is a VOID function and therefore has no way to tell you when you violate this rule.
You mustn’t touch an I/O buffer associated with a parked IRP because the IRP can be cancelled (and the I/O buffer
released!) at any time while it’s parked. You should remove the IRP from the queue before trying to use a buffer.
Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
5.5 Cancelling I/O Requests - 141 -
In this scheme, your own StartIo routine must also acquire and release the cancel spin lock to safely test the
Cancel flag in the IRP and to reset the CancelRoutine pointer to NULL.
Hardly anyone was able to craft queuing and cancel logic that approached being bulletproof using this original
scheme. Even the best algorithms actually have a residual flaw arising from a coincidence in IRP pointer values.
In addition, the fact that every driver in the system needed to use a single spin lock two or three times in the
normal execution path created a measurable performance problem. Consequently, Microsoft now recommends
that drivers either use the cancel-safe queue routines or else copy someone else’s proven queue logic. Neither
Microsoft nor I would recommend that you try to design your own queue logic with cancellation because getting
it right is very hard.
Nowadays, we handle the cancel races in one of two ways. We can implement our own IRP queue (or, more probably, cut and
paste someone else’s). Or, in certain kinds of drivers, we can use the IoCsqXxx family of functions. You don’t need to
understand how the IoCsqXxx functions handle IRP cancellation because Microsoft intends these functions to be a black box.
I’ll discuss in detail how my own DEVQUEUE handles cancellation, but I first need to tell you a bit more about the internal
workings of IoCancelIrp.
IoAcquireCancelSpinLock(&Irp->CancelIrql);
Irp->Cancel = TRUE;
(*CancelRoutine)(stack->DeviceObject, Irp);
return TRUE;
}
else
{
IoReleaseCancelSpinLock(Irp->CancelIrql);
return FALSE;
}
}
1. IoCancelIrp first acquires the global cancel spin lock. As you know if you read the sidebar earlier, lots of old drivers
contend for the use of this lock in their normal IRP-handling path. New drivers hold this lock only briefly while handling
the cancellation of an IRP.
2. Setting the Cancel flag to TRUE alerts any interested party that IoCancelIrp has been called for this IRP.
3. IoSetCancelRoutine performs an interlocked exchange to simultaneously retrieve the existing CancelRoutine pointer and
set the field to NULL in one atomic operation.
4. IoCancelIrp calls the cancel routine, if there is one, without first releasing the global cancel spin lock. The cancel routine
must release the lock! Note also that the device object argument to the cancel routine comes from the current stack
location, where IoCallDriver is supposed to have left it.
5. If there is no cancel routine, IoCancelIrp itself releases the global cancel spin lock. Good idea, huh?
Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
- 142 - The I/O Request Packet | Chapter 5
DEVQUEUE Internals—Initialization
The DEVQUEUE object has this declaration in my DEVQUEUE.H and GENERIC.H header files:
LIST_ENTRY head;
KSPIN_LOCK lock;
PDRIVER_START StartIo;
LONG stallcount;
PIRP CurrentIrp;
KEVENT evStop;
NTSTATUS abortstatus;
} DEVQUEUE, *PDEVQUEUE;
1. We use an ordinary (noninterlocked) doubly-linked list to queue IRPs. We don’t need to use an interlocked list because
we’ll always access it within the protection of our own spin lock.
2. This spin lock guards access to the queue and other fields in the DEVQUEUE structure. It also takes the place of the
global cancel spin lock for guarding nearly all of the cancellation process, thereby improving system performance.
3. Each queue has its own associated StartIo function that we call automatically in the appropriate places.
4. The stall counter indicates how many times somebody has requested that IRP delivery to StartIo be stalled. Initializing
the counter to 1 means that the IRP_MN_START_DEVICE handler must call RestartRequests to release an IRP. I’ll
discuss this issue more fully in Chapter 6.
5. The CurrentIrp field records the IRP most recently sent to the StartIo routine. Initializing this field to NULL indicates that
the device is initially idle.
6. We use this event when necessary to block WaitForCurrentIrp, one of the DEVQUEUE routines involved in handling
PnP requests. We’ll set the event inside StartNextPacket, which should always be called when the current IRP completes.
7. We reject incoming IRPs in two situations. The first situation occurs after we irrevocably commit to removing the device,
when we must start causing new IRPs to fail with STATUS_DELETE_PENDING. The second situation occurs during a
period of low power, when, depending on the type of device we’re managing, we might choose to cause new IRPs to fail
with the STATUS_DEVICE_POWERED_OFF code. The abortstatus field records the status code we should use in
rejecting IRPs in these situations.
In the steady state after all PnP initialization finishes, each DEVQUEUE will have a zero stallcount and abortstatus.
Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
5.5 Cancelling I/O Requests - 143 -
source code directly from GENERIC.SYS and did some minor formatting for the sake of readability on the printed page. I also
removed some power management code from StartNextPacket because it would just confuse this presentation.
KeAcquireSpinLock(&pdq->lock, &oldirql);
NTSTATUS abortstatus = pdq->abortstatus;
if (abortstatus)
{
KeReleaseSpinLock(&pdq->lock, oldirql);
Irp->IoStatus.Status = abortstatus;
IoCompleteRequest(Irp, IO_NO_INCREMENT);
}
IoSetCancelRoutine(Irp, cancel);
else
{
InsertTailList(&pdq->head, &Irp->Tail.Overlay.ListEntry);
KeReleaseSpinLock(&pdq->lock, oldirql);
}
}
else
{
pdq->CurrentIrp = Irp;
KeReleaseSpinLockFromDpcLevel(&pdq->lock);
(*pdq->StartIo)(fdo, Irp);
KeLowerIrql(oldirql);
}
}
KeAcquireSpinLock(&pdq->lock, &oldirql);
pdq->CurrentIrp = NULL;
if (!IoSetCancelRoutine(Irp, NULL))
{
InitializeListHead(&Irp->Tail.Overlay.ListEntry);
continue;
}
pdq->CurrentIrp = Irp;
KeReleaseSpinLockFromDpcLevel(&pdq->lock);
Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
- 144 - The I/O Request Packet | Chapter 5
(*pdq->StartIo)(fdo, Irp);
KeLowerIrql(oldirql);
}
KeReleaseSpinLock(&pdq->lock, oldirql);
}
IoReleaseCancelSpinLock(DISPATCH_LEVEL);
KeAcquireSpinLockAtDpcLevel(&pdq->lock);
RemoveEntryList(&Irp->Tail.Overlay.ListEntry);
KeReleaseSpinLock(&pdq->lock, oldirql);
Irp->IoStatus.Status = STATUS_CANCELLED;
IoCompleteRequest(Irp, IO_NO_INCREMENT);
}
Now I’ll explain in detail how these functions work together to provide cancel-safe queuing. I’ll do this by describing a series
of scenarios that involve all of the code paths.
Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
5.5 Cancelling I/O Requests - 145 -
If the device is idle, the if test at point 3 fails and we once again go directly to point 7, where we send the IRP to StartIo. In
effect, we’re going to ignore the Cancel flag. This is fine so long as we process the IRP “relatively quickly,” which is an
engineering judgment. If we won’t process the IRP with reasonable dispatch, StartIo and the downstream logic for handling the
IRP should have code to detect the Cancel flag and to complete the IRP early.
Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
- 146 - The I/O Request Packet | Chapter 5
with the situation: we simply initialize the linking field of the IRP as if it were the anchor of a list! The call to RemoveEntryList
at point 16 in CancelRequest will perform several motions with no net result to “remove” the IRP from the degenerate list it
now inhabits.
SomeFunction()
{
KEVENT event;
IO_STATUS_BLOCK iosb;
KeInitializeEvent(&event, ...);
PIRP Irp = IoBuildSynchronousFsdRequest(..., &event, &iosb);
NTSTATUS status = IoCallDriver(DeviceObject, Irp);
if (status == STATUS_PENDING)
{
LARGE_INTEGER timeout;
timeout.QuadPart = -5 * 10000000;
The first call (A) to KeWaitForSingleObject waits until one of two things happens. First, someone might complete the IRP, and
the I/O Manager’s cleanup code will then run and signal event.
Alternatively, the timeout might expire before anyone completes the IRP. In this case, KeWaitForSingleObject will return
STATUS_TIMEOUT. The IRP should now be completed quite soon in one of two paths. The first completion path is taken
when whoever was processing the IRP was really just about done when the timeout happened and has, therefore, already called
(or will shortly call) IoCompleteRequest. The other completion path is through the cancel routine that, we must assume, the
lower driver has installed. That cancel routine should complete the IRP. Recall that we have to trust other kernel-mode
components to do their jobs, so we have to rely on whomever we sent the IRP to complete it soon. Whichever path is taken, the
I/O Manager’s completion logic will set event and store the IRP’s ending status in iosb. The second call (B) to
KeWaitForSingleObject makes sure that the event and iosb objects don’t pass out of scope too soon. Without that second call,
we might return from this function, thereby effectively deleting event and iosb. The I/O Manager might then end up walking on
memory that belongs to some other subroutine.
Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
5.5 Cancelling I/O Requests - 147 -
The problem with the preceding code is truly minuscule. Imagine that someone manages to call IoCompleteRequest for this
IRP right around the same time we decide to cancel it by calling IoCancelIrp. Maybe the operation finishes shortly after the 5
‐second timeout terminates the first KeWaitForSingleObject, for example. IoCompleteRequest initiates a process that finishes
with a call to IoFreeIrp. If the call to IoFreeIrp were to happen before IoCancelIrp was done mucking about with the IRP, you
can see that IoCancelIrp could inadvertently corrupt memory when it touched the CancelIrql, Cancel, and CancelRoutine fields
of the IRP. It’s also possible, depending on the exact sequence of events, for IoCancelIrp to call a cancel routine, just before
someone clears the CancelRoutine pointer in preparation for completing the IRP, and for the cancel routine to be in a race with
the completion process.
It’s very unlikely that the scenario I just described will happen. But, as someone (James Thurber?) once said in connection with
the chances of being eaten by a tiger on Main Street (one in a million, as I recall), “Once is enough.” This kind of bug is almost
impossible to find, so you want to prevent it if you can. I’ll show you two ways of cancelling your own IRPs. One way is
appropriate for synchronous IRPs, the other for asynchronous IRPs.
Don’t Do This…
A once common but now deprecated technique for avoiding the tiger-on-main-street bug described in the text
relies on the fact that, in earlier versions of Windows, the call to IoFreeIrp happened in the context of an APC
in the thread that originates the IRP. You could make sure you were in that same thread, raise IRQL to
APC_LEVEL, check whether the IRP had been completed yet, and (if not) call IoCancelIrp. You could be sure of
blocking the APC and the problematic call to IoFreeIrp.
You shouldn’t rely on future releases of Windows always using an APC to perform the cleanup for a synchronous
IRP. Consequently, you shouldn’t rely on boosting IRQL to APC_LEVEL as a way to avoid a race between
IoCancelIrp and IoFreeIrp.
SomeFunction()
{
KEVENT event;
IO_STATUS_BLOCK iosb;
KeInitializeEvent(&event, ...);
PIRP Irp = IoBuildSynchronousFsdRequest(..., &event, &iosb);
IoSetCompletionRoutine(Irp, OnComplete, (PVOID) &event, TRUE, TRUE, TRUE);
NTSTATUS status = IoCallDriver(...);
if (status == STATUS_PENDING)
{
LARGE_INTEGER timeout;
timeout.QuadPart = -5 * 10000000;
The new code in boldface prevents the race. Suppose IoCallDriver returns STATUS_PENDING. In a normal case, the operation
will complete normally, and a lower-level driver will call IoCompleteRequest. Our completion routine gains control and signals
the event on which our mainline is waiting. Because the completion routine returns
STATUS_MORE_PROCESSING_REQUIRED, IoCompleteRequest will then stop working on this IRP. We eventually regain
Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
- 148 - The I/O Request Packet | Chapter 5
control in our SomeFunction and notice that our wait (the one labeled A) terminated normally. The IRP hasn’t yet been cleaned
up, though, so we need to call IoCompleteRequest a second time to trigger the normal cleanup mechanism.
Now suppose we decide we want to cancel the IRP and that Thurber’s tiger is loose so we have to worry about a call to
IoFreeIrp releasing the IRP out from under us. Our first wait (labeled A) finishes with STATUS_TIMEOUT, so we perform a
second wait (labeled B). Our completion routine sets the event on which we’re waiting. It will also prevent the cleanup
mechanism from running by returning STATUS_MORE_PROCESSING_REQUIRED. IoCancelIrp can stomp away to its
heart’s content on our hapless IRP without causing any harm. The IRP can’t be released until the second call to
IoCompleteRequest from our mainline, and that can’t happen until IoCancelIrp has safely returned.
Notice that the completion routine in this example calls KeSetEvent only when the IRP’s PendingReturned flag is set to
indicate that the lower driver’s dispatch routine returned STATUS_PENDING. Making this step conditional is an optimization
that avoids the potentially expensive step of setting the event when SomeFunction won’t be waiting on the event in the first
place.
I want to mention one last fine point in connection with the preceding code. The call to IoCompleteRequest at the very end of
the subroutine will trigger a process that includes setting event and iosb so long as the IRP originally completed with a success
status. In the first edition, I had an additional call to KeWaitForSingleObject at this point to make sure that event and iosb
could not pass out of scope before the I/O Manager was done touching them. A reviewer pointed out that the routine that
references event and iosb will already have run by the time IoCompleteRequest returns; consequently, the additional wait is not
needed.
Initialize these fields just before you call IoCallDriver to launch the IRP:
pdx->TheIrp = IRP;
pdx->CancelFlag = 0;
IoSetCompletionRoutine(Irp,
(PIO_COMPLETION_ROUTINE) CompletionRoutine,
(PVOID) pdx, TRUE, TRUE, TRUE);
IoCallDriver(..., Irp);
If you decide later on that you want to cancel this IRP, do something like the following:
if (InterlockedExchange(&pdx->CancelFlag, 1)
IoFreeIrp(Irp);
}
}
This function dovetails with the completion routine you install for the IRP:
if (InterlockedExchangePointer(&pdx->TheIrp, NULL)
││ InterlockedExchange(&pdx->CancelFlag, 1))
IoFreeIrp(Irp);
Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
5.5 Cancelling I/O Requests - 149 -
return STATUS_MORE_PROCESSING_REQUIRED;
}
The basic idea underlying this deceptively simple code is that whichever routine sees the IRP last (either CompletionRoutine or
CancelTheIrp) will make the requisite call to IoFreeIrp, at point 3 or 6. Here’s how it works:
The normal case occurs when you don’t ever try to cancel the IRP. Whoever you sent the IRP to eventually completes it,
and your completion routine gets control. The first InterlockedExchangePointer (point 4) returns the non-NULL address
of the IRP. Since this is not 0, the compiler short-circuits the evaluation of the Boolean expression and executes the call to
IoFreeIrp. Any subsequent call to CancelTheIrp will find the IRP pointer set to NULL at point 1 and won’t do anything
else.
Another easy case to analyze occurs when CancelTheIrp is called long before anyone gets around to completing this IRP,
which means that we don’t have any actual race. At point 1, we nullify the TheIrp pointer. Because the IRP pointer was
previously not NULL, we go ahead and call IoCancelIrp. In this situation, our call to IoCancelIrp will cause somebody to
complete the IRP reasonably soon, and our completion routine runs. It sees TheIrp as NULL and goes on to evaluate the
second half of the Boolean expression. Whoever executes the InterlockedExchange on CancelFlag first will get back 0
and skip calling IoFreeIrp. Whoever executes it second will get back 1 and will call IoFreeIrp.
Now for the case we were worried about: suppose someone is completing the IRP right about the time CancelTheIrp
wants to cancel it. The worst that can happen is that our completion routine runs before we manage to call IoCancelIrp.
The completion routine sees TheIrp as NULL and therefore exchanges CancelFlag with 1. Just as in the previous case,
the routine will get 0 as the return value and skip the IoFreeIrp call. IoCancelIrp can safely operate on the IRP. (It will
presumably just return without calling a cancel routine because whoever completed this IRP will undoubtedly have set
the CancelRoutine pointer to NULL first.)
The appealing thing about the technique I just showed you is its elegance: we rely solely on interlocked operations and
therefore don’t need any potentially expensive synchronization primitives.
Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
- 150 - The I/O Request Packet | Chapter 5
This code is almost the same as what I showed earlier for canceling your own synchronous IRP. The only difference is that this
example involves a dispatch routine, which must return a status code. As in the earlier example, we install our own completion
routine to prevent the completion process from running to its ultimate conclusion before we get past the point where we might
call IoCancelIrp.
You might notice that I didn’t say anything about whether the IRP itself was synchronous or asynchronous. This is because the
difference between the two types of IRP only matters to the driver that creates them in the first place. File system drivers must
make distinctions between synchronous and asynchronous IRPs with respect to how they call the system cache manager, but
device drivers don’t typically have this complication. What matters to a lower-level driver is whether it’s appropriate to block a
thread in order to handle an IRP synchronously, and that depends on the current IRQL and whether you’re in an arbitrary or a
nonarbitrary thread.
This code is similar to the code I showed earlier for cancelling your own asynchronous IRP. Here, however, allowing
IoCompleteRequest to finish completing the IRP takes the place of the call to IoFreeIrp we made when we were dealing with
our own IRP. If the completion routine is last on the scene, it returns STATUS_SUCCESS to allow IoCompleteRequest to finish
completing the IRP. If CancelTheIrp is last on the scene, it calls IoCompleteRequest to resume the completion processing that
Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
5.5 Cancelling I/O Requests - 151 -
File Objects
Ordinarily, just one driver (the function driver, in fact) in a device stack implements all three of the following
requests: IRP_MJ_CREATE, IRP_MJ_CLOSE, and IRP_MJ_CLEANUP. The I/O Manager creates a file object (a
regular kernel object) and passes it in the I/O stack to the dispatch routines for all three of these IRPs. Anybody
who sends an IRP to a device should have a pointer to the same file object and should insert that pointer into
the I/O stack as well. The driver that handles these three IRPs acts as the owner of the file object in some sense,
in that it’s the driver that’s entitled to use the FsContext and FsContext2 fields of the object. So your
DispatchCreate routine can put something into one of these context fields for use by other dispatch routines and
for eventual cleanup by your DispatchClose routine.
It’s easy to get confused about IRP_MJ_CLEANUP. In fact, programmers who have a hard time understanding IRP
cancellation sometimes decide (incorrectly) to just ignore this IRP. You need both cancel and cleanup logic in your driver,
though:
IRP_MJ_CLEANUP means a handle is being closed. You should purge all the IRPs that pertain to that handle.
The I/O Manager and other drivers cancel individual IRPs for a variety of reasons that have nothing to do with closing
handles.
One of the times the I/O Manager cancels IRPs is when a thread terminates. Threads often terminate because their parent
process is terminating, and the I/O Manager will also automatically close all handles that are still open when a process
Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
- 152 - The I/O Request Packet | Chapter 5
terminates. The coincidence between this kind of cancellation and the automatic handle closing contributes to the
incorrect idea that a driver can get by with support for just one concept.
In this book, I’ll show you two ways of painlessly implementing support for IRP_MJ_CLEANUP, depending on whether
you’re using one of my DEVQUEUE objects or one of Microsoft’s cancel-safe queues.
CleanupRequests will remove all IRPs from the queue that belong to the same file object and will complete those IRPs with
STATUS_CANCELLED. Note that you complete the IRP_MJ_CLEANUP request itself with STATUS_SUCCESS.
CleanupRequests contains a wealth of detail:
InitializeListHead(&cancellist);
KIRQL oldirql;
KeAcquireSpinLock(&pdq->lock, &oldirql);
PLIST_ENTRY first = &pdq->head;
PLIST_ENTRY next;
if (!IoSetCancelRoutine(Irp, NULL))
continue;
RemoveEntryList(current);
InsertTailList(&cancellist, current);
}
KeReleaseSpinLock(&pdq->lock, oldirql);
while (!IsListEmpty(&cancellist))
{
next = RemoveHeadList(&cancellist);
PIRP Irp = CONTAINING_RECORD(next, IRP,
Tail.Overlay.ListEntry);
Irp->IoStatus.Status = status;
IoCompleteRequest(Irp, IO_NO_INCREMENT);
}
}
1. Our strategy will be to move the IRPs that need to be cancelled into a private queue under protection of the queue’s spin
lock. Hence, we initialize the private queue and acquire the spin lock before doing anything else.
2. This loop traverses the entire queue until we return to the list head. Notice the absence of a loop increment step—the
third clause in the for statement. I’ll explain in a moment why it’s desirable to have no loop increment.
Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
5.6 Summary—Eight IRP-Handling Scenarios - 153 -
3. If we’re being called to help out with IRP_MJ_CLEANUP, the fop argument is the address of a file object that’s about to
be closed. We’re supposed to isolate the IRPs that pertain to the same file object, which requires us to first find the stack
location.
4. If we decide to remove this IRP from the queue, we won’t thereafter have an easy way to find the next IRP in the main
queue. We therefore perform the loop increment step here.
5. This especially clever statement comes to us courtesy of Jamie Hanrahan. We need to worry that someone might be trying
to cancel the IRP that we’re currently looking at during this iteration. They could get only as far as the point where
CancelRequest tries to acquire the spin lock. Before getting that far, however, they necessarily had to execute the
statement inside IoCancelIrp that nullifies the cancel routine pointer. If we find that pointer set to NULL when we call
IoSetCancelRoutine, therefore, we can be sure that someone really is trying to cancel this IRP. By simply skipping the
IRP during this iteration, we allow the cancel routine to complete it later on.
6. Here’s where we take the IRP out of the main queue and put it in the private queue instead.
7. Once we finish moving IRPs into the private queue, we can release our spin lock. Then we cancel all the IRPs we moved.
Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
- 154 - The I/O Request Packet | Chapter 5
Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
5.6 Summary—Eight IRP-Handling Scenarios - 155 -
Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
- 156 - The I/O Request Packet | Chapter 5
DEVQUEUE dqReadWrite;
} DEVICE_EXTENSION, *PDEVICE_EXTENSION;
InitializeQueue(&pdx->dqReadWrite, StartIo);
IoInitializeDpcRequest(fdo, (PIO_DPC_ROUTINE) DpcForIsr);
Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
5.6 Summary—Eight IRP-Handling Scenarios - 157 -
if (!NT_SUCCESS(status))
return <status>;
PIRP Irp;
Irp = IoBuildAsynchronousFsdRequest(IRP_MJ_XXX, DeviceObject, ...);
-or-
Irp = IoAllocateIrp(DeviceObject->StackSize, FALSE);
PIO_STACK_LOCATION stack = IoGetNextIrpStackLocation(Irp);
stack->MajorFunction = IRP_MJ_XXX;
<additional initialization)
IoSetCompletionRoutine[Ex]([pdx->DeviceObject,] Irp,
(PIO_COMPLETION_ROUTINE) CompletionRoutine, pdx,
TRUE, TRUE, TRUE);
ObReferenceObject(DeviceObject);
IoCallDriver(DeviceObject, Irp);
ObDereferenceObject(DeviceObject);
}
Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
- 158 - The I/O Request Packet | Chapter 5
{
<IRP cleanup -- see below>
IoFreeIrp(Irp);
The calls to IoAcquireRemoveLock and IoReleaseRemoveLock (the points labeled A) are necessary only if the device to which
you’re sending this IRP is the LowerDeviceObject in your PnP stack. The 42 is an arbitrary tag—it’s simply too complicated to
try to acquire the remove lock after the IRP exists just so we can use the IRP pointer as a tag in the debug build.
The calls to ObReferenceObject and ObDereferenceObject that precede and follow the call to IoCallDriver (the points labeled
B) are necessary only when you’ve used IoGetDeviceObjectPointer to obtain the DeviceObject pointer and when the
completion routine (or something it calls) will release the resulting reference to a device or file object.
You do not have both the A code and the B code—you have one set or neither.
If you use IoBuildAsynchronousFsdRequest to build an IRP_MJ_READ or IRP_MJ_WRITE, you have some relatively
complex cleanup to perform in the completion routine.
NTSTATUS CompletionRoutine(...)
{
PMDL mdl;
while ((mdl = Irp->MdlAddress))
{
Irp->MdlAddress = mdl->Next;
MmUnlockPages(mdl); // <== only if you earlier
// called MmProbeAndLockPages
IoFreeMdl(mdl);
}
IoFreeIrp(Irp);
<optional release of remove lock>
return STATUS_MORE_PROCESSING_REQUIRED;
}
Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
5.6 Summary—Eight IRP-Handling Scenarios - 159 -
if (!NT_SUCCESS(status))
return <status>;
PIRP Irp;
KEVENT event;
IO_STATUS_BLOCK iosb;
KeInitializeEvent(&event, NotificationEvent, FALSE);
Irp = IoBuildSynchronousFsdRequest(IRP_MJ_XXX,
DeviceObject, ..., &event, &iosb);
-or-
Irp = IoBuildDeviceIoControlRequest(IOCTL_XXX, DeviceObject,
..., &event, &iosb);
status = IoCallDriver(DeviceObject, Irp);
if (status == STATUS_PENDING)
{
KeWaitForSingleObject(&event, Executive, KernelMode, FALSE, NULL);
status = iosb.Status;
}
}
As in scenario 5, the calls to IoAcquireRemoveLock and IoReleaseRemoveLock (the points labeled A) are necessary only if the
device to which you’re sending this IRP is the LowerDeviceObject in your PnP stack. The 42 is an arbitrary tag—it’s simply
too complicated to try to acquire the remove lock after the IRP exists just so we can use the IRP pointer as a tag in the debug
build.
We’ll use this scenario frequently in Chapter 12 to send USB Request Blocks (URBs) synchronously down the stack. In the
examples we’ll study there, we’ll usually be doing this in the context of an IRP dispatch routine that independently acquires the
remove lock. Therefore, you won’t see the extra remove lock code in those examples.
You do not clean up after this IRP! The I/O Manager does it automatically.
Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
- 160 - The I/O Request Packet | Chapter 5
Someone is sending you an IRP (as opposed to you creating the IRP yourself).
You’re running at PASSIVE_LEVEL in a nonarbitrary thread.
Your postprocessing for the IRP must be done at PASSIVE_LEVEL.
The caller of this routine needs to call IoCompleteRequest for this IRP and to acquire and release the remove lock. It’s
inappropriate for ForwardAndWait to contain the remove lock logic because the caller might not want to release the lock so
soon.
Note that the Windows XP DDK function IoForwardIrpSynchronously encapsulates these same steps.
Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
5.6 Summary—Eight IRP-Handling Scenarios - 161 -
if (!NT_SUCCESS(status))
return <status>;
PIRP Irp;
Irp = IoBuildAsynchronousFsdRequest(IRP_MJ_XXX, DeviceObject, ...);
-or-
Irp = IoAllocateIrp(DeviceObject->StackSize, FALSE);
PIO_STACK_LOCATION stack = IoGetNextIrpStackLocation(Irp);
Stack->MajorFunction = IRP_MJ_XXX;
<additional initialization)
KEVENT event;
KeInitializeEvent(&event, NotificationEvent, FALSE);
IoSetCompletionRoutine[Ex]([pdx->DeviceObject], Irp,
(PIO_COMPLETION_ROUTINE) CompletionRoutine,
&event, TRUE, TRUE, TRUE);
status = IoCallDriver(DeviceObject, Irp);
if (status == STATUS_PENDING)
KeWaitForSingleObject(&event, Executive, KernelMode, FALSE, NULL);
Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
- 163 -
Chapter 6
6 Plug and Play for Function Drivers
The Plug and Play (PnP) Manager communicates information and requests to device drivers via I/O request packets (IRPs)
with the major function code IRP_MJ_PNP. This type of request was new with Microsoft Windows 2000 and the Windows
Driver Model (WDM): previous versions of Microsoft Windows NT required device drivers to do most of the work of
detecting and configuring their devices. Happily, WDM drivers can let the PnP Manager do that work. To work with the PnP
Manager, driver authors will have to understand a few relatively complicated IRPs.
Plug and Play requests play two roles in the WDM. In their first role, these requests instruct the driver when and how to
configure or deconfigure itself and the hardware. Table 6-1 lists the roughly two dozen minor functions that a PnP request can
designate. Only a bus driver handles the nine minor functions shown with an asterisk; a filter driver or function driver would
simply pass these IRPs down the stack. Of the remaining minor functions, three have special importance to a typical filter
driver or function driver. The PnP Manager uses IRP_MN_START_DEVICE to inform the function driver which I/O resources
it has assigned to the hardware and to instruct the function driver to do any necessary hardware and software setup so that the
device can function. IRP_MN_STOP_DEVICE tells the function driver to shut down the device. IRP_MN_REMOVE_DEVICE
tells the function driver to shut down the device and release the associated device object. I’ll discuss these three minor
functions in detail in this chapter and the next; along the way, I’ll also describe the purpose of the other unstarred minor
functions that a filter driver or function driver might need to handle.
Table 6-1. Minor Function Codes for IRP_MJ_PNP (* Indicates Handled Only by Bus Drivers)
A second and more complicated purpose of PnP requests is to guide the driver through a series of state transitions, as illustrated
in Figure 6-1. WORKING and STOPPED are the two fundamental states of the device. The STOPPED state is the initial state
of a device immediately after you create the device object. The WORKING state indicates that the device is fully operational.
Two of the intermediate states—PENDINGSTOP and PENDINGREMOVE—arise because of queries that all drivers for a
device must process before making the transition from WORKING. SURPRISEREMOVED occurs after the sudden and
unexpected removal of the physical hardware.
I introduced my DEVQUEUE queue management routines in the preceding chapter. The main reason for needing a custom
queuing scheme in the first place is to facilitate the PnP state transitions shown in Figure 6-1 and the power state transitions I’ll
discuss in Chapter 8. I’ll describe the DEVQUEUE routines that support these transitions in this chapter.
- 164 - Plug and Play for Function Drivers | Chapter 6
TIP
You can save yourself a lot of work by copying and using my GENERIC.SYS library. Instead of writing your own
elaborate dispatch function for IRP_MJ_PNP, simply delegate this IRP to GenericDispatchPnp. See the
Introduction for a table that lists the callback functions your driver supplies to perform device-specific
operations. I’ve used the same callback function names in this chapter. In addition, I’m basically using
GENERIC’s PnP handling code for all of the examples.
IoSkipCurrentIrpStackLocation(Irp);
PDEVICE_EXTENSION pdx =
(PDEVICE_EXTENSION) fdo->DeviceExtension;
Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
6.2 Starting and Stopping Your Device - 165 -
1. All the parameters for the IRP, including the all-important minor function code, are in the stack location. Hence, we
obtain a pointer to the stack location by calling IoGetCurrentIrpStackLocation.
2. We expect the IRP’s minor function code to be one of those listed in Table 6‐1.
3. A method of handling the two dozen possible minor function codes is to write a subdispatch function for each one we’re
going to handle and then to define a table of pointers to those subdispatch functions. Many of the entries in the table will
be DefaultPnpHandler. Subdispatch functions such as HandleStartDevice will take pointers to a device object and an IRP
as parameters and will return an NTSTATUS code.
4. If we get a minor function code we don’t recognize, it’s probably because Microsoft defined a new one in a release of the
DDK after the DDK with which we built our driver. The right thing to do is to pass the minor function code down the
stack by calling the default handler. By the way, arraysize is a macro in one of my own header files that returns the
number of elements in an array. It’s defined as #define arraysize(p) (sizeof(p)/sizeof((p)[0])).
5. This is the operative statement in the dispatch routine, in which we index the table of subdispatch functions and call the
right one.
6. The DefaultPnpHandler routine is essentially the ForwardAndForget function I showed in connection with IPR-handling
scenario 2 in the preceding chapter. We’re passing the IRP down without a completion routine and therefore use
IoSkipCurrentIrpStackLocation to retard the IRP stack pointer in anticipation that IoCallDriver will immediately advance
it.
Most programmers would probably place a switch statement in their DispatchPnp routine. You can simply
recompile your driver to conform to any reassignment of minor function codes. Recompilation will also
highlight—by producing compilation errors!—name changes that might signal functionality shifts. That
happened a time or two during the Microsoft Windows 98 and Windows 2000 betas, in fact. Furthermore, an
optimizing compiler should be able to use a jump table to produce slightly faster code for a switch statement
than for calls to subdispatch functions.
I think the choice between a switch statement and a table of function pointers is mostly a matter of taste, with
readability and modularity winning over efficiency in my own evaluation. You can avoid uncertainty during a
beta test by placing appropriate assertions in your code. For example, the HandleStartDevice function can
assert that stack->MinorFunction == IRP_MN_START_DEVICE. If you recompile your driver with each new beta
DDK, you’ll catch any number reassignments or name changes.
NOTE
I find it hard to give an abstract definition of the term I/O resource that isn’t circular (for example, a resource
used for I/O), so I’ll give a concrete one instead. The WDM encompasses four standard I/O resource types: I/O
ports, memory registers, direct memory access (DMA) channels, and interrupt requests.
When the PnP Manager detects hardware, it consults the registry to learn which filter drivers and function drivers will manage
the hardware. As I discussed in Chapter 2, the PnP Manager loads these drivers (if necessary—one or more of them might
already be present, having been called into memory on behalf of some other hardware) and calls their AddDevice functions.
The AddDevice functions, in turn, create device objects and link them into a stack. At this point, the stage is set for the PnP
Manager, working with all of the device drivers, to assign I/O resources.
The PnP Manager initially creates a list of resource requirements for each device and allows the drivers to filter that list. I’m
going to ignore the filtering step for now because not every driver will need to participate in this step. Given a list of
Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
- 166 - Plug and Play for Function Drivers | Chapter 6
requirements, the PnP Manager can then assign resources so as to harmonize the potentially conflicting requirements of all the
hardware present on the system. Figure 6-2 illustrates how the PnP Manager can arbitrate between two different devices that
have overlapping requirements for an interrupt request number, for example.
6.2.1 IRP_MN_START_DEVICE
Once the resource assignments are known, the PnP Manager notifies each device by sending it a PnP request with the minor
function code IRP_MN_START_DEVICE. Filter drivers are typically not interested in this IRP, so they usually pass the request
down the stack by using the DefaultPnpHandler technique I showed you earlier in “IRP_MJ_PNP Dispatch Function.”
Function drivers, on the other hand, need to do a great deal of work on the IRP to allocate and configure additional software
resources and to prepare the device for operation. This work needs to be done, furthermore, at PASSIVE_LEVELafter the lower
layers in the device hierarchy have processed this IRP.
You might implement IRP_MN_START_DEVICE in a subdispatch routine—reached from the DispatchPnp dispatch routine
shown earlier—that has the following skeletal form:
Irp->IoStatus.Status = STATUS_SUCCESS;
1. The bus driver uses the incoming setting of IoStatus.Status to determine whether upper-level drivers have handled this
IRP. The bus driver makes a similar determination for several other minor functions of IRP_MJ_PNP. We therefore need
to initialize the Status field of the IRP to STATUS_SUCCESS before passing it down.
2. ForwardAndWait is the function I showed you in Chapter 5 in connection with IRP-handling scenario 7 (synchronous
pass down). The function returns a status code. If the status code denotes some sort of failure in the lower layers, we
propagate the code back to our own caller. Because our completion routine returned STATUS_MORE_PRO-
CESSING_REQUIRED, we halted the completion process inside IoCompleteRequest. Therefore, we have to complete the
request all over again, as shown here.
3. Our configuration information is buried inside the stack parameters. I’ll show you where a bit further on.
4. StartDevice is a helper routine you write to handle the details of extracting and dealing with configuration information. In
my sample drivers, I’ve placed it in a separate source module named READWRITE.CPP. I’ll explain shortly what
Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
6.2 Starting and Stopping Your Device - 167 -
arguments you would pass to this routine besides the address of the device object.
5. EnableAllInterfaces enables all the device interfaces that you registered in your AddDevice routine. This step allows
applications to find your device when they use SetupDiXxx functions to enumerate instances of your registered interfaces.
6. Since ForwardAndWait short-circuited the completion process for the START_DEVICE request, we need to complete the
IRP a second time. In this example, I’m using an overloaded version of CompleteRequest that doesn’t change
IoStatus.Information, in accordance with the DDK rules for handling PnP requests.
You might guess (correctly!) that the IRP_MN_START_DEVICE handler has work to do that concerns the transition from the
initial STOPPED state to the WORKING state. I can’t explain that yet because I need to first explain the ramifications of other
PnP requests on state transitions, IRP queuing, and IRP cancellation. So I’m going to concentrate for a while on the
configuration aspects of the PnP requests.
The I/O stack location’s Parameters union has a substructure named StartDevice that contains the configuration information
you pass to the StartDevice helper function. See Table 6-2.
The only difference between the last two statements is the reference to either the AllocatedResources or
AllocatedResourcesTranslated member of the parameters structure.
The raw and translated resource lists are the logical arguments to send to the StartDevice helper function, by the way:
There are two different lists of resources because I/O buses and the CPU can address the same physical hardware in different
ways. The raw resources contain numbers that are bus-relative, whereas the translated resources contain numbers that are
system-relative. Prior to the WDM, a kernel-mode driver might expect to retrieve raw resource values from the registry, the
Peripheral Component Interconnect (PCI) configuration space, or some other source, and to translate them by calling routines
such as HalTranslateBusAddress and HalGetInterruptVector. See, for example, Art Baker’s The Windows NT Device Driver
Book: A Guide for Programmers (Prentice Hall, 1997), pages 122-62. Both the retrieval and translation steps are done by the
PnP Manager now, and all a WDM driver needs to do is access the parameters of a start device IRP as I’m now describing.
What you actually do with the resource descriptions inside your StartDevice function is a subject for Chapter 7.
6.2.2 IRP_MN_STOP_DEVICE
The stop device request tells you to shut your device down so that the PnP Manager can reassign I/O resources. At the
hardware level, shutting down involves pausing or halting current activity and preventing further interrupts. At the software
level, it involves releasing the I/O resources you configured at start device time. Within the framework of the
dispatch/subdispatch architecture I’ve been illustrating, you might have a subdispatch function like this one:
<complicated stuff>
StopDevice(fdo, oktouch);
Irp->IoStatus.Status = STATUS_SUCCESS;
Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
- 168 - Plug and Play for Function Drivers | Chapter 6
1. Right about here, you need to insert some more or less complicated code that concerns IRP queuing and cancellation. I’ll
show you the code that belongs in this spot further on in this chapter in “While the Device Is Stopped.”
2. In contrast with the start device case, in which we passed the request down and then did device-dependent work, here we
do our device-dependent stuff first and then pass the request down. The idea is that our hardware will be quiescent by the
time the lower layers see this request. I wrote a helper function named StopDevice to do the shutdown work. The second
argument indicates whether it will be OK for StopDevice to touch the hardware if it needs to. Refer to the sidebar
“Touching the Hardware When Stopping the Device” for an explanation of how to set this argument.
3. We always pass PnP requests down the stack. In this case, we don’t care what the lower layers do with the request, so we
can simply use the DefaultPnpHandler code to perform the mechanics.
The StopDevice helper function called in the preceding example is code you write that essentially reverses the configuration
steps you took in StartDevice. I’ll show you that function in the next chapter. One important fact about the function is that you
should code it in such a way that it can be called more than once for a single call to StartDevice. It’s not always easy for a PnP
IRP handler to know whether you’ve already called StopDevice, but it is easy to make StopDevice proof against duplicative
calls.
There’s no certain way to know whether your hardware is physically connected to the computer except by trying
to access it. Microsoft recommends, however, that if you succeeded in processing a START_DEVICE request,
you should go ahead and try to access your hardware when you process STOP_DEVICE and certain other PnP
requests. When I discuss how you track PnP state changes later in this chapter, I’ll honor this recommendation
by setting the oktouch argument to TRUE if we believe that the device is currently working and FALSE
otherwise.
6.2.3 IRP_MN_REMOVE_DEVICE
Recall that the PnP Manager calls the AddDevice function in your driver to notify you about an instance of the hardware you
manage and to give you an opportunity to create a device object. Instead of calling a function to do the complementary
operation, however, the PnP Manager sends you a PnP IRP with the minor function code IRP_MN_REMOVE_DEVICE. In
response to that, you’ll do the same things you did for IRP_MN_STOP_DEVICE to shut down your device, and then you’ll
delete the device object:
This fragment looks similar to HandleStopDevice, with a couple of additions. DeregisterAllInterfaces will disable any device
interfaces you registered (probably in AddDevice) and enabled (probably in StartDevice), and it will release the memory
occupied by their symbolic link names. RemoveDevice will undo all the work you did inside AddDevice. For example:
IoDetachDevice(pdx->LowerDeviceObject);
IoDeleteDevice(fdo);
}
Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
6.3 Managing PnP State Transitions - 169 -
2. This call to IoDeleteDevice balances the call AddDevice made to IoCreateDevice. Once this function returns, you should
act as if the device object no longer exists. If your driver isn’t managing any other devices, it will shortly be unloaded
from memory too.
Note, by the way, that you don’t get a stop device request followed by a remove device request. The remove device request
implies a shutdown, so you do both pieces of work in reply.
6.2.4 IRP_MN_SURPRISE_REMOVAL
Sometimes the end user has the physical ability to remove a device without going through any user interface elements first. If
the system detects that such a surprise removal has occurred, or that the device appears to be broken, it sends the driver a PnP
request with the minor function code IRP_MN_SURPRISE_REMOVAL. It will later send an IRP_MN_REMOVE_DEVICE.
Unless you previously set the SurpriseRemovalOK flag while processing IRP_MN_QUERY_CAPABILITIES (as I’ll discuss in
Chapter 8), some platforms also post a dialog box to inform the user that it’s potentially dangerous to yank hardware out of the
computer.
In response to the surprise removal request, a device driver should disable any registered interfaces. This will give applications
a chance to close handles to your device if they’re on the lookout for the notifications I discuss later in “PnP Notifications.”
Then the driver should release I/O resources and pass the request down:
Whence IRP_MN_SURPRISE_REMOVAL?
The surprise removal PnP notification doesn’t happen as a simple and direct result of the end user yanking the
device from the computer. Some bus drivers can know when a device disappears. For example, removing a
universal serial bus (USB) device generates an electronic signal that the bus driver notices. For many other
buses, however, there isn’t any signal to alert the bus driver. The PnP Manager therefore relies on other
methods to decide that a device has disappeared.
A function driver can signal the disappearance of its device (if it knows) by calling IoInvalidateDeviceState and
then returning any of the values PNP_DEVICE_FAILED, PNP_DEVICE_REMOVED, or PNP_DEVICE_DISABLED
from the ensuing IRP_MN_QUERY_PNP_DEVICE_STATE. You might want to do this in your own driver if—to give
one example of many—your interrupt service routines (ISRs) read all 1 bits from a status port that normally
returns a mixture of 1s and 0s. More commonly, a bus driver calls IoInvalidateDeviceRelations to trigger a
re-enumeration and then fails to report the newly missing device. It’s worth knowing that when the end user
removes a device while the system is hibernating or in another low-power state, when power is restored, the
driver receives a series of power management IRPs before it receives the IRP_MN_SURPRISE_REMOVAL
request.
What these facts mean, practically speaking, is that your driver should be able to cope with errors that might
arise from having your device suddenly not present.
Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
- 170 - Plug and Play for Function Drivers | Chapter 6
NTSTATUS AddDevice(...)
{
Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
6.3 Managing PnP State Transitions - 171 -
After AddDevice returns, the system sends IRP_MJ_PNP requests to direct you through the various PnP states the device can
assume.
NOTE
If your driver uses GENERIC.SYS, GENERIC will initialize your DEVQUEUE object or objects for you. Just be sure
to give GENERIC the addresses of those objects in your call to InitializeGenericExtension.
NTSTATUS HandleStartDevice(...)
{
status = StartDevice(...);
if (NT_SUCCESS(status))
{
pdx->state = WORKING;
RestartRequests(&pdx->dqReadWrite, fdo);
}
}
You record WORKING as the current state of your device, and you call RestartRequests for each of your queues to release any
IRPs that might have arrived between the time AddDevice ran and the time you received the IRP_MN_START_DEVICE
request.
if (pdx->state != WORKING)
return DefaultPnpHandler(fdo, Irp);
if (!OkayToStop(pdx))
return CompleteRequest(Irp, STATUS_UNSUCCESSFUL, 0);
StallRequests(&pdx->dqReadWrite);
WaitForCurrentIrp(&pdx->dqReadWrite);
pdx->state = PENDINGSTOP;
return DefaultPnpHandler(fdo, Irp);
}
1. This statement handles a peculiar situation that can arise for a boot device: the PnP Manager might send you a
QUERY_STOP when you haven’t initialized yet. You want to ignore such a query, which is tantamount to saying yes.
2. At this point, you perform some sort of investigation to see whether it will be OK to revert to the STOPPED state. I’ll
discuss factors bearing on the investigation next.
3. StallRequests puts the DEVQUEUE in the STALLED state so that any new IRP just goes into the queue.
WaitForCurrentIrp waits until the current request, if there is one, finishes on the device. These two steps make the device
Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
- 172 - Plug and Play for Function Drivers | Chapter 6
quiescent until we know whether the device is really going to stop or not. If the current IRP won’t finish quickly of its
own accord, you’ll do something (such as calling IoCancelIrp to force a lower-level driver to finish the current IRP) to
“encourage” it to finish; otherwise, WaitForCurrentIrp won’t return.
4. At this point, we have no reason to demur. We therefore record our state as PENDINGSTOP. Then we pass the request
down the stack so that other drivers can have a chance to accept or decline this query.
The other basic way of handling QUERY_STOP is appropriate when your device might be busy with a request that will take a
long time and can’t be stopped in the middle, such as a tape retension operation that can’t be stopped without potentially
breaking the tape. In this case, you can use the DEVQUEUE object’s CheckBusyAndStall function. That function returns TRUE
if the device is busy, whereupon you cause the QUERY_STOP to fail with STATUS_UNSUCCESSFUL. The function returns
FALSE if the device is idle, in which case it also stalls the queue. (The operations of checking the state of the device and
stalling the queue need to be protected by a spin lock, which is why I wrote this function in the first place.)
You can cause a stop query to fail for many reasons. Disk devices that are used for paging, for example, cannot be stopped.
Neither can devices that are used for storing hibernation or crash dump files. (You’ll know about these characteristics as a
result of an IRP_MN_DEVICE_USAGE_NOTIFICATION request, which I’ll discuss later in “Other Configuration
Functionality.”) Other reasons may also apply to your device.
Even if you have the query succeed, one of the drivers underneath you might cause it to fail for some reason. Even if all the
drivers have the query succeed, the PnP Manager might decide not to shut you down. In any of these cases, you’ll receive
another PnP request with the minor code IRP_MN_CANCEL_STOP_DEVICE to tell you that your device won’t be shut down.
You should then clear whatever state you set during the initial query:
We first check to see whether a stop operation is even pending. Some higher-level driver might have vetoed a query that we
never saw, so we’d still be in the WORKING state. If we’re not in the PENDINGSTOP state, we simply forward the IRP.
Otherwise, we send the CANCEL_STOP IRP synchronously to the lower-level drivers. That is, we use our ForwardAndWait
helper function to send the IRP down the stack and await its completion. We wait for low-level drivers because we’re about to
resume processing IRPs, and the drivers might have work to do before we send them an IRP. We then change our state variable
to indicate that we’re back in the WORKING state, and we call RestartRequests to unstall the queues we stalled when we
caused the query to succeed.
if (pdx->state != PENDINGSTOP);
{
<complicated stuff>
}
pdx->state = STOPPED;
1. We expect the system to send us a QUERY_STOP before it sends us a STOP, so we should already be in the
PENDINGSTOP state with all of our queues stalled. There is, however, a bug in Windows 98 such that we can sometimes
get a STOP (without a QUERY_STOP) instead of a REMOVE. You need to take some action at this point that causes you
Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
6.3 Managing PnP State Transitions - 173 -
to reject any new IRPs, but you mustn’t really remove your device object or do the other things you do when you really
receive a REMOVE request.
2. StopDevice is the helper function I’ve already discussed that deconfigures the device.
3. We now enter the STOPPED state. We’re in almost the same situation as we were when AddDevice was done. That is, all
queues are stalled, and the device has no I/O resources. The only difference is that we’ve left our registered interfaces
enabled, which means that applications won’t have received removal notifications and will leave their handles open.
Applications can also open new handles in this situation. Both aspects are just as they should be because the stop
condition won’t last long.
4. As I previously discussed, the last thing we do to handle IRP_MN_STOP_DEVICE is pass the request down to the lower
layers of the driver hierarchy.
if (OkayToRemove(fdo))
{
StallRequests(&pdx->dqReadWrite);
WaitForCurrentIrp(&pdx->dqReadWrite);
pdx->prevstate = pdx->state;
pdx->state = PENDINGREMOVE;
return DefaultPnpHandler(fdo, Irp);
}
return CompleteRequest(Irp, STATUS_UNSUCCESSFUL, 0);
}
if (pdx->state != PENDINGREMOVE)
return DefaultPnpHandler(fdo, Irp);
NTSTATUS status = ForwardAndWait(fdo, Irp);
pdx->state = pdx->prevstate;
RestartRequests(&pdx->dqReadWrite, fdo);
return CompleteRequest(Irp, status);
}
1. This OkayToRemove helper function provides the answer to the question, “Is it OK to remove this device?” In general,
this answer includes some device-specific ingredients, such as whether the device holds a paging or hibernation file, and
so on.
2. Just as I showed you for IRP_MN_QUERY_STOP_DEVICE, you want to stall the request queue and wait for a short
period, if necessary, until the current request finishes.
3. If you look at Figure 6-1 carefully, you’ll notice that it’s possible to get a QUERY_REMOVE when you’re in either the
WORKING or the STOPPED state. The right thing to do if the current query is later cancelled is to return to the original
state. Hence, I have a prevstate variable in the device extension to record the prequery state.
4. We get the CANCEL_REMOVE request when someone either above or below us vetoes a QUERY_REMOVE. If we
never saw the query, we’ll still be in the WORKING state and don’t need to do anything with this IRP. Otherwise, we
need to forward it to the lower levels before we process it because we want the lower levels to be ready to process the
IRPs we’re about to release from our queues.
5. Here we undo the steps we took when we succeeded the QUERY_REMOVE. We revert to the previous state. We stalled
the queues when we handled the query and need to unstall them now.
Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
- 174 - Plug and Play for Function Drivers | Chapter 6
struct DEVICE_EXTENSION {
IO_REMOVE_LOCK RemoveLock;
};
IoInitializeRemoveLock(&pdx->RemoveLock, 0, 0, 0);
The last three parameters to IoInitializeRemoveLock are, respectively, a tag value, an expected maximum lifetime for a lock,
and a maximum lock count, none of which is used in the free build of the operating system.
These preliminaries set the stage for what you do during the lifetime of the device object. Whenever you receive an I/O request
that you plan to forward down the stack, you call IoAcquireRemoveLock. IoAcquireRemoveLock will return
STATUS_DELETE_PENDING if a removal operation is under way. Otherwise, it will acquire the lock and return
STATUS_SUCCESS. Whenever you finish such an I/O operation, you call IoReleaseRemoveLock, which will release the lock
and might unleash a heretofore pending removal operation. In the context of some purely hypothetical dispatch function that
synchronously forwards an IRP, the code might look like this:
IoReleaseRemoveLock(&pdx->RemoveLock, Irp);
return CompleteRequest(Irp, <some code>, <info value>);
}
Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
6.3 Managing PnP State Transitions - 175 -
The second argument to IoAcquireRemoveLock and IoReleaseRemoveLock is just a tag value that a checked build of the
operating system can use to match up acquisition and release calls, by the way.
The calls to acquire and release the remove lock dovetail with additional logic in the PnP dispatch function and the remove
device subdispatch function. First DispatchPnp has to obey the rule about locking and unlocking the device, so it will contain
the following code, which I didn’t show you earlier in “IRP_MJ_PNP Dispatch Function”:
In other words, DispatchPnp locks the device, calls the subdispatch routine, and then (usually) unlocks the device afterward.
The subdispatch routine for IRP_MN_REMOVE_DEVICE has additional special logic that you also haven’t seen yet:
AbortRequests(&pdx->dqReadWrite, STATUS_DELETE_PENDING);
DeregisterAllInterfaces(pdx);
StopDevice(fdo, pdx->state == WORKING);
pdx->state = REMOVED;
IoReleaseRemoveLockAndWait(&pdx->RemoveLock, Irp);
RemoveDevice(fdo);
return status;
}
1. Windows 98/Me doesn’t send the SURPRISE_REMOVAL request, so this REMOVE IRP may be the first indication you
have that the device has disappeared. Calling StopDevice allows you to release all your I/O resources in case you didn’t
get an earlier IRP that caused you to release them. Calling AbortRequests causes you to complete any queued IRPs and to
start rejecting any new IRPs.
2. We pass this request to the lower layers now that we’ve done our work.
3. The PnP dispatch routine acquired the remove lock. We now call the special function IoReleaseRemoveLockAndWait to
release that lock reference and wait until all references to the lock are released. Once you call
IoReleaseRemoveLockAndWait, any subsequent call to IoAcquireRemoveLock will elicit a STATUS_DELETE_PENDING
status to indicate that device removal is under way.
NOTE
You’ll notice that the IRP_MN_REMOVE_DEVICE handler might block while an IRP finishes. This is certainly OK
in Windows 98/Me and Windows XP, which were designed with this possibility in mind—the IRP gets sent in the
context of a system thread that’s allowed to block. Some WDM functionality (a Microsoft developer even called
it “embryonic”) is present in OEM releases of Microsoft Windows 95, but you can’t block a remove device
request there. Consequently, if your driver needs to run in Windows 95, you need to discover that fact and avoid
blocking. That discovery process is left as an exercise for you.
It bears repeating that you need to use the remove lock only for an IRP that you pass down the PnP stack. If you have the
stamina, you can read the next section to understand exactly why this conclusion is true—and note that it differs from the
conventional wisdom that I and others have been espousing for several years. If someone sends you an IRP that you handle
entirely inside your own driver, you can rely on whoever sent you the IRP to make sure your driver remains in memory until
you both complete the IRP and return from your dispatch routine. If you send an IRP to someone outside your PnP stack, you’ll
use other means (such as a referenced file or device object) to keep the target driver in memory until it both completes the IRP
and returns from its dispatch routine.
Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
- 176 - Plug and Play for Function Drivers | Chapter 6
NOTE
In Chapter 5, I advised you to take an extra reference to a file object or device object discovered via
IoGetDeviceObjectPointer around the call to IoCallDriver for an asynchronous IRP. The reason for the advice
may now be clear: you want to be sure the target driver for the IRP is pinned in memory until its dispatch
routine returns regardless of whether your completion routine releases the reference taken by
IoGetDeviceObjectPointer. Dang, but this is getting complicated!
IoDeleteDevice makes some checks before it releases the last reference to a device object. In both operating systems, it checks
whether the AttachedDevice pointer is NULL. This field in the device object points upward to the device object for the next
upward driver. This field is set by IoAttachDeviceToDeviceStack and reset by IoDetachDevice, which are functions that WDM
drivers call in their AddDevice and RemoveDevice functions, respectively.
You want to think about the entire PnP stack of device objects as being the target of IRPs that the I/O Manager and drivers
outside the stack send to “your” device. This is because the driver for the topmost device object in the stack is always first to
process any IRP. Before anyone sends an IRP to your stack, however, they will have a referenced pointer to this topmost device
object, and they won’t release the reference until after the IRP completes. So if a driver stack contains just one device object,
there will never be any danger of having a device object or driver code disappear while the driver is processing an IRP: the IRP
sender’s reference pins the device object in memory, even if someone calls IoDeleteDevice before the IRP completes, and the
device object pins the driver code in memory.
WDM driver stacks usually contain two or more device objects, so you have to wonder about the second and lower objects in a
stack. After all, whoever sends an IRP to the device has a reference only to the topmost device object, not to the objects lower
down in the stack. Imagine the following scenario, then. Someone sends an IRP_MJ_SOMETHING (a made-up major function
to keep us focused on the remove lock) to the topmost filter device object (FiDO), whose driver sends it down the stack to your
function driver. You plan to send this IRP down to the filter driver underneath you. But, at about the same time on another CPU,
the PnP Manager has sent your driver stack an IRP_MN_REMOVE_DEVICE request.
Before the PnP Manager sends REMOVE_DEVICE requests, it takes an extra reference to every device object in the stack.
Then it sends the IRP. Each driver passes the IRP down the stack and then calls IoDetachDevice followed by IoDeleteDevice.
At each level, IoDeleteDevice sees that AttachedDevice is not (yet) NULL and decides that the time isn’t quite right to
dereference the device object. When the driver at the next higher level calls IoDetachDevice, however, the time is right, and
the I/O Manager dereferences the device object. Without the PnP Manager’s extra reference, the object would then disappear,
and that might trigger unloading the driver at that level of the stack. Once the REMOVE_DEVICE request is complete, the PnP
Manager will release all the extra references. That will allow all but the topmost device object to disappear because only the
topmost object is protected by the reference owned by the sender of the IRP_MJ_SOMETHING.
IMPORTANT
Every driver I’ve ever seen or written processes REMOVE_DEVICE synchronously. That is, no driver ever pends
a REMOVE_DEVICE request. Consequently, the calls to IoDetachDevice and IoDeleteDevice at any level of the
PnP stack always happen after the lower-level drivers have already performed those calls. This fact doesn’t
impact our analysis of the remove lock because the PnP Manager won’t release its extra reference to the stack
until after REMOVE_DEVICE actually completes, which requires IoCompleteRequest to run to conclusion.
Can you see why the Microsoft folks who understand the PnP Manager deeply are fond of saying, “Game Over” at this point?
We’re going to trust whoever is above us in the PnP stack to keep our device object and driver code in memory until we’re
done handling the IRP_MJ_SOMETHING that I hypothesized. But we haven’t (yet) done anything to keep the next lower
device object and driver in memory. While we were getting ready to send the IRP down, the IRP_MN_REMOVE_DEVICE ran
to completion, and the lower driver is now gone!
And that’s the problem that the remove lock solves: we simply don’t want to pass an IRP down the stack if we’ve already
Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
6.3 Managing PnP State Transitions - 177 -
In summary, we acquire the remove lock for this IRP in the dispatch routine, and we release it in the completion routine.
Suppose this IRP is racing an IRP_MN_REMOVE_DEVICE down the stack. If our HandleRemoveDevice function has gotten
to the point of calling IoReleaseRemoveLockAndWait before we get to point A, perhaps all the device objects in the stack are
teetering on the edge of extinction because the REMOVE_DEVICE may have finished long ago. If we’re the topmost device
object, somebody’s reference is keeping us alive. If we’re lower down the stack, the driver above us is keeping us alive. Either
way, it’s certainly OK for us to execute instructions. We’ll find that our call to IoAcquireRemoveLock returns
STATUS_DELETE_PENDING, so we’ll just complete the IRP and return.
Suppose instead that we win the race by calling IoAcquireRemoveLock before our HandleRemoveDevice function calls
IoReleaseRemoveLockAndWait. In this case, we’ll pass the IRP down the stack. IoReleaseRemoveLockAndWait will block until
our completion routine (at point B) releases the lock. At this exact instant, we fall back on the IRP sender’s reference or the
driver above us to keep us in memory long enough for our completion routine to return.
At this point in the analysis, I have to raise an alarming point that everyone who writes WDM drivers or writes or lectures
about them, including me, has missed until now. Passing an IRP down without a completion routine is actually unsafe because
it allows us to send an IRP down to a driver that isn’t pinned in memory. Anytime you see a call to
IoSkipCurrentIrpStackLocation (there are 204 of them in the Windows XP DDK), your antennae should twitch. We’ve all been
getting away with this because some redundant protections are in place and because the coincidence of an
IRP_MN_REMOVE_DEVICE with some kind of problem IRP is very rare. Refer to the sidebar for a discussion.
There is a large class of IRP that device drivers never see because these IRPs involve file system operations on
volumes. Thus, worrying about what might happen as a device driver handles an
IRP_MJ_QUERY_VOLUME_INFORMATION, for example, isn’t practical.
Only a few IRPs aren’t handle based or aimed at file system drivers, and most of them carry their own built-in
safeguards. To get an IRP_MJ_SHUTDOWN, you have to specifically register with the I/O Manager by calling
IoRegisterShutdownNotification. IoDeleteDevice automatically deregisters you if you happen to forget, and you
won’t be getting REMOVE_DEVICE requests while shutdown notifications are in progress. (While we’re on the
subject, note these additional details about IRP_MJ_SHUTDOWN. Like every other IRP, this one will be sent first
to the topmost FiDO in the PnP stack if any driver in the stack has called IoRegisterShutdownNotification.
Furthermore, as many IRPs will be sent as there are drivers in the stack with active notification requests. Thus,
drivers should take care to do their shutdown processing only once and should pass this IRP down the stack
after doing their own shutdown processing.)
Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
- 178 - Plug and Play for Function Drivers | Chapter 6
The PnP Manager itself is the source of most IRP_MJ_PNP requests, and you can be sure that it won’t overlap
a REMOVE_DEVICE request with another PnP IRP. You can’t, however, be sure there’s no overlap with PnP IRPs
sent by other drivers, such as a QUERY_DEVICE_RELATIONS to get the physical device object (PDO) address or
a QUERY_INTERFACE to locate a direct-call interface.
Finally, there’s IRP_MJ_POWER, which is a potential problem because the Power Manager doesn’t lock an entire
device stack and doesn’t hold a file object pointer.
The window of vulnerability is actually pretty small. Consider the following fragment of dispatch routines in two drivers:
NTSTATUS DriverA_DispatchSomething(...)
{
NTSTATUS DriverB_DispatchSomething(...)
{
return ??;
}
Driver A’s use of the remove lock protects Driver B until Driver B’s dispatch routine returns. Thus, if Driver B completes the
IRP or itself passes the IRP down using IoSkipCurrentIrpStackLocation, Driver B’s involvement with the IRP will certainly be
finished by the time Driver A is able to release the remove lock. If Driver B were to pend the IRP, Driver A wouldn’t be
holding the remove lock by the time Driver B got around to completing the IRP. We can assume, however, that Driver B will
have some mechanism in place for purging its queues of pending IRPs before returning from its own HandleRemoveDevice
function. Driver A won’t call IoDetachDevice or return from its own HandleRemoveDevice function until afterwards.
The only time there will be a problem is if Driver B passes the IRP down with a completion routine installed via the original
IoSetCompletionRoutine macro. Even here, if the lowest driver that handles this IRP does so correctly, itsHandleRemoveDevice
function won’t return until the IRP is completed. We’ll have just a slim chance that Driver B could be unloaded before its
completion routine runs.
There is, unfortunately, no way for a driver to completely protect itself from being unloaded while processing an IRP. Any
scheme you or I can devise will inevitably risk executing at least one instruction (a return) after the system removes the driver
image from memory. You can, however, hope that the drivers above you minimize the risk by using the techniques I’ve
outlined here.
InterlockedIncrement(&pdq->stallcount);
}
Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
6.3 Managing PnP State Transitions - 179 -
KeAcquireSpinLock(&pdq->lock, &oldirql);
InterlockedIncrement(&pdq->stallcount);
KeReleaseSpinLock(&pdq->lock, oldirql);
return busy;
}
1. To stall requests, we just need to set the stall counter to a nonzero value. It’s unnecessary to protect the increment with a
spin lock because any thread that might be racing with us to change the value will also be using an interlocked increment
or decrement.
2. Since CheckBusyAndStall needs to operate as an atomic function, we first take the queue’s spin lock.
3. CurrentIrp being non-NULL is the signal that the device is busy handling one of the requests from this queue.
4. If the device is currently idle, this statement starts stalling the queue, thereby preventing the device from becoming busy
later on.
Recall that StartPacket and StartNextPacket don’t send IRPs to the queue’s StartIo routine while the stall counter is nonzero. In
addition, InitializeQueue initializes the stall counter to 1, so the queue begins life in the stalled state.
KeAcquireSpinLock(&pdq->lock, &oldirql);
if (InterlockedDecrement(&pdq->stallcount) > 0)
{
KeReleaseSpinLock(&pdq->lock, oldirql);
return;
}
1. We acquire the queue spin lock to prevent interference from a simultaneous invocation of StartPacket.
2. Here we decrement the stall counter. If it’s still nonzero, the queue remains stalled, and we return.
3. This loop duplicates a similar loop inside StartNextPacket. We need to duplicate the code here to accomplish all of this
function’s actions within one invocation of the spin lock.
Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
- 180 - Plug and Play for Function Drivers | Chapter 6
NOTE
True confession: The first edition described a much simpler—and incorrect—implementation of
RestartRequests. A reader pointed out a race between the earlier implementation and StartPacket, which was
corrected on my Web site as shown here.
KeClearEvent(&pdq->evStop);
ASSERT(pdq->stallcount != 0);
KIRQL oldirql;
KeAcquireSpinLock(&pdq->lock, &oldirql);
BOOLEAN mustwait = pdq->CurrentIrp != NULL;
KeReleaseSpinLock(&pdq->lock, oldirql);
if (mustwait)
KeWaitForSingleObject(&pdq->evStop, Executive, KernelMode, FALSE, NULL);
}
1. StartNextPacket signals the evStop event each time it’s called. We want to be sure that the wait we’re about to perform
doesn’t complete because of a now-stale signal, so we clear the event before doing anything else.
2. It doesn’t make sense to call this routine without first stalling the queue. Otherwise, StartNextPacket will just start the
next IRP if there is one, and the device will become busy again.
3. If the device is currently busy, we’ll wait on the evStop event until someone calls StartNextPacket to signal that event. We
need to protect our inspection of CurrentIrp with the spin lock because, in general, testing a pointer for NULL isn’t an
atomic event. If the pointer is NULL now, it can’t change later because we’ve assumed that the queue is stalled.
Aborting Requests
Surprise removal of the device demands that we immediately halt every outstanding IRP that might try to touch the hardware.
In addition, we want to make sure that all further IRPs are rejected. The AbortRequests function helps with these tasks:
Setting abortstatus puts the queue in the REJECTING state so that all future IRPs will be rejected with the status value our
caller supplied. Calling CleanupRequests at this point—with a NULL file object pointer so that CleanupRequests will process
the entire queue—empties the queue.
We don’t dare try to do anything with the IRP, if any, that’s currently active on the hardware. Drivers that don’t use the
hardware abstraction layer (HAL) to access the hardware—USB drivers, for example, which rely on the hub and
host-controller drivers—can count on another driver to cause the current IRP to fail. Drivers that use the HAL might, however,
need to worry about hanging the system or, at the very least, leaving an IRP in limbo because the nonexistent hardware can’t
generate the interrupt that would let the IRP finish. To deal with situations such as this, you call AreRequestsBeingAborted:
It would be silly, by the way, to use the queue spin lock in this routine. Suppose we capture the instantaneous value of
abortstatus in a thread-safe and multiprocessor-safe way. The value we return can become obsolete as soon as we release the
spin lock.
Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
6.4 Other Configuration Functionality - 181 -
NOTE
If your device might be removed in such a way that an outstanding request simply hangs, you should also have
some sort of watchdog timer running that will let you kill the IRP after a specified period of time. See the
“Watchdog Timers“ section in Chapter 14.
Sometimes we need to undo the effect of a previous call to AbortRequest. AllowRequests lets us do that:
PIO_RESOURCE_REQUIREMENTS_LIST filtered =
(PIO_RESOURCE_REQUIREMENTS_LIST) Irp->IoStatus.Information;
if (source->AlternativeLists != 1)
return DefaultPnpHandler(fdo, Irp);
newlist->ListSize += sizeof(IO_RESOURCE_DESCRIPTOR);
PIO_RESOURCE_DESCRIPTOR resource =
&newlist->List[0].Descriptors[newlist->List[0].Count++];
RtlZeroMemory(resource, sizeof(IO_RESOURCE_DESCRIPTOR));
Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney
- 182 - Plug and Play for Function Drivers | Chapter 6
resource->Type = CmResourceTypeDevicePrivate;
resource->ShareDisposition = CmResourceShareDeviceExclusive;
resource->u.DevicePrivate.Data[0] = 42;
Irp->IoStatus.Status = status;
IoCompleteRequest(Irp, IO_NO_INCREMENT);
return status;
}
1. The parameters for this request include a list of I/O resource requirements. These are derived from the device’s
configuration space, the registry, or wherever the bus driver happens to find them.
2. Higher-level drivers might have already filtered the resources by adding requirements to the original list. If so, they set
the IoStatus.Information field to point to the expanded requirements list structure.
3. If there’s no filtered list, we’ll extend the original list. If there’s a filtered list, we’ll extend that.
4. Theoretically, several alternative lists of requirements could exist, but dealing with that situation is beyond the scope of
this simple example.
5. We need to add any resources before we pass the request down the stack. First we allocate a new requirements list and
copy the old requirements into it.
6. Taking care to preserve the preexisting order of the descriptors, we add our own resource description. In this example,
we’re adding a resource that’s private to the driver.
7. We store the address of the expanded list of requirements in the IRP’s IoStatus.Information field, which is where
lower-level drivers and the PnP system will be looking for it. If we just extended an already filtered list, we need to
release the memory occupied by the old list.
8. We pass the request down using the same ForwardAndWait helper function that we used for IRP_MN_START_DEVICE.
If we weren’t going to modify any resource descriptors on the IRP’s way back up the stack, we could just call
DefaultPnpHandler here and propagate the returned status.
9. When we complete this IRP, whether we indicate success or failure, we must take care not to modify the Information
field of the I/O status block: it might hold a pointer to a resource requirements list that some driver—maybe even
ours!—installed on the way down. The PnP Manager will release the memory occupied by that structure when it’s no
longer needed.
Parameter Description
InPath TRUE if device is in the path of the Type usage; FALSE if not
Type Type of usage to which the IRP applies
Programming The Microsoft Windows Driver Model 2nd Edition Copyright © 2003 by Walter Oney