Kernel Extensions and Device Support Programming Concepts
Kernel Extensions and Device Support Programming Concepts
3
SC23-4900-03
AIX 5L Version 5.3
SC23-4900-03
Note
Before using this information and the product it supports, read the information in “Notices,” on page 359.
Chapter 13. FCP, iSCSI, and Virtual SCSI Client Subsystem . . . . . . . . . . . . . . . 251
Programming FCP, iSCSI, and Virtual SCSI Client Device Drivers . . . . . . . . . . . . . . 251
FCP, iSCSI, and Virtual SCSI Client Subsystem Overview . . . . . . . . . . . . . . . . . 273
Understanding FCP, iSCSI, and Virtual SCSI Client Asynchronous Event Handling . . . . . . . . 278
FCP, iSCSI, and Virtual SCSI Client Error Recovery . . . . . . . . . . . . . . . . . . . 280
FCP, iSCSI, and Virtual SCSI Client Initiator-Mode Recovery When Not Command Tag Queuing 281
Initiator-Mode Recovery During Command Tag Queuing . . . . . . . . . . . . . . . . . . 282
A Typical Initiator-Mode FCP, iSCSI, and Virtual SCSI Client Driver Transaction Sequence . . . . . 283
Understanding FCP, iSCSI, and Virtual SCSI Client Device Driver Internal Commands . . . . . . 284
Understanding the Execution of FCP, iSCSI, and Virtual SCSI Client Initiator I/O Requests . . . . . 284
FCP, iSCSI, and Virtual SCSI Client Command Tag Queuing . . . . . . . . . . . . . . . . 285
Contents v
Understanding the scsi_buf Structure . . . . . . . . . . . . . . . . . . . . . . . . . 286
Other FCP, iSCSI, and Virtual SCSI Client Design Considerations . . . . . . . . . . . . . . 292
Required FCP, iSCSI, and Virtual SCSI Client Adapter Device Driver ioctl Commands . . . . . . . 297
Related Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363
Highlighting
The following highlighting conventions are used in this book:
Case-Sensitivity in AIX
Everything in the AIX operating system is case-sensitive, which means that it distinguishes between
uppercase and lowercase letters. For example, you can use the ls command to list files. If you type LS, the
system responds that the command is ″not found.″ Likewise, FILEA, FiLea, and filea are three distinct file
names, even if they reside in the same directory. To avoid causing undesirable actions to be performed,
always ensure that you use the correct case.
ISO 9000
ISO 9000 registered quality systems were used in the development and manufacturing of this product.
Related Publications
The following books contain additional information on kernel extension programming and the existing
kernel subsystems:
v Printers and printing
v Keyboard Technical Reference
v Operating system and device management
v AIX 5L Version 5.3 Technical Reference: Kernel and Subsystems Volume 1
v AIX 5L Version 5.3 Technical Reference: Kernel and Subsystems Volume 2
The term kernel extension applies to all routines added to the kernel, independent of their purpose. Kernel
extensions can be added at any time by a user with the appropriate privilege.
Kernel extensions run in the same mode as the kernel. That is, when the 64–bit kernel is used, kernel
extensions run in 64–bit mode. These kernel extensions must be compiled to produce a 64–bit object.
The following kernel-environment programming information is provided to assist you in programming kernel
extensions:
v “Understanding Kernel Extension Symbol Resolution”
v “Understanding Execution Environments” on page 5
v “Understanding Kernel Threads” on page 6
v “Using Kernel Processes” on page 8
v “Accessing User-Mode Data While in Kernel Mode” on page 12
v “Understanding Locking” on page 13
v “Understanding Exception Handling” on page 14
v “Using Kernel Extensions to Support 64–bit Processes” on page 19
A process executing in user mode can customize the kernel by using the sysconfig subroutine, if the
process has appropriate privilege. In this way, a user-mode process can load, unload, initialize, or
terminate kernel routines. Kernel configuration can also be altered by changing tunable system
parameters.
Kernel extensions can also customize the kernel by using kernel services to load, unload, initialize, and
terminate dynamically loaded kernel routines; to create and initialize kernel processes; and to define
interrupt handlers.
Note: Private kernel routines (or kernel services) execute in a privileged protection domain and can affect
the operation and integrity of the whole system. See “Kernel Protection Domain” on page 23 for
more information.
In an export file, symbols are listed one per line. These system calls are available to both 32- and 64-bit
processes. System calls are identified by using one of the syscall32, syscall64 or syscall3264 keywords
after the symbol name. Use syscall32 to make a system call available to 32-bit processes, syscall64 to
make a system call available to 64-bit processes, and syscall3264 to make a system call available to both
32- and 64-bit processes. For more information about export files, see ld Command in AIX 5L Version 5.3
Commands Reference, Volume 3.
When a new kernel extension is loaded by the sysconfig or kmod_load subroutine, any symbols
exported by the kernel extension are added to the kernel name space, and are available to all
subsequently loaded kernel extensions. Similarly, system calls exported by a kernel extension are
available to all user programs or shared objects subsequently loaded.
Note: Link-editing of a kernel extension should always be performed by using the ld command. Do not
use the compiler to create a kernel extension.
If a kernel extension depends on kernel services provided by other kernel extensions, an additional import
file must be specified when link-editing. An import file lists additional kernel services, with each service
listed on its own line. An import file must contain the line #!/unix before any services are listed. The same
file can be used both as an import file and an export file. The #!/unix line is ignored when a file is used
as an export file. For more information on import files, see ld command in AIX 5L Version 5.3 Commands
Reference, Volume 3.
The second restriction is imposed because, when they access a caller’s data, system calls with
parameters passed by reference access storage across a protection domain. The cross-domain memory
services performing these cross-memory operations support kernel processes as if they, too, accessed
storage across a protection domain. However, these services have no way to determine that the caller is in
the same protection domain when the caller is a user-mode process in kernel mode. For more information
on cross-domain memory services, see “Cross-Memory Kernel Services” on page 70.
Note: System calls must not be used by kernel extensions executing in the interrupt handler
environment.
System calls available to kernel extensions are listed in /usr/lib/kernex.imp, along with other kernel
services.
Loading Kernel Extensions: Normally, kernel extensions that provide new system calls or kernel
services only need to be loaded once. For these kernel extensions, loading should be performed by
specifying SYS_SINGLELOAD when calling the sysconfig function, or LD_SINGLELOAD when calling the
kmod_load function. If the specified kernel extension is already loaded, a second copy is not loaded.
Instead, a reference to the existing kernel extension is returned. The loader uses the specified pathname
to determine whether a kernel extensions is already loaded. If multiple pathnames refer to the same kernel
extension, multiple copies can be loaded into the kernel.
If a kernel extension can support multiple instances of itself (particularly its data), it can be loaded multiple
times, by specifying SYS_KLOAD when calling the sysconfig function, or by not specifying
LD_SINGLELOAD when calling the kmod_load function. Either of these operations loads a new copy of
the kernel extension, even when one or more copies are already loaded. When this operation is used,
currently loaded routines bound to the old copy of the kernel extension continue to use the old copy.
Subsequently loaded routines use the most recently loaded copy of the kernel extension.
Unloading Kernel Extensions: Kernel extensions can be unloaded. For each kernel extension, the
loader maintains a use count and a load count. The use count indicates how many other object files have
referenced some exported symbol provided by the kernel extension. The load count indicates how many
explicit load requests have been made for each kernel extension.
When an explicit unload of a kernel extension is requested, the load count is decremented. If the load
count and the use count are both equal to 0, the kernel extension is unloaded, and the memory used by
the text and data of the kernel extension is freed.
If either the load count or use count is not equal to 0, the kernel extension is not unloaded. As processes
exit or other kernel extensions are unloaded, the use counts for referenced kernel extensions are
decremented. Even if the load and use counts become 0, the kernel extension may not be unloaded
immediately. In this case, the kernel extension’s exported symbols are still available for load-time binding
unless another kernel extension is unloaded or the slibclean command is executed. At this time, the
loader unloads all modules that have both load and use counts of 0.
Kernel extensions can also consist of several separately link-edited modules. This is particularly useful for
device drivers, where a kernel extension contains the top (pageable) half of the driver and a dependent
module contains the bottom (pinned) half of the driver. Using a dependent module also makes sense when
several kernel extensions use common routines. In both cases, the symbols exported by the dependent
modules are not added to the global kernel name space. Instead, these symbols are only available to the
kernel extension being loaded.
When link-editing a kernel extension that depends on another module, an import file should be specified
listing the symbols exported by the dependent module. Before any symbols are listed, the import file
should contain one of the following lines:
#! path/file
or
#! path/file(member)
While a kernel extension is being loaded, any dependent modules are only loaded a single time. This
allows modules to depend on each other in a complicated way, without causing multiple instances of a
module to be loaded.
Note: The loader uses the pathname of a module to determine whether it has already been loaded.
Another copy of the module can be loaded if different path names are used for the same module.
The symbols exported by dependent modules are not added to the kernel name space. These symbols
can only be used by a kernel extension and its other dependent modules. If another kernel extension is
loaded that uses the same dependent modules, these dependent modules will be loaded a second time.
This special treatment of archives only applies to an explicitly loaded kernel extension. If a kernel
extension depends on a member of another archive, the kernel extension must be link-edited with an
import file that specifies the member name.
Using Libraries
The operating system provides the following two libraries that can be used by kernel extensions:
v libcsys.a
v libsys.a
libcsys Library
The libcsys.a library contains a subset of subroutines found in the user-mode libc.a library that can be
used by kernel extensions. When using any of these routines, the header file /usr/include/sys/libcsys.h
should be included to obtain function prototypes, instead of the application header files, such as
/usr/include/string.h or /usr/include/stdio.h. The following routines are included in libcsys.a:
v atoi
v bcmp
v bcopy
v bzero
v memccpy
v memchr
v memcmp
v memcpy
v memmove
v memset
v ovbcopy
v strcat
v strchr
v strcmp
v strcpy
Note: In addition to these explicit subroutines, some additional functions are defined in libcsys.a. All
kernel extensions should be linked with libcsys.a by specifying -lcsys at link-edit time. The
library libc.a is intended for user-level code only. Do not link-edit kernel extensions with the -lc
flag.
libsys Library
The libsys.a library provides the following set of kernel services:
v d_align
v d_roundup
v timeout
v timeoutcf
v untimeout
When using these services, specify the -lsys flag at link-edit time.
User-provided Libraries
To simplify the development of kernel extensions, you can choose to split a kernel extension into
separately loadable modules. These modules can be used when linking kernel extensions in the same way
that they are used when developing user-level applications and shared objects. In particular, a kernel
module can be created as a shared object by linking with the -bM:SRE flag.. The shared object can then
be used as an input file when linking a kernel extension. In addition, shared objects can be put into an
archive file, and the archive file can be listed on the command line when linking a kernel extension. In both
cases, the shared object will be loaded as a dependent module when the kernel extension is loaded.
A kernel extension runs in the process environment when invoked either by a user process in kernel mode
or by a kernel process. A kernel extension is executing in the interrupt environment when invoked as part
of an interrupt handler.
A kernel extension can determine in which environment it is called to run by calling the getpid or
thread_self kernel service. These services respectively return the process or thread identifier of the
current process or thread , or a value of -1 if called in the interrupt environment. Some kernel services can
be called in both environments, whereas others can only be called in the process environment.
A routine running in the process environment can sleep or be interrupted by routines executing in the
interrupt environment. A kernel routine that runs on behalf of a user-mode process can only invoke system
calls that have no parameters passed by reference. A kernel process, however, can use all system calls
listed in the System Calls Available to Kernel Extensions if necessary.
Interrupt Environment
A routine runs in the interrupt environment when called on behalf of an interrupt handler. A kernel routine
executing in this environment cannot request data that has been paged out of memory and therefore
cannot cause page faults by accessing pageable code or data. In addition, the kernel routine has a stack
of limited size, is not subject to replacement by another process, and cannot perform any function that
would cause it to sleep.
A routine in this environment is only interruptible either by interrupts that have priority more favored than
the current priority or by exceptions. These routines cannot use system calls and can use only kernel
services available in both the process and interrupt environments.
A process in kernel mode can also put itself into an environment similar to the interrupt environment. This
action, occurring when the interrupt priority is changed to a priority more favored than INTBASE, can be
accomplished by calling the i_disable or disable_lock kernel service. A kernel-mode process is
sometimes required to do this to serialize access to a resource shared by a routine executing in the
interrupt environment. When this is the case, the process operates under most of the same restrictions as
a routine executing in the interrupt environment. However, the e_sleep, e_wait, e_sleepl, et_wait, lockl,
and unlockl process can sleep, wait, and use locking kernel services if the event word or lock word is
pinned.
Routines executed in this environment can adversely affect system real-time performance and are
therefore limited to a specific maximum path length. Guidelines for the maximum path length are
determined by the interrupt priority at which the routines are executed. Understanding Interrupts provides
more information.
One process can have multiple threads, with each thread executing different code concurrently, while
sharing data and synchronizing much more easily than cooperating processes. Threads require fewer
system resources than processes, and can start more quickly.
Although threads can be scheduled, they exist in the context of their process. The following list indicates
what is managed at process level and shared among all threads within a process:
v Address space
v System resources, like files or terminals
v Signal list of actions.
The process remains the swappable entity. Only a few resources are managed at thread level, as
indicated in the following list:
A kernel thread is a kernel entity, like processes and interrupt handlers; it is the entity handled by the
system scheduler. A kernel thread runs in user mode environment when executing user functions or library
calls; it switches to kernel mode environment when executing system calls.
A kernel-only thread is a kernel thread that executes only in kernel mode environment. Kernel-only threads
are controlled by the kernel mode environment programmer through kernel services.
User mode programs can access user threads through a library (such as the libpthreads.a threads
library). User threads are part of a portable programming model. User threads are mapped to kernel
threads by the threads library, in an implementation dependent manner. The threads library uses a
proprietary interface to handle kernel threads. See Understanding Threads in AIX 5L Version 5.3 General
Programming Concepts: Writing and Debugging Programs to get detailed information about the user
threads library and their implementation.
All threads discussed in this article are kernel threads; and the information applies only to the kernel mode
environment. Kernel threads cannot be accessed from the user mode environment, except through the
threads library.
These structures cannot be accessed directly by kernel extensions and device drivers. They are
encapsulated for portability reasons. Many fields that were previously in the user structure are now in the
uthread structure.
Other threads can be created, using a two-step procedure. The thread_create kernel service allocates
and initializes a new thread, and sets its state to idle. The kthread_start kernel service then starts the
thread, using the specified entry point routine.
A thread is terminated when it executes a return from its entry point, or when it calls the thread_terminate
kernel service. Its resources are automatically freed. If it is the last thread in the process, the process
ends.
Scheduling parameters can be changed using the thread_setsched kernel service. The process-oriented
setpri system call sets the priority of all the threads within a process. The process-oriented getpri system
call gets the priority of a thread in the process. The scheduling policy and priority of an individual thread
can be retrieved from the ti_policy and ti_pri fields of the thrdsinfo structure returned by the getthrds
system call.
The signal mask of a thread is handled by the limit_sigs and sigsetmask kernel services. The
kthread_kill kernel service can be used to direct a signal to a particular thread.
In the kernel environment, when a signal is received, no action is taken (no termination or handler
invocation), even for the SIGKILL signal. A thread in kernel environment, especially kernel-only threads,
must poll for signals so that signals can be delivered. Polling ensures the proper kernel-mode serialization.
For example, SIGKILL will not be delivered to a kernel-only thread that does not poll for signals.
Therefore, SIGKILL is not necessarily an effective means for terminating a kernel-only thread.
Signals whose actions are applied at generation time (rather than delivery time) have the same effect
regardless of whether the target is in kernel or user mode. A kernel-only thread can poll for unmasked
signals that are waiting to be delivered by calling the sig_chk kernel service. This service returns the
signal number of a pending signal that was not blocked or ignored. The thread then uses the signal
number to determine which action should be taken. The kernel does not automatically call signal handlers
for a thread in kernel mode as it does for user mode.
See “Kernel Process Signal and Exception Handling” on page 11 for more information about signal
handling at process level.
A kernel process controls directly the kernel threads. Because kernel processes are always in the kernel
protection domain, threads within a kernel process are kernel-only threads. For more information on kernel
threads, see “Understanding Kernel Threads” on page 6.
A kernel process inherits the environment of its parent process (the one calling the creatp kernel service
to create it), but with some exceptions. The kernel process does not have a root directory or a current
directory when initialized. All uses of the file system functions must specify absolute path names.
Kernel processes created during phase 1 of system boot must not keep any long-term opens on files until
phase 2 of system boot or run time has been reached. This is because Base Operating System changes
root file systems between phase 1 and phase 2 of system boot. As a result, the system crashes if any files
are open at root file system transition time.
Cross-Memory Services
Kernel processes must be provided with a valid cross-memory descriptor to access address regions
outside the kernel global address space or kernel process address space. For example, if a kernel process
is to access data from a user-mode process, the system call using the process must obtain a
cross-memory descriptor for the user-mode region to be accessed. Calling the xmattach or xmattach64
kernel service provides a descriptor that can then be made available to the kernel process.
The kernel process should then call the xmemin and xmemout kernel services to access the targeted
cross-memory data area. When the kernel process has completed its operation on the memory area, the
cross-memory descriptor must be detached by using the xmdetach kernel service.
The process is created with one kernel-only thread, called the initial thread. See Understanding Kernel
Threads to get more information about threads.
After the initp kernel service has completed the process initialization, the initial thread is placed on the run
queue. On the first dispatch of the newly initialized kernel process, it begins execution at the entry point
previously supplied to the initp kernel service. The initialization parameters were previously specified in
the call to the initp kernel service.
A kernel process terminates when it executes a return from its main entry routine. A process should never
exit without both freeing all dynamically allocated storage and releasing all locks owned by the kernel
process.
When kernel processes exit, the parent process (the one calling the creatp and initp kernel services to
create the kernel process) receives the SIGCHLD signal, which indicates the end of a child process.
However, it is sometimes undesirable for the parent process to receive the SIGCHLD signal due to ending
a process. In this case, the kproc can call the setpinit kernel service to designate the init process as its
parent. The init process cleans up the state of all its child processes that have become zombie processes.
A kernel process can also issue the setsid subroutine call to change its session. Signals and job control
affecting the parent process session do not affect the kernel process.
The kernel process can use the locking kernel services to serialize access to critical data structures. This
use of locks does not guarantee that the process will not be replaced, but it does ensure that another
process trying to acquire the lock waits until the kernel process owning the lock has released it.
Kernel processes must ensure that their maximum path lengths adhere to the specifications for interrupt
handlers when executing at an interrupt priority more favored than INTBASE. This ensures that system
real-time performance is not degraded.
Signals whose action is applied at generation time (rather than delivery time) have the same effect
regardless of whether the target is a kernel or user process. A kernel process can poll for unmasked
signals that are waiting to be delivered by calling the sig_chk kernel service. This service returns the
signal number of a pending signal that was not blocked or ignored. The kernel process then uses the
signal number to determine which action should be taken. The kernel does not automatically call signal
handlers for a kernel process as it does for user processes.
A kernel process should also use the exception-catching facilities (setjmpx, and clrjmpx) available in
kernel mode to handle exceptions that can be caused during run time of the kernel process. Exceptions
received during the execution of a kernel process are handled the same as exceptions that occur in any
kernel-mode routine.
Unhandled exceptions that occur in kernel mode (in any user process while in kernel mode, in an interrupt
handler, or in a kernel process) result in a system crash. To avoid crashing the system due to unhandled
exceptions, kernel routines should use the setjmpx, clrjmpx, and longjmpx kernel services to handle
exceptions that might possibly occur during run time. See “Understanding Exception Handling” on page 14
for more details on handling exceptions.
However, the error information returned from a kernel process system call must be accessed differently
than for a user process. A kernel process must use the getuerror kernel service to retrieve the system call
error information normally provided in the errno global variable for user-mode processes. In addition, the
kernel process can use the setuerror kernel service to set the error information to 0 before calling the
system call. The return code from the system call is handled the same for all processes.
Kernel processes can use only a restricted set of the base system calls. “System Calls Available to Kernel
Extensions” on page 35 lists system calls available to kernel processes.
Additional kernel services allow data transfer between user mode and kernel mode when a uio structure is
used, thereby describing the user-mode data area to be accessed. All addresses on the 32–bit kernel, with
the exception of addresses ending in ″64″, passed into these services must be remapped. Following is a
list of services that typically are used between the file system and device drivers to perform device I/O:
The services ending in “64” are not supported in the 64-bit kernel, since all pointers are already 64-bits
wide. The services without the “64” can be used instead. To allow common source code to be used,
macros are provided in the sys/uio.h header file that redefine these special services to their general
counterparts when compiling in 64-bit mode.
In these circumstances, the kernel cross-memory services are required to provide the necessary access.
The xmattach and xmattach64 kernel services allow a cross-memory descriptor to be obtained for the
data area to be accessed. These services must be called in the process environment of the process
containing the data area.
Understanding Locking
The following information is provided to assist you in understanding locking.
Lockl Locks
The lockl locks (previously called conventional locks) are provided for compatibility only and should not be
used in new code: simple or complex locks should be used instead. These locks are used to protect a
critical section of code which accesses a resource such as a data structure or device, serializing access to
the resource. Every thread which accesses the resource must acquire the lock first, and release the lock
when finished.
A conventional lock has two states: locked or unlocked. In the locked state, a thread is currently executing
code in the critical section, and accessing the resource associated with the conventional lock. The thread
is considered to be the owner of the conventional lock. No other thread can lock the conventional lock
(and therefore enter the critical section) until the owner unlocks it; any thread attempting to do so must
wait until the lock is free. In the unlocked state, there are no threads accessing the resource or owning the
conventional lock.
Lockl locks are recursive and, unlike simple and complex locks, can be awakened by a signal.
Simple Locks
A simple lock provides exclusive-write access to a resource such as a data structure or device. Simple
locks are not recursive and have only two states: locked or unlocked.
Complex Locks
A complex lock can provide either shared or exclusive access to a resource such as a data structure or
device. Complex locks are not recursive by default (but can be made recursive) and have three states:
exclusive-write, shared-read, or unlocked.
If several threads perform read operations on the resource, they must first acquire the corresponding lock
in shared-read mode. Because no threads are updating the resource, it is safe for all to read it. Any thread
which writes to the resource must first acquire the lock in exclusive-write mode. This guarantees that no
other thread will read or write the resource while it is being updated.
thread-thread These critical sections must be protected (by using the locking kernel services) from
concurrent execution by multiple processes or threads.
thread-interrupt These critical sections must be protected (by using the disable_lock and
unlock_enable kernel services) from concurrent execution by an interrupt handler
and a thread or process.
A hierarchy of locks exists. This hierarchy is imposed by software convention, but is not enforced by the
system. The lockl kernel_lock variable, which is the global kernel lock, has the the coarsest granularity.
Other types of locks have finer granularity. The following list shows the ordering of locks based on
granularity:
v The kernel_lock global kernel lock
Note: Avoid using the kernel_lock global kernel lock variable in new code. It is only included for
compatibility purposes.
v File system locks (private to file systems)
v Device driver locks (private to device drivers)
v Private fine-granularity locks
Locks should generally be released in the reverse order from which they were acquired; all locks must be
released before a kernel process or thread exits. Kernel mode processes do not receive any signals while
they hold any lock.
The computer hardware generally uses the same mechanism to report both interrupts and exceptions. The
machine saves and modifies some of the event’s state and forces a branch to a particular location. When
decoding the reason for the machine interrupt, the interrupt handler determines whether the event is an
interrupt or an exception, then processes the event accordingly.
Exception Processing
When an exception occurs, the current instruction stream cannot continue. If you ignore the exception, the
results of executing the instruction may become undefined. Further execution of the program may cause
unpredictable results. The kernel provides a default exception-handling mechanism by which an instruction
stream (a process- or interrupt-level program) can specify what action is to be taken when an exception
occurs. Exceptions are handled differently depending on whether they occurred while executing in kernel
mode or user mode.
This stacking mechanism allows the execution point and context of a process or interrupt handler to be
restored to a point in the process or interrupt handler, at the point of return from the setjmpx kernel
service. When execution returns to this point, the return code from setjmpx kernel service indicates the
type of exception that occurred so that the process or interrupt handler state can be fully restored.
Appropriate retry or recovery operations are then invoked by the software performing the operation.
When an exception occurs, the kernel first-level exception handler gets control. The first-level exception
handler determines what type of exception has occurred and saves information necessary for handling the
specific type of exception. For an I/O exception, the first-level handler also enables again the programmed
I/O operations.
The first-level exception handler then modifies the saved context of the interrupted process or interrupt
handler. It does so to execute the longjmpx kernel service when the first-level exception handler returns
to the interrupted process or interrupt handler.
The longjmpx kernel service executes in the environment of the code that caused the exception and
restores the current context from the topmost jump buffer on the stack of saved contexts. As a result, the
state of the process or interrupt handler that caused the exception is restored to the point of the return
from the setjmpx kernel service. (The return code, nevertheless, indicates that an exception has
occurred.)
The process or interrupt handler software should then check the return code and invoke exception
handling code to restore fully the state of the process or interrupt handler. Additional information about the
exception can be obtained by using the getexcept kernel service.
When a kernel routine that has established an exception handler completes normally, it must remove its
exception handler from the stack (by using the clrjmpx kernel service) before returning to its caller.
Note: When the longjmpx kernel service invokes an exception handler, that handler’s entry is
automatically removed from the stack.
This state can be restored later by calling the longjmpx kernel service, which accomplishes the following
tasks:
v Reloads the registers (including TOC and stack pointers)
v Enables or disables to the correct interrupt level
v Conditionally acquires or releases the kernel-mode lock
v Forces a branch back to the point of original return from the setjmpx kernel service
The setjmpx kernel service takes the address of a jump buffer (a label_t structure) as an explicit
parameter. This structure can be defined anywhere including on the stack (as an automatic variable). After
noting the state data in the jump buffer, the setjmpx kernel service pushes the buffer onto the top of a
stack that is maintained in the machine-state save structure.
The longjmpx kernel service is used to return to the point in the code at which the setjmpx kernel service
was called. Specifically, the longjmpx kernel service returns to the most recently created jump buffer, as
indicated by the top of the stack anchored in the machine-state save structure.
The parameter to the longjmpx kernel service is an exception code that is passed to the resumed
program as the return code from the setjmp kernel service. The resumed program tests this code to
determine the conditions under which the setjmpx kernel service is returning. If the setjmpx kernel
service has just saved its jump buffer, the return code is 0. If an exception has occurred, the program is
entered by a call to the longjmpx kernel service, which passes along a return code that is not equal to 0.
Note: Only the resources listed here are saved by the setjmpx kernel service and restored by the
longjmpx kernel service. Other resources, in particular segment registers, are not restored. A call
to the longjmpx kernel service, by definition, returns to an earlier point in the program. The
program code must free any resources that are allocated between the call to the setjmpx kernel
service and the call to the longjmpx kernel service.
If the exception handler stack is empty when the longjmpx kernel service is issued, there is no place to
jump to and the system default exception-handling mechanism is used. If the stack is not empty, the
context that is defined by the topmost jump buffer is reloaded and resumed. The topmost buffer is then
removed from the stack.
The clrjmpx kernel service removes the top element from the stack as placed there by the setjmpx kernel
service. The caller to the clrjmpx kernel service is expected to know exactly which jump buffer is being
removed. This should have been established earlier in the code by a call to the setjmpx kernel service.
Accordingly, the address of the buffer is required as a parameter to the clrjmpx kernel service. It can then
perform consistency checking by asserting that the address passed is indeed the address of the top stack
element.
Note: An interrupt handler context is newly created each time the interrupt handler is invoked. As a result,
exception handlers for interrupt handlers must be registered (by calling the setjmpx kernel service)
each time the interrupt handler is invoked. Otherwise, an exception detected during execution of the
interrupt handler will be handled by the default handler.
If, on the other hand, an exception does occur (that is, the return code from setjmpx kernel service is
nonzero), the jump buffer is automatically removed from the list of jump buffers. In this case, a call to the
clrjmpx kernel service for the jump buffer must not be performed.
Care must also be taken in defining variables that are used after the context save (the call to the setjmpx
service), and then again by the exception handler. Sensitive variables of this nature must be restored to
their correct value by the exception handler when an exception occurs.
Note: If the last value of the variable is desired at exception time, the variable data type must be
declared as ″volatile.″
Exception handling is concluded in one of two ways. Either a registered exception handler handles the
exception and continues from the saved context, or the default exception handler is reached by exhausting
the stack of jump buffers.
Exception Codes
The /usr/include/sys/except.h file contains a list of code numbers corresponding to the various types of
hardware exceptions. When an exception handler is invoked (the return from the setjmpx kernel service is
not equal to 0), it is the responsibility of the handler to test the code to ensure that the exception is one
the routine can handle. If it is not an expected code, the exception handler must:
v Release any resources that would not otherwise be freed (buffers, segment registers, storage acquired
using the xmalloc routines)
v Call the longjmpx kernel service, passing it the exception code as a parameter
Thus, when an exception handler does not recognize the exception for which it has been invoked, it
passes the exception on to the next most recent exception handler. This continues until an exception
handler is reached that recognizes the code and can handle it. Eventually, if no exception handler can
handle the exception, the stack is exhausted and the system default action is taken.
In this manner, a component can allocate resources (after calling the setjmpx kernel service to establish
an exception handler) and be assured that the resources will later be released. This ensures the exception
handler gets a chance to release those resources regardless of what events occur before the instruction
stream (a process- or interrupt-level code) is terminated.
By coding the exception handler to recognize what exception codes it can process rather than encoding
this knowledge in the stack entries, a powerful and simple-to-use mechanism is created. Each handler
Exceptions generated by hardware use one of the codes in the /usr/include/sys/except.h file. However,
the longjmpx kernel service can be invoked by any kernel component, and any integer can serve as the
exception code. A mechanism similar to the old-style setjmp and longjmp kernel services can be
implemented on top of the setjmpx/longjmpx stack by using exception codes outside the range of those
used for hardware exceptions.
To implement this old-style mechanism, a unique set of exception codes is needed. These codes must not
conflict with either the pre-assigned hardware codes or codes used by any other component. A simple way
to get such codes is to use the addresses of unique objects as code values.
For example, a program that establishes an exception handler might compare the exception code to the
address of its own entry point. Later on in the calling sequence, after any number of intervening calls to
the setjmpx kernel service by other programs, a program can issue a call to the longjmpx kernel service
and pass the address of the agreed-on function descriptor as the code. This code is only recognized by a
single exception handler. All the intervening ones just clean up their resources and pass the code to the
longjmpx kernel service again.
Addresses of functions are not the only possibilities for unique code numbers. For example, addresses of
external variables can also be used. By using unigue, system-wide addresses, the problem of code-space
collision is transformed into a problem of external-name collision. This problem is easier to solve, and is
routinely solved whenever the system is built. By comparison, pre-assigning exception numbers by using
#define statements in a header file is a much more cumbersome and error-prone method.
When a SLIH determines that a hardware interrupt should actually be considered a synchronous
exception, it sets up the machine-state save to invoke the longjmpx kernel service, and then returns. The
FLIH then resumes the instruction stream at the entry to the longjmpx service.
The longjmpx service then invokes the top exception handler on the stack or takes the system default
action as previously described.
The uexadd and uexdel kernel services allow system-wide user-mode exception handlers to be added
and removed.
The most recently registered exception handler is the first called. If it cannot handle the exception, the next
most recent handler on the list is called, and this second handler attempts to handle the exception. If this
attempt fails, successive handlers are tried, until the default handler is called, which generates the signal.
Additional information about the exception can be obtained by using the getexcept kernel service.
System calls can be made available to 32- or 64-bit processes, selectively. If an application invokes a
system call that is not exported to processes running in the current mode, the call will fail.
A 32-bit kernel extension that supports 64-bit applications on AIX 4.3 cannot be used to support 64-bit
applications on AIX 5.1 and beyond, because of a potential incompatibility with data types. Therefore, one
of the following three techniques must be used to indicate that a 32-bit kernel extension can be used with
64-bit applications:
v The module type of the kernel extension module can be set to LT, using the ld command with the
-bM:LT flag
v If sysconfig is used to load a kernel extension, the SYS_64L flag can be logically ored with the
SYS_SINGLELOAD or SYS_KLOAD requires.
v If kmod_load is used to load a kernel extension, the LD_64L flag can be specified
If none of these techniques is used, a kernel extension will still load, but 64-bit programs with calls to one
of the exported system calls will not execute.
The first aspect is the use of kernel services for working with the 64-bit user address space. The 64-bit
services for examining and manipulating the 64-bit address space are as_att64, as_det64, as_geth64,
as_puth64, as_seth64, and as_getsrval64. The services for copying data to or from 64-bit address
spaces are copyin64, copyout64, copyinstr64, fubyte64, fuword64, subyte64, and suword64. The
service for doing cross-memory attaches to memory in a 64-bit address space is xmattach64. The
services for creating real memory mappings are rmmap_create64 and rmmap_remove64. The major
difference between all these services and their 32-bit counterparts is that they use 64-bit user addresses
rather than 32-bit user addresses.
The service for determining whether a process (and its address space) is 32-bit or 64-bit is IS64U.
The second aspect of supporting 64-bit applications on the 32-bit kernel is taking 64-bit user data pointers
and using the pointers directly or transforming 64-bit pointers into 32-bit pointers which can be used in the
kernel. If the types of the parameters passed to a system call are all 32 bits or smaller when compiled in
64-bit mode, no additional work is required. However, if 64-bit data, long or pointers, are passed to a
system call, the function must reconstruct the full 64-bit values.
When a 64-bit process makes a system call in the 32-bit kernel, the system call handler saves the
high-order 32 bits of each parameter and converts the parameters to 32-bit values. If the full 64-bit value is
needed, the get64bitparm service should be called. This service converts a 32-bit parameter and a
0-based parameter number into a 64-bit long long value.
These 64-bit values can be manipulated directly by using services such as copyin64, or mapped to a
32-bit value, by calling as_remap64. In this way, much of the kernel does not have to deal with 64-bit
addresses. Services such as copyin will correctly transform a 32-bit value back into a 64-bit value before
referencing user space.
It is also possible to obtain the 64-bit value from a 32-bit pointer by calling as_unremap64. Both
as_remap64 and as_unremap64 are prototyped in /usr/include/sys/remap.h.
In order to port an existing 32-bit kernel extension to the 64-bit kernel environment, source code must be
modified to be type-safe under LP64. This means ensuring that data types are used in a consistent
fashion. Source code is incorrect for the 64-bit environment if it assumes that pointers, long, and int are
all the same size.
In addition, the use of system-derived types must be examined whenever values are passed from an
application to the kernel. For example, size_t is a system-derived type whose size depends on the
compilation mode, and key_t is a system-derived type that is 64 bits in the 64-bit kernel environment, and
32 bits otherwise.
In cases where 32-bit and 64-bit versions of a kernel extension are to be generated from a single source
base, the kernel extension must be made type-safe for both the LP64 and ILP32 data models. To facilitate
this, the sys/types.h and sys/inttypes.h header files contain fixed-width system-derived types, constants,
and macros. For example, the int8_t, int16_t, int32_t, int64_t fixed-width types are provided along with
constants that specify their maximum values.
Function Prototypes
Function prototypes are more important in the 64-bit programming environment than the 32-bit
programming environment, because the default return value of an undeclared function is int. If a function
prototype is missing for a function returning a pointer, the compiler will convert the returned value to an int
by setting the high-order word to 0, corrupting the value. In addition, function prototypes allow the compiler
to do more type checking, regardless of the compilation mode.
When compiled in 64-bit mode, system header files define full function prototypes for all kernel services
provided by the 64-bit kernel. By defining the __FULL_PROTO macro, function prototypes are provided in
32-bit mode as well. It is recommended that function prototypes be provided by including the system
header files, instead of providing a prototype in a source file.
Compiler Options
To compile a kernel extension in 64-bit mode, the -q64 flag must be used. To check for missing function
prototypes, -qinfo=pro can be specified. To compile in ANSI mode, use the -qlanglvl=ansi flag. When this
flag is used, additional error checking will be performed by the compiler. To link-edit a kernel extension, the
-b64 option must be used with the ld command.
Conditional Compilation
When compiling in 64-bit mode, the compiler automatically defines the macro __64BIT__. Kernel
extensions should always be compiled with the _KERNEL macro defined, and if sys/types.h is included,
Kernel extensions should not be compiled with the _KERNSYS macro defined. If this macro is defined, the
resulting kernel extension will not be supported, and binary compatibility will not be assured with future
releases.
Temporary Attachment
The 64-bit kernel provides kernel extensions with the capability to temporarily attach virtual memory
segments to the kernel space for the current thread of kernel mode execution. This capability is also
available on the 32-bit kernel, and is provided through the vm_att and vm_det services.
A total of four concurrent temporary attaches will be supported under a single thread of execution.
Global Regions
The 64-bit kernel provides kernel extensions with the capability to create global regions within the kernel
address space. Once created, a region is globally accessible to all kernel code until it is destroyed.
Regions may be created with unique characteristics, for example, page protection, that suit kernel
extension requirements and are different from the global virtual memory allocated from the kernel_heap.
Global regions are also useful for kernel extensions that in the past have organized their data around
virtual memory segments and require sizes and alignments that are inappropriate for the kernel heap.
Under the 64-bit kernel, this memory can be provided through global regions rather than separate virtual
memory segments, thus avoiding the complexity and performance cost of temporarily attaching virtual
memory segments.
Global regions are created and destroyed with the vm_galloc and vm_gfree kernel services.
Once a kernel extension has been updated to support the new 64-bit ABI, there are two ways to indicate
that the kernel extension can be used by 64-bit processes again. The first way uses a linker flag to mark
the module as a ported kernel extension. Use the bM:LT linker flag to mark the module in this manner.
The second way requires changing the sysconfig or kmod_load call used to load the kernel extension.
When the SYS_64L flag is passed to sysconfig, or the LD_64L flag is passed to kmod_load, the
specified kernel extension will be allowed to export 64-bit system calls.
Kernel extensions in the 64-bit kernel are always assumed to support the 64-bit ABI. The module type,
specified by the -bM linker flag, as well as the SYS_64L and LD_64L flags are always ignored when the
64-bit kernel is running.
32-bit device drivers cannot be used by 64-bit applications unless the DEV_64L flag is set in the d_opts
field. The DEV_64BIT flag is ignored, and in the 64-bit kernel, DEV_64L is ignored as well.
Related Information
Chapter 15, “Serial Direct Access Storage Device Subsystem,” on page 311
Subroutine References
The setpri subroutine, sysconfig subroutine in AIX 5L Version 5.3 Technical Reference: Base Operating
System and Extensions Volume 2.
Commands References
The ar command in AIX 5L Version 5.3 Commands Reference, Volume 1.
Technical References
The clrjmpx kernel service, copyin kernel service, copyinstr kernel service, copyout kernel service,
creatp kernel service, disable_lock kernel service, e_sleep kernel service, e_sleepl kernel service,
e_wait kernel service, et_wait kernel service, fubyte kernel service, fuword kernel service, getexcept
kernel service, i_disable kernel service, i_enable kernel service, i_init kernel service, initp kernel service,
lockl kernel service, longjmpx kernel service, setjmpx kernel service, setpinit kernel service, sig_chk
kernel service, subyte kernel service, suword kernel service, uiomove kernel service, unlockl kernel
service, ureadc kernel service, uwritec kernel service, uexadd kernel service, uexdel kernel service,
xmalloc kernel service, xmattach kernel service, xmdetach kernel service, xmemin kernel service,
xmemout kernel service in AIX 5L Version 5.3 Technical Reference: Kernel and Subsystems Volume 1.
The uio structure in AIX 5L Version 5.3 Technical Reference: Kernel and Subsystems Volume 1.
The distinction between a system call and an ordinary function call is only important in the kernel
programming environment. User-mode application programs are not usually aware of this distinction.
Operating system functions are made available to the application program in the form of programming
libraries. A set of library functions found in a library such as libc.a can have functions that perform some
user-mode processing and then internally start a system call. In other cases, the system call can be
directly exported by the library without any user-space code. For more information on programming
libraries, see “Using Libraries” on page 4.
Operating system functions available to application programs can be split or moved between user-mode
functions and kernel-mode functions as required for different releases or machine platforms. Such
movement does not affect the application program. Chapter 1, “Kernel Environment,” on page 1 provides
more information on how to use system calls in the kernel environment.
When a program is running in the user protection domain, the processor executes instructions in the
problem state, and the program does not have direct access to kernel data.
Programming errors in the code running in the kernel protection domain can cause the operating system to
fail. In particular, a process’s user data cannot be accessed directly, but must be accessed using the
copyin and copyout kernel services, or their variants. These routines protect the kernel from improperly
supplied user data addresses.
Application programs can gain controlled access to kernel data by making system calls. Access to
functions that directly or indirectly invoke system calls is typically provided by programming libraries,
providing access to operating system functions.
The system loader maintains a table of the functions that are used for each system call.
The system call runs within the calling thread, but with more privilege because system calls run in the
kernel protection domain. After the function implementing the system call has performed the requested
action, control returns to the system call handler. If the ut_error field in the uthread structure has a
non-zero value, the value is copied to the application’s thread-specific errno variable. If a signal is
pending, signal processing take place, which can result in an application’s signal handler being invoked. If
no signals are pending, the system call handler restores the state of the calling thread, which is resumed
in the user protection domain. For more information on protection domains, see “Understanding Protection
Domains” on page 23.
First, system calls cannot have floating-point parameters. In fact, the operating system does not preserve
the contents of floating-point registers when a system call is preempted by another thread, so system calls
cannot use any floating-point operations.
Second, a system call in the 32–bit kernel cannot return a long long value to a 32–bit application. In
32–bit mode, long long values are returned in a pair of general purpose registers, GPR3 and GPR4. Only
GPR3 is preserved by the system call handler before it returns to the application. A system call in the
32–bit kernel can return a 64–bit value to a 64–bit application, but the saveretval64 kernel service must
used.
Third, since a system call runs on its own stack, the number of arguments that can be passed to a system
call is limited. The operating system linkage conventions specify that up to eight general purpose registers
are used for parameter passing. If more parameters exist than will fit in eight registers, the remaining
parameters are passed in the stack. Because a system call does not have direct access to the
application’s stack, all parameters for system calls must fit in eight registers.
Some parameters are passed in multiple registers. For example, 32-bit applications pass long long
parameters in two registers, and structures passed by value can require multiple registers, depending on
the structure size. The writer of a system call should be familiar with the way parameters are passed by
the compiler and ensure that the 8-register limit is not exceeded. For more information on parameter
calling conventions, see Subroutine Linkage Convention in Assembler Language Reference.
Finally, because 32- and 64-bit applications are supported by both the 32- and 64-bit kernels, the data
model used by the kernel does not always match the data model used by the application. When the data
models do not match, the system call might have to perform extra processing before parameters can be
used.
Regardless of whether the 32-bit or 64-bit kernel is running, the interface that is provided by the kernel to
applications must be identical. This simplifies the development of applications and libraries, because their
behavior does not depend on the mode of the kernel. On the other hand, system calls might need to know
the mode of the calling process. The IS64U macro can be used to determine if the caller of a system call
is a 64-bit process. For more information on the IS64U macro, see IS64U Kernel Service in AIX 5L
Version 5.3 Technical Reference: Kernel and Subsystems Volume 1.
The ILP32 and LP64 data models differ in the way that pointers and long and long long parameters are
treated when used in structures or passed as functional parameters. The following tables summarize the
differences.
System calls using these types must take the differing data models into account. The treatment of these
types depends on whether they are used as parameters or in structures passed as parameters by value or
by reference.
Signed long Parameters: To convert a 32-bit signed long parameter to a 64-bit value, the 32-bit value
must be sign extended. The LONG32TOLONG64 macro is provided for this operation. It converts a 32-bit
signed value into a 64-bit signed value, as shown in this example:
syscall1(long incr)
{
/* If the caller is a 32-bit process, convert
* ’incr’ to a signed, 64-bit value.
*/
if (!IS64U)
incr = LONG32TOLONG64(incr);
.
.
.
}
If a parameter can be either a pointer or a symbolic constant, special handling is needed. For example, if
-1 is passed as a pointer argument to indicate a special case, comparing the pointer to -1 will fail, as will
unconditionally sign-extending the parameter value. Code similar to the following should be used:
syscall2(void *ptr)
{
/* If caller is a 32-bit process,
* check for special parameter value.
*/
if (!IS64U && (LONG32TOLONG64(ptr) == -1)
ptr = (void *)-1;
Similar treatment is required when an unsigned long parameter is interpreted as a signed value.
long long Parameters: A 32-bit application passes a long long parameter in two registers, while a
64-bit kernel system call uses a single register for a long long parameter value.
The system call function prototype cannot match the function prototype used by the application. Instead,
each long long parameter should be replaced by a pair of uintptr_t parameters. Subsequent parameters
should be replaced with uintptr_t parameters as well. When the caller is a 32-bit process, a single 64-bit
value will be constructed from two consecutive parameters. This operation can be performed using the
INTSTOLLONG macro. For a 64-bit caller, a single parameter is used directly.
if (!IS64U) {
len1 = INTSTOLLONG(L1, L2);
len2 = INTSTOLLONG(L3, L4);
size = (int)L5;
}
else {
len1 = (long)L1
len2 = (long)L2
size = (int)L3;
}
.
.
.
}
If a 64-bit parameter is an address, the system call might not be able to use the address directly. Instead,
it might be necessary to map the 64-bit address into a 32-bit address, which can be passed to various
kernel services.
For example, suppose that the first and third parameters of a system call are 64-bit values. The full
parameter values are obtained as shown:
#include <sys/types.h>
syscall4(char *str, int fd, long count)
{
ptr64 str64;
int64 count64;
if (IS64U)
{
/* get 64-bit address. */
str64 = get64bitparm(str, 0);
The get64bitparm kernel service must not be used when the caller is a 32-bit process, nor should it be
used when the parameter type is an int or smaller. In these cases, the system call parameter can be used
directly. For example, the fd parameter in the previous example can be used directly.
A system call can use a 64-bit address to access user-space memory by calling one of the 64-bit
data-movement kernel services, such as copyin64, copyout64, or copyinstr64. Alternatively, if the user
address is to be passed to kernel services that expect 32-bit addresses, the 64-bit address should be
mapped to a 32-bit address.
Consider a system call that takes a path name and a buffer pointer as parameters. This system call will
use the path name to obtain information about the file, and use the buffer pointer to return the information.
Because pathname is passed to the lookupname kernel service, which takes a 32-bit pointer, the
pathname parameter must be mapped. The buffer address can be used directly. For example:
int syscall5 (
char *pathname,
char *buffer)
{
ptr64 upathanme;
ptr64 ubuffer;
struct vnode *vp;
struct cred *crp;
crp = crref();
rc = lookupname(pathname, USR, L_SEARCH, NULL, &vp, crp);
getinfo(vp, &local_buffer);
The function prototype for the get64bitparm kernel service is found in the sys/remap.h header file. To
allow common code to be written, the get64bitparm kernel service is defined as a macro when compiling
in 64-bit mode. The macro simply returns the specified parameter value, as this value is already a full
64-bit value.
In some cases, a system call or kernel service will need to obtain the original 64-bit address from the
32-bit mapped address. The as_unremap64 kernel service is used for this purpose.
The saveretval64 kernel service allows a 32-bit system call to return a 64-bit value to a 64-bit application.
This kernel service takes a single long long parameter, saves the low-order word (passed in GPR4) in a
save area for the current thread, and returns the original parameter. Depending on the return type of the
system call function, this value can be returned to the system call handler, or the high-order word of the
full 64-bit return value can be returned.
After the system call function returns to the system call handler, the original 64-bit return value will be
reconstructed in GPR3, and returned to the application. If the saveretval64 kernel service is not called by
the system call, the high-order word of GPR3 is zeroed before returning to the application. For example:
void * syscall6 (
int arg)
{
if (IS64U) {
ptr64 rc = f(arg);
saveretval64(rc); /* Save low-order word */
return (void *)(rc >> 32); /* Return high-order word as
* 32-bit address */
}
else {
return (void *)f(arg);
}
}
Structure Reshaping
Structure reshaping allows system calls to support both 32- and 64-bit applications using a single system
call interface and using code that is predominately common to both application types.
Structure reshaping requires defining more than one version of a structure. One version of the structure is
used internally by the system call to process the request. The other version should use size-invariant
types, so that the layout of the structure fields matches the application’s view of the structures. When a
structure is copied in from user space, the application-view structure definition is used. The structure is
reshaped by copying each field of the application’s structure to the kernel’s structure, converting the fields
as required. A similar conversion is performed on structures that are being returned to the caller.
Structure reshaping is used for structures whose size and layout as seen by an application differ from the
size and layout as seen by the system call. If the system call uses a structure definition with fields big
enough for both 32- and 64-bit applications, the system call can use this structure, independent of the
mode of the caller.
While reshaping requires two versions of a structure, only one version is public and visible to the end user.
This version is the natural structure, which can also be used by the system call if reshaping is not needed.
The private version should only be defined in the source file that performs the reshaping. The following
example demonstrates the techniques for passing structures to system calls that are running in the 64-bit
kernel and how a structure can be reshaped:
if (IS64U()) {
copyin(&f1, f, sizeof(f1));
}
else {
copyin(&f2, f, sizeof(f2));
f1.a = f2.a;
f1.b = f2.b;
}
/* Common structure f1 used from now on. */
.
.
.
}
Dual Implementation: The dual implementation approach involves separate code paths for calls from
32-bit applications and calls from 64-bit applications. Similar to reshaping, the system call code defines a
private view of the application’s structure. With dual implementations, the function syscall7 could be
rewritten as:
syscall8(struct foo *f)
{
struct foo f1;
struct foo32 f2;
if (IS64U()) {
copyin(&f1, f, sizeof(f1));
/* Code for 64-bit process uses f1 */
.
.
.
}
else {
copyin(&f2, f, sizeof(f2));
/* Code for 32-bit process uses f2 */
.
.
.
}
}
Dual implementation is most appropriate when the structures are so large that the overhead of reshaping
would affect the performance of the system call.
Passing Structures by Value: When structures are passed by value, the structure is loaded into as
many parameter registers as are needed. When the data model of an application and the data model of
the kernel extension differ, the values in the registers cannot be used directly. Instead, the registers must
be stored in a temporary variable. For example:
Because system calls can be preempted, access to global data must be serialized. Kernel locking
services, such as simple_lock and simple_unlock, are frequently used to serialize access to kernel data.
A thread can be preempted even when it owns a lock. If multiple locks are obtained by system calls, a
technique must be used to prevent multiple threads from deadlocking. One technique is to define a lock
hierarchy. A system call must never return while holding a lock. For more information on locking, see
“Understanding Locking” on page 13.
The sleep kernel service, provided for compatibility, also supports the PCATCH and SWAKEONSIG
options to control the response to a signal during the sleep function.
Previously, the kernel automatically saved context on entry to the system call handler. As a result, any long
(interruptible) sleep not specifying the PCATCH option returned control to the saved context when a signal
interrupted the wait. The system call handler then set the errno global variable to EINTR and returned a
return code of -1 from the system call.
The kernel, however, requires each system call that can directly or indirectly issue a sleep call without the
PCATCH option to set up a saved context using the setjmpx kernel service. This is done to avoid
overhead for system calls that handle waits terminated by signals. Using the setjmpx service, the system
can set up a saved context, which sets the system call return code to a -1 and the ut_error field to
EINTR, if a signal interrupts a long wait not specifying return-from-signal.
It is probably faster and more robust to specify return-from-signal on all long waits and use the return
code to control the system call return.
An initial context is set up for each process by the initp kernel service for kernel processes and by the
fork subroutine for user processes. The process terminates if that context is resumed.
The default exception handler generates a signal if the process is in a state where signals can be
delivered immediately. Otherwise, the default exception handler generates a system dump.
The exception handler returns a value that can specify any of the following:
v The process should resume with the instruction that caused the exception.
v The process should return to the saved context that is on the top of the stack of contexts.
v The exception handler did not handle the exception.
If the exception handler did not handle the exception, then the next exception handler in the stack of
contexts is called. If none of the stacked exception handlers handle the exception, the kernel performs
default exception handling. The setjmpx and longjmpx kernel services help implement exception
handlers.
This restriction does not apply to kernel processes. User-mode data access services can distinguish
between kernel processes and user-mode processes in kernel mode. As a result, these services can
access the referenced data areas accessed correctly when the caller is a kernel process.
Kernel processes cannot call the fork or exec system calls, among others. A list of the base operating
system calls available to system calls or other routines in kernel mode is provided in “System Calls
Available to Kernel Extensions” on page 35.
Most data accessed by system calls is pageable by default. This includes the system call code, static data,
dynamically allocated data, and stack. As a result, a system call can be preempted in two ways:
v By a more favored process, or by an equally favored process when a time slice has been exhausted
v By losing control of the processor when it page faults
In the latter case, even less-favored processes can run while the system call is waiting for the paging I/O
to complete.
Before actually calling the system call function, the system call handler sets the ut_error field to 0. Upon
return from the system call function, the system call handler copies the value found in ut_error into the
thread-specific errno variable if ut_error was nonzero. After setting the errno variable, the system call
handler returns to user mode with the return code provided by the system call function.
Kernel-mode callers of system calls must be aware of this return code convention and use the getuerror
kernel service to obtain the error value when an error indication is returned by the system call. When
system calls are nested, the system call function called by the system call handler can return the error
value provided by the nested system call function or can replace this value with a new one by using the
setuerror kernel service.
Related Information
“Handling Signals While in a System Call” on page 32
Subroutine References
The fork subroutine in AIX 5L Version 5.3 Technical Reference: Base Operating System and Extensions
Volume 1.
Logical file system Provides support for the system call interface.
Physical file system Manages permanent storage of data.
The interface between the physical and logical file systems is the virtual file system interface. This
interface allows support for multiple concurrent instances of physical file systems, each of which is called a
file system implementation. The file system implementation can support storing the file data in the local
node or at a remote node. For more information on the virtual filesystem interface, see “Understanding the
Virtual File System Interface” on page 41.
The virtual file system interface is usually referred to as the v-node interface. The v-node structure is the
key element in communication between the virtual file system and the layers that call it. For more
information on v-nodes, see “Understanding Virtual Nodes (V-nodes)” on page 40.
Both the virtual and logical file systems exist across all of this operating system family’s platforms.
A consistent view of file system implementations is made possible by the virtual file system abstraction.
This abstraction specifies the set of file system operations that an implementation must include in order to
carry out logical file system requests. Physical file systems can differ in how they implement these
predefined operations, but they must present a uniform interface to the logical file system. A list of file
system operators can be found at “Requirements for a File System Implementation” on page 41. For more
information on the virual file system, see “Virtual File System Overview” on page 40.
Each set of predefined operations implemented constitutes a virtual file system. As such, a single physical
file system can appear to the logical file system as one or more separate virtual file systems.
Virtual file system operations are available at the logical file system level through the virtual file system
switch. This array contains one entry for each virtual file system, with each entry holding entry point
addresses for separate operations. Each file system type has a set of entries in the virtual file system
switch.
The logical file system and the virtual file system switch support other operating system file-system access
semantics. This does not mean that only other operating system file systems can be supported. It does
mean, however, that a file system implementation must be designed to fit into the logical file system
model. Operations or information requested from a file system implementation need be performed only to
the extent possible.
Logical file system can also refer to the tree of known path names in force while the system is running. A
virtual file system that is mounted onto the logical file system tree itself becomes part of that tree. In fact, a
A virtual file system can also be viewed as a subset of the logical file system tree, that part belonging to a
single file system implementation. A virtual file system can be physical (the instantiation of a physical file
system), remote, or strictly logical. In the latter case, for example, a virtual file system need not actually be
a true file system or entail any underlying physical storage device.
A virtual file system mount point grafts a virtual file system subtree onto the logical file system tree. This
mount point ties together a mounted-over v-node (virtual node) and the root of the virtual file system
subtree. A mounted-over, or stub, v-node points to a virtual file system, and the mounted VFS points to the
v-node it is mounted over.
A v-node is either created or used again for every reference made to a file by path name. When a user
attempts to open or create a file, if the VFS containing the file already has a v-node representing that file,
a use count in the v-node is incremented and the existing v-node is used. Otherwise, a new v-node is
created.
Each file system implementation is responsible for allocating and destroying g-nodes. The g-node then
serves as the interface between the logical file system and the file system implementation. Calls to the file
system implementation serve as requests to perform an operation on a specific g-node.
A g-node is needed, in addition to the file system i-node, because some file system implementations may
not include the concept of an i-node. Thus the g-node structure substitutes for whatever structure the file
system implementation may have used to uniquely identify a file system object.
The logical file system relies on the file system implementation to provide valid data for the following fields
in the g-node:
The header files contain the structure definitions for the key components of the virtual file system
abstraction. Understanding the contents of these files and the relationships between them is essential to
an understanding of virtual file systems. The following are the necessary header files:
v sys/vfs.h
v sys/gfs.h
v sys/vnode.h
v sys/vmount.h
These are the steps that must be followed to get a file system configured.
1. A user-level routine must call the sysconfig subroutine requesting that the code for the virtual file
system be loaded.
2. The user-level routine must then request, again by calling the sysconfig subroutine, that the virtual file
system be configured. The name of a VFS-specific configuration routine must be specified.
3. The virtual file system-specific configuration routine calls the gfsadd kernel service to have the new file
system added to the gfs table. The gfs table that the configuration routine passes to the gfsadd
kernel service contains a pointer to an initialization routine. This routine is then called to do any further
virtual file system-specific initialization.
4. The file system is now operational.
Related Information
“Logical File System Kernel Services” on page 66
“Understanding Data Structures and Header Files for Virtual File Systems” on page 42
List of Virtual File System Operations in AIX 5L Version 5.3 Technical Reference: Kernel and Subsystems
Volume 1.
Subroutine References
The mntctl subroutine, mount subroutine, sysconfig subroutine in AIX 5L Version 5.3 Technical
Reference: Base Operating System and Extensions Volume 1.
Files References
The vmount.h file in AIX 5L Version 5.3 Files Reference.
Technical References
The gfsadd kernel service, gfsdel kernel service in AIX 5L Version 5.3 Technical Reference: Kernel and
Subsystems Volume 1.
Callers of kernel services execute in kernel mode. They therefore share with the kernel the responsibility
for ensuring that system integrity is not compromised.
For a list of system calls that kernel extensions are allowed to use, see “System Calls Available to Kernel
Extensions” on page 35.
bawrite Writes the specified buffer’s data without waiting for I/O to complete.
bdwrite Releases the specified buffer after marking it for delayed write.
bflush Flushes all write-behind blocks on the specified device from the buffer cache.
binval Invalidates all of the specified device’s blocks in the buffer cache.
blkflush Flushes the specified block if it is in the buffer cache.
bread Reads the specified block’s data into a buffer.
breada Reads in the specified block and then starts I/O on the read-ahead block.
brelse Frees the specified buffer.
bwrite Writes the specified buffer’s data.
clrbuf Sets the memory for the specified buffer structure’s buffer to all zeros.
getblk Assigns a buffer to the specified block.
geteblk Allocates a free buffer.
geterror Determines the completion status of the buffer.
purblk Purges the specified block from the buffer cache.
In addition to the mbuf kernel services, the following macros are available for use with mbufs:
d_align Provides needed information to align a buffer with a processor cache line.
d_cflush Flushes the processor and I/O controller (IOCC) data caches when using the long term
DMA_WRITE_ONLY mapping of DMA buffers approach to the bus device DMA.
d_map_clear Deallocates resources previously allocated on a d_map_init call.
d_map_disable Disables DMA for the specified handle.
d_map_enable Enables DMA for the specified handle.
d_map_init Allocates and initializes resources for performing DMA with PCI and ISA devices.
For example, if each port of a 4-port Ethernet adapter can be assigned to a different LPAR, and each port
is a PCI function, then each port would be a PE. In this case, granularity of a PE is a PCI function. A PE is
not necessarily the same as a PCIe Endpoint, and the two should not be confused. A bridge or switch
above the PE where the EEH state is maintained forms a PE domain. This bridge or switch is called the
Top of PE domain. EEH recovery is performed according to the PE domain and is carried out by the Top
of PE domain as directed by the software (operating system and device drivers).
Several PCI functions in one or more adapters that belong to the same EEH recovery domain are called a
shared EEH domain. This has been typically limited to a multifunction adapter, in which the functions on
the adapter are recovered together. Because a shared EEH domain supports any number of PCI functions
to be included in it, including the functions on different adapters, its function is more general than a
multifunction adapter. For present purposes, the multifunction model are referred to as the shared EEH
model.
In the LPAR environment, a PE domain is the same as a shared EEH domain and includes all PCI
functions in the PE domain. In other words, if multiple functions belong to a shared EEH domain, they
cannot be individual PEs because the EEH recovery can only affect the LPAR to which the PE belongs.
The bridges can be of different types, such as PCI-to-PCI or PCIX-to-PCI, and so on. A switch is a PCIe
switch, which is logically a collection of bridges. The exact type of the bridge or switch is not important to
this programming model. These details are handled by the hardware and firmware. A bridged
single-function adapter is treated like a bridged multifunction adapter for the purposes of EEH
programming model.
The device drivers for all these types of adapters use the same EEH kernel services to drive the error
recovery except for the registration service. A nonbridged single-function adapter calls the eeh_init()
registration service function. Adapters in a shared EEH domain call the eeh_init_multifunc() function. This
includes any bridged or nonbridged multifunction adapters and bridged single-function adapters. Although
a nonbridged single-function adapter typically calls eeh_init(), it can choose the shared EEH model and
call eeh_init_multifunc() instead. Regardless of the number of functions and bridges, the device drivers
should always use the shared EEH model and call eeh_init_multifunc(). The PCIe device drivers are
required to use the shared EEH model. Although the same services are used by the single and shared
EEH adapter drivers, the error recovery models are different. In addition, any time there are intermediate
The error recovery is performed by resetting the PCI bus between the Top of PE domain and the PE under
it, and then reconfiguring any intermediate bridges. The basic steps in error detection and recovery are as
follows:
v An adapter driver suspects an error on the card when it receives some invalid values from one or more
locations in its I/O or memory spaces.
v The driver then confirms the existence of the error by calling EEH kernel services. After the error state
is confirmed, the slot is declared frozen.
v After the slot is frozen, all further activities to the card are suspended until the error is recovered. For
example, new read/write requests are blocked or failed.
v The driver attempts to recover the slot by toggling the reset line. After three attempts to recover, the
driver declares the slot unusable (or dead). If the slot is reset successfully, normal operations resume.
The key difference in the single-function and shared EEH models is that in the shared EEH model, there is
a need for coordination among different driver instances controlling the same PE domain. For example, a
PE domain can include a physical device on a single slot. The driver instances controlling each function on
the device require coordination. In addition, there are steps in the recovery process that need to be carried
out only once. So among all registered drivers in a shared EEH domain, one is chosen to be the master.
The drivers follow a state machine. The EEH kernel services are implemented so that they present an
EEH recovery state machine to the device drivers. It is the master driver’s responsibility to drive the state
machine. The section titled Shared EEH Programming Model, which follows, contains more details on how
a master driver is determined. Many details are hidden from the device drivers for simplicity. Because the
shared EEH model is more flexible and extensible, it is recommended for the new device drivers.
In the single-function model, the device drivers are responsible for driving their own error recovery. In other
words, they are responsible for implementing their own state machine. Every time EEH recovery is
extended in some way at the hardware or firmware level, there is probably a code and testing impact on
the single-function implementations. As previously described, a single-function adapter should still use the
shared EEH model. In that case, all the messages from the EEH kernel services are sent to just one driver
instance.
Note: This is the usual recovery sequence. If any of the services fail, the EEH_DD_DEAD message is
broadcast asking the drivers to mark their adapters unavailable (for example, the drivers might
have to perform some cleanup work and mark their internal states appropriately). The master
driver must call eeh_slot_error() to create an AIX error log and mark the adapter permanently
unavailable.
There are two special scenarios that a driver developer needs to be aware of:
1. If a driver receives either an EEH_DD_SUSPEND or an EEH_DD_DEAD message, it can return an EEH_BUSY
return code from its callback routine instead of an EEH_SUCC return code. If EEH kernel services
receives an EEH_BUSY message, EEH kernel services waits for some time and then calls the same
driver again. This process continues until EEH kernel services receive a different return code. This
process is repeated because some drivers need more time to cleanup before recovery can continue.
Cleanup would include such activities like killing a kproc or notifying a user level app.
2. If eeh_enable_dma() and eeh_enable_pio() cannot succeed due to the platform state restrictions, the
service returns an EEH_FAIL return code followed by an EEH_DD_DEAD message unless you take action.
To avoid receiving an EEH_FAIL return code, the driver must supply an
EEH_ENABLE_NO_SUPPORT_RC flag when eeh_init_multifunc() kernel services is initiated. If an
EEH_ENABLE_NO_SUPPORT_RC flag is supplied, eeh_enable_pio() and eeh_enable_dma() return
the EEH_NO_SUPPORT return code that indicates to the drivers that they cannot collect debug data but
can continue with the next step in recovery. For more information, see eeh_read_slot_state.
The EEH kernel services that you can use are listed in the following table:
Note: eeh_init() and eeh_init_multifunc() are the only exported kernel services. All other kernel services
are called using function pointers in the eeh_handle kernel service.
Callback Routine
The(*callback_ptr)() function prototype is defined as:
long eeh_callback(
unsigned long long cmd, /* EEH messages */
void *arg, /* Pointer to dd defined argument */
unsigned long flag) /* DD defined flag */
v cmd – contains a kernel and driver message
v arg – is a cookie to a target device driver that is usually a pointer to the adapter structure
v flag argument can be either just EEH_MASTER or EEH_MASTER ORed with EEH_DD_PIO_ENABLED
EEH_MASTER
Indicates that the target device driver is the EEH_MASTER.
EEH_DD_PIO_ENABLED
Set only with the EEH_DD_DEBUG message to indicate that the PIO is enabled.
When eeh_init_multifunc() is called, the callback routines are registered. When eeh_clear() is called the
callback routines are unregistered. The callback routines are necessary for EEH kernel services recovery.
They coordinate shared EEH domain driver instances. For more information on how this coordination is
done, see “Enhanced I/O Error Handling Kernel Services” on page 48.
The shared EEH domain drivers are expected to handle the following EEH kernel services messages:
EEH_DD_SUSPEND
Notifies all the device drivers on a slot that an EEH kernel services event occurred. The slot is
either frozen or temporarily unavailable. Because an EEH kernel services event occurred, the
device drivers suspend operations. Then, the EEH_MASTER driver either enables PIO or resets
the slot.
EEH_DD_DEBUG
Notifies all drivers on a slot that they can now gather debug data from the devices. The device
drivers log errors by calling the eeh_slot_error() function and passing in the gathered debug data.
This message is sent when the EEH_MASTER calls the eeh_enable_pio() function. On the
callback routine, the flag argument is set to EEH_DD_PIO_ENABLED.
EEH_DD_DEAD
Notifies all drivers on a slot that the slot reached an unrecoverable state and the slot is no longer
usable. This message is sent anytime EEH kernel services fail because of hardware or firmware
The device drivers define their own messages based on the contents of the sys/eeh.h file.
The eeh_callback() functions are scheduled to start sequentially at INTIODONE priority. They are not
started in any specific order. For more information, see eeh_broadcast.
For compatibility support of other file systems and block special file support, the buffer cache services
serve two important purposes:
v They ensure that multiple processes accessing the same block of the same device in multiprogrammed
fashion maintain a consistent view of the data in the block.
v They increase the efficiency of the system by keeping in-memory copies of blocks that are frequently
accessed.
The Buffer Cache services use the buf structure or buffer header as their main data-tracking mechanism.
Each buffer header contains a pair of pointers that maintains a doubly-linked list of buffers associated with
a particular block device. An additional pair of pointers maintain a doubly-linked list of blocks available for
use again on another operation. Buffers that have I/O in progress or that are busy for other purposes do
not appear in this available list.
Kernel buffers are discussed in more detail in Introduction to Kernel Buffers in AIX 5L Version 5.3
Technical Reference: Kernel and Subsystems Volume 1.
See “Block I/O Kernel Services” on page 45 for a list of these services.
In either case, the buffer and the corresponding device block are made busy. Other processes attempting
to access the buffer must wait until it becomes free. The getblk service is used when:
v A block is about to be rewritten totally.
The breada kernel service is used to perform read-ahead I/O and is similar to the bread service except
that an additional parameter specifies the number of the block on the same device to be read
asynchronously after the requested block is available. The brelse kernel service makes the specified
buffer available again to other processes.
The bawrite service is an asynchronous version of the bwrite service and does not wait for I/O
completion. This service is normally used when the overlap of processing and device I/O activity is
desired.
The bdwrite service does not start any I/O operations, but marks the buffer as a delayed write and
releases it to the free list. Later, when the buffer is obtained from the free list and found to contain data
from some other block, the data is written out to the correct device before the buffer is used. The bdwrite
service is used when it is undetermined if the write is needed immediately.
For example, the bdwrite service is called when the last byte of the write operation associated with a
block special file falls short of the end of a block. The bdwrite service is called on the assumption that
another write will soon occur that will use the same block again. On the other hand, as the end of a block
is passed, the bawrite service is called, because it is assumed the block will not be accessed again soon.
Therefore, the I/O processing can be started as soon as possible.
Note that the getblk and bread services dedicated the specified block to the caller while making other
processes wait, whereas the brelse, bwrite, bawrite, or bdwrite services must eventually be called to
free the block for use by other processes.
Understanding Interrupts
Each hardware interrupt has an interrupt level, trigger, and interrupt priority.
Interrupt Level
The interrupt level defines the source of the interrupt and is often referred to as the interrupt source. There
are basically two types of interrupt levels: system and bus. The bus interrupts are generated by the
devices on the buses (such as PCI, ISA, VDEVICE, and PCI-E). Examples of system interrupts are the
timer and Environmental and Power Off Warning (EPOW).
The interrupt level of a bus or system interrupt is one of the resources managed by the respective
configuration methods.
Interrupt Trigger
There are two types of trigger mechanisms, level-triggered interrupts and edge-triggered interrupts. All ISA
and VDEVICE interrupts are edge-triggered. PCI/PCIX and PCI-E buses define two types of interrupts,
Level Signalled Interrupts (LSI) and Message Signalled Interrupts (MSI). LSIs are level-triggered, and MSIs
are edge-triggered. PCI/PCIX device drivers in AIX must handle only level-triggered interrupts even though
edge-triggered interrupts using MSIs are supported by PCIX. Similarly, even though PCI-E supports LSI
A key difference between the edge-triggered and level-triggered interrupts is interrupt sharing.
Level-triggered interrupts can be shared. Edge-triggered interrupts cannot be shared. Because they cannot
be shared, edge-triggered interrupt handlers should pass the INTR_EDGE flag on the i_init() kernel
service.
Another difference between the edge-triggered and level-triggered interrupts is in issuing the End of
Interrupt (EOI). For level-triggered interrupts, the AIX kernel issues the EOI. For ISA edge-triggered
interrupts, the AIX kernel also issues EOI. However, for the VDEVICE and PCI-E edge-triggered interrupts,
the device driver must issue the EOI before returning from the interrupt handler. The VDEVICE and PCI-E
drivers can use the i_eoi() kernel service to accomplish this.
Note: During the processing of i_eoi(), there is a brief period of time in which a newly arrived interrupt
could also be issued an EOI. Therefore, it is necessary to check for additional work between a call
to i_eoi() and the return from the interrupt handler.
Interrupt Priorities
The interrupt priority defines which of a set of pending interrupts is serviced first. INTMAX is the most
favored interrupt priority and INTBASE is the least favored interrupt priority. The interrupt priorities for bus
interrupts range from INTCLASS0 to INTCLASS3. The rest of the interrupt priorities are reserved for the
base kernel. Interrupts that cannot be serviced within the time limits specified for bus interrupts qualify as
off-level interrupts.
A device’s interrupt priority is selected based on two criteria: its maximum interrupt latency requirements
and the device driver’s interrupt execution time. The interrupt latency requirement is the maximum time
within which an interrupt must be serviced. (If it is not serviced in this time, some event is lost or
performance is seriously degraded.) The interrupt execution time is the number of machine cycles required
by the device driver to service the interrupt. Interrupts with a short interrupt latency time must have a short
interrupt service time.
The general rule for interrupt service times is based on the following interrupt priority table:
See “Interrupt Management Kernel Services” on page 46 for a list of these services.
A device driver allocates and initializes DMA-related resources with the d_map_init service and frees the
resources with the d_map_clear service. Each time a DMA mapping needs to be established, the driver
calls d_map_page or d_map_list service.
The d_map_page and d_map_list services map DMA buffers in the bus memory. In other words, given a
set of DMA buffer addresses, a corresponding set of bus addresses is returned to the driver. The driver
programs its device with the bus addresses and sets it up to start the DMA. When the DMA is complete:
v The device generates an interrupt that is handled by the driver.
v If no more DMA will be done to the buffers, the driver unmaps the DMA buffers with the
d_unmap_page or d_unmap_list services.
Data Structures
The d_map_init service returns a d_handle_t structure to the caller upon a successful completion. Only
the d_map_init service is an exported kernel service. All other DMA kernel services are called through the
function pointers in the d_handle_t structure (see the sys/dma.h system file).
d_handle {
uint id; /* identifier for this device */
uint flags; /* device capabilities */
#ifdef __64BIT_KERNEL
/* pointer to d_map_page routine */
int (*d_map_page)(d_handle_t,int,caddr_t, ulong *, struct xmem
*);
/* pointer to d_unmap_page routine */
void (*d_unmap_page)(d_handle_t, ulong *);
/* pointer to d_map_list routine */
int (*d_map_list)(d_handle_t, int, int, dio_t, dio_t);
/* pointer to d_unmap_list routine */
void (*d_unmap_list)(d_handle_t, dio_t);
/* pointer to d_map_slave routine */
int (*d_map_slave)(d_handle_t, int, int, dio_t, uint);
/* pointer to d_unmap_slave routine */
int (*d_unmap_slave)(d_handle_t);
/* pointer to d_map_disable routine */
int (*d_map_disable)(d_handle_t);
/* pointer to d_map_enable routine */
int (*d_map_enable)(d_handle_t);
/* pointer to d_map_clear routine */
void (*d_map_clear)(d_handle_t);
/* pointer to d_sync_mem routine */
int (*d_sync_mem)(d_handle_t, dio_t);
#else
int (*d_map_page)(); /* pointer to d_map_page routine */
void (*d_unmap_page)(); /* pointer to d_unmap_page routine */
int (*d_map_list)(); /* pointer to d_map_list routine */
void (*d_unmap_list)(); /* pointer to d_unmap_list routine */
int (*d_map_slave)(); /* pointer to d_map_slave routine */
int (*d_unmap_slave)(); /* pointer to d_unmap_slave routine */
int (*d_map_disable)(); /* pointer to d_map_disable routine */
int (*d_map_enable)(); /* pointer to d_map_enable routine */
void (*d_map_clear)(); /* pointer to d_map_clear routine */
int (*d_sync_mem)(); /* pointer to d_sync_mem routine */
#endif
The following dio and d_iovec structures are used to define the scatter and gather lists used by the
d_map_list, d_unmap_list, and d_map_slave services (see the sys/dma.h system file).
struct dio {
int32long64_t total_iovecs; /* total available iovec entries */
int32long64_t used_iovecs; /* number of used iovecs */
int32long64_t bytes_done; /* count of bytes processed */
int32long64_t resid_iov; /* number of iovec that couldn’t be */
/* fully mapped due to NORES,DIOFULL*/
/* vec =&dvec [resid_iov] */
struct d_iovec *dvec; /* pointer to list of d_iovecs */
};
struct d_iovec {
caddr_t iov_base; /* base memory address */
int32long64_t iov_len; /* length of transfer for this area */
struct xmem *xmp; /* cross memory pointer for this address*/
};
The following dio_64 and d_iovec_64 structures are used to define the scatter and gather lists used by
the d_map_list and d_unmap_list services when the DMA_ENABLE_64 flag is set on the d_map_init
call. These are not used under the 64-bit kernel and kernel extension environment because the dio and
d_iovec data structures are naturally 64-bit capable in that environment. (For more information, see the
sys/dma.h system file.)
struct dio_64 {
int total_iovecs; /* total available iovec entries */
int used_iovecs; /* number of used iovecs */
int bytes_done; /* count of bytes processed */
int resid_iov; /* number of iovec that couldn’t be */
/* fully mapped due to NORES,DIOFULL*/
/* vec = &dvec [resid_iov] */
struct d_iovec_64 *dvec; /* pointer to list of d_iovecs */
};
struct d_iovec_64 {
unsigned long long iov_base; /* base memory address */
int iov_len; /* length of transfer for this area */
struct xmem *xmp; /* cross memory pointer for this address*/
}
The following macros are provided in sys/dma.h for device drivers in order to call the DMA kernel services
cleanly:
#define D_MAP_INIT(bid, flags, bus_flags, channel) \
d_map_init(bid, flags, bus_flags, channel)
#define D_UNMAP_SLAVE(handle) \
(handle->d_unmap_slave != NULL) ? \
(handle->d_unmap_slave)(handle) : DMA_SUCC
Note: The driver does not need a dio bus list for calls to d_map_slave because the address
generation for slaves is hidden.
Typically, a device driver provides a dio structure that contains only one virtual buffer and one length in
the list. If the virtual buffer length spans many pages, the bus address list contains multiple entries that
reflect the physical locations of the virtually contiguous buffer. The driver can provide multiple virtual
buffers in the virtual list. The driver can then place many buffer requests in one I/O operation.
The device driver is responsible for allocating the storage for all the dio lists it needs. For more
information, see the DIO_INIT and DIO_FREE macros in the sys/dma.h header file. The driver must have
at least two dio structures. One is needed for passing in the virtual list. Another is needed to accept the
resulting bus list. The driver can have many dio lists if it plans to have multiple outstanding I/O commands
to its device. The length of each list is dependent on the use of the device and driver. The virtual list
needs as many elements as the device could place in one operation at the same time. A formula for
estimating how many elements the bus address list needs is the sum of each of the virtual buffers lengths
divided by page size plus 2. Or,
sum [i=0 to n] ((vlist[i].length / PSIZE) + 2).
This formula handles a worst-case situation. For a contiguous virtual buffer that spans multiple pages,
each physical page is discontiguous, and neither the starting nor ending addresses are page-aligned.
If the d_map_list service runs out of space when filling in the dio bus list, a DMA_DIOFULL error is
returned to the device driver and the bytes_done field of the dio virtual list is set to the number of bytes
successfully mapped in the bus list. This byte count is guaranteed to be a multiple of the minxfer field
provided to the d_map_list or d_map_slave services. Also, the resid_iov field of the virtual list is set to
the index of the first d_iovec entry that represents the remainder of iovecs that could not be mapped.
If d_map_list or d_map_slave encounter an access violation on a page within the virtual list, then a
DMA_NOACC error is returned to the device driver and the bytes_done field of the dio virtual list is set to
the number of bytes that preceded the faulting iovec. In this case, the resid_iov field is set to the index of
the d_iovec entry that encountered the violation. From this information, the driver can determine which
virtual buffer contained the faulting page and fail that request back to the originator.
Note: If the DMA_NOACC error is returned, the bytes_done count is not guaranteed to be a multiple of
the minxfer field provided to the d_map_list or d_map_slave services, and no partial mapping is
done. For slaves, setup of the address generation hardware is not done. For masters, the bus list is
undefined. If the driver desires a partial transfer, it must make another call to the mapping service
with the dio list adjusted to not include the faulting buffer.
If either the d_map_list or d_map_slave services run out of resources when mapping a transfer, a
DMA_NORES error is returned to the device driver. In this case, the bytes_done field of the dio virtual list
is set to the number of bytes that were successfully mapped in the bus list. This byte count is guaranteed
to be a multiple of the minxfer field provided to the d_map_list or d_map_slave services. Also, the
resid_iov field of the virtual list is set to the index of the first d_iovec of the remaining iovecs that could
not be mapped. The device driver can:
v Initiate a partial transfer on its device and leave the remainder on its device queue
Note: If the DMA_ENABLE_64 flag is indicated on the d_map_init call, the programming model is the
same with one exception. The dio_64 and d_iovec_64 structures are used in addition to 64-bit
address fields on d_map_page and d_unmap_page calls.
Fields of dio
The only field of the bus list that a device driver modifies is the total_iovecs field to indicate how many
elements are available in the list. The device driver never changes any of the other fields in the bus list.
The device driver uses the bus list to set up its device for the transfer. The bus list is provided to the
d_unmap_list service to unmap the transfer. The d_map_list service sets the used_iovecs field to
indicate how many elements it filled out. The device driver sets up all of the fields in the virtual list except
for the bytes_done and resid_iov fields. These fields are set by the mapping service.
Using DMA_CONTIGUOUS
The DMA_CONTIGUOUS flag in d_map_init is the preferred way for the drivers to ask for contiguous bus
addresses. The other way is the old model of drivers explicitly using rmalloc() to guarantee contiguous
allocation during boot. However, with the advent of PCI Hot Plug devices, the rmalloc reservation does
not add a device after boot. If a PowerPC® driver determines that the device was dynamically added, the
driver can use the DMA_CONTIGUOUS flag to ensure that a contiguous list of bus addresses is
generated because no prior resources were reserved with rmalloc.
Using DMA_NO_ZERO_ADDR
DMA_NO_ZERO_ADDR is supplied on d_map_init to prevent d_map_page and d_map_list from giving
out bus address zero to this d_handle. Because many off-the-shelf PCI devices are not tested for bus
address of zero, such devices might not work. Striking out bus address 0 causes a driver’s mappable
memory to shrink by one I/O page (4 KB). On some systems, using the flag would cause d_map_init to
fail even if there is not an error condition. In such a case, the driver should call d_map_init without the
flag and then check the bus address to see whether zero falls in its range of addresses. The driver can do
this by mapping all of its range and checking for address 0. Such a check should be done at the driver
initialization time. If bus address 0 is assigned to the driver, it can leave it mapped for the life of the driver
and unmap all other addresses. This guarantees that address 0 is not assigned to it again.
Using DMA_MAXMIN_MAPSPACE
The DMA_MAXMIN_MAPSPACE flag is supplied on the d_map_init service in order to prevent the
d_map_init service from allocating more DMA space for the DMA handle than what the device driver
requests.
The device drivers request a specific amount of DMA space by specifying a DMA_MAXMIN_* flag.
Because some device drivers support non-page-aligned DMA transfers of the specified DMA_MAXMIN_*,
by default the d_map_init service allocates at least one additional page of DMA space.
Device drivers that use the DMA_MAXMIN_MAPSPACE flag cannot support non-page-aligned DMA
transfers of the specified DMA_MAXMIN_*. The DMA_MAXMIN_MAPSPACE flag indicates that the
DMA_MAXMIN_* flag value represents the amount of mappable address space the device driver requires,
rather than the maximum transfer value.
dd_start_io:
dd_finish_io:
dd_unconfigure:
call "D_MAP_CLEAR(handle)"
dd_start_io:
dd_finish_io:
dd_unconfigure:
The DMA_BYPASS flag lets a device driver bypass the access checking functionality of these services.
This should only be used for global system buffers such as mbufs or other command, control, and status
buffers used by a device driver. Also, the DMA buffers must be pinned before the DMA transfer begins and
can only be unpinned after the DMA transfer is complete.
Starting with AIX 5.2, only ISA slave devices are supported (ISA masters are not supported). For such ISA
slave devices, the PCI-to-ISA bridge acts as the PCI master and initiates DMA on behalf of the ISA slave
devices. Because the PCI devices are always master, d_map_slave and d_unmap_slave services cannot
be used by PCI device drivers. By the same token, the DMA_SLAVE flag cannot be supplied on
d_map_init by a PCI device driver. If DMA_SLAVE is used by a PCI driver, d_map_init() returns
DMA_FAIL.
The following information is provided to assist you in in learning more about kernel services:
v “Kernel Extension Loading and Unloading Services”
v “Other Kernel Extension and Device Driver Management Services”
v “List of Kernel Extension and Device Driver Management Kernel Services” on page 63
The kmod_load, kmod_unload services can be used to dynamically alter the set of routines loaded into
the kernel based on system configuration and application demand. Subsystems and device drivers can
use these services to load large, seldom-used routines on demand.
Some kernel extensions might be sensitive to the settings of base kernel runtime configurable parameters
that are found in the var structure defined in the /usr/include/sys/var.h file. These parameters can be set
automatically during system boot or at runtime by a privileged user. Kernel extensions can register or
unregister a configuration notification routine with the cfgnadd and cfgndel kernel services. Each time the
sysconfig subroutine is used to change base kernel tunable parameters found in the var structure, each
registered configuration notification routine is called.
The prochadd and prochdel kernel services allow kernel extensions to be notified when any process in
the system has a state transition, such as being created, exiting, or being swapped in or swapped out.
The uexadd and uexdel kernel services give kernel extensions the capability to intercept user-mode
exceptions. A user-mode exception handler can use this capability to dynamically reassign access to
single-use resources or to clean up after some particular user-mode error. The associated uexblock and
uexclear services can be used by these handlers to block and resume process execution when handling
these exceptions.
The pio_assist and getexcept kernel services are used by device drivers to obtain detailed information
about exceptions that occur during I/O bus access. The getexcept service can also be used by any
exception handler requiring more information about an exception that has occurred. The selreg kernel
service is used by file select operations to register unsatisfied asynchronous poll or select event requests
with the kernel. The selnotify kernel service provides the same functionality as the selwakeup service
found on other operating systems.
The getuerror and setuerror services allow kernel extensions to read or set the ut_error field for the
current thread. This field can be used to pass an error code from a system call function to an application
program, because kernel extensions do not have direct access to the application’s errno variable.
cfgnadd Registers a notification routine to be called when system-configurable variables are changed.
cfgndel Removes a notification routine for receiving broadcasts of changes to system configurable
variables.
devdump Calls a device driver dump-to-device routine.
devstrat Calls a block device driver’s strategy routine.
devswadd Adds a device entry to the device switch table.
devswchg Alters a device switch entry point in the device switch table.
devswdel Deletes a device driver entry from the device switch table.
devswqry Checks the status of a device switch entry in the device switch table.
getexcept Allows kernel exception handlers to retrieve additional exception information.
getuerror Allows kernel extensions to read the ut_error field for the current thread.
iostadd Registers an I/O statistics structure used for updating I/O statistics reported by the iostat
subroutine.
iostdel Removes the registration of an I/O statistics structure used for maintaining I/O statistics on a
particular device.
kmod_entrypt Returns a function pointer to a kernel module’s entry point.
kmod_load Loads an object file into the kernel or queries for an object file already loaded.
kmod_unload Unloads a kernel object file.
pio_assist Provides a standardized programmed I/O exception handling mechanism for all routines
performing programmed I/O.
prochadd Adds a system wide process state-change notification routine.
prochdel Deletes a process state change notification routine.
selreg Registers an asynchronous poll or select request with the kernel.
selnotify Wakes up processes waiting in a poll or select subroutine or the fp_poll kernel service.
setuerror Allows kernel extensions to set the ut_error field for the current thread.
uexadd Adds a system wide exception handler for catching user-mode process exceptions.
uexblock Makes the currently active kernel thread not runnable when called from a user-mode
exception handler.
uexclear Makes a kernel thread blocked by the uexblock service runnable again.
uexdel Deletes a previously added system-wide user-mode exception handler.
Simple Locks
Simple locks are exclusive-write, non-recursive locks that protect thread-thread or thread-interrupt critical
sections. Simple locks are preemptable, meaning that a kernel thread can be preempted by another,
higher priority kernel thread while it holds a simple lock. The simple lock kernel services are:
On a multiprocessor system, simple locks that protect thread-interrupt critical sections must be used in
conjunction with interrupt control in order to serialize execution both within the executing processor and
between different processors. On a uniprocessor system interrupt control is sufficient; there is no need to
use locks. The following kernel services provide appropriate locking calls for the system on which they are
executed:
disable_lock Raises the interrupt priority, and locks a simple lock if necessary.
unlock_enable Unlocks a simple lock if necessary, and restores the interrupt priority.
Using the disable_lock and unlock_enable kernel services to protect thread-interrupt critical sections
(instead of calling the underlying interrupt control and locking kernel services directly) ensures that
multiprocessor-safe code does not make unnecessary locking calls on uniprocessor systems.
Simple locks are spin locks; a kernel thread that attempts to acquire a simple lock may spin (busy-wait:
repeatedly execute instructions which do nothing) if the lock is not free. The table shows the behavior of
kernel threads and interrupt handlers that attempt to acquire a busy simple lock.
Note: On uniprocessor systems, the maximum spin threshold is set to one, meaning that that a kernel
thread will never spin waiting for a lock.
A simple lock that protects a thread-interrupt critical section must never be held across a sleep, otherwise
the interrupt could spin for the duration of the sleep, as shown in the table. This means that such a routine
must not call any external services that might result in a sleep. In general, using any kernel service which
is callable from process level may result in a sleep, as can accessing unpinned data. These restrictions do
not apply to simple locks that protect thread-thread critical sections.
The lock word of a simple lock must be located in pinned memory if simple locking services are called with
interrupts disabled.
By default, complex locks are not recursive (they cannot be acquired in exclusive-write mode multiple
times by a single thread). A complex lock can become recursive through the lock_set_recursive kernel
service. A recursive complex lock is not freed until lock_done is called once for each time that the lock
was locked.
Complex locks are not spin locks; a kernel thread that attempts to acquire a complex lock may spin briefly
(busy-wait: repeatedly execute instructions which do nothing) if the lock is not free. The table shows the
behavior of kernel threads that attempt to acquire a busy complex lock:
Note:
1. On uniprocessor systems, the maximum spin threshold is set to one, meaning that a kernel
thread will never spin waiting for a lock.
2. The concept of a single owner does not apply to a lock held in shared-read mode.
Lockl Locks
Note: Lockl locks (previously called conventional locks) are only provided to ensure compatibility with
existing code. New code should use simple or complex locks.
Lockl locks are exclusive-access and recursive locks. The lockl lock kernel services are:
Atomic Operations
Atomic operations are sequences of instructions that guarantee atomic accesses and updates of shared
single word variables. This means that atomic operations cannot protect accesses to complex data
structures in the way that locks can, but they provide a very efficient way of serializing access to a single
word.
Single word variables accessed by atomic operations must be aligned on a full word boundary, and must
be located in pinned memory if atomic operation kernel services are called with interrupts disabled.
The Logical File System services are one component of the logical file system, which provides the
functions required to map system call requests to virtual file system requests. The logical file system is
responsible for resolution of file names and file descriptors. It tracks all open files in the system using the
file table. The Logical File System services are lower level entry points into the system call support within
the logical file system.
Routines in the kernel that must access data stored in files or that must set up paths to devices are the
primary users of these services. This occurs most commonly in device drivers, where a lower level device
driver must be accessed or where the device requires microcode to be downloaded. Use of the Logical
File System services is not, however, restricted to these cases.
A process can use the Logical File System services to establish access to a file or device by calling:
v The fp_open service with a path name to the file or device it must access.
These three services return a file pointer that is needed to call the other Logical File System services. The
other services provide the functions that are provided by the corresponding system calls.
Other Considerations
The Logical File System services are available only in the process environment.
In addition, calling the fp_open service at certain times can cause a deadlock. The lookup on the file
name must acquire file system locks. If the process is already holding any lock on a component of the
path, the process will be deadlocked. Therefore, do not use the fp_open service when the process is
already executing an operation that holds file system locks on the requested path. The operations most
likely to cause this condition are those that create files.
These kernel services are defined in the adspace.h and ioacc.h header files.
For a list of PIO macros, see Programmed I/O Services in Understanding the Diagnostic Subsystem for
AIX.
The following information is provided to assist you in learning more about memory kernel services:
v Memory Management Kernel Services
v ldata Kernel Services
v Memory Pinning Kernel Services
v User Memory Access Kernel Services
v Virtual Memory Management Kernel Services
v Cross-Memory Kernel Services
init_heap Initializes a new heap to be used with kernel memory management services.
xmalloc Allocates memory.
xmfree Frees allocated memory.
Services are provided to allow kernel subsystems and extensions to create, destroy, grow the ldata pools.
There are a couple of advantages of using ldata kernel services over raw xmallocs:
1. Since the memory allocated by ldata kernel services are backed by local node memory, it is faster to
read and write the ldata region on that node.
2. ldata elements can be allocated from the interrupt environment. xmalloc kernel service cannot be
called from the interrupt environment. Of course, there is an upper limit on a given ldata pool -- the
maximum number of elements asked at ldata creation time.
The ldata services are supported on 32 and 64 bit kernels as well as Uni and multiprocessor
environments.
ldata_create Creates a SRAD-aware pinned storage element pool (ldata pool) and returns its handle.
ldata_destroy Destroys a ldata pool created by ldata_create.
ldata_grow Expands the count of available pinned storage elements contained within a ldata pool.
ldata_alloc Allocates a pinned storage element from a ldata pool.
ltpin Pins the address range in the system (kernel) space and frees the page space for the
associated pages.
ltunpin Unpins the address range in system (kernel) address space and reallocates paging
space for the specified region.
pin Pins the address range in the system (kernel) space.
pincode Pins the code and data associated with a loaded object module.
pinu Pins the specified address range in user or system memory.
unpin Unpins the address range in system (kernel) address space.
unpincode Unpins the code and data associated with a loaded object module.
unpinu Unpins the specified address range in user or system memory.
xmempin Pins the specified address range in user or system memory, given a valid cross-memory
descriptor.
xmemunpin Unpins the specified address range in user or system memory, given a valid
cross-memory descriptor.
Note: pinu and unpinu are only available on the 32–bit kernel. Because of this limitation, it is
recommended that xmempin and xmemunpin be used in place of pinu and unpinu.
Note: The copyin64, copyout64, copyinstr64, fubyte64, fuword64, subyte64, and suword64 kernel
services are defined as macros when compiling kernel extensions on the 64–bit kernel. The macros
invoke the corresponding kernel services without the ″64″ suffix.
as_att, as_att64 Selects, allocates, and maps a specified region in the current user address space.
Note: as_att, as_det, as_geth, as_getsrval, as_seth, getadsp, lo_att and lo_det are supported only on
the 32–bit kernel.
Cross-memory services provide the xmemdma or xmemdma64 service to prepare a page for DMA
processing. The xmemdma or xmemdma64 service returns the real address of the page for use in
preparing DMA address lists. When the DMA transfer is completed, the xmemdma or xmemdma64
service must be called again to unhide the page.
The xmemdma64 service is identical to xmemdma, except that xmemdma64 returns a 64-bit real
address. The xmemdma64 service can be called from the process or interrupt environments. It is also
present on 32-bit platform to allow a single device driver or kernel extension binary to work on 32-bit or
64-bit platforms with no change and no run-time checks.
Data movement by DMA or an interrupt handler requires that the pages remain in memory. This is ensured
by pinning the data areas using the xmempin service. This can only be done under a process, because
the memory pinning services page-fault on pages not present in memory.
The xmemunpin service unpins pinned pages. This can be done by an interrupt handler if the data area is
the global kernel address space. It must be done under the process if the data area is in user process
space.
Note: xmattach, xmattach64 and xmemdma are supported only on the 32–bit kernel. xmemdma64 is
supported on both the 32– and 64–bit kernels.
The following aspects of the virtual memory manager interface are discussed:
v Virtual Memory Objects
v Addressing Data
v Moving Data to or from a Virtual Memory Object
v Data Flushing
v Discarding Data
v Protecting Data
v Executable Data
v Installing Pager Backends
v Referenced Routines
File systems use virtual memory objects so that the files can be referenced using a mapped file access
method. The mapped file access method represents the data through a virtual memory object, and allows
the virtual memory manager to handle page faults on the mapped file. When a page fault occurs, the
virtual memory manager calls the services supplied by the service provider (such as a virtual file system)
to get and put pages. A data provider (such as a file system) maintains any data structures necessary to
map between the virtual memory object offset and external storage addressing.
The data provider creates a virtual memory object when it has a request for access to the data. It deletes
the virtual memory object when it has no more clients referencing the data in the virtual memory object.
The vms_create service is called to create virtual memory objects. The vms_delete service is called to
delete virtual memory objects.
Addressing Data
Data in a virtual memory object is made addressable in user or kernel processes through the shmat
subroutine. A kernel extension uses the vm_att kernel service to select and allocate a region in the current
(per-process kernel) address space.
The per-process kernel address space initially sees only global kernel memory and the per-process kernel
data. The vm_att service allows kernel extensions to allocate additional regions. However, this augmented
per-process kernel address space does not persist across system calls. The additional regions must be
re-allocated with each entry into the kernel protection domain.
The vm_att service takes as an argument a virtual memory handle representing the virtual memory object
and the access capability to be used. The vm_handle service constructs the virtual memory handles.
When the kernel extension has finished processing the data mapped into the current address space, it
should call the vm_det service to deallocate the region and remove access.
The vm_move and vm_uiomove kernel services move data between a virtual memory object and a buffer
specified in a uio structure. This allows data providers (such as a file system) to move data to or from a
specified buffer to a designated offset in a virtual memory object. This service is similar to uiomove
service, but the trusted buffer is replaced by the virtual memory object, which need not be currently
addressable.
Data Flushing
A kernel extension can initiate the writing of a data area to external storage with the vm_write kernel
service, if it has addressability to the data area. The vm_writep kernel service can be used if the virtual
memory object is not currently addressable.
Discarding Data
The pages specified by a data range can be released from the underlying virtual memory object by calling
the vm_release service. The virtual memory manager deallocates any associated paging space slots. A
subsequent reference to data in the range results in a page fault.
A virtual memory data provider can release a specified range of pages in a virtual memory object by
calling the vm_releasep service. The virtual memory object need not be addressable for this call.
Protecting Data
The vm_protectp service can change the storage protect keys in a page range in one client storage
virtual memory object. This only acts on the resident pages. The pages are referred to through the virtual
memory object. They do not need to be addressable in the current address space. A client file system data
provider uses this protection to detect stores of in-memory data, so that mapped files can be extended by
storing into them beyond their current end of file.
Executable Data
If the data moved is to become executable, any data remaining in processor cache must be guaranteed to
be moved from cache to memory. This is because the retrieval of the instruction does not need to use the
data cache. The vm_cflush service performs this operation.
For a local device, the device strategy routine is required. A call to the vm_mount service is used to
identify the device (through a dev_t value) to the virtual memory manager.
For a remote data provider, the routine required is a strategy routine, which is specified in the vm_mount
service. These strategy routines must run as interrupt-level routines. They must not page fault, and they
cannot sleep waiting for locks.
When access to a remote data provider or a local device is removed, the vm_umount service must be
called to remove the device entry from the virtual memory manager’s paging device table.
Referenced Routines
The virtual memory manager exports these routines exported to kernel extensions:
as_att Selects, allocates, and maps a region in the specified address space for the
specified virtual memory object.
as_det Unmaps and deallocates a region in the specified address space that was mapped
with the as_att kernel service.
as_geth Obtains a handle to the virtual memory object for the specified address given in
the specified address space. The virtual memory object is protected.
as_getsrval Obtains a handle to the virtual memory object for the specified address given in
the specified address space.
as_puth Indicates that no more references will be made to a virtual memory object that was
obtained using the as_geth kernel service.
as_seth Maps a specified region in the specified address space for the specified virtual
memory object.
getadsp Obtains a pointer to the current process’s address space structure for use with the
as_att and as_det kernel services.
vm_cflush Flushes cache lines for a specified address range.
vm_release Releases page frames and paging space slots for the specified address range.
vm_write Initiates page-out for an address range.
Note: as_att, as_det, as_geth, as_getsrval, as_seth and getadsp are supported only on the 32–bit
kernel.
The following Memory-Pinning kernel services also support address space operations. They are the pin,
pinu, unpin, and unpinu services.
as_remap64 Maps a 64-bit address to a 32-bit address that can be used by the 32–bit kernel.
as_unremap64 Returns the original 64-bit original address associated with a 32-bit mapped address.
rmmap_create64 Defines an effective address to real address translation region for either 64-bit or 32-bit
effective addresses.
rmmap_remove64 Destroys an effective address to real address translation region.
xmattach64 Attaches to a user buffer for cross-memory operations.
copyin64 Copies data between user and kernel memory.
copyout64 Copies data between user and kernel memory.
copyinstr64 Copies data between user and kernel memory.
fubyte64 Retrieves a byte of data from user memory.
fuword64 Retrieves a word of data from user memory.
subyte64 Stores a byte of data in user memory.
suword64 Stores a word of data in user memory.
The Message Queue services can be called only from the process environment because they prevent the
caller from specifying kernel buffers. These services can be used as an Interprocess Communication
mechanism to other kernel processes or user-mode processes. See Kernel Extension and Device Driver
Management Services for more information on the functions that these services provide.
There are four Message Queue services available from the kernel:
The if_attach service and if_detach services add and remove network interfaces from the Network
Interface List. Protocols search this list to determine an appropriate interface on which to transmit a
packet.
Protocols use the add_input_type and del_input_type services to notify network interface drivers that the
protocol is available to handle packets of a certain type. The Network Interface Driver uses the
find_input_type service to distribute packets to a protocol.
The add_netisr and del_netisr services add and delete network software interrupt handlers. Address
families add and delete themselves from the Address Family Domain switch table by using the
add_domain_af and del_domain_af services. The Address Family Domain switch table is a list of all
available protocols that can be used in the socket subroutine.
The Address Family Domain and Network Interface Device Driver services are:
add_domain_af Adds an address family to the Address Family domain switch table.
add_input_type Adds a new input type to the Network Input table.
add_netisr Adds a network software interrupt service to the Network Interrupt table.
del_domain_af Deletes an address family from the Address Family domain switch table.
del_input_type Deletes an input type from the Network Input table.
del_netisr Deletes a network software interrupt service routine from the Network Interrupt table.
find_input_type Finds the given packet type in the Network Input Interface switch table and distributes
the input packet according to the table entry for that type.
if_attach Adds a network interface to the network interface list.
if_detach Deletes a network interface from the network interface list.
ifunit Returns a pointer to the ifnet structure of the requested interface.
schednetisr Schedules or invokes a network software interrupt service routine.
The interface address services accept a destination address or network and return an associated interface
address. Protocols use these services to determine if an address is on a directly connected network.
add_netopt This macro adds a network option structure to the list of network options.
del_netopt This macro deletes a network option structure from the list of network options.
net_attach Opens a communications I/O device handler.
net_detach Closes a communications I/O device handler.
net_error Handles errors for communication network interface drivers.
net_sleep Sleeps on the specified wait channel.
net_start Starts network IDs on a communications I/O device handler.
net_start_done Starts the done notification handler for communications I/O device handlers.
net_wakeup Wakes up all sleepers waiting on the specified wait channel.
net_xmit Transmits data using a communications I/O device handler.
net_xmit_trace Traces transmit packets. This kernel service was added for those network interfaces that
do not use the net_xmit kernel service to trace transmit packets.
The thread_setsched service is used to control the scheduling parameters, priority and scheduling policy,
of a thread.
The thread-specific uthread structure is also encapsulated. The getuerror and setuerror kernel services
should be used to access the ut_error field. The thread_self kernel service should be used to get the
current thread’s ID.
For more information concerning use of these services, see “Handling Exceptions While in a System Call”
on page 33.
Signal Management
Signals can be posted either to a kernel process or to a kernel thread. The pidsig service posts a signal
to a specified kernel process; the kthread_kill service posts a signal to a specified kernel thread. A thread
uses the sig_chk service to poll for signals delivered to the kernel process or thread in the kernel mode.
For more information about signal management, see “Kernel Process Signal and Exception Handling” on
page 11.
The et_wait and et_post kernel services support single waiter event notification by using mutually agreed
upon event control bits for the kernel thread being posted. There are a limited number of control bits
available for use by kernel extensions. If the kernel_lock is owned by the caller of the et_wait service, it
is released and acquired again upon wakeup.
The following kernel services support a shared event notification mechanism that allows for multiple
threads to be waiting on the shared event.
e_assert_wait e_wakeup
e_block_thread e_wakeup_one
e_clear_wait e_wakeup_w_result
e_sleep_thread e_wakeup_w_sig
These services support an unlimited number of shared events (by using caller-supplied event words). The
following list indicates methods to wait for an event to occur:
v Calling e_assert_wait and e_block_thread successively; the first call puts the thread on the event
queue, the second blocks the thread. Between the two calls, the thread can do any job, like releasing
several locks. If only one lock, or no lock at all, needs to be released, one of the two other methods
should be preferred.
v Calling e_sleep_thread; this service releases a simple or a complex lock, and blocks the thread. The
lock can be automatically reacquired at wakeup.
The e_clear_wait service can be used by a thread or an interrupt handler to wake up a specified thread,
or by a thread that called e_assert_wait to remove itself from the event queue without blocking when
calling e_block_thread. The other wakeup services are event-based. The e_wakeup and
e_wakeup_w_result services wake up every thread sleeping on an event queue; whereas the
e_wakeup_one service wakes up only the most favored thread. The e_wakeup_w_sig service posts a
signal to every thread sleeping on an event queue, waking up all the threads whose sleep is interruptible.
The e_sleep and e_sleepl kernel services are provided for code that was written for previous releases of
the operating system. Threads that have called one of these services are woken up by the e_wakeup,
e_wakeup_one, e_wakeup_w_result, e_wakeup_w_sig, or e_clear_wait kernel services. If the caller of
the e_sleep service owns the kernel lock, it is released before waiting and is acquired again upon
wakeup. The e_sleepl service provides the same function as the e_sleep service except that a
caller-specified lock is released and acquired again instead of the kernel_lock.
The panic kernel service is called when a catastrophic failure occurs and the system can no longer
operate. The panic service performs a system dump. The system dump captures data areas that are
registered in the Master Dump Table. The kernel and kernel extensions use the dmp_ctl kernel service to
add and delete entries in the Master Dump Table, and record dump routine failures.
The errsave and errlast kernel service is called to record an entry in the system error log when a
hardware or software failure is detected.
The trcgenk and trcgenkt kernel services are used along with the trchook subroutine to record selected
system events in the event-tracing facility.
The register_HA_handler and unregister_HA_handler kernel services are used to register high
availability event handlers for kernel extensions that need to be aware of events such as processor
deallocation.
One of the RAS features is a service that monitors for excessive periods of interrupt disablement on a
processor, and logs these events to the error log. The disablement_checking_suspend and
disablement_checking_resume services exempt a code segment from this detection.
The Timer and Time-of-Day kernel services are divided into the following categories:
v Time-of-Day services
v Fine Granularity Timer services
v Timer services for compatibility
v Watchdog Timer services
delay Suspends the calling process for the specified number of timer ticks.
talloc Allocates a timer request block before starting a timer request.
tfree Deallocates a timer request block.
tstart Submits a timer request.
tstop Cancels a pending timer request.
For more information about using the Fine Granularity Timer services, see “Using Fine Granularity Timer
Services and Structures” on page 83.
w_clear Removes a watchdog timer from the list of watchdog timers known to the kernel.
w_init Registers a watchdog timer with the kernel.
w_start Starts a watchdog timer.
w_stop Stops a watchdog timer.
The Watchdog timer services can be used for noncritical times having a one-second resolution. The
timeout service can be used for noncritical times having a clock-tick resolution.
The itimerstruc_t t.it value substructure should be used to store time information for both absolute and
incremental timers. The T_ABSOLUTE absolute request flag is defined in the sys/timer.h file. It should be
ORed into the t->flag field if an absolute timer request is desired.
The T_LOWRES flag causes the system to round the t->timeout value to the next timer timeout. It should
be ORed into the t->flags field. The timeout is always rounded to a larger value. Because the system
maintains 10ms interval timer, T_LOWRES will never cause more than 10ms to be added to a timeout.
The advantage of using T_LOWRES is that it prevents an extra interrupt from being generated.
The t->timeout and t->flags fields must be set or reset before each call to the tstart kernel service.
The argument to the func completion handler routine is the address of the trb structure, not the contents
of the t_union field.
The t->func timer function is called on an interrupt level. Therefore, code for this routine must follow
conventions for interrupt handlers.
If the requested service cannot be performed, the kernel service returns an error value.
Drivers which were written for uniprocessor systems do not check the return values of these kernel
services and are not multiprocessor-safe. Such drivers can still run as funnelled device drivers.
Most functions involved in the writing of a file system are specific to that file system type. But a limited
number of functions must be performed in a consistent manner across the various file system types to
enable the logical file system to operate independently of the file system type.
Related Information
Chapter 1, “Kernel Environment,” on page 1
Understanding File Descriptors in AIX 5L Version 5.3 General Programming Concepts: Writing and
Debugging Programs.
Subroutine References
The msgctl subroutine, msgget subroutine, msgsnd subroutine, msgxrcv subroutine in AIX 5L Version
5.3 Technical Reference: Base Operating System and Extensions Volume 1.
The trchook subroutine in AIX 5L Version 5.3 Technical Reference: Base Operating System and
Extensions Volume 2.
Technical References
The talloc kernel service, tfree kernel service, tstart kernel service, tstop kernel service in AIX 5L Version
5.3 Technical Reference: Kernel and Subsystems Volume 1.
In contrast, asynchronous I/O (AIO) operations run in the background and do not block user applications.
This improves performance, because I/O operations and applications processing can run simultaneously.
Using AIO usually improves your I/O throughput, especially when you are storing data in raw logical
volumes (as opposed to Journaled file systems). The actual performance, however, depends on how many
server processes are running that handle the I/O requests.
Many applications, such as databases and file servers, take advantage of the ability to overlap processing
and I/O. These AIO operations use various kinds of devices and files. Additionally, multiple AIO operations
can run at the same time on one or more devices or files.
Each AIO request has a corresponding control block in the application’s address space. When an AIO
request is made, a handle is established in the control block. This handle is used to retrieve the status and
the return values of the request.
Applications use the aio_read and aio_write subroutines to perform the I/O. Control returns to the
application from the subroutine, as soon as the request has been queued. The application can then
continue processing while the disk operation is being performed.
A kernel process (kproc), called a server, is in charge of each request from the time it is taken off the
queue until it completes. The number of servers limits the number of disk I/O operations that can be in
progress in the system simultaneously.
The default values are minservers=1 and maxservers=10. In systems that seldom run applications that use
AIO, this is usually adequate. For environments with many disk drives and key applications that use AIO,
the default is far too low. The result of a deficiency of servers is that disk I/O seems much slower than it
should be. Not only do requests spend inordinate lengths of time in the queue, but the low ratio of servers
to disk drives means that the seek-optimization algorithms have too few requests to work with for each
drive.
Note: AIO does not work if the control block or buffer is created using mmap (mapping segments).
In AIX 5.2 there are two AIO subsystems. The original AIX AIO, now called LEGACY AIO, has the same
function names as the Portable Operating System Interface (POSIX) compliant POSIX AIO. The major
differences between the two involve different parameter passing. Both subsystems are defined in the
/usr/include/sys/aio.h file. The _AIO_AIX_SOURCE macro is used to distinguish between the two
versions.
Note: The _AIO_AIX_SOURCE macro used in the /usr/include/sys/aio.h file must be defined when
using this file to compile an AIO application with the LEGACY AIO function definitions. The default
compile using the aio.h file is for an application with the new POSIX AIO definitions. To use the
LEGACY AIO function definitions do the following in the source file:
#define _AIO_AIX_SOURCE
#include <sys/aio.h>
If there is at least one outstanding I/O to a local disk when the wait process is running, the time is
classified as waiting for I/O. Unless AIO is being used by the process, an I/O request to disk causes the
calling process to block (or sleep) until the request is complete. After a process’s I/O request completes, it
is placed on the run queue.
A wa value consistently over 25 percent might indicate that the disk subsystem is not balanced properly, or
it might be the result of a disk-intensive workload.
Note: AIO does not relieve an overly busy disk drive. Using the iostat command with an interval and
count value, you can determine if any disks are overly busy. Monitor the %tm_act column for each
disk drive on the system. On some systems, a %tm_act of 35.0 or higher for one disk can cause
noticeably slower performance. The relief for this case could be to move data from more busy to
less busy disks, but simply having AIO does not relieve an overly busy disk problem.
SMP Systems
For SMP systems, the us, sy, id and wa columns are only averages over all processors. But remember
that the I/O wait statistic per processor is not really a processor-specific statistic; it is a global statistic. An
I/O wait is distinguished from idle time only by the state of a pending I/O. If there is any pending disk I/O,
and the processor is not busy, then it is an I/O wait time. Disk I/O is not tracked by processors, so when
there is any I/O wait, all processors get charged (assuming they are all equally idle).
To determine how many LEGACY AIO Servers are currently running, type the following information on the
command line:
pstat -a | egrep ’ aioserver’ | wc -l
If the disk drives that are being accessed asynchronously are using either the Journaled File System (JFS)
or the Enhanced Journaled File System (JFS2), all I/O is routed through the AIOs kprocs.
If the disk drives that are being accessed asynchronously are using a form of raw logical volume
management, then the disk I/O is not routed through the AIOs kprocs. In that case the number of servers
that are running is not relevant.
However, if you want to confirm that an application that uses raw logic volumes is taking advantage of
AIO, you can disable the fast path option using System Management Interface Tool (SMIT). When this
88 Kernel Extensions and Device Support Programming Concepts
option is disabled, even raw I/O is forced through the AIOs kprocs. At that point, the pstat command listed
in the preceding discussion works. Do not run the system with this option disabled for any length of time.
The option provides a way to confirm that the application is working with AIO and raw logical volumes.
At releases earlier than AIX 4.3, the fast path is enabled by default and cannot be disabled.
Note: In some environments, you might see more than 80 AIOs kprocs running. If so, consider the
third rule that follows.
3. A third suggestion is to take statistics using vmstat -s before any high I/O activity begins, and again at
the end. Check the field iodone. From this you can determine how many physical I/Os are being
handled in a given wall clock period. Then increase the maximum number of servers and see if you
can get more activity or event completions (iodones) in the same time period.
Prerequisites
To make use of AIO the following fileset must be installed:
bos.rte.aio
You must also make the aio0 or posix_aio0 device available using SMIT.
smit chgaio
smit chgposixaio
or
smit aio
smit posixaio
Due to the signed 32-bit definition of aio_offset, the default AIO interfaces are limited to an offset of 2Gb
minus 1. To overcome this limitation, a new AIO control block with a signed 64-bit offset field and a new
set of AIO interfaces has been defined. These 64-bit definitions end with ″64.″
The large offset-enabled AIO interfaces are available under the _LARGE_FILES compilation environment
and under the _LARGE_FILE_API programming environment. For further information, see Writing
Programs That Access Large Files in AIX 5L Version 5.3 General Programming Concepts: Writing and
Debugging Programs.
Under the _LARGE_FILES compilation environment, AIO applications written to the default interfaces
interpret the following redefinitions:
For information on using the _LARGE_FILES environment, see Porting Applications to the Large File
Environment in AIX 5L Version 5.3 General Programming Concepts: Writing and Debugging Programs.
In the _LARGE_FILE_API environment, the 64-bit application programming interfaces (APIs) are visible.
This environment requires recoding of applications to the new 64-bit API name. For further information on
using the _LARGE_FILE_API environment, see Using the 64-Bit File System Subroutines in AIX 5L
Version 5.3 General Programming Concepts: Writing and Debugging Programs.
Nonblocking I/O
After issuing an I/O request, the user application can proceed without being blocked while the I/O
operation is in progress. The I/O operation occurs while the application is running. Specifically, when the
application issues an I/O request, the request is queued. The application can then resume running before
the I/O operation starts.
To manage AIO, each AIO request has a corresponding control block in the application’s address space.
This control block contains the control and status information for the request. It can be used again when
the I/O operation finishes.
Subroutine Purpose
aio_cancel or aio_cancel64 Cancels one or more outstanding AIO requests.
aio_error or aio_error64 Retrieves the error status of an AIO request.
aio_fsync Synchronizes asynchronous files.
lio_listio or lio_listio64 Initiates a list of AIO requests with a single call.
aio_nwait Suspends the calling process until n AIO requests are completed.
aio_read or aio_read64 Reads asynchronously from a file.
aio_return or aio_return64 Retrieves the return status of an AIO request.
aio_suspend or aio_suspend64 Suspends the calling process until one or more AIO requests finishes.
aio_write or aio_write64 Writes asynchronously to a file.
Note: Priority among the I/O requests is available only for POSIX AIO.
For files that support seek operations, seeking can be done as part of the asynchronous read or write
operations. The whence and offset fields are provided in the control block of the request to set the seek
parameters. The seek pointer is updated when the asynchronous read or write call returns.
If the application closes a file, or calls the _exit or exec subroutines while it has some outstanding I/O
requests, the requests are canceled. If they cannot be canceled, the application is blocked until the
requests finish. When a process calls the fork subroutine, its AIO is not inherited by the child process.
One fundamental limitation in AIO is page hiding. When an unbuffered (raw) AIO is issued, the page that
contains the user buffer is hidden during the actual I/O operation. This ensures cache consistency.
However, the application can access the memory locations that fall within the same page as the user
buffer. This can cause the application to block as a result of a page fault. To alleviate this, allocate page
aligned buffers and do not touch the buffers until the I/O request using it finishes.
Attention: Raising the server PRIORITY (by decreasing this numeric value) is not recommended
because system hangs or crashes could can if the priority of the AIO servers is favored too much.
There is little to be gained by making big priority changes.
PUSER and PRI_SCHED are defined in the /usr/include/sys/pri.h file.
64-bit Enhancements
AIO has been enhanced to support 64-bit enabled applications. On 64-bit platforms, both 32-bit and 64-bit
AIO can occur simultaneously.
The aiocb structure, the fundamental data structure associated with all AIO operations, has changed. The
element of this struct, aio_return, is now defined as ssize_t. Previously, it was defined as an int. AIO
supports large files by default. An application compiled in 64-bit mode can do AIO to a large file without
any additional #define specifications or special opening of those files.
Extended AIOCB
LEGACY AIO supports functionality that is not available in POSIX AIO. To extend the LEGACY AIOCB, the
aio_reqprio and aio_fp fields are deprecated, and the following new fields are introduced:
Field Version
aio_version All versions.
aio_priority AIOCBX_VERS1
aio_cache_hint AIOCBX_VERS1
aio_iocpfd AIOCBX_VERS2
A new flag, AIO_EXTENDED, has also been added to the aio_flags field. If the AIO_EXTENDED flag is
not set, LEGACY AIO completely ignores any new extended fields. If the AIO_EXTENDED flag is set
within aio_flags, and the aio_version field contains a value greater than 0 and less than or equal to
AIOCBX_VERSION, all extended fields with a version indicated in the preceding table that are less than or
equal to the version number specified in the AIOCB are in force. Future extensions to the legacy AIOCB
structure will use new version values and introduce new extended fields beyond what is currently defined
within the AIOCB structure.
Except for the aio_version field, all extended fields are required to ignore a value of 0 (zero). A user of
any extended field must ensure that all other unused extended fields are initialized to zero. Use either the
bzero or memset function on the entire AIOCB structure prior to setting any field in the structure.
Field Version
aio_priority AIOCBX_VERS1
aio_cache_hint AIOCBX_VERS1
The aio_priority and aio_cache_hint values take effect only on a 64-bit kernel under the following
conditions:
v The file descriptor being operated on by AIO belongs to the raw character interface of an LVM logical
volume.
v The LVM logical volume resides on a device that supports I/O priorities and cache hints.
The aio_read, aio_read64, aio_write, aio_write64, lio_listio, and lio_listio64 interfaces are all
compatible with an extended AIOCB. Other interfaces (such as aio_cancel) ignore the extended fields.
The valid values for aio_priority and aio_cache_hint are described in the sys/extendio.h file. The
aio_priority must be either IOPRIORITY_UNSET (0) or a value from 1 to 15. Lower I/O priority values are
considered to be more important than higher values. For example, a value of 1 is considered highest
priority and a value of 15 is considered lowest priority. The aio_cache_hint must be either
CH_AGE_OUT_FAST or CH_PAGE_WRITE. These cache hint values are mutually exclusive. If CH_AGE_OUT_FAST is
set, the I/O buffer can be aged out quickly from the storage device buffer cache. This is useful in situations
where the application is already caching the I/O buffer and redundant caching within the storage layer can
be avoided. If CH_PAGE_WRITE is set, the I/O buffer is written only to the storage device cache and not to
the disk.
Field Version
aio_iocpfd AIOCBX_VERS2
A limitation of the AIO interface that is used in a threaded environment is that aio_nwait() collects
completed I/O requests for ALL threads in the same process. In other words, one thread collects
completed I/O requests that are submitted by another thread. Another problem is that multiple threads
cannot invoke the collection routines (such as aio_nwait()) at the same time. If one thread issues
aio_nwait() while another thread is calling it, the second aio_nwait() returns EBUSY. This limitation can
affect I/O performance when many I/Os must run at the same time and a single thread cannot run fast
enough to collect all the completed I/Os.
Using I/O completion ports with AIO requests provides the capability for an application to capture results of
various AIO operations on a per-thread basis in a multithreaded environment. This functionality provides
threads with a method of receiving completion status for only the AIO requests initiated by the thread.
The IOCP subsystem only provides completion status by generating completion packets for AIO requests.
The I/O cannot be submitted for regular files through IOCP.
The application must associate a file with a completion port using the CreateIoCompletionPort IOCP
routine. The file can be associated with multiple completion ports, and a completion port can have multiple
files associated with it. When making the association, the application must use an application-defined
CompletionKey to differentiate between AIO completion packets and socket completion packets. The
application can use different CompletionKeys to differentiate among individual files (or in any other
manner) as necessary.
The application must also associate AIO requests with the same completion port as the corresponding file.
It does this by initializing the aio_iocpfd of the AIOCB with the file descriptor of the completion port. An
AIOCB can be associated with only one completion port, but a completion port can have multiple AIOCBs
associated with it. The association between a completion port and an AIOCB must be done before the
request is made. This is accomplished using an AIO routine, such as aio_write, aio_read, or lio_listio. If
the value in the aio_iocpfd field is not a valid completion port file descriptor, the attempt to start the
request fails and no I/O is performed.
An association must be made directly between a completion port and an AIOCB. For example, if you want
to call lio_listio(), each AIOCB in the lio_listio chain must be associated individually prior to the call. It is
not necessary to have all AIOCBs in the chain associated with a completion port.
After an association is made, it remains until the application explicitly clears it by using a value of 0 for the
aio_iocpfd field, or the AIOCB is destroyed. A completion packet is created only when I/O completes for
an AIOCB that has been associated with a completion port.
A summary of the steps that an application takes to use I/O completion ports with AIO requests is as
follows:
1. Opens a regular file for I/O.
2. Calls the CreateIoCompletionPort routine to create an I/O completion port (IOCP), using the file
descriptor for the regular file and an application-defined CompletionKey, which is used to differentiate
AIO requests from socket I/O. The CreateIoCompletionPort function returns an IOCP file descriptor
that corresponds to the newly created IOCP.
3. Allocates and clears (using the bzero function) an AIO control block. Indicates that I/O completion
ports are to be used with AIO requests by setting the AIO_EXTENDED flag of the AIOCB's aio_flags
field. Also sets the aio_version field to a value of AIOCBX_VERS2 or higher.
4. Associates the AIO request with the IOCP by initializing the aio_iocpfd field in the AIOCB to contain
the IOCP file descriptor returned by the CreateIoCompletionPort routine.
5. Starts the AIO request using existing AIO interfaces. Multiple requests can be started using the
lio_listio interface.
6. Calls the GetQueuedCompletionStatus function with the IOCP file descriptor to collect the results of
the completed AIO requests on a particular IOCP. The application provides the address of a pointer in
the LPOVERLAPPED argument to GetQueuedCompletionStatus, so the corresponding AIOCB
pointer can be returned. Details of the AIO request can be determined by examining the returned
AIOCB.
7. After all I/O is complete, the application is responsible for closing all file descriptors.
Related Information
Subroutine References
The aio_cancel or aio_cancel64 subroutine, aio_error or aio_error64 subroutine, aio_read or
aio_read64 subroutine, aio_return or aio_return64 subroutine, aio_suspend or aio_suspend64
Commands References
The chdev command in AIX 5L Version 5.3 Commands Reference, Volume 1.
System users cannot operate devices until device configuration occurs. To configure devices, the Device
Configuration Subsystem is available.
Also, in this operating system, hardware components such as buses, adapters, and enclosures (including
racks, drawers, and expansion boxes) are considered devices.
Each device is classified into functional classes, functional subclasses and device types (for example,
printer class, parallel subclass, 4201 Proprinter type). These classifications are maintained in the device
configuration databases with all other device information.
High-level Commands Maintain (add, delete, view, change) configured devices within the system.
These commands manage all of the configuration functions and are performed
by invoking the appropriate device methods for the device being configured.
These commands call device methods and low-level commands.
Data that is used by the three levels is maintained in the Configuration database. The database is
managed as object classes by the Object Data Manager (ODM). All information relevant to support the
device configuration process is stored in the configuration database.
The database has two components: the Predefined database and the Customized database. The
Predefined database contains configuration data for all devices that could possibly be supported by the
system. The Customized database contains configuration data for the devices actually defined or
configured in that particular system.
The Configuration manager (cfgmgr command) performs the configuration of a system’s devices
automatically when the system is booted. This high-level program can also be invoked through the system
keyboard to perform automatic device configuration. The configuration manager command configures
devices as specified by Configuration rules.
High-Level Perspective
From a high-level, user-oriented perspective, device configuration comprises the following basic tasks:
v Adding a device to the system
v Deleting a device from the system
v Changing the attributes of a device
v Showing information about a device
From a high-level, system-oriented perspective, device configuration provides the basic task of automatic
device configuration: running the configuration manager program.
A set of high-level commands accomplish all of these tasks during run time: chdev, mkdev, lsattr,
lsconn, lsdev, lsparent, rmdev, and cfgmgr. High-level commands can invoke device methods and
low-level commands.
“Understanding Device States” on page 103 discusses possible device states and how the various
methods affect device state changes.
Low-Level Perspective
Beneath the device methods is a set of low-level library routines that can be directly called by device
methods as well as by high-level configuration programs.
Predefined database Contains information about all possible types of devices that can be defined for
the system.
Customized database Describes all devices currently defined for use in the system. Items are referred
to as device instances.
ODM Device Configuration Object Classes in AIX 5L Version 5.3 Technical Reference: Kernel and
Subsystems Volume 2 provides access to the object classes that make up the Predefined and Customized
databases.
Devices must be defined in the database for the system to make use of them. For a device to be in the
Defined state, the Configuration database must contain a complete description of it. This information
includes items such as the device driver name, the device major and minor numbers, the device method
names, the device attributes, connection information, and location information.
High-level device commands invoke methods and allow the user to add, delete, show, and change devices
and their associated attributes.
When a specific device is defined through its define method, the information from the Predefined database
for that type of device is used to create the information describing the specific device instance. This
specific device instance information is then stored in the Customized database. For more information on
define methods, see Writing a Define Method in AIX 5L Version 5.3 Technical Reference: Kernel and
Subsystems Volume 2.
The process of configuring a device is often highly device-specific. The configure method for a kernel
device must:
v Load the device’s driver into the kernel.
v Pass the device dependent structure (DDS) describing the device instance to the driver. For more
information on DDS, see “Device Dependent Structure (DDS) Overview” on page 107.
v Create a special file for the device in the /dev directory. For more information, see Special Files in AIX
5L Version 5.3 Files Reference.
Of course, many devices do not have device drivers. For this type of device the configured state is not as
meaningful. However, it still has a Configure method that simply marks the device as configured or
performs more complex operations to determine if there are any devices attached to it.
The configuration process requires that a device be defined or configured before a device attached to it
can be defined or configured. At system boot time, the Configuration Manager first configures the system
device. The remaining devices are configured by traversing down the parent-child connections layer by
layer. The Configuration Manager then configures any pseudo-devices that need to be configured.
Devices in the system are organized in clusters of tree structures known as nodes. Each tree is a logical
subsystem by itself. For example, the system node consists of all the physical devices in the system. The
top of the node is the system device. Below the bus and connected to it are the adapters. The bottom of
the hierarchy contains devices to which no other devices are connected. Most pseudo-devices, including
low -function terminal (LFT) and pseudo-terminal (pty) devices, are organized as separate tree structures
or nodes.
Devices Graph
See “Understanding Device Dependencies and Child Devices” on page 105 for more information.
Configuration Rules
Each rule in the Configuration Rules (Config_Rules) object class specifies a program name that the
Configuration Manager must execute. These programs are typically the configuration programs for the
devices at the top of the nodes. When these programs are invoked, the names of the next lower-level
devices that need to be configured are returned.
The Configuration Manager configures the next lower-level devices by invoking the configuration methods
for those devices. In turn, those configuration methods return a list of to-be-configured device names. The
process is repeated until no more device names are returned. As a result, all devices in the same node
are configured in transverse order. The following are different types of rules:
v Phase 1
v Phase 2
v Service
The system boot process is divided into two phases. In each phase, the Configuration Manager is invoked.
During phase 1, the Configuration Manager is called with a -f flag, which specifies that phase = 1 rules are
to be executed. This results in the configuration of base devices into the system, so that the root file
system can be used. During phase 2, the Configuration Manager is called with a -s flag, which specifies
that phase = 2 rules are to be executed. This results in the configuration of the rest of the devices into the
system.
For more information on the booting process, see Understanding System Boot Processing in Operating
system and device management.
If device names are returned from the program invoked, the Configuration Manager finishes traversing the
node tree before it invokes the next program. Note that some program names might not be associated
with any devices, but they must be included to configure the system.
In phase 2, the Configuration Manager configures the remaining devices using the configuration database
from the root file system. During this phase, different rules are used, depending on whether the system
was booted in normal mode or in service mode. If the system is booted in service mode, the rules for
service mode are used. Otherwise, the phase 2 rules are used.
The Configuration Manager can also be invoked during run time to configure all the detectable devices
that might have been turned off at system boot or added after the system boot. In this case, the
Configuration Manager uses the phase 2 rules.
Devices are organized into a set of functional classes at the highest level. From a user’s point of view, all
devices belonging to the same class perform the same functions. For example, all printer devices basically
perform the same function of generating printed output.
However, devices within a class can have different interfaces. A class can therefore be partitioned into a
set of functional subclasses in which devices belonging to the same subclass have similar interfaces. For
example, serial printers and parallel printers form two subclasses of printer devices.
Finally, a device subclass is a collection of device types. All devices belonging to the same device type
share the same manufacturer’s model name and number. For example, 3812-2 (model 2 Pageprinter) and
4201 (Proprinter II) printers represent two types of printers.
Devices of the same device type can be managed by different drivers if the type belongs to more than one
subclass. For example, the 4201 printer belongs to both the serial interface and parallel interface
subclasses of the printer class, although there are different drivers for the two interfaces. However, a
device of a particular class, subclass, and type can be managed by only one device driver.
Devices in the system are organized in clusters of tree structures known as nodes. For example, the
system node consists of all the physical devices in the system. At the top of the node is the system
Invoking Methods
One device method can invoke another device method. For instance, a Configure method for a device
may need to invoke the Define method for child devices. The Change method can invoke the Unconfigure
and Configure methods. To ensure proper operation, a method that invokes another method must always
use the odm_run_method subroutine.
Example Methods
See the /usr/samples directory for example device method source code. These source code excerpts are
provided for example purposes only. The examples do not function as written.
The parameters that are passed into the methods as well as the exit codes returned must both satisfy the
requirements for each type of method. Additionally, some methods must write information to the stdout
and stderr files.
These interfaces are defined for each of the device methods in the individual articles on writing each
method.
To better understand how these interfaces work, one needs to understand, at least superficially, the flow of
operations through the Configuration Manager and the run-time configuration commands.
Configuration Manager
The Configuration Manager begins by invoking a Node Configuration program listed in one of the rules in
the Configuration Rules (Config_Rules) object class. A node is a group of devices organized into a tree
structure representing the various interconnections of the devices. The Node Configuration program is
responsible for starting the configuration process for a node. It does this by querying the Customized
database to see if the device at the top of the node is represented in the database. If so, the program
writes the logical name of the device to the stdout file and then returns to the Configuration Manager.
The Configuration Manager intercepts the Configure method’s stdout file to retrieve the names of the
children. It then invokes, one at a time, the Configure methods for each child device. Each of these
Configure methods operates as described for the parent device. For example, it might simply exit when
complete, or write to its stdout file a list of additional device names to be configured and then exit. The
Configuration Manager will continue to intercept the device names written to the stdout file and to invoke
the Configure methods for those devices until the Configure methods for all the devices have been run
and no more names are written to the stdout file.
mkdev The mkdev command is invoked to define or configure, or define and configure, devices at run time. If
just defining a device, the mkdev command invokes the Define method for the device. The Define
method creates the customized device instance in the Customized Devices (CuDv) object class and
writes the name assigned to the device to the stdout file. The mkdev command intercepts the device
name written to the stdout file by the Define method to learn the name of the device. If user-specified
attributes are supplied with the -a flag, the mkdev command then invokes the Change method for the
device.
If defining and configuring a device, the mkdev command invokes the Define method, gets the name
written to the stdout file with the Define method, invokes the Change method for the device if
user-specified attributes were supplied, and finally invokes the device’s Configure method.
If only configuring a device, the device must already exist in the CuDv object class and its name must
be specified to the mkdev command. In this case, the mkdev command simply invokes the Configure
method for the device.
chdev The chdev command is used to change the characteristics, or attributes, of a device. The device must
already exist in the CuDv object class, and the name of the device must be supplied to the chdev
command. The chdev command simply invokes the Change method for the device.
rmdev The rmdev command can be used to undefine or unconfigure, or unconfigure and undefine, a device.
In all cases, the device must already exist in the CuDv object class and the name of the device must
be supplied to the rmdev command. The rmdev command then invokes the Undefine method, the
Unconfigure method, or the Unconfigure method followed by the Undefine method, depending on the
function requested by the user.
cfgmgr The cfgmgr command can be used to configure all detectable devices that did not get configured at
boot time. This might occur if the devices had been powered off at boot time. The cfgmgr command is
the Configuration Manager and operates in the same way at run time as it does at boot time. The boot
time operation is described in Device Configuration Manager Overview .
Defined Represented in the Customized database, but neither configured nor available for use in the
system.
Available Configured and available for use.
Undefined Not represented in the Customized database.
The Define method is responsible for creating a device instance in the Customized database and setting
the state to Defined. The Configure method performs all operations necessary to make the device usable
and then sets the state to Available.
The Change method usually does not change the state of the device. If the device is in the Defined state,
the Change method applies all changes to the database and leaves the device defined. If the device is in
the Available state, the Change method attempts to apply the changes to both the database and the actual
device, while leaving the device available. However, if an error occurs when applying the changes to the
actual device, the Change method might need to unconfigure the device, thus changing the state to
Defined.
Any Unconfigure method you write must perform the operations necessary to make a device unusable.
Basically, this method undoes the operations performed by the Configure method and sets the device state
to Defined. Finally, the Undefine method actually deletes all information for a device instance from the
Customized database, thus reverting the instance to the Undefined state.
The Stopped state is an optional state that some devices require. A device that supports this state needs
Start and Stop methods. The Stop method changes the state from Available to Stopped. The Start method
changes it from Stopped back to Available.
Note: Start and stop methods are only supported on the inet0 device.
To add a currently unsupported device to your system, you might need to:
v Modify the Predefined database
v Add appropriate device methods
v Add a device driver
v Use installp procedures
To describe the device, you must add one object to the PdDv object class to indicate the class, subclass,
and device type. You must also add one object to the PdAt object class for each device attribute, such as
interrupt level or block size. Finally, you must add objects to the PdCn object class if the device is an
intermediate device. If the device is an intermediate device, you must add an object for each different
connection location on the intermediate device.
You can use the odmadd Object Data Manager (ODM) command from the command line or in a shell
script to populate the necessary Predefined object classes from stanza files.
For example, if you have a serial printer that closely resembles a printer supported by the system, and the
system’s device driver for serial printers works on your printer, you can add the device driver as a printer
of type osp (other serial printer). If these generic devices successfully add your device, you do not need to
provide additional system software.
When adding a device that closely resembles devices already supported, you might be able to use one of
the methods of the already supported device. For example, if you are adding a new type of SCSI disk
whose interfaces are identical to supported SCSI disks, the existing methods for SCSI disks may work. If
so, all you need to do is populate the Predefined database with information describing the new SCSI disk,
which will be similar to information describing a supported SCSI disk.
If you need instructions on how to write a device method, see Writing a Device Method .
The second method represents a logical connection. A device method can add an object identifying both a
dependent device and the device upon which it depends to the Customized Dependency (CuDep) object
class. The dependent device is considered to have a dependency, and the depended-upon device is
These two types of dependencies differ significantly. The configuration process uses parent-child
dependencies at boot time to configure all devices that make up a node. The CuDep dependency is
usually only used by a device’s Configure method to record the names of the devices on which it depends.
For device methods, the parent-child relationship is the more important. Parent-child relationships affect
device-method activities in these ways:
v A parent device cannot be unconfigured if it has a configured child.
v A parent device cannot be undefined if it has a defined or configured child.
v A child device cannot be defined if the parent is not defined or configured.
v A child device cannot be configured if the parent is not configured.
v A parent device’s configuration cannot be changed if it has a configured child. This guarantees that the
information about the parent that the child’s device driver might be using remains valid.
However, when a device is listed as a dependency of another device in the CuDep object class, the only
effect is to prevent the depended-upon device from being undefined. The name of the dependency is
important to the dependent device. If the depended-upon device were allowed to be undefined, a third
device could be defined and assigned the same name.
Writers of Unconfigure and Change methods for a depended-upon device should not worry about whether
the device is listed as a dependency. If the depended-upon device is actually open by the other device,
the Unconfigure and Change operations will fail because their device is busy. But if the depended-upon
device is not currently open, the Unconfigure or Change operations can be performed without affecting the
dependent device.
The possible parent-child connections are defined in the Predefined Connection (PdCn) object class. Each
predefined device type that can be a parent device is represented in this object class. There is an object
for each connection location (such as slots or ports) describing the subclass of devices that can be
connected at that location. The subclass is used to identify each device because it indicates the devices’
connection type (for example, SCSI or rs232).
There is no corresponding predefined object class describing the possible CuDep dependencies. A device
method can be written so that it already knows what the dependencies are. If predefined data is required,
it can be added as predefined attributes for the dependent device in the Predefined Attribute (PdAt) object
class.
When a customized device instance is created by a Define method, its attributes assume the default
values. As a result, no objects are added to the CuAt object class for the device. If an attribute for the
device is changed from the default value by the Change method, an object to describe the attribute’s
current value is added to the CuAt object class for the attribute. If the attribute is subsequently changed
back to the default value, the Change method deletes the CuAt object for the attribute.
Any device methods that need the current attribute values for a device must access both the PdAt and
CuAt object classes. If an attribute appears in the CuAt object class, then the associated object identifies
the current value. Otherwise, the default value from the PdAt attribute object identifies the current value.
Any method you write must be able to handle the following four scenarios:
v If the new value differs from the default value and no object currently exists in the CuAt object class,
any method you write must add an object into the CuAt object class to identify the new value.
v If the new value differs from the default value and an object already exists in the CuAt object class, any
method you write must update the CuAt object with the new value.
v If the new value is the same as the default value and an object exists in the CuAt object class, any
method you write must delete the CuAt object for the attribute.
v If the new value is the same as the default value and no object exists in the CuAt object class, any
method you write does not need to do anything.
Your methods can use the getattr and putattr subroutines to get and modify attributes. The getattr
subroutine checks both the PdAt and CuAt object classes before returning an attribute to you. It always
returns the information in the form of a CuAt object even if returning the default value from the PdAt object
class.
A device’s DDS is built each time the device is configured. The Configure method can fill in the DDS with
fixed values, computed values, and information from the Configuration database. Most of the information
from the Configuration database usually comes from the attributes for the device in the Customized
Attribute (CuAt) object class, but can come from any of the object classes. Information from the database
for the device’s parent device or parent’s parent device can also be included. The DDS is passed to the
device driver with the SYS_CFGDD flag of the sysconfig subroutine, which calls the device driver’s
ddconfig subroutine with the CFG_INIT command.
Many Change methods simply invoke the device’s Unconfigure method, apply changes to the database,
and then invoke the device’s Configure method. This process ensures the two stipulated conditions since
the Unconfigure method, and thus the change, will fail, if the device has Available or Stopped children.
Also, if the device has a device driver, its Unconfigure method terminates the device instance. Its
Configure method also rebuilds the DDS and passes it to the driver.
When building a DDS for a device connected to an adapter card, you will typically need the following
adapter information:
slot number Obtained from the connwhere descriptor of the adapter’s Customized Device (CuDv)
object.
bus resources Obtained from attributes for the adapter in the Customized Attribute (CuAt) or Predefined
Attribute (PdAt) object classes. These include attributes for bus interrupt levels, interrupt
priorities, bus memory addresses, bus I/O addresses, and DMA arbitration levels.
The following attribute must be obtained for the adapter’s parent bus device:
bus_id Identifies the I/O bus. This field is needed by the device driver to access the I/O bus.
Note: The getattr device configuration subroutine should be used whenever attributes are obtained from
the Configuration database. This subroutine returns the Customized attribute value if the attribute is
represented in the Customized Attribute object class. Otherwise, it returns the default value from the
Predefined Attribute object class.
Finally, a DDS generally includes the device’s logical name. This is used by the device driver to identify
the device when logging an error for the device.
Example of DDS
The following example provides a guide for using DDS format.
/* Device DDS */
struct device_dds {
/* Bus information */
ulong bus_id; /* I/O bus id */
ushort us_type; /* Bus type, i.e. BUS_MICRO_CHANNEL*/
/* Adapter information */
int slot_num; /* Slot number */
ulong io_addr_base; /* Base bus i/o address */
int bus_intr_lvl; /* bus interrupt level */
int intr_priority; /* System interrupt priority */
int dma_lvl; /* DMA arbitration level */
/* Device specific information */
int block_size; /* Size of block in bytes */
int abc_attr; /* The abc attribute */
int xyz_attr; /* The xyz attribute */
char resource_name[16]; /* Device logical name */
};
bootlist Alters the list of boot devices seen by ROS when the machine boots.
lscfg Displays diagnostic information about a device.
restbase Reads the base customized information from the boot image and restores it into the Device
Configuration database used during system boot phase 1.
savebase Saves information about base customized devices in the Device Configuration Database onto the
boot device.
Machine Device Driver, Loading a Device Driver in AIX 5L Version 5.3 Technical Reference: Kernel and
Subsystems Volume 2.
Writing a Define Method, Writing a Configure Method, Writing a Change Method, Writing an Unconfigure
Method, Writing an Undefine Method, Writing Optional Start and Stop Methods, How Device Methods
Return Errors, Device Methods for Adapter Cards: Guidelines in AIX 5L Version 5.3 Technical Reference:
Kernel and Subsystems Volume 2
Configuration Rules (Config_Rules) Object Class, Customized Dependency (CuDep) Object Class,
Customized Devices (CuDv) Object Class, Predefined Attribute (PdAt) Object Class, Predefined
Connection (PdCn) Object Class, Adapter-Specific Considerations For the Predefined Devices (PdDv)
Object Class, Adapter-Specific Considerations For the Predefined Attributes (PdAt) Object Class,
Predefined Devices Object Class, ODM Device Configuration Object Classes in AIX 5L Version 5.3
Technical Reference: Kernel and Subsystems Volume 2.
Subroutine References
The getattr subroutineioctl subroutine, odm_run_method subroutine, putattr subroutine in AIX 5L
Version 5.3 Technical Reference: Base Operating System and Extensions Volume 1.
The sysconfig subroutine in AIX 5L Version 5.3 Technical Reference: Base Operating System and
Extensions Volume 2.
Commands References
The cfgmgr command, chdev command in AIX 5L Version 5.3 Commands Reference, Volume 1.
Technical References
The SYS_CFGDD sysconfig operation in AIX 5L Version 5.3 Technical Reference: Base Operating
System and Extensions Volume 1.
The ddconfig device driver entry point in AIX 5L Version 5.3 Technical Reference: Kernel and
Subsystems Volume 1.
The Communication I/O Subsystem consists of one or more physical device handlers (PDHs) that control
various communication adapters. The interface to the physical device handlers can support any number of
processes, the limit being device-dependent.
Note: A PDH, as used for the Communications I/O, provides both the device head role for interfacing
to users, and the device handler role for performing I/O to the device.
A communications PDH is a special type of multiplexed character device driver. Information common to all
communications device handlers is discussed here. Additionally, individual communications PDHs have
their own adapter-specific sets of information. Refer to the following to learn more about the adapter types:
v Serial Optical Link Device Handler Overview
Each adapter type requires a device driver. Each PDH can support one or more adapters of the same
type.
There are two interfaces a user can use to access a PDH. One is from a user-mode process (application
space), and the other is from a kernel-mode process (within the kernel).
ddconfig Performs configuration functions for a device handler. Supported the same way that the common
ddconfig entry point is.
ddmpx Allocates or deallocates a channel for a multiplexed device handler. Supported the same way as the
common ddmpx device handler entry point.
ddopen Performs data structure allocation and initialization for a communications PDH. Supported the same
way as the common ddopen entry point. Time-consuming tasks, such as port initialization and
connection establishment, are deferred until the (CIO_START) ddioctl call is issued. A PDH can
support multiple users of a single port.
ddclose Frees up system resources used by the specified communications device until they are needed
again. Supported the same way as the common ddclose entry point.
ddwrite Queues a message for transmission or blocks until the message can be queued. The ddwrite entry
point can attempt to queue a transmit request (nonblocking) or wait for it to be queued (blocking),
depending on the setting of the DNDELAY flag. The caller has the additional option of requesting an
asynchronous acknowledgment when the transmission actually completes.
ddread Returns a message of data to a user-mode process. Supports blocking or nonblocking reads
depending on the setting of the DNDELAY flag. A blocking read request does not return to the caller
until data is available. A nonblocking read returns with a message of data if it is immediately
available. Otherwise, it returns a length of 0 (zero).
ddselect Checks to see if a specified event or events has occurred on the device for a user-mode process.
Supported the same way as the common ddselect entry point.
ddioctl Performs the special I/O operations requested in an ioctl subroutine. Supported the same way as the
common ddioctl entry point. In addition, a communications PDH must support the following four
options:
v CIO_START
v CIO_HALT
v CIO_QUERY
v CIO_GET_STAT
Individual PDHs can add additional commands. Hardware initialization and other time-consuming activities,
such as call establishment, are performed during the CIO_START operation.
PDHs and kernel-mode processes require a set of utilities for obtaining and returning mbuf structures from
a buffer pool.
User-mode processes receive a status block whenever they request a CIO_GET_STAT operation. A
user-mode process can wait for the next available status block by issuing a ddselect entry point with the
specified POLLPRI event.
A kernel-mode process receives a status block through the stat_fn procedure. This procedure is specified
when the device is opened with the ddopen entry point.
Status blocks contain a code field and possible options. The code field indicates the type of status block
code (for example, CIO_START_DONE). A status block’s options depend on the block code. The C
structure of a status block is defined in the /usr/include/sys/comio.h file.
CIO_START_DONE
This block is provided by the device handler when the CIO_START operation completes:
option[0] The CIO_OK or CIO_HARD_FAIL status/exception code from the common or device-dependent
list.
option[1] The low-order two bytes are filled in with the netid field. This field is passed when the CIO_START
operation is invoked.
option[2] Device-dependent.
option[3] Device-dependent.
CIO_HALT_DONE
This block is provided by the device handler when the CIO_HALT operation completes:
option[0] The CIO_OK status/exception code from the common or device-dependent list.
option[1] The low-order two bytes are filled in with the netid field. This field is passed when the CIO_START
operation is invoked.
option[2] Device-dependent.
option[3] Device-dependent.
CIO_TX_DONE
The following block is provided when the physical device handler (PDH) is finished with a transmit request
for which acknowledgment was requested:
option[0] The CIO_OK or CIO_TIMEOUT status/exception code from the common or device-dependent list.
option[1] The write_id field specified in the write_extension structure passed in the ext parameter to the
ddwrite entry point.
option[2] For a kernel-mode process, indicates the mbuf pointer for the transmitted frame.
option[3] Device-dependent.
CIO_NULL_BLK
This block is returned whenever a status block is requested but there are none available:
CIO_LOST_STATUS
This block is returned once after one or more status blocks is lost due to status queue overflow. The
CIO_LOST_STATUS block provides the following:
tsclose Resets the PCI MPQP device to a known state and returns system resources back to the
system on the last close for that adapter. The port no longer transmits or receives data.
tsconfig Provides functions for initializing and terminating the PCI MPQP device handler and
adapter.
tsioctl Provides the following functions for controlling the PCI MPQP device:
CIO_START
Initiates a session with the PCI MPQP device handler.
CIO_HALT
Ends a session with the PCI MPQP device handler.
CIO_QUERY
Reads the counter values accumulated by the PCI MPQP device handler.
CIO_GET_STAT
Gets the status of the current PCI MPQP adapter and device handler.
MP_CHG_PARMS
Permits the data link control (DLC) to change certain profile parameters after the
PCI MPQP device has been started.
tsopen Opens a channel on the PCI MPQP device for transmitting and receiving data.
tsmpx Provides allocation and deallocation of a channel.
tsread Provides the means for receiving data to the PCI MPQP device.
tsselect Provides the means for determining which specified events have occurred on the PCI
MPQP device.
tswrite Provides the means for transmitting data to the PCI MPQP device.
For control frames that only contain control characters, the message type is returned and no data is
transferred from the board. For example, if an ACK0 was received, the message type MP_ACK0 is returned
in the status field of the extension block. In addition, a NULL pointer for the receive buffer is returned. If
an error occurs, the error status is logged by the device driver. Unlogged buffer overrun errors are an
exception.
Note: In BSC communications, the caller receives either a message type or an error status.
If status and data information are available, but no extension block is provided, the read operation returns
the data, but not the status information.
Note: Errors, such as buffer overflow errors, can occur during the read data operation. In these cases, the
return value is the byte count. Therefore, status should be checked even if no errno global value is
returned.
The /dev/op0, /dev/op1, ..., /dev/opn special files provide a diagnostic interface to the serial link adapters
and the serial optical channel converters. Each special file corresponds to a single optical port that can
only be opened in Diagnostic mode. A diagnostic open allows the diagnostic ioctls to be used, but normal
reads and writes are not allowed. A port that is open in this manner cannot be opened with the /dev/ops0
special file. In addition, if the port has already been opened with the /dev/ops0 special file, attempting to
open a /dev/opx special file will fail unless a forced diagnostic open is used.
Entry Points
The SOL device handler interface consists of the following entry points:
sol_close Resets the device to a known state and frees system resources.
sol_config Provides functions to initialize and terminate the device handler, and query the vital product
data (VPD).
sol_fastwrt Provides the means for kernel-mode users to transmit data to the SOL device driver.
sol_ioctl Provides various functions for controlling the device. The valid sol_ioctl operations are:
CIO_GET_FASTWRT
Gets attributes needed for the sol_fastwrt entry point.
CIO_GET_STAT
Gets the device status.
CIO_HALT
Halts the device.
CIO_QUERY
Queries device statistics.
CIO_START
Starts the device.
IOCINFO
Provides I/O character information.
SOL_CHECK_PRID
Checks whether a processor ID is connected.
SOL_GET_PRIDS
Gets connected processor IDs.
sol_mpx Provides allocation and deallocation of a channel.
sol_open Initializes the device handler and allocates the required system resources.
sol_read Provides the means for receiving data.
sol_select Determines if a specified event has occurred on the device.
sol_write Provides the means for transmitting data.
Device Description
slc (serial link chip) There are two serial link adapters in each COMBO chip. The slc
device is automatically detected and configured by the system.
otp (optic two-port card) Also known as the serial optical channel converter (SOCC). There
is one SOCC possible for each slc. The otp device is
automatically detected and configured by the system.
op (optic port) There are two optic ports per otp. The op device is automatically
detected and configured by the system.
ops (optic port subsystem) This is a logical device. There is only one created at any time.
The ops device requires some additional configuration initially,
and is then automatically configured from that point on. The
/dev/ops0 special file is created when the ops device is
configured. The ops device cannot be configured when the
processor ID is set to -1.
Note: If your system uses serial optical link to make a direct, point-to-point connection to another system
or systems, special conditions apply. You must start interfaces on two systems at approximately the
same time, or a method error occurs. If you wish to connect to at least one machine on which the
interface has already been started, this is not necessary.
Processor ID This is the address by which other machines connected by means of the optical
link address this machine. The processor ID can be any value in the range of 1 to
254. To avoid a conflict on the network, this value is initially set to -1, which is not
valid, and the ops device cannot be configured.
Note: If you are using TCP/IP over the serial optical link, the processor ID must
be the same as the low-order octet of the IP address. It is not possible to
successfully configure TCP/IP if the processor ID does not match.
Receive Queue Size This is the maximum number of packets that is queued for a user-mode caller.
The default value is 30 packets. Any integer in the range from 30 to 150 is valid.
Status Queue Size This is the maximum number of status blocks that will be queued for a user-mode
caller. The default value is 10. Any integer in the range from 3 to 20 is valid.
The standard SMIT interface is available for setting these attributes, listing the serial optical channel
converters, handling the initial configuration of the ops device, generating a trace report, generating an
error report, and configuring TCP/IP.
The ATM LANE device driver emulates the operation of Standard Ethernet, IEEE 802.3 Ethernet, and
IEEE 802.5 Token Ring LANs. It encapsulates each LAN packet and transfers its LAN data over an ATM
network at up to OC12 speeds (622 megabits per second). This data can also be bridged transparently to
a traditional LAN with ATM/LAN bridges such as the IBM® 2216.
There is always at least one ATM switch and a possibility of additional switches, bridges, or concentrators.
In support of Ethernet jumbo frames, LE Clients can be configured with maximum frame size values
greater than 1516 bytes. Supported forum values are: 1516, 4544, 9234, and 18190.
Incoming Add Party requests are supported for the Control Distribute and Multicast Forward Virtual Circuits
(VCs). This allows multiple LE clients to be open concurrently on the same ELAN without additional
hardware.
LANE and MPOA are both enabled for IPV4 TCP checksum offload. Transmit offload is automatically
enabled when it is supported by the adapter. Receive offload is configured by using the rx_checksum
attribute. The NDD_CHECKSUM_OFFLOAD device driver flag is set to indicate general offload capability
and also indicates that transmit offload is operational.
Transmit offload of IP-fragmented TCP packets is not supported. Transmit packets that MPOA needs to
fragment are offloaded in the MPOA software, instead of in the adapter. UDP offloading is also not
supported.
The ATM LANE device driver is a dynamically loadable device driver. Each LE Client or MPOA Client is
configurable by the operator, and the LANE driver is loaded into the system as part of that configuration
process. If an LE Client or MPOA Client has already been configured, the LANE driver is automatically
reloaded at reboot time as part of the system configuration process.
The interface to the ATM LANE device driver is through kernel services known as Network Services.
Interfacing to the ATM LANE device driver is achieved by calling the device driver’s entry points for
opening the device, closing the device, transmitting data, and issuing device control commands, just as
you would interface to any of the Common Data Link Interface (CDLI) LAN device drivers.
The ATM LANE device driver interfaces with all hardware-level ATM device drivers that support CDLI, ATM
Call Management, and ATM Signaling.
Entries are required for the Local LE Client’s LAN MAC Address field and possibly the LES ATM
Address or LECS ATM Address fields, depending on the support provided at the server. If the server
accepts the well-known ATM address for LECS, the value of the Automatic Configuration via LECS field
can be set to Yes, and the LES and LECS ATM Address fields can be left blank. If the server does not
support the well-known ATM address for LECS, an ATM address must be entered for either LES (manual
configuration) or LECS (automatic configuration). All other configuration attribute values are optional. If
used, you can accept the defaults for ease-of-use.
addl_drvr Specifies the CDLI demultiplexer being used by the LE Client. The value set by the
ATM LANE device driver is /usr/lib/methods/cfgdmxtok for Token Ring emulation
and /usr/lib/methods/cfgdmxeth for Ethernet. This is not an operator-configurable
attribute.
addl_stat Specifies the routine being used by the LE client to generate device-specific statistics
for the entstat and tokstat commands. The values set by the ATM LANE device
driver are:
v /usr/sbin/atmle_ent_stat
v /usr/sbin/atmle_tok_stat
The LANE device driver does an asynchronous open. It starts the process of attaching the device to the
network, sets the NDD_UP flag in the ndd_flags field, and returns 0. The network attachment continues in
the background where it is driven by network activity and system timers.
If the connection is successful, the NDD_RUNNING flag is set in the ndd_flags field, and an
NDD_CONNECTED status block is sent. The ns_alloc routine returns at this time.
If the device connection fails, the NDD_LIMBO flag is set in the ndd_flags field, and an
NDD_LIMBO_ENTRY status block is sent.
If the device is eventually connected, the NDD_LIMBO flag is disabled, and the NDD_RUNNING flag is
set in the ndd_flags field. Both NDD_CONNECTED and NDD_LIMBO_EXIT status blocks are sent.
The device will not be detached from the network until the device’s transmit queue is allowed to drain.
Data Transmission
The atmle_output function transmits data using the network device.
If the destination address in the packet is a broadcast address, the M_BCAST flag in the
p_mbuf->m_flags field should be set prior to entering this routine. A broadcast address is defined as
FF.FF.FF.FF.FF.FF (hex) for both Ethernet and Token Ring and C0.00.FF.FF.FF.FF (hex) for Token Ring.
If the destination address in the packet is a multicast or group address, the M_MCAST flag in the
p_mbuf->m_flags field should be set prior to entering this routine. A multicast or group address is defined
as any nonindividual address other than a broadcast address.
The device driver keeps statistics based on the M_BCAST and M_MCAST flags.
Token Ring LANE emulates a duplex device. If a Token Ring packet is transmitted with a destination
address that matches the LAN MAC address of the local LE Client, the packet is received. This is also
True for Token Ring packets transmitted to a broadcast address, enabled functional address, or an
enabled group address. Ethernet LANE, on the other hand, emulates a simplex device and does not
receive its own broadcast or multicast transmit packets.
Data Reception
When the LANE device driver receives a valid packet from a network ATM device driver, the LANE device
driver calls the nd_receive function that is specified in the ndd_t structure of the network device. The
nd_receive function is part of a CDLI network demuxer. The packet is passed to the nd_receive function
in mbufs.
The LANE device driver passes one packet to the nd_receive function at a time.
The device driver sets the M_BCAST flag in the p_mbuf->m_flags field when a packet is received that
has an all-stations broadcast destination address. This address value is defined as FF.FF.FF.FF.FF.FF
(hex) for both Token Ring and Ethernet and is defined as C0.00.FF.FF.FF.FF (hex) for Token Ring.
The device driver sets the M_MCAST flag in the p_mbuf->m_flags field when a packet is received that
has a nonindividual address that is different than an all-stations broadcast address.
Asynchronous Status
When a status event occurs on the device, the LANE device driver builds the appropriate status block and
calls the nd_status function that is specified in the ndd_t structure of the network device. The nd_status
function is part of a CDLI network demuxer.
The following status blocks are defined for the LANE device driver:
Hard Failure
When an error occurs within the internal operation of the ATM LANE device driver, it is considered
unrecoverable. If the device was operational at the time of the error, the NDD_LIMBO and
NDD_RUNNING flags are disabled, and the NDD_DEAD flag is set in the ndd_flags field, and a hard
failure status block is generated.
Note: While the device driver is in this recovery logic, the network connections might not be fully
functional. The device driver notifies users when the device is fully functional by way of an
NDD_LIMBO_EXIT asynchronous status block.
When a general error occurs during operation of the device, this status block is generated.
ATMLE_MIB_GET
This control requests the LANE device driver’s current ATM LAN Emulation MIB statistics.
The user should pass in the address of an atmle_mibs_t structure as defined in usr/include/sys/
atmle_mibs.h. The driver returns EINVAL if the buffer area is smaller than the required structure.
The ndd_flags field can be checked to determine the current state of the LANE device.
ATMLE_MIB_QUERY
This control requests the LANE device driver’s ATM LAN Emulation MIB support structure.
The device driver does not support any variables for read_write or write only. If the syntax of a member of
the structure is some integer type, the level of support flag is stored in the whole field, regardless of the
size of the field. For those fields defined as character arrays, the value is returned only in the first byte in
the field.
NDD_CLEAR_STATS
This control requests all the statistics counters kept by the LANE device driver to be zeroed.
NDD_DISABLE_ADDRESS
This command disables the receipt of packets destined for a multicast/group address; and for Token Ring,
it disables the receipt of packets destined for a functional address. For Token Ring, the functional address
indicator (bit 0, the most significant bit of byte 2) indicates whether the address is a functional address (the
bit is a 0) or a group address (the bit is a 1).
In all cases, the length field value is required to be 6. Any other value causes the LANE device driver to
return EINVAL.
Functional Address: The reference counts are decremented for those bits in the functional address that
are enabled (set to 1). If the reference count for a bit goes to zero, the bit is disabled in the functional
address mask for this LE Client.
If no functional addresses are active after receipt of this command, the TOK_RECEIVE_FUNC flag in the
ndd_flags field is reset. If no functional or multicast/group addresses are active after receipt of this
command, the NDD_ALTADDRS flag in the ndd_flags field is reset.
If no functional or multicast/group addresses are active after receipt of this command, the
NDD_ALTADDRS flag in the ndd_flags field is reset. Additionally for Token Ring, if no multicast/group
address is active after receipt of this command, the TOK_RECEIVE_GROUP flag in the ndd_flags field is
reset.
NDD_DISABLE_MULTICAST
The NDD_DISABLE_MULTICAST command disables the receipt of all packets with unregistered multicast
addresses, and only receives those packets whose multicast addresses were registered using the
NDD_ENABLE_ADDRESS command. The arg and length parameters are not used. The
NDD_MULTICAST flag in the ndd_flags field is reset only after the reference count for multicast
addresses has reached zero.
NDD_ENABLE_ADDRESS
The NDD_ENABLE_ADDRESS command enables the receipt of packets destined for a multicast/group
address; and additionally for Token Ring, it enables the receipt of packets destined for a functional
address. For Ethernet, the address is entered in canonical format, which is left-to-right byte order with the
I/G (Individual/Group) indicator as the least significant bit of the first byte. For Token Ring, the address
format is entered in noncanonical format, which is left-to-right bit and byte order and has a functional
address indicator. The functional address indicator (the most significant bit of byte 2) indicates whether the
address is a functional address (the bit value is 0) or a group address (the bit value is 1).
In all cases, the length field value is required to be 6. Any other length value causes the LANE device
driver to return EINVAL.
For example, if function G is assigned a functional address of C0.00.00.08.00.00 (hex), and function M is
assigned a functional address of C0.00.00.00.00.40 (hex), then ring station Y, whose node contains
function G and M, would have a mask of C0.00.00.08.00.40 (hex). Ring station Y would receive packets
addressed to either function G or M or to an address like C0.00.00.08.00.48 (hex) because that address
contains bits specified in the mask.
Note: The LANE device driver forces the first 2 bytes of the functional address to be C0.00 (hex). In
addition, bits 6 and 7 of byte 5 of the functional address are forced to 0.
The NDD_ALTADDRS and TOK_RECEIVE_FUNC flags in the ndd_flags field are set.
Because functional addresses are encoded in a bit-significant format, reference counts are kept on each of
the 31 least significant bits of the address. Reference counts are not kept on the 17 most significant bits
(the C0.00 (hex) of the functional address and the functional address indicator bit).
Multicast/Group Address: A multicast/group address table is used by the LANE device driver to store
address filters for incoming multicast/group packets. If the LANE device driver is unable to allocate kernel
memory when attempting to add a multicast/group address to the table, the address is not added and
ENOMEM is returned.
If the LANE device driver is successful in adding a multicast/group address, the NDD_ALTADDRS flag in
the ndd_flags field is set. Additionally for Token Ring, the TOK_RECEIVE_GROUP flag is set, and the
first 2 bytes of the group address are forced to be C0.00 (hex).
NDD_ENABLE_MULTICAST
The NDD_ENABLE_MULTICAST command enables the receipt of packets with any multicast (or group)
address. The arg and length parameters are not used. The NDD_MULTICAST flag in the ndd_flags field
is set.
NDD_DEBUG_TRACE
This control requests a LANE or MPOA driver to toggle the current state of its debug_trace configuration
flag.
This control is available to the operator through the LANE Ethernet entstat -t or LANE Token Ring tokstat
-t commands, or through the MPOA mpcstat -t command. The current state of the debug_trace
configuration flag is displayed in the output of each command as follows:
v For the entstat and tokstat commands, NDD_DEBUG_TRACE is enabled only if you see Driver
Flags: Debug.
v For the mpcstat command, you see Debug Trace: Enabled.
NDD_GET_ALL_STATS
This control requests all current LANE statistics, based on both the generic LAN statistics and the ATM
LANE protocol in progress.
For Ethernet, pass in the address of an ent_ndd_stats_t structure as defined in the file
/usr/include/sys/cdli_entuser.h.
For Token Ring, pass in the address of a tok_ndd_stats_t structure as defined in the file
/usr/include/sys/cdli_tokuser.h.
The ndd_flags field can be checked to determine the current state of the LANE device.
NDD_GET_STATS
This control requests the current generic LAN statistics based on the LAN protocol being emulated.
For Ethernet, pass in the address of an ent_ndd_stats_t structure as defined in the file
/usr/include/sys/cdli_entuser.h.
For Token Ring, pass in the address of a tok_ndd_stats_t structure as defined in file
/usr/include/sys/cdli_tokuser.h.
The ndd_flags field can be checked to determine the current state of the LANE device.
NDD_MIB_ADDR
This control requests the current receive addresses that are enabled on the LANE device driver. The
following address types are returned, up to the amount of memory specified to accept the address list:
v Local LAN MAC Address
v Broadcast Address FF.FF.FF.FF.FF.FF (hex)
v Broadcast Address C0.00.FF.FF.FF.FF (hex)
v (returned for Token Ring only)
v Functional Address Mask
v (returned for Token Ring only, and only if at least one functional address has been enabled)
v Multicast/Group Address 1 through n
v (returned only if at least one multicast/group address has been enabled)
NDD_MIB_GET
This control requests the current MIB statistics based on whether the LAN being emulated is Ethernet or
Token Ring.
If Token Ring, pass in the address of a token_ring_all_mib_t structure as defined in the file
/usr/include/sys/tokenring_mibs.h.
The driver returns EINVAL if the buffer area is smaller than the required structure.
The ndd_flags field can be checked to determine the current state of the LANE device.
NDD_MIB_QUERY
This control requests LANE device driver’s MIB support structure based on whether the LAN being
emulated is Ethernet or Token Ring.
If Token Ring, pass in the address of a token_ring_all_mib_t structure as defined in the file
/usr/include/sys/tokenring_mibs.h.
The driver returns EINVAL if the buffer area is smaller than the required structure.
Tracing can be disabled through SMIT or with the trcstop command. After trace is stopped, the results
can be formatted into readable text with the trcrpt command.
trcrpt > /tmp/trc.out
Only one MPOA Client is established per node. This MPC can support multiple ATM ports, containing LE
Clients/Servers and MPOA Servers. The key requirement being, that for this MPC to create shortcut paths,
each remote target node must also support MPOA Client, and must be directly accessible using the matrix
of switches representing the ATM network.
A user with root authority can add this MPOA Client using the smit mpoa_panel fast path, or click
Devices —> Communication —> ATM Adapter —> Services —> Multi-Protocol Over ATM (MPOA).
Configuration help text is also available within MPOA Client SMIT to aid in making any modifications to
attribute default values.
auto_cfg Auto Configuration with LEC/LECS. Specifies whether the MPOA Client is to be
automatically configured using LANE Configuration Server (LECS). Select Yes if a
primary LE Client is used to obtain the MPOA configuration attributes, which overrides
any manual or default values.
The default value is No (manual configuration). The attribute values are:
Yes - auto configuration
No - manual configuration
debug_trace Specifies whether this MPOA Client should keep a real time debug log within the
kernel and allow full system trace capability. Select Yes to enable full tracing
capabilities for this MPOA Client. Select No for optimal performance when minimal
tracing is desired.
The default is Yes (full tracing capability).
fragment Enables MPOA fragmentation and specifies whether fragmentation should be
performed on packets that exceed the maximum transmission unit (MTU) returned in
the MPOA Resolution Reply. Select Yes to have outgoing packets fragmented as
needed. Select No to avoid having outgoing packets fragmented. Selecting No causes
outgoing packets to be sent down the LANE path when fragmentation must be
performed. Incoming packets are always fragmented as needed even if No has been
selected. The default value is Yes.
hold_down_time Failed resolution request retry Hold Down Time (in seconds). Specifies the length of
time to wait before reinitiating a failed address resolution attempt. This value is
normally set to a value greater than retry_time_max. This attribute correlates to ATM
Forum MPC Configuration parameter MPC-p6.
The default value is 160 seconds.
init_retry_time Initial Request Retry Time (in seconds). Specifies the length of time to wait before
sending the first retry of a request that does not receive a response. This attribute
correlates to ATM Forum MPC Configuration parameter MPC-p4.
The default value is 5 seconds.
retry_time_max Maximum Request Retry Time (in seconds). Specifies the maximum length of time to
wait when retrying requests that have not received a response. Each retry duration
after the initial retry are doubled (2x) until the retry duration reaches this Maximum
Request Retry Time. All subsequent retries wait this maximum value. This attribute
correlates to ATM Forum MPC Configuration parameter MPC-p5.
The default value is 40 seconds.
sc_setup_count Shortcut Setup Frame Count. This attribute is used in conjunction with sc_setup_time
to determine when to establish a shortcut path. After the MPC has forwarded at least
sc_setup_count packets to the same target within a period of sc_setup_time, the MPC
attempts to create a shortcut VCC. This attribute correlates to ATM Forum MPC
Configuration parameter MPC-p1.
The default value is 10 packets.
sc_setup_time Shortcut Setup Frame Time (in seconds). This attribute is used in conjunction with
sc_setup_count above to determine when to establish a shortcut path. After the MPC
has forwarded at least sc_setup_count packets to the same target within a period of
sc_setup_time, the MPC attempts to create a shortcut VCC. This attribute correlates
to ATM Forum MPC Configuration parameter MPC-p2.
The default value is 1 second.
vcc_inact_time VCC Inactivity Timeout value (in minutes). Specifies the maximum length of time to
keep a shortcut VCC enabled when there is no send or receive activity on that VCC.
The default value is 20 minutes.
Tracing can be disabled through SMIT or with the trcstop command. After trace is stopped, the results
can be formatted into readable text with the trcrpt command.
trcrpt > /tmp/trc.out
For more information see, entstat Command, lecstat Command, mpcstat Command, and tokstat Command
in AIX 5L Version 5.3 Commands Reference.
The FDDI device driver is a dynamically loadable device driver. The device driver is automatically loaded
into the system at device configuration time as part of the configuration process.
The interface to the device is through the kernel services known as Network Services.
Interfacing to the device driver is achieved by calling the device driver’s entry points for opening the
device, closing the device, transmitting data, doing a remote dump, and issuing device control commands.
The device is initialized. When the resources have been successfully allocated, the device is attached to
the network.
If the station is not connected to another running station, the device driver opens, but is unable to transmit
Logical Link Control (LLC) packets. When in this mode, the device driver sets the
CFDDI_NDD_LLC_DOWN flag (defined in /usr/include/sys/cdli_fddiuser.h). When the adapter is able to
make a connection with at least one other station this flag is cleared and LLC packets can be transmitted.
The device is not detached from the network until the device’s transmit queue is allowed to drain.
Data Transmission
The fddi_output function transmits data using the network device.
The FDDI device driver supports up to three mbuf’s for each packet. It cannot gather from more than three
locations to a packet.
The FDDI device driver does not accept user-memory mbufs. It uses bcopy on small frames which does
not work on user memory.
The driver requires that the entire mac header be in a single mbuf.
The driver will not accept chained frames of different types. The user should not send Logical Link Control
(LLC) and station management (SMT) frames in the same call to output.
The user needs to fill the frame out completely before calling the output routine. The mac header for a
FDDI packet is defined by the cfddi_hdr_t structure defined in /usr/include/sys/cdli_fddiuser.h. The first
Data Reception
When the FDDI device driver receives a valid packet from the network device, the FDDI device driver calls
the nd_receive function that is specified in the ndd_t structure of the network device. The nd_receive
function is part of a CDLI network demuxer. The packet is passed to the nd_receive function in mbufs.
For FDDI the type of data in an error log is the same for every error log. Only the specifics and the title of
the error log change. Information that follows includes an example of an error log and a list of error log
entries.
FILE NAME
line: 332 file: fddiintr_b.c
POS REGISTERS
F48E D317 3CC7 0008
SOURCE ADDRESS
4000 0000 0000
ATTACHMENT CLASS
0000 0001
SELF TESTS
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000
The 8fc8 Token-Ring device driver is a dynamically loadable device driver. The device driver automatically
loads into the system at device configuration time as part of the configuration process.
The interface to the device is through the kernel services known as Network Services.
Interfacing to the device driver is achieved by calling the device driver’s entry points for opening the
device, closing the device, transmitting data, doing a remote dump, and issuing device control commands.
The Token-Ring device driver interfaces with the Token-Ring High-Performance Network Adapter (8fc8). It
provides a Micro Channel-based connection to a Token-Ring network. The adapter is IEEE 802.5
compatible and supports both 4 and 16 megabit per second networks. The adapter supports only a
Shielded Twisted-Pair (STP) Token-Ring connection.
The Token Ring device driver does an asynchronous open. It starts the process of attaching the device to
the network, sets the NDD_UP flag in the ndd_flags field, and returns 0. The network attachment will
continue in the background where it is driven by device activity and system timers.
Note: The Network Services ns_alloc routine that calls this open routine causes the open to be
synchronous. It waits until the NDD_RUNNING flag is set in the ndd_flags field or 60 seconds have
passed.
If the connection is successful, the NDD_RUNNING flag will be set in the ndd_flags field and a
NDD_CONNECTED status block will be sent. The ns_alloc routine will return at this time.
If the device connection fails, the NDD_LIMBO flag will be set in the ndd_flags field and a
NDD_LIMBO_ENTRY status block will be sent.
If the device is eventually connected, the NDD_LIMBO flag will be turned off and the NDD_RUNNING flag
will be set in the ndd_flags field. Both NDD_CONNECTED and NDD_LIMBO_EXIT status blocks will be
set.
The device will not be detached from the network until the device’s transmit queue is allowed to drain.
Data Transmission
The tok_output function transmits data using the network device.
The device driver does not support mbufs from user memory (which have the M_EXT flag set).
If the destination address in the packet is a broadcast address, the M_BCAST flag in the p_mbuf->m_flags
field should be set prior to entering this routine. A broadcast address is defined as 0xFFFF FFFF FFFF or
0xC000 FFFF FFFF. If the destination address in the packet is a multicast address the M_MCAST flag in
the p_mbuf->m_flags field should be set prior to entering this routine. A multicast address is defined as a
non-individual address other than a broadcast address. The device driver will keep statistics based upon
the M_BCAST and M_MCAST flags.
Data Reception
When the Token-Ring device driver receives a valid packet from the network device, the Token-Ring
device driver calls the nd_receive function that is specified in the ndd_t structure of the network device.
The nd_receive function is part of a CDLI network demuxer. The packet is passed to the nd_receive
function in mbufs.
The Token-Ring device driver passes one packet to the nd_receive function at a time.
The device driver sets the M_BCAST flag in the p_mbuf->m_flags field when a packet is received that has
an all-stations broadcast address. This address is defined as 0xFFFF FFFF FFFF or 0xC000 FFFF FFFF.
The device driver sets the M_MCAST flag in the p_mbuf->m_flags field when a packet is received that has
a non-individual address that is different than the all-stations broadcast address.
The adapter does not pass invalid packets to the device driver.
Asynchronous Status
When a status event occurs on the device, the Token-Ring device driver builds the appropriate status
block and calls the nd_status function that is specified in the ndd_t structure of the network device. The
nd_status function is part of a CDLI network demuxer.
The following status blocks are defined for the Token-Ring device driver.
Hard Failure
When a hard failure has occurred on the Token-Ring device, the following status blocks can be returned
by the Token-Ring device driver. One of these status blocks indicates that a fatal error occurred.
NDD_PIO_FAIL: When a PIO error occurs, it is retried 3 times. If the error still occurs, it is considered
unrecoverable and this status block is generated.
TOK_RECOVERY_THRESH: When most network errors occur, they are retried. Some errors are retried
with no limit and others have a recovery threshold. Errors that have a recovery threshold and fail all the
retries specified by the recovery threshold are considered unrecoverable and generate the following status
block:
Note: While the device driver is in this recovery logic, the device might not be fully functional. The
device driver will notify users when the device is fully functional by way of an NDD_LIMBO_EXIT
asynchronous status block.
NDD_ADAP_CHECK: When an adapter check has occurred, this status block is generated.
NDD_AUTO_RMV: When an internal hardware error following the beacon automatic-removal process
has been detected, this status block is generated.
NDD_CMD_FAIL: The device has detected an error in a command the device driver issued to it.
NDD_TX_ERROR: The device has detected an error in a packet given to the device.
NDD_TX_TIMEOUT: The device has detected an error in a packet given to the device.
TOK_ADAP_INIT: When the initialization of the device fails, this status block is generated.
TOK_RING_SPEED: When an error code of 0x27 (physical insertion, ring beaconing) occurs during open
of the device, this status block is generated.
TOK_RMV_ADAP: The device has received a remove ring station MAC frame indicating that a network
management function had directed this device to get off the ring.
TOK_WIRE_FAULT: When an error code of 0x11 (lobe media test, function failure) occurs during open of
the device, this status block is generated.
Ring Beaconing: When the Token-Ring device has detected a beaconing condition (or the ring has
recovered from one), the following status block is generated by the Token-Ring device driver:
Device Connected
When the device is successfully connected to the network the following status block is returned by the
device driver:
NDD_GET_STATS
The user should pass in the tok_ndd_stats_t structure as defined in usr/include/sys/cdli_tokuser.h. The
driver will fail a call with a buffer smaller than the structure.
The statistics that are returned contain statistics obtained from the device. If the device is inoperable, the
statistics that are returned will not contain the current device statistics. The copy of the ndd_flags field
can be checked to determine the state of the device.
NDD_MIB_QUERY
The arg parameter specifies the address of the token_ring_all_mib_t structure. This structure is defined in
the /usr/include/sys/tokenring_mibs.h file.
The device driver does not support any variables for read_write or write only. If the syntax of a member of
the structure is some integer type, the level of support flag will be stored in the whole field, regardless of
the size of the field. For those fields defined as character arrays, the value will be returned only in the first
byte in the field.
NDD_MIB_GET
The arg parameter specifies the address of the token_ring_all_mib_t structure. This structure is defined in
the /usr/include/sys/tokenring_mibs.h file.
If the device is inoperable, the upstream field of the Dot5Entry_t structure will be zero instead of containing
the nearest active upstream neighbor (NAUN). Also the statistics that are returned contain statistics
obtained from the device. If the device is inoperable, the statistics that are returned will not contain the
current device statistics. The copy of the ndd_flags field can be checked to determine the state of the
device.
NDD_ENABLE_ADDRESS
This command enables the receipt of packets with a functional or a group address. The functional address
indicator (bit 0 ″the MSB″ of byte 2) indicates whether the address is a functional address (the bit is a 0)
or a group address (the bit is a 1). The length field is not used because the address must be 6 bytes in
length.
Functional Address: The specified address is ORed with the currently specified functional addresses
and the resultant address is set as the functional address for the device. Functional addresses are
encoded in a bit-significant format, thereby allowing multiple individual groups to be designated by a single
address.
The Token-Ring network architecture provides bit-specific functional addresses for widely-used functions,
such as configuration report server. Ring stations use functional address masks to identify these functions.
For example, if function G is assigned a functional address of 0xC000 0008 0000, and function M is
assigned a function address of 0xC000 0000 0040, then ring station Y, whose node contains function G
and M, would have a mask of 0xC000 0008 0040. Ring station Y would receive packets addressed to
either function G or M or to an address like 0xC000 0008 0048 because that address contains bits
specified in the mask.
Note: The device forces the first 2 bytes of the functional address to be 0xC000. In addition, bits 6 and 7
of byte 5 of the functional address are forced to a 0 by the device.
Because functional addresses are encoded in a bit-significant format, reference counts are kept on each of
the 31 least significant bits of the address. Reference counts are not kept on the 17 most significant bits
(the 0xC000 of the functional address and the functional address indicator bit).
Group Address: If no group address is currently enabled, the specified address is set as the group
address for the device. The group address will not be set and EINVAL will be returned if a group address
is currently enabled.
The device forces the first 2 bytes of the group address to be 0xC000.
The NDD_ALTADDRS and TOK_RECEIVE_GROUP flags in the ndd_flags field are set.
NDD_DISABLE_ADDRESS
This command disables the receipt of packets with a functional or a group address. The functional address
indicator (bit 0 ″the MSB″ of byte 2) indicates whether the address is a functional address (the bit is a 0)
or a group address (the bit is a 1). The length field is not used because the address must be 6 bytes in
length.
Functional Address: The reference counts are decremented for those bits in the functional address that
are a one (on). If the reference count for a bit goes to zero, the bit will be ″turned off″ in the functional
address for the device.
If no functional addresses are active after receipt of this command, the TOK_RECEIVE_FUNC flag in the
ndd_flags field is reset. If no functional or group addresses are active after receipt of this command, the
NDD_ALTADDRS flag in the ndd_flags field is reset.
Group Address: If the group address that is currently enabled is specified, receipt of packets with a
group address is disabled. If a different address is specified, EINVAL will be returned.
If no group address is active after receipt of this command, the TOK_RECEIVE_GROUP flag in the
ndd_flags field is reset. If no functional or group addresses are active after receipt of this command, the
NDD_ALTADDRS flag in the ndd_flags field is reset.
NDD_MIB_ADDR
The following addresses are returned:
v Device Physical Address (or alternate address specified by user)
v Broadcast Address 0xFFFF FFFF FFFF
v Broadcast Address 0xC000 FFFF FFFF
v Functional Address (only if a user specified a functional address)
v Group Address (only if a user specified a group address)
NDD_CLEAR_STATS
The counters kept by the device will be zeroed.
NDD_GET_ALL_STATS
The arg parameter specifies the address of the mon_all_stats_t structure. This structure is defined in the
/usr/include/sys/cdli_tokuser.h file.
The statistics that are returned contain statistics obtained from the device. If the device is inoperable, the
statistics that are returned will not contain the current device statistics. The copy of the ndd_flags field
can be checked to determine the state of the device.
The 8fa2 Token-Ring device driver is a dynamically loadable device driver. The device driver is
automatically loaded into the system at device configuration time as part of the configuration process.
The interface to the device is through the kernel services known as Network Services.
Interfacing to the device driver is achieved by calling the device driver’s entry points for opening the
device, closing the device, transmitting data, doing a remote dump, and issuing device control commands.
The Token-Ring device driver interfaces with the Token-Ring High-Performance Network Adapter (8fa2). It
provides a Micro Channel-based connection to a Token-Ring network. The adapter is IEEE 802.5
compatible and supports both 4 and 16 megabit per second networks. The adapter supports only a RJ-45
connection.
The Token Ring device driver does a synchronous open. The device will be initialized at this time. When
the resources have been successfully allocated, the device will start the process of attaching the device to
the network.
If the connection is successful, the NDD_RUNNING flag will be set in the ndd_flags field and a
NDD_CONNECTED status block will be sent.
If the device connection fails, the NDD_LIMBO flag will be set in the ndd_flags field and a
NDD_LIMBO_ENTRY status block will be sent.
If the device is eventually connected, the NDD_LIMBO flag will be turned off and the NDD_RUNNING flag
will be set in the ndd_flags field. Both NDD_CONNECTED and NDD_LIMBO_EXIT status blocks will be
set.
The device will not be detached from the network until the device’s transmit queue is allowed to drain.
Data Transmission
The tok_output function transmits data using the network device.
The device driver does not support mbufs from user memory (which have the M_EXT flag set).
If the destination address in the packet is a broadcast address the M_BCAST flag in the
p_mbuf->m_flags field should be set prior to entering this routine. A broadcast address is defined as
0xFFFF FFFF FFFF or 0xC000 FFFF FFFF. If the destination address in the packet is a multicast address
the M_MCAST flag in the p_mbuf->m_flags field should be set prior to entering this routine. A multicast
address is defined as a non-individual address other than a broadcast address. The device driver will keep
statistics based upon the M_BCAST and M_MCAST flags.
If a packet is transmitted with a destination address which matches the adapter’s address, the packet will
be received. This is true for the adapter’s physical address, broadcast addresses (0xC000 FFFF FFFF or
0xFFFF FFFF FFFF), enabled functional addresses, or an enabled group address.
Data Reception
When the Token-Ring device driver receives a valid packet from the network device, the Token-Ring
device driver calls the nd_receive function that is specified in the ndd_t structure of the network device.
The nd_receive function is part of a CDLI network demuxer. The packet is passed to the nd_receive
function in mbufs.
The device driver will set the M_BCAST flag in the p_mbuf->m_flags field when a packet is received which
has an all stations broadcast address. This address is defined as 0xFFFF FFFF FFFF or 0xC000 FFFF
FFFF.
The device driver will set the M_MCAST flag in the p_mbuf->m_flags field when a packet is received
which has a non-individual address which is different than the all-stations broadcast address.
The adapter will not pass invalid packets to the device driver.
Asynchronous Status
When a status event occurs on the device, the Token-Ring device driver builds the appropriate status
block and calls the nd_status function that is specified in the ndd_t structure of the network device. The
nd_status function is part of a CDLI network demuxer.
The following status blocks are defined for the Token-Ring device driver.
Hard Failure
When a hard failure has occurred on the Token-Ring device, the following status blocks can be returned
by the Token-Ring device driver. One of these status blocks indicates that a fatal error occured.
NDD_PIO_FAIL
Indicates that when a PIO error occurs, it is retried 3 times. If the error persists, it is considered
unrecoverable and the following status block is generated:
NDD_HARD_FAIL
Indicates that when a transmit error occurs it is retried. If the error is unrecoverable, the following
status block is generated:
NDD_ADAP_CHECK
Indicates that when an adapter check has occurred, the following status block is generated:
NDD_DUP_ADDR
Indicates that the device detected a duplicated address in the network and the following status
block is generated:
NDD_CMD_FAIL
Indicates that the device detected an error in a command that the device driver issued. The
following status block is generated:
TOK_RING_SPEED
Indicates that when a ring speed error occurs while the device is being open, the following status
block is generated:
Note: While the device driver is in this recovery logic, the device might not be fully functional. The device
driver will notify users when the device is fully functional by way of an NDD_LIMBO_EXIT
asynchronous status block.
Device Connected
Indicates that when the device is successfully connected to the network the following status block is
returned by the device driver:
Functional Address
The specified address is ORed with the currently specified functional addresses and the resultant address
is set as the functional address for the device. Functional addresses are encoded in a bit-significant
format, thereby allowing multiple individual groups to be designated by a single address.
The Token-Ring network architecture provides bit-specific functional addresses for widely used functions,
such as configuration report server. Ring stations use functional address masks to identify these functions.
For example, if function G is assigned a functional address of 0xC000 0008 0000, and function M is
assigned a function address of 0xC000 0000 0040, then ring station Y, whose node contains function G
and M, would have a mask of 0xC000 0008 0040. Ring station Y would receive packets addressed to
either function G or M or to an address like 0xC000 0008 0048 because that address contains bits
specified in the mask.
The NDD_ALTADDRS and TOK_RECEIVE_FUNC flags in the ndd_flags field are set.
Because functional addresses are encoded in a bit-significant format, reference counts are kept on each of
the 31 least significant bits of the address.
Group Address
The device support 256 general group addresses. The promiscuous mode will be turned on when the
group addresses needed to be set are more than 256. The device driver will maintain a reference count on
this operation.
The NDD_ALTADDRS and TOK_RECEIVE_GROUP flags in the ndd_flags field are set.
NDD_DISABLE_ADDRESS
This command disables the receipt of packets with a functional or a group address. The functional
address indicator (bit 0 ″the MSB″ of byte 2) indicates whether the address is a functional address
(the bit is a 0) or a group address (the bit is a 1). The length field is not used because the address
must be 6 bytes in length.
If no functional addresses are active after receipt of this command, the TOK_RECEIVE_FUNC flag in the
ndd_flags field is reset. If no functional or group addresses are active after receipt of this command, the
NDD_ALTADDRS flag in the ndd_flags field is reset.
Group Address
If the number of group address enabled is less than 256, the driver sends a command to the device to
disable the receipt of the packets with the specified group address. Otherwise, the driver just deletes the
group address from the group address table.
If there are less than 256 group addresses enabled after the receipt of this command, the promiscuous
mode is turned off.
If no group address is active after receipt of this command, the TOK_RECEIVE_GROUP flag in the
ndd_flags field is reset. If no functional or group addresses are active after receipt of this command, the
NDD_ALTADDRS flag in the ndd_flags field is reset.
NDD_PRIORITY_ADDRESS
The driver returns the address of the device driver’s priority transmit routine.
NDD_MIB_ADDR
The driver will return at least three addresses: device physical address (or alternate address
specified by user) and two broadcast addresses (0xFFFF FFFF FFFF and 0xC000 FFFF FFFF).
Additional addresses specified by the user, such as functional address and group addresses,
might also be returned.
NDD_CLEAR_STATS
The counters kept by the device are zeroed.
NDD_GET_ALL_STATS
The arg parameter specifies the address of the mon_all_stats_t structure. This structure is
defined in the /usr/include/sys/cdli_tokuser.h file.
The statistics returned include statistics obtained from the device. If the device is inoperable, the
statistics returned do not contain the current device statistics. The copy of the ndd_flags field can
be checked to determine the state of the device.
Trace Points and Error Log Templates for 8fa2 Token-Ring Device
Driver
The Token-Ring device driver has four trace points. The IDs are defined in the /usr/include/sys/
cdli_tokuser.h file.
The interface to the device is through the kernel services known as Network Services. Interfacing to the
device driver is achieved by calling the device driver’s entry points to perform the following actions:
v Opening the device
v Closing the device
v Transmitting data
v Performing a remote dump
v Issuing device control commands
The PCI Token-Ring High Performance Device Driver (14101800) interfaces with the PCI Token-Ring
High-Performance Network Adapter (14101800). The adapter is IEEE 802.5 compatible and supports both
4 and 16 Mbps networks. The adapter supports only an RJ-45 connection.
The PCI Token-Ring Device Driver (14103e00) interfaces with the PCI Token-Ring Network Adapter
(14103e00). The adapter is IEEE 802.5 compatible and supports both 4 and 16 Mbps networks. The
adapter supports both an RJ-45 and a 9 Pin connection.
Configuration Parameters
The following configuration parameter is supported by all PCI Token-Ring Device Drivers:
Ring Speed
The device driver supports a user-configurable parameter that indicates if the token-ring is to run
at 4 or 16 Mbps.
The device driver supports a user-configurable parameter that selects the ring speed of the
adapter. There are three options for the ring speed: 4, 16, or autosense.
1. If 4 is selected, the device driver opens the adapter with 4 Mbits. It returns an error if the ring
speed does not match the network speed.
2. If 16 is selected, the device driver opens the adapter with 16 Mbits. It returns an error if the
ring speed does not match the network speed.
3. If autosense is selected, the adapter guarantees a successful open, and the speed used to
open is dependent on the following:
v If the adapter is opened on an existing network the speed is determined by the ring speed
of the network.
v If the device is opened on a new network and the adapter is new, 16 Mbits is used. Or, if
the adapter opened successfully, the ring speed is determined by the speed of the last
successful open.
Software Transmit Queue
The device driver supports a user-configurable transmit queue that can be set to store between 32
and 2048 transmit request pointers. Each transmit request pointer corresponds to a transmit
request that might be for several buffers of data.
Receive Queue
The device driver supports a user-configurable receive queue that can be set to store between 32
and 160 receive buffers. These buffers are mbuf clusters into which the device writes the received
data.
In addition, the following configuration parameters are supported by the PCI Token-Ring High Performance
Device Driver (14101800):
Priority Data Transmission
The device driver supports a user option to request priority transmission of the data packets.
Software Priority Transmit Queue
The device driver supports a user-configurable priority transmit queue that can be set to store
between 32 and 160 transmit request pointers. Each transmit request pointer corresponds to a
transmit request that might be for several buffers of data.
If the connection is successful, the NDD_RUNNING flag is set in the ndd_flags field, and an
NDD_CONNECTED status block is sent.
If the device connection fails, the NDD_LIMBO flag is set in the ndd_flags field, and an
NDD_LIMBO_ENTRY status block is sent.
If the device is eventually connected, the NDD_LIMBO flag is turned off, and the NDD_RUNNING flag is
set in the ndd_flags field. Both NDD_CONNECTED and NDD_LIMBO_EXIT status blocks are set.
The device is not detached from the network until the device’s transmit queue is allowed to drain.
Data Transmission
The device drivers do not support mbuf structures from user memory that have the M_EXT flag set.
If the destination address in the packet is a broadcast address, the M_BCAST flag in the p_mbuf->m_flags
field must be set prior to entering this routine. A broadcast address is defined as 0xFFFF FFFF FFFF or
0xC000 FFFF FFFF. If the destination address in the packet is a multicast address, the M_MCAST flag in
the p_mbuf->m_flags field must be set prior to entering this routine. A multicast address is defined as a
non-individual address other than a broadcast address. The device driver keeps statistics based on the
M_BCAST and M_MCAST flags.
If a packet is transmitted with a destination address that matches the adapter’s address, the packet is
received. This is true for the adapter’s physical address, broadcast addresses (0xC000 FFFF FFFF or
0xFFFF FFFF FFFF), enabled functional addresses, or an enabled group address.
Data Reception
When the Token-Ring device driver receives a valid packet from the network device, the Token-Ring
device driver calls the nd_receive() function specified in the ndd_t structure of the network device. The
nd_receive() function is part of a CDLI network demuxer. The packet is passed to the nd_receive()
function in the mbuf structures.
The Token-Ring device driver passes only one packet to the nd_receive() function at a time.
The device driver sets the M_BCAST flag in the p_mbuf->m_flags field when a packet that has an
all-stations broadcast address is received. This address is defined as 0xFFFF FFFF FFFF or 0xC000
FFFF FFFF.
The device driver sets the M_MCAST flag in the p_mbuf->m_flags field when a packet is received that has
a non-individual address that is different from the all-stations broadcast address.
The adapter does not pass invalid packets to the device driver.
Asynchronous Status
When a status event occurs on the device, the Token-Ring device driver builds the appropriate status
block and calls the nd_status() function specified in the ndd_t structure of the network device. The
nd_status() function is part of a CDLI network demuxer.
The following status blocks are defined for the Token-Ring device driver.
Note: While the device driver is in this recovery logic, the device might not be fully functional. The device
driver notifies users when the device is fully functional by way of an NDD_LIMBO_EXIT
asynchronous status block:
Following is a list of trace hooks and location of definition files for the existing ethernet device drivers.
The PCI Token-Ring High Performance Device Driver (14101800): Definition File:
/sys/cdli_tokuser.h
Error Logging
PCI Token-Ring High Performance Device Driver (14101800): The error IDs for the PCI Token-Ring
High Performance Device Driver (14101800) are as follows:
ERRID_STOK_ADAP_CHECK
The microcode on the device performs a series of diagnostic checks when the device is idle.
These checks can find errors, and they are reported as adapter checks. If the device is connected
to the network when this error occurs, the device driver goes into Network Recovery Mode in an
attempt to recover from the error. The device is temporarily unavailable during the recovery
procedure. User intervention is not required for this error unless the problem persists.
ERRID_STOK_ADAP_OPEN
Enables the device driver to open the device. The device driver goes into Network Recovery Mode
in an attempt to recover from the error. The device is temporarily unavailable during the recovery
procedure. User intervention is not required for this error unless the problem persists.
ERRID_STOK_AUTO_RMV
An internal hardware error following the beacon automatic removal process was detected. The
device driver goes into Network Recovery Mode in an attempt to recover from the error. The
device is temporarily unavailable during the recovery procedure. User intervention is not required
for this error unless the problem persists.
ERRID_STOK_RING_SPEED
The ring speed (or ring data rate) is probably wrong. Contact the network administrator to
determine the speed of the ring. The device driver only retries twice at 2-minute intervals after this
error log entry is generated.
ERRID_STOK_DMAFAIL
The device detected a DMA error in a TX or RX operation. The device driver goes into Network
Recovery Mode in an attempt to recover from the error. The device is temporarily unavailable
during the recovery procedure. User intervention is not required unless the problem persists.
ERRID_STOK_BUS_ERR
The device detected a Micro Channel bus error. The device driver goes into Network Recovery
PCI Token-Ring Device Driver (14103e00): The error IDs for the PCI Token-Ring Device Driver
(14103e00) are as follows:
ERRID_CSTOK_ADAP_CHECK
The microcode on the device performs a series of diagnostic checks when the device is idle on
initialization. These checks find errors and they are reported as adapter checks. If the device was
connected to the network when this error occurred, the device driver will go into Network Recovery
Mode in an attempt to recover from the error. The device is temporarily unavailable during the
recovery procedure. After this error log entry has been generated, the device driver will retry 3
times with no delay between retries. User intervention is not required for this error unless the
problem persists.
ERRID_CSTOK_ADAP_OPEN
The device driver was unable to open the device. The device driver will go into Network Recovery
Mode in an attempt to recover from this error. The device is temporarily unavailable during the
recovery procedure. The device driver will retry indefinitely with a 30 second delay between retries
to recover. User intervention is not required for this error unless the problem persists.
ERRID_CSTOK_AUTO_RMV
An internal hardware error following the beacon automatic removal process has been detected.
The device driver will go into Network Recovery Mode in an attempt to recover from the error. The
device is temporarily unavailable during the recovery procedure. User intervention is not required
for this error unless the problem persists.
The following information is provided about each of the Ethernet device drivers:
v Configuration Parameters
v Interface Entry Points
v Asynchronous Status
v Device Control Operations
v Trace
v Error Logging
For each Ethernet device, the interface to the device driver is achieved by calling the entry points for
opening, closing, transmitting data, and issuing device control commands.
There are a number of Ethernet device drivers in use. All drivers provide PCI-based connections to an
Ethernet network, and support both Standard and IEEE 802.3 Ethernet Protocols.
The PCI Ethernet Adapter Device Driver (22100020) supports the PCI Ethernet BNC/RJ-45 Adapter
(feature 2985) and the PCI Ethernet BNC/AUI Adapter (feature 2987), as well as the integrated Ethernet
port on certain systems.
The 10/100 Mbps Ethernet PCI Adapter Device Driver (23100020) supports the 10/100 Mbps Ethernet PCI
Adapter (feature 2968) and the Four Port 10/100 Mbps Ethernet PCI Adapter (features 4951 and 4961), as
well as the integrated Ethernet port on certain systems.
The 10/100 Mbps Ethernet PCI Adapter II Device Driver (1410ff01) supports the 10/100 Mbps Ethernet
PCI Adapter II (feature 4962), as well as the integrated Ethernet port on certain systems.
The Gigabit Ethernet-SX PCI Adapter Device Driver (14100401) supports the Gigabit Ethernet-SX PCI
Adapter (feature 2969) and the 10/100/1000 Base-T Ethernet Adapter (feature 2975).
The Gigabit Ethernet-SX PCI-X Adapter Device Driver (14106802) supports the Gigabit Ethernet-SX PCI-X
Adapter (feature 5700).
The 10/100/1000 Base-TX Ethernet PCI-X Adapter Device Driver (14106902) supports the 10/100/1000
Base-TX Ethernet PCI-X Adapter (feature 5701).
The 2-Port Gigabit Ethernet-SX PCI-X Adapter Device Driver (14108802) supports the 2-Port Gigabit
Ethernet-SX PCI-X Adapter (feature 5707).
The 2-Port 10/100/1000 Base-TX PCI-X Adapter Device Driver (14108902) supports the 2-Port
10/100/1000 Base-TX PCI-X Adapter (feature 5706).
The 10 Gigabit Ethernet-SR PCI-X Adapter Device Driver (1410ba02) supports the 10 Gigabit Ethernet-SR
PCI-X Adapter (feature 5718).
The 10 Gigabit Ethernet-LR PCI-X Adapter Device Driver (1410bb02) supports the 10 Gigabit Ethernet-LR
PCI-X Adapter (feature 5719).
The 10 Gigabit Ethernet-SR PCI-X 2.0 DDR Adapter Device Driver supports the 10 Gigabit Ethernet-SR
PCI-X 2.0 DDR Adapter (feature 5721).
The 10 Gigabit Ethernet-LR PCI-X 2.0 DDR Adapter Device Driver supports the 10 Gigabit Ethernet-LR
PCI-X 2.0 DDR Adapter (feature 5722).
The Gigabit Ethernet-SX Adapter Device Driver (14101403) supports the eServer BladeCenter JS21
Gigabit Ethernet-SX Adapter.
The 4-Port 10/100/1000 Base-TX Ethernet PCI-X Adapter Device Driver (14101103) supports the 4-Port
10/100/1000 Base-TX PCI-X Adapter (feature 5740).
Configuration Parameters
The following configuration parameter is supported by all Ethernet device drivers:
Alternate Ethernet Addresses
The device drivers support the device’s hardware address as the network address or an alternate
network address configured through software. When an alternate address is used, any valid
Individual Address can be used. The least significant bit of an Individual Address must be set to
zero. A multicast address cannot be defined as a network address. Two configuration parameters
are provided to provide the alternate Ethernet address and enable the alternate address.
Note: If autonegotiation is selected, the remote link device must also be set to autonegotiate or
the link might not function properly.
Inter Packet Gap
The 10/100 Mbps Ethernet PCI Adapter Device Driver (23100020) supports a user-configurable
inter packet gap for the adapter. The inter packet gap attribute controls the aggressiveness of the
adapter on the network. A small number increases the aggressiveness of the adapter, but a large
number decreases the aggressiveness (and increase the fairness) of the adapter. A small number
(more aggressive) could cause the adapter to capture the network by forcing other less aggressive
nodes to defer. A larger number (less aggressive) might cause the adapter to defer more often
than normal. If the statistics for other nodes on the network show a large number of collisions and
deferrals, then try increasing this number. The default is 96, which results in IPG of 9.6 micro
seconds for 10 Mbps and 0.96 microseconds for 100 Mbps media speed. Each unit of bit rate
introduces an IPG of 100 nsec at 10 Mbps, and 10 nsec at 100 Mbps media speed.
Link Polling Timer
The 10/100 Mbps Ethernet PCI Adapter Device Driver (23100020) implements a polling function
(Enable Link Polling) that periodically queries the adapter to determine whether the Ethernet link
is up or down. The Enable Link Polling attribute is disabled by default. If this function is enabled,
the link polling timer value indicates how often the driver should poll the adapter for link status.
This value can range from 100 to 1000 milliseconds. If the adapter’s link goes down, the device
driver disables its NDD_RUNNING flag. When the device driver finds that the link has come back
up, it enables this NDD_RUNNING flag. In order for this to work successfully, protocol layer
implementations, such as Etherchannel, need notification if the link has gone down. Enable the
Enable Link Polling attribute to obtain this notification. Because of the additional PIO calls that
the device driver makes, enabling this attribute can decrease the performance of this adapter.
Note: If autonegotiation is selected, the remote link device must also be set to autonegotiate or
the link might not function properly.
Link Polling Timer
The 10/100 Mbps Ethernet PCI Adapter II Device Driver (1410ff01) implements a polling function
which periodically queries the adapter to determine whether the Ethernet link is up or down. If this
function is enabled, the link polling timer value indicates how often the driver should poll the
adapter for link status. This value can range from 100 to 1000 milliseconds.
Checksum Offload
The 10/100 Mbps Ethernet PCI Adapter II Device Driver (1410ff01) supports the capability of the
adapter to calculate TCP checksums in hardware. If this capability is enabled, the TCP checksum
calculation is performed on the adapter instead of the host, which can increase system
performance. Possible values are Yes and No.
Transmit TCP Resegmentation Offload
The 10/100 Mbps Ethernet PCI Adapter II Device Driver (1410ff01) supports the capability of the
adapter to perform resegmentation of transmitted TCP segments in hardware. This capability
enables the host to use TCP segments that are larger than the actual MTU size of the Ethernet
link, which can increase system performance. Possible values are Yes and No.
IPsec Offload
The 10/100 Mbps Ethernet PCI Adapter II Device Driver (1410ff01) supports the capability of the
adapter to perform IPsec cryptographic algorithms for data encryption and authentication in
hardware. This capability enables the host to offload processor-intensive cryptographic processing
to the adapter, which can increase system performance. Possible values are Yes and No.
Note: The mbuf describing a frame to be transmitted contains a flag that says if the adapter
should calculate the checksum for the frame.
Media Speed
The Gigabit Ethernet-SX PCI Adapter Device Driver (14100401) supports a user-configurable
media speed only for the IBM 10/100/1000 Base-T Ethernet PCI adapter (feature 2975). For the
Gigabit Ethernet-SX PCI Adapter (feature 2969), the only possible choice is autonegotiation. The
media speed attribute indicates the speed at which the adapter attempts to operate. The available
speeds are 10 Mbps half-duplex, 10 Mbps full-duplex, 100 Mbps half-duplex, 100 Mbps full-duplex
and autonegotiation, with a default of autonegotiation. Select autonegotiate when the adapter
Note: The autonegotiation setting must be selected in order for the adapter to run at 1000 M-bit/s.
Enable Hardware Transmit TCP Resegmentation
Setting this attribute to Yes indicates that the adapter should perform TCP resegmentation on
transmitted TCP segments. This capability allows TCP/IP to send larger datagrams to the adapter,
which can increase performance. If No is specified, TCP resegmentation is not performed.
Note: The default values for the Gigabit Ethernet-SX PCI Adapter Device Driver (14100401) configuration
parameters were chosen for optimal performance, and should not be changed unless IBM
recommends a change.
The following configuration parameters for the Gigabit Ethernet-SX PCI Adapter Device Driver (14100401)
are not accessible using the SMIT interface, and can only by modified using the chdev command line
interface:
stat_ticks
The number of microseconds that the adapter waits before updating the adapter statistics (through
a DMA write) and generating an interrupt to the host. Valid values range from 1000-1000000. The
default value is 1000000.
receive_ticks
The number of microseconds that the adapter waits before updating the receive return ring
producer index (through a DMA write) and generating an interrupt to the host. Valid values range
from 0-1000, the default value is 50.
receive_bds
The number of receive buffers that the adapter transfers to host memory before updating the
receive return ring producer index (through a DMA write) and generating an interrupt to the host.
Valid values range from 0-128. The default value is 6.
tx_done_ticks
The number of microseconds that the adapter waits before updating the send consumer index
(through a DMA write) and generating an interrupt to the host. Valid values range from 0-1000000.
The default value is 1000000.
tx_done_count
The number of transmit buffers that the adapter transfers from host memory before updating the
send consumer index (through a DMA write) and generating an interrupt to the host. Valid values
range from 0-128. The default value is 64.
receive_proc
When this number of receive buffer descriptors is processed by the device driver (or all packets
are received), the device driver returns this number of receive buffer descriptors to the adapter
through an MMIO write. Valid values range from 1-64. The default value is 16.
rxdesc_count
When this number of receive buffer descriptors is processed by the device driver (or all packets
were received), the device driver exits the rx_handler() routine and continues processing other
adapter events, such as transmit completions and adapter status changes. Valid values range from
1-1000000. The default value is 1000.
slih_hog
The number of adapter events (such as receive completions, transmit completions and adapter
status changes) processed by the device driver per interrupt. Valid values range from 1-1000000.
The default value is 10.
copy_bytes
When the number of data bytes in a transmit mbuf exceeds this value, the device driver maps the
Note: The mbuf structure that describes a transmitted frame contains a flag that indicates
whether the adapter should perform TCP resegmentation for the frame.
Enable Hardware Transmit and Receive Checksum
Setting this attribute to the Yes value indicates that the adapter calculates the checksum for
transmitted and received TCP frames. If you specify the No value, the checksum is calculated by
the appropriate software.
Note: The mbuf structure that describes a transmitted frame contains a flag that indicates
whether the adapter should calculate the checksum for the frame.
The following configuration parameters for the Gigabit Ethernet-SX PCI-X Adapter Device Driver
(14106802) are not accessible using the SMIT interface, and can only by modified using the chdev
command line interface:
rx_hog
When this number of receive buffer descriptors is processed by the device driver (or all packets
Note: The mbuf structure that describes a transmitted frame contains a flag that indicates
whether the adapter should perform TCP resegmentation for the frame.
166 Kernel Extensions and Device Support Programming Concepts
Enable Hardware Transmit and Receive Checksum
Setting this attribute to the yes value indicates that the adapter calculates the checksum for
transmitted and received TCP frames. If you specify the no value, the checksum is calculated by
the appropriate software.
Note: The mbuf describing a frame to be transmitted contains a flag that says if the adapter
should calculate the TCP checksum for the frame.
The following configuration parameters for the 10/100/1000 Base-T Ethernet PCI-X Adapter Device Driver
(14106902) are not accessible using the SMIT interface, and can only by modified using the chdev
command line interface:
rx_hog
When this number of receive buffer descriptors is processed by the device driver (or all packets
were received), the device driver exits the rx_handler() routine and continues processing other
adapter events (such as transmit completions and adapter status changes). Valid values range
from 1 - 1000000. The default value is 1000.
slih_hog
The number of adapter events (such as receive completions, transmit completions, and adapter
status changes) processed by the device driver per interrupt. Valid values range from 1 - 1000000.
The default value is 10.
copy_bytes
When the number of data bytes in a transmit mbuf exceeds this value, the device driver maps the
mbuf data area into DMA memory and updates the transmit descriptor so that it points to this
DMA memory area. When the number of data bytes in a transmit mbuf does not exceed this
value, the data is copied from the mbuf into a preallocated transmit buffer that is already mapped
into DMA memory. The device driver also attempts to coalesce transmit data in an mbuf chain into
a single preallocated transmit buffer until the total transmit data size exceeds that of the
preallocated buffer (2048 bytes). Valid values range from 64 - 2048. The default value is 2048.
delay_open
Setting this attribute to Yes causes the adapter device driver to delay its open completion until the
Ethernet link status is determined to be either up or down. This prevents applications from sending
data before the Ethernet link is established. Commands such as ifconfig, however, might take
longer to complete, especially when an active Ethernet link is not present. Valid values are Yes
and No. The default value is No.
compat_mode
Setting this attribute to Yes forces the adapter to implement an early version of the IEEE 802.3z
autonegotiation protocol. Use the yes value only if the adapter is unable to establish a link with
your older Gigabit Ethernet-TX adapters or switches. Valid values are yes and No. The default
value is No.
Note: If this option is enabled, the adapter cannot establish a link with newer Gigabit Ethernet-TX
hardware. Enable this option only if you cannot establish a link using autonegotiation, but
can force a link at a slower speed (for example, 100 full-duplex).
Note: The mbuf structure that describes a transmitted frame contains a flag that indicates
whether the adapter should perform TCP resegmentation for the frame.
Enable Hardware Transmit and Receive Checksum
Setting this attribute to the Yes value indicates that the adapter calculates the checksum for
transmitted and received TCP frames. If you specify the No value, the checksum is calculated by
the appropriate software.
Note: The mbuf structure that describes a transmitted frame contains a flag that indicates
whether the adapter should calculate the TCP checksum for the frame.
Enable Failover Mode
This attribute indicates the requested failover configuration for the port. Possible values are
primary, backup, and disable. primary indicates the port is to act as the primary port in a failover
configuration for a 2-Port Gigabit adapter. backup indicates the port is to act as the backup port in
a failover configuration for a 2-Port Gigabit adapter. disable indicates the port is not a member of a
failover configuration. The default value for failover is disable.
The following configuration parameters for the 2-Port Gigabit Ethernet-SX PCI-X Adapter Device Driver
(14108802) are not accessible using the SMIT interface, and can only by modified using the chdev
command line interface:
rx_hog
When this number of receive buffer descriptors is processed by the device driver (or all packets
were received), the device driver exits the rx_handler() routine and continues processing other
adapter events (such as transmit completions and adapter status changes). Valid values range
from 1 - 1000000. The default value is 1000.
slih_hog
The number of adapter events (such as receive completions, transmit completions, and adapter
status changes) processed by the device driver per interrupt. Valid values range from 1 - 1000000.
The default value is 10.
copy_bytes
When the number of data bytes in a transmit mbuf exceeds this value, the device driver maps the
mbuf data area into DMA memory and updates the transmit descriptor so that it points to this
DMA memory area. When the number of data bytes in a transmit mbuf does not exceed this
value, the data is copied from the mbuf into a preallocated transmit buffer that is already mapped
Note: The mbuf structure that describes a transmitted frame contains a flag that indicates
whether the adapter should perform TCP resegmentation for the frame.
Chapter 7. Communications I/O Subsystem 169
Enable Hardware Transmit and Receive Checksum
Setting this attribute to the Yes value indicates that the adapter calculates the checksum for
transmitted and received TCP frames. If you specify the No value, the checksum is calculated by
the appropriate software.
Note: The mbuf structure that describes a transmitted frame contains a flag that indicates
whether the adapter should calculate the TCP checksum for the frame.
Failover Mode (failover)
This attribute indicates the requested failover configuration for the port. Possible values are
primary, backup, and disable. primary indicates the port is to act as the primary port in a failover
configuration for a 2-Port Gigabit adapter. backup indicates the port is to act as the backup port in
a failover configuration for a 2-Port Gigabit adapter. disable indicates the port is not a member of a
failover configuration. The default value for failover is disable.
The following configuration parameters for the 2-Port 10/100/1000 Base-TX PCI-X Adapter Device Driver
(14108902) are not accessible using the SMIT interface, and can only by modified using the chdev
command line interface:
rx_hog
When this number of receive buffer descriptors is processed by the device driver (or all packets
were received), the device driver exits the rx_handler() routine and continues processing other
adapter events (such as transmit completions and adapter status changes). Valid values range
from 1 - 1000000. The default value is 1000.
slih_hog
The number of adapter events (such as receive completions, transmit completions, and adapter
status changes) processed by the device driver per interrupt. Valid values range from 1 - 1000000.
The default value is 10.
copy_bytes
When the number of data bytes in a transmit mbuf exceeds this value, the device driver maps the
mbuf data area into DMA memory and updates the transmit descriptor so that it points to this
DMA memory area. When the number of data bytes in a transmit mbuf does not exceed this
value, the data is copied from the mbuf into a preallocated transmit buffer that is already mapped
into DMA memory. The device driver also attempts to coalesce transmit data in an mbuf chain into
a single preallocated transmit buffer until the total transmit data size exceeds that of the
preallocated buffer (2048 bytes). Valid values range from 64 - 2048. The default value is 2048.
delay_open
Setting this attribute to Yes causes the adapter device driver to delay its open completion until the
Ethernet link status is determined to be either up or down. This prevents applications from sending
data before the Ethernet link is established. Commands such as ifconfig, however, might take
longer to complete, especially when an active Ethernet link is not present. Valid values are Yes
and No. The default value is No.
failback
This attribute is used in conjunction with Failover Mode. If Failover Mode is enabled, setting this
attribute to Yes causes the adapter to automatically fail back to the primary port if the primary port
recovers. Valid values are Yes and No. The default value is yes.
failback_delay
This attribute is used in conjunction with the failback attribute. If failback is enabled, the
failback_delay attribute specifies the number of seconds that the adapter waits before failing back
to the primary port, after the primary port recovers. This delay is useful for ensuring that the
primary port has fully recovered and for allowing switch protocols (such as Spanning Tree
Protocol) to complete. Valid values range from 0 - 300 seconds. Setting the failback_delay
attribute to 0 seconds disables the delay timer, causing failback to occur immediately. The default
value is 15 seconds.
Note: If this option is enabled, the adapter cannot establish a link with newer Gigabit Ethernet-TX
hardware. Enable this option only if you cannot establish a link using autonegotiation, but
can force a link at a slower speed (for example, 100 full-duplex).
Note: The mbuf structure that describes a transmitted frame contains a flag that indicates
whether the adapter should calculate the checksum for the frame.
Note: The mbuf structure that describes a transmitted frame contains a flag that indicates
whether the adapter should calculate the checksum for the frame.
Note: The mbuf describing a frame to be transmitted contains a flag that says if the adapter
should calculate the checksum for the frame.
Media Speed
The Gigabit Ethernet-SX Adapter Device Driver (e414a816) supports a user-configurable media
speed for 1000 Mbps full-duplex and autonegotiation. The media speed attribute indicates the
speed at which the adapter attempts to operate. Select autonegotiate when the adapter should
use autonegotiation across the network to determine the speed. When the network does not
support autonegotiation, select the specific speed.
Note: The default values for the Gigabit Ethernet-SX Adapter Device Driver (e414a816) configuration
parameters were chosen for optimal performance, and should not be changed unless IBM
recommends a change.
The following configuration parameters for the Gigabit Ethernet-SX Adapter Device Driver (e414a816) are
not accessible using the SMIT interface, and can only by modified using the chdev command line
interface:
stat_ticks
The number of microseconds that the adapter waits before updating the adapter statistics (through
a DMA write) and generating an interrupt to the host. Valid values range from 1000-1000000. The
default value is 1000000.
receive_ticks
The number of microseconds that the adapter waits before updating the receive return ring
producer index (through a DMA write) and generating an interrupt to the host. Valid values range
from 0-1000, the default value is 50.
receive_bds
The number of receive buffers that the adapter transfers to host memory before updating the
172 Kernel Extensions and Device Support Programming Concepts
receive return ring producer index (through a DMA write) and generating an interrupt to the host.
Valid values range from 0-128. The default value is 6.
tx_done_ticks
The number of microseconds that the adapter waits before updating the send consumer index
(through a DMA write) and generating an interrupt to the host. Valid values range from 0-1000000.
The default value is 1000000.
tx_done_count
The number of transmit buffers that the adapter transfers from host memory before updating the
send consumer index (through a DMA write) and generating an interrupt to the host. Valid values
range from 0-128. The default value is 64.
receive_proc
When this number of receive buffer descriptors is processed by the device driver (or all packets
are received), the device driver returns this number of receive buffer descriptors to the adapter
through an MMIO write. Valid values range from 1-64. The default value is 16.
rxdesc_count
When this number of receive buffer descriptors is processed by the device driver (or all packets
were received), the device driver exits the rx_handler() routine and continues processing other
adapter events, such as transmit completions and adapter status changes. Valid values range from
1-1000000. The default value is 1000.
slih_hog
The number of adapter events (such as receive completions, transmit completions and adapter
status changes) processed by the device driver per interrupt. Valid values range from 1-1000000.
The default value is 10.
copy_bytes
When the number of data bytes in a transmit mbuf exceeds this value, the device driver maps the
mbuf data area into DMA memory and updates the transmit descriptor such that it points to this
DMA memory area. When the number of data bytes in a transmit mbuf does not exceed this value,
the data is copied from the mbuf into a preallocated transmit buffer which is already mapped into
DMA memory. The device driver also attempts to coalesce transmit data in an mbuf chain into a
single preallocated transmit buffer, until the total transmit data size exceeds that of the
preallocated buffer (2048 bytes). Valid values range from 64-2048. The default value is 2048.
Note: The mbuf describing a frame to be transmitted contains a flag that says if the adapter
should calculate the checksum for the frame.
Media Speed
The Gigabit Ethernet-SX Adapter Device Driver (14101403) supports a user-configurable media
speed for 1000 Mbps full-duplex and autonegotiation. The media speed attribute indicates the
speed at which the adapter attempts to operate. Select autonegotiate when the adapter should
use autonegotiation across the network to determine the speed. When the network does not
support autonegotiation, select the specific speed.
Chapter 7. Communications I/O Subsystem 173
Note: The default values for the Gigabit Ethernet-SX Adapter Device Driver (14101403) configuration
parameters were chosen for optimal performance, and should not be changed unless IBM
recommends a change.
The following configuration parameters for the Gigabit Ethernet-SX Adapter Device Driver (14101403) are
not accessible using the SMIT interface, and can only by modified using the chdev command line
interface:
receive_ticks
The number of microseconds that the adapter waits before updating the receive return ring
producer index (through a DMA write) and generating an interrupt to the host. Valid values range
from 0 - 1000, the default value is 50.
receive_bds
The number of receive buffers that the adapter transfers to host memory before updating the
receive return ring producer index (through a DMA write) and generating an interrupt to the host.
Valid values range from 0 - 128. The default value is 6.
tx_done_ticks
The number of microseconds that the adapter waits before updating the send consumer index
(through a DMA write) and generating an interrupt to the host. Valid values range from 0 -
1000000. The default value is 1000000.
tx_done_count
The number of transmit buffers that the adapter transfers from host memory before updating the
send consumer index (through a DMA write) and generating an interrupt to the host. Valid values
range from 0 - 128. The default value is 64.
receive_proc
When this number of receive buffer descriptors is processed by the device driver (or all packets
are received), the device driver returns this number of receive buffer descriptors to the adapter
through an MMIO write. Valid values range from 1 - 64. The default value is 16.
rx_hog
When this number of receive buffer descriptors is processed by the device driver (or all packets
were received), the device driver exits the rx_handler() routine and continues processing other
adapter events (such as transmit completions and adapter status changes). Valid values range
from 1 - 1000000. The default value is 1000.
slih_hog
The number of adapter events (such as receive completions, transmit completions and adapter
status changes) processed by the device driver per interrupt. Valid values range from 1 - 1000000.
The default value is 10.
copy_bytes
When the number of data bytes in a transmit mbuf exceeds this value, the device driver maps the
mbuf data area into DMA memory and updates the transmit descriptor so that it points to this
DMA memory area. When the number of data bytes in a transmit mbuf does not exceed this
value, the data is copied from the mbuf into a preallocated transmit buffer that is already mapped
into DMA memory. The device driver also attempts to coalesce transmit data in an mbuf chain into
a single preallocated transmit buffer until the total transmit data size exceeds that of the
preallocated buffer (2048 bytes). Valid values range from 64 - 2048. The default value is 2048.
delay_open
Setting this attribute to Yes causes the adapter device driver to delay its open completion until the
Ethernet link status is determined to be either up or down. This prevents applications from sending
data before the Ethernet link is established. Commands such as ifconfig, however, might take
longer to complete, especially when an active Ethernet link is not present. Valid values are Yes
and No. The default value is No.
Note: 1000 Mbps half-duplex is not a valid value. The IEEE 802.3z specification dictates that the
gigabit speeds for half-duplex must be autonegotiated for copper (TX)-based adapters.
Select autonegotiation if this speed is required.
Transmit Jumbo Frames
Setting this attribute to the Yes value indicates that frames up to 9018 bytes in length can be
transmitted on this adapter. If you specify the No value, the maximum size of frames transmitted is
1518 bytes. Frames up to 9018 bytes in length can always be received on this adapter.
Transmit TCP Resegmentation Offload
Supports the capability of the adapter to perform resegmentation of transmitted TCP segments in
hardware. This capability enables the host to use TCP segments that are larger than the actual
MTU size of the Ethernet link, which can increase system performance. Possible values are Yes
and no.
Enable Hardware Checksum Offload
Setting this attribute to the Yes value indicates that the adapter calculates the checksum for
transmitted and received TCP frames. If you specify the No value, the checksum is calculated by
the appropriate software.
Note: The mbuf structure that describes a transmitted frame contains a flag that indicates
whether the adapter should calculate the checksum for the frame.
Gigabit Backward Compatibility
Older gigabit TX equipment might not be able to communicate with this adapter. If the adapter is
unable to communicate with your older gigabit equipment, enabling this option forces the adapter
to implement the IEEE 802.3z incorrectly. As such, this option should be enabled if the adapter is
unable to communicate with your older gigabit equipment.
Note: Enabling this option forces the adapter to implement the IEEE 802.3z incorrectly. If this
option is enabled, the adapter cannot communicate with newer equipment. Enable this
option only if you cannot obtain a link using autonegotiation, but can force a link at a slower
speed (for example, 100 full-duplex).
Failover Mode (failover)
This attribute indicates the requested failover configuration for the port. Possible values are
primary, backup, and disable. The primary value indicates the port is to act as the primary port in
The device driver issues commands to start the initialization of the device. The state of the device now is
OPEN_PENDING. The device driver invokes the open process for the device. The open process involves
a sequence of events that are necessary to initialize and configure the device. The device driver does the
sequence of events in an orderly fashion to make sure that one step is finished executing on the adapter
before the next step is continued. Any error during these sequence of events makes the open fail. The
device driver requires about 2 seconds to open the device. When the whole sequence of events is done,
the device driver verifies the open status and then returns to the caller of the open with a return code to
indicate open success or open failure.
After the device has been successfully configured and connected to the network, the device driver sets the
device state to OPENED, the NDD_RUNNING flag in the NDD flags field is turned on. In the case of
unsuccessful open, both the NDD_UP and NDD_RUNNING flags in the NDD flags field are off and a
non-zero error code is returned to the caller.
The device will not be detached from the network until the device’s transmit queue drains. That is, the
close entry point will not return until all packets have been transmitted or timed out. If the device is
inoperable at the time of the close, the device’s transmit queue does not have to drain.
At the beginning of the close entry point, the device state is set to be CLOSE_PENDING. The
NDD_RUNNING flag in the ndd_flags is turned off. After the outstanding transmit queue is all done, the
device driver starts a sequence of operations to deactivate the adapter and to free up resources. Before
the close entry point returns to the caller, the device state is set to CLOSED.
Data Transmission
The output entry point transmits data using the specified network device.
The data to be transmitted is passed into the device driver by way of mbuf structures. The first mbuf
structure in the chain must be of M_PKTHDR format. Multiple mbuf structures can be used to hold the
frame. Link the mbuf structures using the m_next field of the mbuf structure.
Multiple packet transmits are supported with the mbufs being chained using the m_nextpkt field of the
mbuf structure. The m_pkthdr.len field must be set to the total length of the packet. The device driver
does not support mbufs from user memory (which have the M_EXT flag set).
On successful transmit requests, the device driver is responsible for freeing all the mbufs associated with
the transmit request. If the device driver returns an error, the caller is responsible for the mbufs. If any of
the chained packets can be transmitted, the transmit is considered successful and the device driver is
responsible for all of the mbufs in the chain.
For packets that are shorter than the Ethernet minimum MTU size (60 bytes), the device driver pads them
by adjusting the transmit length to the adapter so they can be transmitted as valid Ethernet packets.
Users are not notified by the device driver about the status of the transmission. Various statistics about
data transmission are kept by the driver in the ndd structure. These statistics are part of the data returned
by the NDD_GET_STATS control operation.
Data Reception
When the Ethernet device drivers receive a valid packet from the network device, the device drivers call
the nd_receive function that is specified in the ndd_t structure of the network device. The nd_receive
function is part of a CDLI network demultiplexer. The packet is passed to the nd_receive function in the
form of a mbuf.
The Ethernet device drivers can pass multiple packets to the nd_receive function by chaining the packets
together using the m_nextpkt field of the mbuf structure. The m_pkthdr.len field must be set to the total
length of the packet. If the source address in the packet is a broadcast address the M_BCAST flag in the
m_flags field should be set. If the source address in the packet is a multicast address the M_MCAST flag
in the m_flags field should be set.
When the device driver initially configures the device to discard all invalid frames. A frame is considered to
be invalid for the following reasons:
v The packet is too short.
v The packet is too long.
v The packet contains a CRC error.
v The packet contains an alignment error.
If the asynchronous status for receiving invalid frames has been issued to the device driver, the device
driver configures the device to receive bad packets as well as good packets. Whenever a bad packet is
received by the driver, an asynchronous status block NDD_BAD_PKTS is created and delivered to the
Various statistics about data reception on the device are kept by the driver in the ndd structure. These
statistics are part of the data returned by the NDD_GET_STATS and NDD_GET_ALL_STATS control
operations.
There is no specified entry point for this function. The device informs the device driver of a received
packet using an interrupt. Upon determining that the interrupt was the result of a packet reception, the
device driver’s interrupt handler invoke the rx_handler completion routine to perform the tasks mentioned
above.
Asynchronous Status
When a status event occurs on the device, the Ethernet device drivers build the appropriate status block
and call the nd_status function that is specified in the ndd_t structure of the network device. The
nd_status function is part of a CDLI network demuxer.
The following status blocks are defined for the Ethernet device drivers.
Note: The following device drivers support the Device Connected status block:
v Gigabit Ethernet-SX PCI Adapter Device Driver (14100401)
v Gigabit Ethernet-SX PCI-X Adapter Device Driver (14106802)
v 10/100/1000 Base-T Ethernet PCI-X Adapter Device Driver (14106902)
v 2-Port Gigabit Ethernet-SX PCI-X Adapter (14108802)
v 2-Port 10/100/1000 Base-TX PCI-X Adapter (14108902)
v 4-Port 10/100/1000 Base-TX PCI-X Adapter (14101103)
v 10 Gigabit Ethernet-SR PCI-X 2.0 DDR Adapter (1410eb02)
v 10 Gigabit Ethernet-LR PCI-X 2.0 DDR Adapter (1410ec02)
The PCI Ethernet Device Driver (22100020) supports the Bad Packets status block.
Bad Packets
When the a bad packet has been received by a device driver (and a user has requested bad
packets), the following status block is returned by the device driver.
code Set to NDD_BAD_PKTS.
option[0]
Specifies the error status of the packet. These error numbers are defined in
<sys/cdli_entuser.h>.
option[1]
Pointer to the mbuf containing the bad packet.
option[]
The remainder of the status block can be used to return additional status information by
the device driver.
Note: The user does not own the mbuf containing the bad packet. The user must copy the mbuf
(and the status block information if necessary). The device driver frees the mbuf upon
return from the nd_status function.
Device Connected
When the device is successfully connected to the network the following status block is returned by
the device driver.
The arg and length parameters specify the address and length in bytes of the area where the statistics are
to be written. The length specified must be the exact length of the general and media-specific statistics.
Note: The ndd_speclen field in the ndd_t structure plus the length of the ndd_genstats_t structure is
the required length. The device-specific statistics might change with each new release of the
operating system, but the general and media-specific statistics are not expected to change.
The user should pass in the ent_ndd_stats_t structure as defined in sys/cdli_entuser.h. The driver fails
a call with a buffer smaller than the structure.
The statistics that are returned contain statistics obtained from the device. If the device is inoperable, the
statistics that are returned do not contain the current device statistics. The copy of the ndd_flags field can
be checked to determine the state of the device.
The arg parameter specifies the address of the Ethernet_all_mib structure. This structure is defined in the
/usr/include/sys/Ethernet_mibs.h file.
If the device supports the RFC 1229 receive address object, the corresponding variable is set to the
number of receive addresses currently active.
The arg parameter specifies the address of the Ethernet_all_mib structure. This structure is defined in the
/usr/include/sys/Ethernet_mibs.h file.
The device driver verifies that if the address is a valid multicast address. If the address is not a valid
multicast address, the operation fails with an EINVAL error. If the address is valid, the driver adds it to its
multicast table and enable the multicast filter on the adapter. The driver keeps a reference count for each
individual address. Whenever a duplicate address is registered, the driver simply increments the reference
count of that address in its multicast table, no update of the adapter’s filter is needed. There is a hardware
limitation on the number of multicast addresses in the filter.
The device driver verifies that if the address is a valid multicast address. If the address is not a valid
multicast address, the operation fails with an EINVAL error. The device driver makes sure that the
multicast address can be found in its multicast table. Whenever a match is found, the driver decrements
the reference count of that individual address in its multicast table. If the reference count becomes 0, the
driver deletes the address from the table and update the multicast filter on the adapter.
The statistics that are returned contain statistics obtained from the device. If the device is inoperable, the
statistics that are returned do not contain the current device statistics. The copy of the ndd_flags field can
be checked to determine the state of the device.
When the device driver is running in promiscuous mode, all network traffic is passed to the network
demultiplexer. When the Ethernet device driver receives a valid packet from the network device, the
Ethernet device driver calls the nd_receive function that is specified in the ndd_t structure of the network
device. The NDD_PROMISC flag in the ndd_flags field is set. Promiscuous mode is considered to be
valid packets only. See the NDD_ADD_STATUS command for information about how to request support
for bad packets.
The device driver maintains a reference count on this operation. The device driver increments the
reference count for each operation. When this reference count is equal to one, the device driver issues
commands to enable the promiscuous mode. If the reference count is greater than one, the device driver
does not issue any commands to enable the promiscuous mode.
The device driver maintains a reference count on this operation. The device driver decrements the
reference count for each operation. When the reference count is not equal to zero, the device driver does
not issue commands to disable the promiscuous mode. After the reference count for this operation is equal
to zero, the device driver issues commands to disable the promiscuous mode.
The NDD_DISABLE_ADAPTER operation is used by Etherchannel to disable the adapter so that it cannot
transmit or receive data. During this operation the NDD_RUNNING and NDD_LIMBO flags are cleared
and the adapter is reset. The arg and len parameters are not used.
The NDD_ENABLE_ADAPTER operation is used by Etherchannel to return the adapter to a running state
so it can transmit and receive data. During this operation the adapter is started and the NDD_RUNNING
flag is set. The arg and len parameters are not used.
The NDD_SET_LINK_STATUS operation is used by Etherchannel to pass the driver a function pointer and
argument that will subsequently be called by the driver whenever the link status changes. The arg
parameter contains a pointer to a ndd_sls_t structure, and the len parameter contains the length of the
ndd_sls_t structure.
The NDD_SET_NAC_ADDR operation is used by Etherchannel to set the adapters MAC address at
runtime. The MAC address set by this ioctl is valid until another NDD_SET_MAC_ADDR call is made with
a new address or when the adapter is closed. If the adapter is closed, the previously configured MAC
address. The MAC address configured with the ioctl supersedes any alternate address that might have
been configured.
The arg argument is char [6], representing the MAC address that is configured on the adapter. The len
argument is 6.
Trace
For LAN device drivers, trace points enable error monitoring as well as tracking packets as they move
through the driver. The drivers issue trace points for some or all of the following conditions:
v Beginning and ending of main functions in the main path
v Error conditions
v Beginning and ending of each function that is tracking buffers outside of the main path
v Debugging purposes (These trace points are only enabled when the driver is compiled with -DDEBUG
turned on, and therefore the driver can contain as many of these trace points as necessary.)
Following is a list of trace hooks (and location of definition file) for the existing Ethernet device drivers.
Transmit -2A4
Receive -2A5
Other -2A6
Transmit -2E6
Receive -2E7
Other -2E8
Transmit -470
Receive -471
Other -472
Transmit -2EA
Receive -2EB
Other -2EC
Transmit -473
Receive -474
Other -475
Transmit -598
Receive -599
Other -59A
The device driver also has the following trace points to support the netpmon program:
Transmit -5B2
Receive -5B3
Other -5B4
Transmit -4a1
Receive -4a2
Other -4a3
The device driver also has the following trace points to support the netpmon program:
WQUE
An output packet has been queued for transmission.
WEND
The output of a packet is complete.
RDAT An input packet has been received by the device driver.
Transmit -4C5
Receive -4C6
Other -4C7
Error Logging
For error logging information, see “Error Logging” on page 320.
Related Information
“Common Communications Status and Exception Codes” on page 113.
Error Logging Overview in AIX 5L Version 5.3 General Programming Concepts: Writing and Debugging
Programs.
Status Blocks for the Serial Optical Link Device Driver, Sense Data for the Serial Optical Link Device
Driver in AIX 5L Version 5.3 Technical Reference: Kernel and Subsystems Volume 2.
Subroutine References
The readx subroutine in AIX 5L Version 5.3 Technical Reference: Base Operating System and Extensions
Volume 2.
Commands References
The entstat Command in AIX 5L Version 5.3 Commands Reference, Volume 1.
The lecstat Command, mpcstat Command in AIX 5L Version 5.3 Commands Reference, Volume 3.
Technical References
The ddwrite entry point, ddselect entry point in AIX 5L Version 5.3 Technical Reference: Kernel and
Subsystems Volume 2.
The mpconfig Multiprotocol (MPQP) Device Handler Entry Point, mpwrite Multiprotocol (MPQP)
Device Handler Entry Point, mpread Multiprotocol (MPQP) Device Handler Entry Point, mpmpx
Multiprotocol (MPQP) Device Handler Entry Point , mpopen Multiprotocol (MPQP) Device Handler
Entry Point, mpselect Multiprotocol (MPQP) Device Handler Entry Point, mpclose Multiprotocol
(MPQP) Device Handler Entry Point, mpioctl Multiprotocol (MPQP) Device Handler Entry Point in
AIX 5L Version 5.3 Technical Reference: Kernel and Subsystems Volume 2.
The program interface to the input device drivers is described in the inputdd.h header file. This header file
is available as part of the bos.adt.graphics fileset.
The close subroutine is used to remove a channel created by the open subroutine call.
ioctl Subroutines
The ioctl operations provide run-time services. The special files support the following ioctl operations:
v Keyboard
v Mouse
v Tablet
v GIO (Graphics I/O) Adapter
v Dials
v LPFK
Keyboard
IOCINFO Returns the devinfo structure.
KSQUERYID Queries the keyboard device identifier.
KSQUERYSV Queries the keyboard service vector.
KSREGRING Registers the input ring.
KSRFLUSH Flushes the input ring.
KSLED Sets and resets the keyboard LEDs.
KSCFGCLICK Configures the clicker.
KSVOLUME Sets the alarm volume.
KSALARM Sounds the alarm.
KSTRATE Sets the repeat rate.
KSTDELAY Sets the delay before repeat.
KSKAP Enables and disables the keep-alive poll.
KSKAPACK Acknowledges the keep-alive poll.
KSDIAGMODE Enables and disables the diagnostics mode.
Mouse
IOCINFO Returns the devinfo structure.
MQUERYID Queries the mouse device identifier.
MREGRING Registers the input ring.
MRFLUSH Flushes the input ring.
MTHRESHOLD Sets the mouse reporting threshold.
MRESOLUTION Sets the mouse resolution.
MSCALE Sets the mouse scale.
MSAMPLERATE Sets the mouse sample rate.
Tablet
IOCINFO Returns the devinfo structure.
TABQUERYID Queries the tablet device identifier.
TABREGRING Registers the input ring.
TABFLUSH Flushes the input ring.
TABCONVERSION Sets the tablet conversion mode.
TABRESOLUTION Sets the tablet resolution.
TABORIGIN Sets the tablet origin.
TABSAMPLERATE Sets the tablet sample rate.
TABDEADZONE Sets the tablet dead zones.
Dials
IOCINFO Returns the devinfo structure.
DIALREGRING Registers the input ring.
DIALRFLUSH Flushes the input ring.
DIALSETGRAND Sets the dial granularity.
LPFK
IOCINFO Returns the devinfo structure.
LPFKREGRING Registers the input ring.
LPFKRFLUSH Flushes the input ring.
Input Ring
Data is obtained from graphic input devices by way of a circular First-In First-Out (FIFO) queue or input
ring, rather than with a read subroutine call. The memory address of the input ring is registered with an
ioctl (or fp_ioctl) subroutine call. The program that registers the input ring is the owner of the ring and is
responsible for allocating, initializing, and freeing the storage associated with the ring. The same input ring
can be shared by multiple devices.
The input ring consists of the input ring header followed by the reporting area. The input ring header
contains the reporting area size, the head pointer, the tail pointer, the overflow flag, and the notification
type flag. Before registering an input ring, the ring owner must ensure that the head and tail pointers
contain the starting address of the reporting area. The overflow flag must also be cleared and the size field
set equal to the number of bytes in the reporting area. After the input ring has been registered, the owner
can modify only the head pointer and the notification type flag.
Data stored on the input ring is structured as one or more event reports. Event reports are placed at the
tail of the ring by the graphic input device drivers. Previously queued event reports are taken from the
head of the input ring by the owner of the ring. The input ring is empty when the head and tail locations
are the same. An overflow condition exists if placement of an event on the input ring would overwrite data
that has not been processed. Following an overflow, new event reports are not placed on the input ring
until the input ring is flushed via an ioctl subroutine or service vector call.
The owner of the input ring is notified when an event is available for processing via a SIGMSG signal or
via callback if the channel was created by an fp_open subroutine call. The notification type flag in the
input ring header specifies whether the owner should be notified each tine an event is placed on the ring
or only when an event is placed on an empty ring.
Note: Event report structures are placed on the input-ring without spacing. Data wraps from the end to the
beginning of the reporting area. A report can be split on any byte boundary into two non-contiguous
sections.
Keyboard
ID Specifies the report identifier.
Length Specifies the report length.
Time stamp Specifies the system time (in milliseconds).
Key position code Specifies the key position code.
Key scan code Specifies the key scan code.
Status flags Specifies the status flags.
LPFK
ID Specifies the report identifier.
Length Specifies the report length.
Time stamp Specifies the system time (in milliseconds).
Number of key pressed Specifies the number of the key pressed.
Dials
ID Specifies the report identifier.
Length Specifies the report length.
Time stamp Specifies the system time (in milliseconds).
Number of dial changed Specifies the number of the dial changed.
Delta change Specifies delta dial rotation.
Format 1:
v Status: Specifies the extended button status
v Delta Wheel: Specifies the delta wheel movement
Format 2:
v Button Status: Specifies the button status.
v Delta X: Specifies the delta mouse motion along the X axis.
v Delta Y: Specifies the delta mouse motion along the Y axis.
v Delta Wheel: Specifies the delta wheel movement
The address of the service vector is obtained with the fp_ioctl subroutine call during a non-critical period.
The kernel extension can later invoke the service using an indirect call as follows:
where:
v The service vector is a pointer to the service vector obtained by the KSQUERYSV fp_loctl subroutine
call.
v The ServiceNumber parameter is defined in the inputdd.h file.
v The DeviceNumber parameter specifies the major and minor numbers of the keyboard.
v The Arg parameter points to a ksalarm structure for alarm requests and a uint variable for SAK enable
and disable requests. The Arg parameter is NULL for flush queue requests.
If successful, the function returns a value of 0 is returned. Otherwise, the function returns an error number
defined in the errno.h file. Flush-queue and enable/disable-SAK requests are always processed, but alarm
requests are ignored if the kernel extension’s channel is inactive.
The following example uses the service vector to sound the alarm:
/* pinned data structures */
/* This example assumes that pinning is done elsewhere. */
int (**ksvtbl) ();
struct ksalarm alarm;
dev_t devno;
/* critical section */
/* sound alarm for 1 second using service vector */
alarm.duration = 128;
alarm.frequency = 100;
The low function terminal (lft) interface is a pseudo-device driver that interfaces with device drivers for the
system keyboard and display adapters. The lft interface adheres to all standard requirements for
pseudo-device drivers and has all the entry points and configuration code as required by the device driver
architecture. This section gives a high-level description of the various configuration methods and entry
points provided by the lft interface.
All the device drivers controlled by the lft interface are also used by AIXwindows. Consequently, along with
the functions required for the tty sybsystem interface, the lft interface provides the functions required by
AIXwindows interfaces with display device driver adapters.
Configuration
The lft interface uses the common define, undefine, and unconfiguration methods standard for most
devices.
Note: The lft interface does not support any change method for dynamically changing the lft configuration.
Instead, use the -P flag with the chdev command. The changes become effective the next time the
lft interface is configured.
The configuration process for the lft opens all display device drivers. To define the default display and
console, select the default display and console during the console configuration process. If a graphics
display is chosen as the system console, it automatically becomes the default display. The lft interface
displays text on the default display.
The configuration process for the lft interface queries the ODM database for the available fonts and
software keyboard map for the current session.
Terminal Emulation
The lft interface is a stream-based tty subsystem. The lft interface provides VT100 (or IBM 3151) terminal
emulation for the standard part of the ANSI 3.64 data stream. All line discipline handling is performed in
the layers above the lft interface. The lft interface does not support virtual terminals.
The lft interface supports multiple fonts to handle the different screen sizes and resolutions necessary in
providing a 25x80 character display on various display adapters.
Note: The lft interface provides ioctl support to set and change the default display.
The display drivers might enqueue font request through the font process started during lft initialization. The
font process pins and unpins the requested fonts for DMA to the display adapter.
The display device drivers provide all the standard interfaces (such as config, initialize, terminate, and so
forth) required in any AIX 4.1 (or later) device drivers. The only device switch table entries supported are
open, close, config, and ioctl. All other device switch table entries are set to nodev. In addition, the display
device drivers provide a set of ioctls for use by AIXwindows and diagnostics to perform device specific
functions such as get bus access, bus memory address, DMA operations, and so forth.
Note: Previously, the high functional terminal interface provided AIXwindows with the gsc_handle. This
handle is used in all of the aixgsc system calls. The RCM provides this service for the lft interface.
To ensure that lft can recover the display in case AIXwindows should terminate abnormally, AIXwindows
issues the ioctl to RCM after opening the pseudo-device. RCM passes on the ioctl to the lft. This way, the
close function in RCM is invoked (Because AIXwindows is the only application that has opened RCM), and
RCM notifies the lft interface to start reusing the display. To support this communication, the RCM provides
the required ioctl support.
Diagnostics
Diagnostics and other applications that require access to the graphics adapter use the AIXwindows to lft
interface.
Related Information
National Language Support Overview, Setting National Language Support for Devices, Locales in
Operating system and device management
Commands References
The iconv command in AIX 5L Version 5.3 Commands Reference, Volume 3.
The following topics describe how the logical volume device driver (LVDD) interacts with physical volumes:
v “Direct Access Storage Devices (DASDs)”
v “Physical Volumes”
v “Understanding the Logical Volume Device Driver” on page 208
v “Understanding Logical Volumes and Bad Blocks” on page 211
A removable storage device is any storage device defined by the person who administers your system
during system configuration to be an optional part of the system DASD. The removable storage device can
be removed from the system at any time during normal operation. As long as the device is logically
unmounted first, the operating system does not detect an error.
The following types of devices are not considered DASD and are not supported by the logical volume
manager (LVM):
v Diskettes
v CD-ROM (compact disk read-only memory)
v DVD-ROM (DVD read-only memory)
v WORM (write-once read-many)
For a description of the block level, see “DASD Device Block Level Description” on page 311.
Physical Volumes
A logical volume is a portion of a physical volume viewed by the system as a volume. Logical records are
records defined in terms of the information they contain rather than physical attributes.
A physical volume is a DASD structured for requests at the physical level, that is, the level at which a
processing unit can request device-independent operations on a physical block address basis. A physical
volume is composed of the following:
v A device-dependent reserved area
v A variable number of physical blocks that serve as DASD descriptors
v An integral number of partitions, each containing a fixed number of physical blocks
When performing I/O at a physical level, no bad-block relocation is supported. Bad blocks are not hidden
at this level as they are at the logical level. Typical operations at the physical level are
read-physical-block and write-physical-block. For more information on bad blocks, see “Understanding
Logical Volumes and Bad Blocks” on page 211.
block A contiguous, 512-byte region of a physical volume that corresponds in size to a DASD sector
The number of blocks in a partition, as well as the number of partitions in a given physical volume, are
fixed when the physical volume is installed in a volume group. Every physical volume in a volume group
has exactly the same partition size. There is no restriction on the types of DASDs (for example, Small
Computer Systems Interface (SCSI), Enhanced Small Device Interface (ESDI), or Intelligent Peripheral
Interface (IPI)) that can be placed in a given volume group.
Note: A given physical volume must be assigned to a volume group before that physical volume can be
used by the LVM.
Note: Sector numbering applies to user-accessible data sectors only. Spare sectors and
Customer-Engineer (CE) sectors are not included. CE sectors are reserved for use by diagnostic
test routines or microcode.
The 128-sector reserved area of a physical volume includes a boot record, the bad-block directory, the
LVM record, and the mirror write consistency (MWC) record. The boot record consists of one sector
containing information that allows the read-only system (ROS) to boot the system. A description of the boot
record can be found in the /usr/include/sys/bootrecord.h file.
The boot record also contains the pv_id field. This field is a 64-bit number uniquely identifying a physical
volume. This identifier might be assigned by the manufacturer of the physical volume. However, if a
physical volume is part of a volume group, the pv_id field will be assigned by the LVM.
The bad-block directory records the blocks on the physical volume that have been diagnosed as unusable.
The structure of the bad-block directory and its entries can be found in the /usr/include/sys/bbdir.h file.
The LVM record consists of one sector and contains information used by the LVM when the physical
volume is a member of the volume group. The LVM record is described in the /usr/include/lvmrec.h file.
The volume group descriptor area (VGDA) is divided into the following:
v The volume group header. This header contains general information about the volume group and a time
stamp used to verify the consistency of the VGDA.
v A list of logical volume entries. The logical volume entries describe the states and policies of logical
volumes. This list defines the maximum number of logical volumes allowed in the volume group. The
maximum is specified when a volume group is created.
v A list of physical volume entries. The size of the physical volume list is variable because the number of
entries in the partition map can vary for each physical volume. For example, a 200 MB physical volume
with a partition size of 1 MB has 200 partition map entries.
v A name list. This list contains the special file names of each logical volume in the volume group.
v A volume group trailer. This trailer contains an ending time stamp for the volume group descriptor area.
When a volume group is varied online, a majority (also called a quorum) of VGDAs must be present to
perform recovery operations unless you have specified the force flag. (The vary-on operation, performed
by using the varyonvg command, makes a volume group available to the system.) See Logical volume
storage in Operating system and device management for introductory information about the vary-on
process and quorums.
A volume group with only one physical volume must contain two copies of the physical volume group
descriptor area. For any volume group containing more than one physical volume, there are at least three
on-disk copies of the volume group descriptor area. The default placement of these areas on the physical
volume is as follows:
v For the first physical volume installed in a volume group, two copies of the volume group descriptor
area are placed on the physical volume.
v For the second physical volume installed in a volume group, one copy of the volume group descriptor
area is placed on the physical volume.
v For the third physical volume installed in a volume group, one copy of the volume group descriptor area
is placed on the physical volume. The second copy is removed from the first volume.
v For additional physical volumes installed in a volume group, one copy of the volume group descriptor
area is placed on the physical volume.
When a vary-on operation is performed, a majority of copies of the volume group descriptor area must be
able to come online before the vary-on operation is considered successful. A quorum ensures that at least
one copy of the volume group descriptor areas available to perform recovery was also one of the volume
group descriptor areas that were online during the previous vary-off operation. If not, the consistency of
the volume group descriptor area cannot be ensured.
The volume group status area (VGSA) contains the status of all physical volumes in the volume group.
This status is limited to active or missing. The VGSA also contains the state of all allocated physical
A PP changes from stale to active after a successful resynchronization of the logical partition (LP) that has
multiple copies, or mirrors, and is no longer consistent with its peers in the LP. This inconsistency can be
caused by a write error or by not having a physical volume available when the LP is written to or updated.
A PP changes from stale to active after a successful resynchronization of the LP. A resynchronization
operation issues resynchronization requests starting at the beginning of the LP and proceeding
sequentially through its end. The LVDD reads from an active partition in the LP and then writes that data
to any stale partition in the LP. When the entire LP has been traversed, the partition state is changed from
stale to active.
Note: If a write error occurs in a stale partition while a resynchronization is in progress, that partition
remains stale.
If all stale partitions in an LP encounter write errors, the resynchronization operation is ended for this LP
and must be restarted from the beginning.
The vary-on operation uses the information in the VGSA to initialize the LVDD data structures when the
volume group is brought online.
Attention: Each logical volume has a control block located in the first 512 bytes. Data begins in the
second 512-byte block. Care must be taken when reading and writing directly to the logical volume, such
as when using applications that write to raw logical volumes, because the control block is not protected
from such writes. If the control block is overwritten, commands that use the control block will use default
information instead.
Character I/O requests are performed by issuing a read or write request on a /dev/rlvn character special
file for a logical volume. The read or write is processed by the file system SVC handler, which calls the
LVDD ddread or ddwrite entry point. The ddread or ddwrite entry point transforms the character request
into a block request. This is done by building a buffer for the request and calling the LVDD ddstrategy
entry point.
Block I/O requests are performed by issuing a read or write on a block special file /dev/lvn for a logical
volume. These requests go through the SVC handler to the bread or bwrite block I/O kernel services.
These services build buffers for the request and call the LVDD ddstrategy entry point. The LVDD
ddstrategy entry point then translates the logical address to a physical address (handling bad block
relocation and mirroring) and calls the appropriate physical disk device driver.
On completion of the I/O, the physical disk device driver calls the iodone kernel service on the device
interrupt level. This service then calls the LVDD I/O completion-handling routine. Once this is completed,
the LVDD calls the iodone service again to notify the requester that the I/O is completed.
The LVDD is logically split into top and bottom halves. The top half contains the ddopen, ddclose,
ddread, ddwrite, ddioctl, and ddconfig entry points. The bottom half contains the ddstrategy entry point,
Data Structures
The interface to the ddstrategy entry point is one or more logical buf structures in a list. The logical buf
structure is defined in the /usr/include/sys/buf.h file and contains all needed information about an I/O
request, including a pointer to the data buffer. The ddstrategy entry point associates one or more (if
mirrored) physical buf structures (or pbufs) with each logical buf structure and passes them to the
appropriate physical device driver.
The pbuf structure is a standard buf structure with some additional fields. The LVDD uses these fields to
track the status of the physical requests that correspond to each logical I/O request. A pool of pinned pbuf
structures is allocated and managed by the LVDD.
There is one device switch entry for each volume group defined on the system. Each volume group entry
contains a pointer to the volume group data structure describing it.
ddopen Called by the file system when a logical volume is mounted, to open the logical volume specified.
ddclose Called by the file system when a logical volume is unmounted, to close the logical volume specified.
ddconfig Initializes data structures for the LVDD.
ddread Called by the read subroutine to translate character I/O requests to block I/O requests. This entry
point verifies that the request is on a 512-byte boundary and is a multiple of 512 bytes in length.
Most of the time a request will be sent down as a single request to the LVDD ddstrategy entry point
which handles logical block I/O requests. However, the ddread routine might divide very large
requests into multiple requests that are passed to the LVDD ddstrategy entry point.
If the ext parameter is set (called by the readx subroutine), the ddread entry point passes this
parameter to the LVDD ddstrategy routine in the b_options field of the buffer header.
ddwrite Called by the write subroutine to translate character I/O requests to block I/O requests. The LVDD
ddwrite routine performs the same processing for a write request as the LVDD ddread routine does
for read requests.
ddioctl Supports the following operations:
CACLNUP
Causes the mirror write consistency (MWC) cache to be written to all physical volumes
(PVs) in a volume group.
IOCINFO, XLATE
Return LVM configuration information and PP status information.
LV_INFO
Provides information about a logical volume.
PBUFCNT
Increases the number of physical buffer headers (pbufs) in the LVM pbuf pool.
The bottom half of the LVDD runs on interrupt levels and, as a result, is not permitted to page fault. The
bottom half of the LVDD is divided into the following three layers:
v Strategy layer
v Scheduler layer
v Physical layer
Each logical I/O request passes down through the bottom three layers before reaching the physical disk
device driver. Once the I/O is complete, the request returns back up through the layers to handle the I/O
completion processing at each layer. Finally, control returns to the original requestor.
Strategy Layer
The strategy layer deals only with logical requests. The ddstrategy entry point is called with one or more
logical buf structures. A list of buf structures for requests that are not blocked are passed to the second
layer, the scheduler.
Scheduler Layer
The scheduler layer schedules physical requests for logical operations and handles mirroring and the
MWC cache. For each logical request the scheduler layer schedules one or more physical requests. These
requests involve translating logical addresses to physical addresses, handling mirroring, and calling the
LVDD physical layer with a list of physical requests.
When a physical I/O operation is complete for one phase or mirror of a logical request, the scheduler
initiates the next phase (if there is one). If no more I/O operations are required for the request, the
scheduler calls the strategy termination routine. This routine notifies the originator that the request has
been completed.
The scheduler also handles the MWC cache for the volume group. If a logical volume is using mirror write
consistency, then requests for this logical volume are held within the scheduling layer until the MWC cache
blocks can be updated on the target physical volumes. When the MWC cache blocks have been updated,
the request proceeds with the physical data write operations.
When MWC is being used, system performance can be adversely affected. This is caused by the
overhead of logging or journalling that a write request is active in one or more logical track groups (LTGs)
(128K, 256K, 512K or 1024K). This overhead is for mirrored writes only. It is necessary to guarantee data
consistency between mirrors particularly if the system crashes before the write to all mirrors has been
completed.
Mirror write consistency can be turned off for an entire logical volume. It can also be inhibited on a request
basis by turning on the NO_MWC flag as defined in the /usr/include/sys/lvdd.h file.
Physical Layer
The physical layer of the LVDD handles startup and termination of the physical request. The physical layer
calls a physical disk device driver’s ddstrategy entry point with a list of buf structures linked together. In
turn, the physical layer is called by the iodone kernel service when each physical request is completed.
This layer also performs bad-block relocation and detection/correction of bad blocks, when necessary.
These details are hidden from the other two layers.
Note: For write requests, the LVDD attempts to hardware-relocate the bad block. If this is
unsuccessful, then the block is software-relocated. For read requests, the information is
recorded and the block is relocated on the next write request to that block.
v For a successful request that generated an excessive number of retries, the device driver can return
good data. To indicate this situation it must set the following:
– The b_error field is set to ESOFT; this is defined in the /usr/include/sys/errno.h file.
– The b_flags field has the B_ERROR flag set to on.
– The b_resid field is set to a count indicating the first block in the request that had excessive retries.
This block is then relocated.
v The physical disk device driver needs to accept a request of one block with HWRELOC (defined in the
/usr/include/sys/lvdd.h file) set to on in the b_options field. This indicates that the device driver is to
perform a hardware relocation on this request. If the device driver does not support hardware relocation
the following should be set:
– The b_error field is set to EIO; this is defined in the /usr/include/sys/errno.h file.
– The b_flags field has the B_ERROR flag set on.
– The b_resid field is set to a count indicating the first block in the request that has excessive retries.
v The physical disk device driver should support the system dump interface as defined.
v The physical disk device driver must support write verification on an I/O request. Requests for write
verification are made by setting the b_options field to WRITEV. This value is defined in the
/usr/include/sys/lvdd.h file.
If bad blocks exist in a physical request, the request is split into pieces. The first piece contains any blocks
up to the relocated block. The second piece contains the relocated block (the relocated address is
specified in the bad-block entry) of the defects directory. The third piece contains any blocks after the
relocated block to the end of the request or to the next relocated block. These separate pieces are
processed sequentially until the entire request has been satisfied.
When a bad block is detected during I/O, the physical disk device driver sets the error fields in the buf
structure to indicate that there was a media surface error. The physical layer of the LVDD then initiates
any bad-block processing that must be done.
If the operation was a nonmirrored read, the block is not relocated because the data in the relocated block
is not initialized until a write is performed to the block. To support this delayed relocation, an entry for the
bad block is put into the LVDD defects directory and into the bad-block directory on disk. These entries
contain no relocated block address and the status for the block is set to indicate that relocation is desired.
On each I/O request, the physical layer checks whether there are any bad blocks in the request. If the
request is a write and contains a block that is in a relocation-desired state, the request is sent to the
physical disk device driver with safe hardware relocation requested. If the request is a read, a read of the
known defective block is attempted.
If the operation was a read operation in a mirrored LP, a request to read one of the other mirrors is
initiated. If the second read is successful, then the read is turned into a write request and the physical disk
device driver is called with safe hardware relocation specified to fix the bad mirror.
If the hardware relocation fails or the device does not support safe hardware relocation, the physical layer
of the LVDD attempts software relocation. At the end of each volume is a reserved area used by the LVDD
as a pool of relocation blocks. When a bad block is detected and the disk device driver is unable to
relocate the block, the LVDD picks the next unused block in the relocation pool and writes to this new
location. A new entry is added to the LVDD defects directory in memory (and to the bad-block directory on
disk) that maps the bad-block address to the new relocation block address. Any subsequent I/O requests
to the bad-block address are routed to the relocation address.
Attention: Formatting a fixed disk deletes any data on the disk. Format a fixed disk only when
absolutely necessary and preferably after backing up all data on the disk.
If you need to format a fixed disk completely (including reinitializing any bad blocks), use the formatting
function supplied by the diag command. (The diag command typically, but not necessarily, writes over all
data on a fixed disk. Refer to the documentation that comes with the fixed disk to determine the effect of
formatting with the diag command.)
Related Information
Subroutine References
The readx subroutine, write subroutine in AIX 5L Version 5.3 Technical Reference: Base Operating
System and Extensions Volume 2.
Files Reference
The lvdd special file in AIX 5L Version 5.3 Files Reference.
The bread kernel service, bwrite kernel service, iodone kernel service in AIX 5L Version 5.3 Technical
Reference: Kernel and Subsystems Volume 1.
“Adding a New Printer Type to Your System” provides general instructions for adding an undefined printer.
To add an undefined printer, you modify an existing printer definition. Undefined printers fall into two
categories:
v Printers that closely emulate a supported printer. You can use SMIT or the virtual printer commands to
make the changes you need.
v Printers that do not emulate a supported printer or that emulate several data streams. It is simpler to
make the necessary changes for these printers by editing the printer colon file. See Adding a Printer
Using the Printer Colon File in Printers and printing.
“Adding an Unsupported Device to the System” on page 104 offers an overview of the major steps
required to add an unsupported device of any type to your system.
Example of Print Formatter in Printers and printing shows how the print formatter interacts with the printer
formatter subroutines.
Steps for adding a new printer-specific formatter to the printer backend are discussed in Adding a Printer
Formatter to the Printer Backend . Example of Print Formatter in Printers and printing shows how print
formatters can interact with the printer formatter subroutines.
Note: These instructions apply to the addition of a new printer definition to the system, not to the addition
of a physical printer device itself. For information on adding a new printer device, refer to device
If the printer being added does not emulate a supported printer or if it emulates several data streams, you
need to make more changes to the Printer definition. It is simpler to make the necessary changes for
these printers by editing the printer colon file. See Adding a Printer Using the Printer Colon File in Printers
and printing.
For example, assume that you created a new file based on the existing 4201-3 printer. The customized file
for the 4201-3 printer contains the following template that the printer formatter uses to initialize the printer:
%I[ez,em,eA,cv,eC,eO,cp,cc, . . .
The formatter fills in the string as directed by this template and sends the resulting sequence of
commands to the 4201-3 printer. Specifically, this generates a string of escape sequences that initialize the
printer and set such parameters as vertical and horizontal spacing and page length. You would construct a
similar command string to properly initialize the new printer and put it into 4201-emulation mode. Although
many of the escape sequences might be the same, at least one will be different: the escape sequence that
is the command to put the printer into the specific printer-emulation mode. Assume that you added an ep
attribute that specifies the string to initialize the printer to 4201-3 emulation mode, as follows:
\033\012\013
You must create a virtual printer for each printer-emulation mode you want to use. See Real and Virtual
Printers in Printers and printing.
Typically, to add a new printer definition to the database, you first modify an existing printer definition and
then create a customized printer definition in the Customized Printer Directory.
Once you have added the new customized printer definition to the directory, the mkvirprt command uses
it to present the new printer as a choice for printer addition and selection. Because the new printer
definition is a customized printer definition, it appears in the list of printers under the name of the original
printer from which it was customized.
A totally new printer must be added as a predefined printer definition in the /usr/lib/lpd/pio/predef
directory. If the user chooses to work with printers once this new predefined printer definition is added to
the Predefined Printer Directory, the mkvirprt command can then list all the printers in that directory. The
added printer appears on the list of printers given to the user as if it had been supported all along. Specific
information about this printer can then be extended, added, modified, or deleted, as necessary.
Printer Support in Printers and printing lists the supported printer types and names of representative
printers.
Embedded references and logic are defined with escape sequences that are placed at appropriate
locations in the attribute string. The first character of each escape sequence is always the % character.
This character indicates the beginning of an escape sequence. The second character (and sometimes
subsequent characters) define the operation to be performed. The remainder of the characters (if any) in
the escape sequence are operands to be used in performing the specified operation.
The escape sequences that can be specified in an attribute string are based on the terminfo
parameterized string escape sequences for terminals. These escape sequences have been modified and
extended for printers.
Any attribute value (integer variable, string variable, or string constant) can be referenced by any attribute
string. Consequently, it is important that the formatter ensures that the values for all the integer variables
and string variables defined to the piogetvals subroutine are kept current.
The formatter must not assume that the particular attribute string whose name it specifies to the piogetstr
or piocmdout subroutine does not reference certain variables. The attribute string is retrieved from the
database that is external to the formatter. The values in the database represented by the string can be
changed to reference additional variables without the formatter’s knowledge.
Related Information
Printers and printing
Subroutine References
The piocmdout subroutine, piogetstr subroutine, piogetvals subroutine in AIX 5L Version 5.3 Technical
Reference: Base Operating System and Extensions Volume 1.
Commands References
The mkvirprt command in AIX 5L Version 5.3 Commands Reference, Volume 3.
This section frequently refers to both a SCSI device driver and a SCSI adapter device driver. These two
distinct device drivers work together in a layered approach to support attachment of a range of SCSI
devices. The SCSI adapter device driver is the lower device driver of the pair, and the SCSI device driver
is the upper device driver.
The SCSI adapter device driver manages the SCSI bus but not the SCSI devices. It can send and receive
SCSI commands, but it cannot interpret the contents of the commands. The lower driver also provides
recovery and logging for errors related to the SCSI bus and system I/O hardware. Management of the
device specifics is left to the SCSI device driver. The interface of the two drivers allows the upper driver to
communicate with different SCSI bus adapters without requiring special code paths for each adapter.
The SCSI device driver also provides recovery and logging for errors related to the SCSI device it controls.
The operating system provides several kernel services allowing the SCSI device driver to communicate
with SCSI adapter device driver entry points without having the actual name or address of those entry
points. The description contained in “Logical File System Kernel Services” on page 66 can provide more
information.
When writing a new SCSI adapter device driver, the writer must know which mode or modes must be
supported to meet the requirements of the SCSI adapter and any interfaced SCSI device drivers. When a
SCSI adapter device driver is added so that a new SCSI adapter works with all existing SCSI device
drivers, both initiator-mode and target-mode must be supported in the SCSI adapter device driver.
Initiator-Mode Support
The interface between the SCSI device driver and the SCSI adapter device driver for initiator-mode
support (that is, the attached device acts as a target) is accessed through calls to the SCSI adapter device
driver open, close, ioctl, and strategy routines. I/O requests are queued to the SCSI adapter device
driver through calls to its strategy entry point.
Communication between the SCSI device driver and the SCSI adapter device driver for a particular
initiator I/O request is made through the sc_buf structure, which is passed to and from the strategy routine
in the same way a standard driver uses a struct buf structure.
Target-Mode Support
The interface between the SCSI device driver and the SCSI adapter device driver for target-mode support
(that is, the attached device acts as an initiator) is accessed through calls to the SCSI adapter device
driver open, close, and ioctl subroutines. Buffers that contain data received from an attached initiator
device are passed from the SCSI adapter device driver to the SCSI device driver, and back again, in
tm_buf structures.
Communication between the SCSI adapter device driver and the SCSI device driver for a particular data
transfer is made by passing the tm_buf structures by pointer directly to routines whose entry points have
been previously registered. This registration occurs as part of the sequence of commands the SCSI device
driver executes using calls to the SCSI adapter device driver when the device driver opens a target-mode
device instance.
A SCSI device driver can register a particular device instance for receiving asynchronous event status by
calling the SCIOEVENT ioctl operation for the SCSI-adapter device driver. When an event covered by the
SCIOEVENT ioctl operation is detected by the SCSI adapter device driver, it builds an sc_event_info
structure and passes a pointer to the structure and to the asynchronous event-handler routine entry point,
which was previously registered. The fields in the structure are filled in by the SCSI adapter device driver
as follows:
id For initiator mode, this is set to the SCSI ID of the attached SCSI target device. For
target mode, this is set to the SCSI ID of the attached SCSI initiator device.
lun For initiator mode, this is set to the SCSI LUN of the attached SCSI target device. For
target mode, this is set to 0).
Note: Reserved fields should be set to 0 by the SCSI adapter device driver.
The information reported in the sc_event_info.events field does not queue to the SCSI device driver, but
is instead reported as one or more flags as they occur. Because the data does not queue, the SCSI
adapter device driver writer can use a single sc_event_info structure and pass it one at a time, by pointer,
to each asynchronous event handler routine for the appropriate device instance. After determining for
which device the events are being reported, the SCSI device driver must copy the sc_event_info.events
field into local space and must not modify the contents of the rest of the sc_event_info structure.
Because the event status is optional, the SCSI device driver writer determines what action is necessary to
take upon receiving event status. The writer may decide to save the status and report it back to the calling
application, or the SCSI device driver or application level program can take error recovery actions.
The unrecoverable adapter command failure event is not necessarily a fatal condition, but it can indicate
that the adapter is not functioning properly. Possible actions by the application program include:
v Ending of the session with the device in the near future
v Ending of the session after multiple (two or more) such events
v Attempting to continue the session indefinitely
The SCSI Bus Reset detection event is mainly intended as information only, but may be used by the
application to perform further actions, if necessary.
Because the event handling routine is running on the hardware interrupt level, the SCSI device driver must
be careful to limit operations in that routine. Processing should be kept to a minimum. In particular, if any
error recovery actions are performed, it is recommended that the event-handling routine set state or status
flags only and allow a process level routine to perform the actual operations.
The SCSI device driver must be careful to disable interrupts at the correct level in places where the SCSI
device driver’s lower execution priority routines manipulate variables that are also modified by the
event-handling routine. To allow the SCSI device driver to disable at the correct level, the SCSI adapter
device driver writer must provide a configuration database attribute that defines the interrupt class, or
priority, it runs on. This attribute must be named intr_priority so that the SCSI device driver configuration
method knows which attribute of the parent adapter to query. The SCSI device driver configuration method
should then pass this interrupt priority value to the SCSI device driver along with other configuration data
for the device instance.
The SCSI device driver writer must follow any other general system rules for writing a routine that must
execute in an interrupt environment. For example, the routine must not attempt to sleep or wait on I/O
operations. It can perform wakeups to allow the process level to handle those operations.
Because the SCSI device driver copies the information from the sc_event_info.events field on each call
to its asynchronous event-handling routine, there is no resource to free or any information which must be
passed back later to the SCSI adapter device driver.
The SCSI adapter device driver should never retry a SCSI command on error after the command has
successfully been given to the adapter. The consequences for retrying a SCSI command at this point
range from minimal to catastrophic, depending upon the type of device. Commands for certain devices
The first transaction passed to the SCSI adapter device driver during error recovery must include a special
flag. This SC_RESUME flag in the sc_buf.flags field must be set to inform the SCSI adapter device driver
that the SCSI device driver has recognized the fatal error and is beginning recovery operations. Any
transactions passed to the SCSI adapter device driver, after the fatal error occurs and before the
SC_RESUME transaction is issued, should be flushed; that is, returned with an error type of ENXIO
through an iodone call.
Note: If a SCSI device driver continues to pass transactions to the SCSI adapter device driver after the
SCSI adapter device driver has flushed the queue, these transactions are also flushed with an error
return of ENXIO through the iodone service. This gives the SCSI device driver a positive indication
of all transactions flushed.
If the SCSI device driver is executing a gathered write operation, the error-recovery information mentioned
previously is still valid, but the caller must restore the contents of the sc_buf.resvdw1 field and the uio
struct that the field pointed to before attempting the retry. The retry must occur from the SCSI device
driver’s process level; it cannot be performed from the caller’s iodone subroutine. Also, additional return
codes of EFAULT and ENOMEM are possible in the sc_buf.bufstruct.b_error field for a gathered write
operation.
When the SCSI adapter driver’s queue is halted, the SCSI device drive can get sense data from a device
by setting the SC_RESUME flag in the sc_buf.flags field and the SC_NO_Q flag in sc_buf.q_tag_msg
field of the request-sense sc_buf. This action notifies the SCSI adapter driver that this is an error-recovery
transaction and should be sent to the device while the remainder of the queue for the device remains
halted. When the request sense completes, the SCSI device driver needs to either clear or resume the
SCSI adapter driver’s queue for this device.
The SCSI device driver can notify the SCSI adapter driver to clear its halted queue by sending a
transaction with the SC_Q_CLR flag in the sc_buf.flags field. This transaction must not contain a SCSI
command because it is cleared from the SCSI adapter driver’s queue without being sent to the adapter.
However, this transaction must have the SCSI ID field (sc_buf.scsi_command.scsi_id) and the LUN fields
(sc_buf.scsi_command.scsi_cmd.lun and sc_buf.lun) filled in with the device’s SCSI ID and logical unit
number (LUN). If addressing LUNs 8 - 31, the sc_buf.lun field should be set to the logical unit number
and the sc_buf.scsi_command.scsi_cmd.lun field should be zeroed out. See the descriptions of these
fields for further explanation. Upon receiving an SC_Q_CLR transaction, the SCSI adapter driver flushes
all transactions for this device and sets their sc_buf.bufstruct.b_error fields to ENXIO. The SCSI device
If the SCSI device driver wants the SCSI adapter driver to resume its halted queue, it must send a
transaction with the SC_Q_RESUME flag set in the sc_buf.flags field. This transaction can contain an
actual SCSI command, but it is not required. However, this transaction must have the
sc_buf.scsi_command.scsi_id, sc_buf.scsi_command.scsi_cmd.lun,and the sc_buf.lun fields filled in with
the device’s SCSI ID and logical unit number. See the description of these fields for further details. If this
is the first transaction issued by the SCSI device driver after receiving the error (indicating that the adapter
driver’s queue is halted), then the SC_RESUME flag must be set as well as the SC_Q_RESUME flag.
In the SCSI subsystem, an error during a send command does not affect future target-mode data
reception. Future send commands continue to be processed by the SCSI adapter device driver and queue
up, as necessary, after the data with the error. The SCSI device driver continues processing the send
command data, satisfying user read requests as usual except that the error status is returned for the
appropriate user request. Any error recovery or synchronization procedures the user requires for a
target-mode received-data error must be implemented in user-supplied software.
The only special requirement for commands with short data-phase transfers (less than or equal to 256
bytes) is that the SCSI device driver must have pinned the memory being transferred into or out of system
memory pages. However, due to system hardware considerations, additional precautions must be taken for
data transfers into system memory pages when the transfers are larger than 256 bytes. The problem is
that any system memory area with a DMA data operation in progress causes the entire memory page that
contains it to become inaccessible.
As a result, a SCSI device driver that initiates an internal command with more than 256 bytes must have
preallocated and pinned an area of some multiple whose size is the system page size. The driver must not
place in this area any other data areas that it may need to access while I/O is being performed into or out
of that page. Memory pages so allocated must be avoided by the device driver from the moment the
transaction is passed to the adapter device driver until the device driver iodone routine is called for the
transaction (and for any other transactions to those pages).
The SCSI device driver can send only one sc_buf structure per call to the SCSI adapter device driver.
Thus, the sc_buf.bufstruct.av_forw pointer should be null when given to the SCSI adapter device driver,
which indicates that this is the only request. The SCSI device driver can queue multiple sc_buf requests
by making multiple calls to the SCSI adapter device driver strategy routine.
To enhance SCSI bus performance, the SCSI device driver should consolidate multiple queued requests
when possible into a single SCSI command. To allow the SCSI adapter device driver the ability to handle
the scatter and gather operations required, the sc_buf.bp should always point to the first buf structure
entry for the spanned transaction. A null-terminated list of additional struct buf entries should be chained
from the first field through the buf.av_forw field to give the SCSI adapter device driver enough information
to perform the DMA scatter and gather operations required. This information must include at least the
buffer’s starting address, length, and cross-memory descriptor.
The spanned requests should always be for requests in either the read or write direction but not both,
because the SCSI adapter device driver must be given a single SCSI command to handle the requests.
The spanned request should always consist of complete I/O requests (including the additional struct buf
entries). The SCSI device driver should not attempt to use partial requests to reach the maximum transfer
size.
The maximum transfer size is actually adapter-dependent. The IOCINFO ioctl operation can be used to
discover the SCSI adapter device driver’s maximum allowable transfer size. To ease the design,
implementation, and testing of components that might need to interact with multiple SCSI-adapter device
If a transfer size larger than the supported maximum is attempted, the SCSI adapter device driver returns
a value of EINVAL in the sc_buf.bufstruct.b_error field.
Due to system hardware requirements, the SCSI device driver must consolidate only commands that are
memory page-aligned at both their starting and ending addresses. Specifically, this applies to the
consolidation of inner memory buffers. The ending address of the first buffer and the starting address of all
subsequent buffers should be memory page-aligned. However, the starting address of the first memory
buffer and the ending address of the last do not need to be aligned so.
The purpose of consolidating transactions is to decrease the number of SCSI commands and bus phases
required to perform the required operation. The time required to maintain the simple chain of buf structure
entries is significantly less than the overhead of multiple (even two) SCSI bus transactions.
Fragmented Commands
Single I/O requests larger than the maximum transfer size must be divided into smaller requests by the
SCSI device driver. For calls to a SCSI device driver’s character I/O (read/write) entry points, the uphysio
kernel service can be used to break up these requests. For a fragmented command such as this, the
sc_buf.bp field should be null so that the SCSI adapter device driver uses only the information in the
sc_buf structure to prepare for the DMA operation.
The gathered write commands, accessed through the sc_buf.resvd1 field, differ from the spanned
commands, accessed through the sc_buf.bp field, in several ways:
v Gathered write commands can transfer data regardless of address alignment, where as spanned
commands must be memory page-aligned in address and length, making small transfers difficult.
v Gathered write commands can be implemented either in software (which requires the extra step of
copying the data to temporary buffers) or hardware. Spanned commands can be implemented in system
hardware due to address-alignment requirements. As a result, spanned commands are potentially faster
to run.
v Gathered write commands are not able to handle read requests. Spanned commands can handle both
read and write requests.
v Gathered write commands can be initiated only on the process level, but spanned commands can be
initiated on either the process or interrupt level.
If any of these conditions are not met, the gathered write commands do not succeed and the
sc_buf.bufstruct.b_error is set to EINVAL.
To support SCSI adapter device drivers that perform the gathered write commands in software, additional
return values in the sc_buf.bufstruct.b_error field are possible when gathered write commands are
unsuccessful.
Note: The gathered write command facility is optional for both the SCSI device driver and the SCSI
adapter device driver. Attempting a gathered write command to a SCSI adapter device driver that
does not support gathered write can cause a system crash. Therefore, any SCSI device driver must
issue a SCIOGTHW ioctl operation to the SCSI adapter device driver before using gathered writes. A
SCSI adapter device driver that supports gathered writes must support the SCIOGTHW ioctl as well.
The ioctl returns a successful return code if gathered writes are supported. If the ioctl fails, the SCSI
device driver must not attempt a gathered write. Typically, a SCSI device driver places the
SCIOGTHW call in its open routine for device instances that it will send gathered writes to.
SCSI command tag queuing refers to queuing multiple commands to a SCSI device. Queuing to the SCSI
device can improve performance because the device itself determines the most efficient way to order and
process commands. SCSI devices that support command tag queuing can be divided into two classes:
those that clear their queues on error and those that do not. Devices that do not clear their queues on
error resume processing of queued commands when the error condition is cleared typically by receiving
the next command. Devices that do clear their queues flush all commands currently outstanding.
Command tag queueing requires the SCSI adapter, the SCSI device, the SCSI device driver, and the SCSI
adapter driver to support this capability. For a SCSI device driver to queue multiple commands to a SCSI
device (that supports command tag queuing), it must be able to provide at least one of the following
values in the sc_buf.q_tag_msg: SC_SIMPLE_Q, SC_HEAD_OF_Q, or SC_ORDERED_Q. The SCSI disk
device driver and SCSI adapter driver do support this capability. This implementation provides some
queuing-specific changeable attributes for disks that can queue commands. With this information, the disk
device driver attempts to queue to the disk, first by queuing commands to the adapter driver. The SCSI
adapter driver then queues these commands to the adapter, providing that the adapter supports command
tag queuing. If the SCSI adapter does not support command tag queuing, then the SCSI adapter driver
sends only one command at a time to the SCSI adapter and so multiple commands are not queued to the
SCSI disk.
Note: Commands with the value of SC_NO_Q for the q_tag_msg field (except for request sense
commands) should not be queued to a device whose queue contains a command with another
value for q_tag_msg. If commands with the SC_NO_Q value (except for request sense) are
sent to the device, then the SCSI device driver must make sure that no active commands are
using different values for q_tag_msg. Similarly, the SCSI device driver must also make sure that
a command with a q_tag_msg value of SC_ORDERED_Q, SC_HEAD_Q, or SC_SIMPLE_Q is
not sent to a device that has a command with the q_tag_msg field of SC_NO_Q.
12. The flags field contains bit flags sent from the SCSI device driver to the SCSI adapter device driver.
The following flags are defined:
SC_RESUME
When set, means the SCSI adapter device driver should resume transaction queuing for this
ID/LUN. Error recovery is complete after a SCIOHALT operation, check condition, or severe
SCSI bus error. This flag is used to restart the SCSI adapter device driver following a
reported error.
SC_DELAY_CMD
When set, means the SCSI adapter device driver should delay sending this command
(following a SCSI reset or BDR to this device) by at least the number of seconds specified to
the SCSI adapter device driver in its configuration information. For SCSI devices that do not
require this function, this flag should not be set.
SC_Q_CLR
When set, means the SCSI adapter driver should clear its transaction queue for this ID/LUN.
The transaction containing this flag setting does not require an actual SCSI command in the
sc_buf because it is flushed back to the SCSI device driver with the rest of the transactions
for this ID/LUN. However, this transaction must have the SCSI ID field
(sc_buf.scsi_command.scsi_id) and the LUN fields (sc_buf.scsi_command.scsi_cmd.lun and
sc_buf.lun) filled in with the device’s SCSI ID and logical unit number (LUN). This flag is
valid only during error recovery of a check condition or command terminated at a command
tag queuing device when the SC_DID_NOT_CLR_Q flag is set in the sc_buf.adap_q_status
field.
Note: When addressing LUN’s 8 - 31, be sure to see the description of the sc_buf.lun field
within the sc_buf structure.
SC_Q_RESUME
When set, means that the SCSI adapter driver should resume its halted transaction queue for
this ID/LUN. The transaction containing this flag setting does not require an actual SCSI
Note: When addressing LUN’s 8 - 31, be sure to see the description of the sc_buf.lun field
within the sc_buf structure.
The SC_DIAGNOSTIC option gives the caller an exclusive open to the selected device. This option
requires appropriate authority to run. If the caller attempts to initiate this system call without the proper
authority, the SCSI device driver should return a value of -1, with the errno global variable set to a value
of EPERM. The SC_DIAGNOSTIC option may be run only if the device is not already opened for normal
operation. If this ioctl operation is attempted when the device is already opened, or if an openx call with
the SC_DIAGNOSTIC option is already in progress, a return value of -1 should be passed, with the errno
global variable set to a value of EACCES. Once successfully opened with the SC_DIAGNOSTIC flag, the
SCSI device driver is placed in Diagnostic mode for the selected device.
Once sucessfully opened, the device is placed in Exclusive Access mode. If another caller tries to do any
type of open, a return value of -1 is passed, with the errno global variable set to a value of EACCES.
The remaining options for the ext parameter are reserved for future requirements.
Implementation note: The following table shows how the various combinations of ext options should be
handled in the SCSI device driver.
When the SCSI adapter device driver receives an SCIOSTOPTGT ioctl operation, it must forcibly free any
receive data buffers that have been queued to the SCSI device driver for this device and have not been
returned to the SCSI adapter device driver through the buffer free routine. The SCSI device driver is
responsible for making sure all the receive data buffers are freed before calling the SCIOSTOPTGT ioctl
operation. However, the SCSI adapter device driver must check that this is done, and, if necessary,
forcibly free the buffers. The buffers must be freed because those not freed result in memory areas being
permanently lost to the system (until the next boot).
To allow the SCSI adapter device driver to free buffers that are sent to the SCSI device driver but never
returned, it must track which tm_bufs are currently queued to the SCSI device driver. Tracking tm_bufs
requires the SCSI adapter device driver to violate the general SCSI rule, which states the SCSI adapter
device driver should not modify the tm_bufs structure while it is queued to the SCSI device driver. This
exception to the rule is necessary because it is never acceptable not to free memory allocated from the
system.
Internally, the devsw table has entry points for the ddconfig, ddopen, ddclose, dddump, ddioctl, and
ddstrategy routines. The SCSI device drivers pass their SCSI commands to the SCSI adapter device
driver by calling the SCSI adapter device driver ddstrategy routine. (This routine is unavailable to other
operating system programs due to the lack of a block-device special file.)
Access to the SCSI adapter device driver’s ddconfig, ddopen, ddclose, dddump, ddioctl, and
ddstrategy entry points by the SCSI device drivers is performed through the kernel services provided.
These include such services as fp_opendev, fp_close, fp_ioctl, devdump, and devstrategy.
Note: SCSI adapter-device-driver writers should be aware that system services providing interrupt and
timer services are unavailable for use in the dump routine. Kernel DMA services are assumed to be
available for use by the dump routine. The SCSI adapter device driver should be designed to ignore
extra DUMPINIT and DUMPSTART commands to the dddump entry point.
The DUMPQUERY option should return a minimum transfer size of 0 bytes, and a maximum transfer size
equal to the maximum transfer size supported by the SCSI adapter device driver.
Calls to the SCSI adapter device driver DUMPWRITE option should use the arg parameter as a pointer to
the sc_buf structure to be processed. Using this interface, a SCSI write command can be run on a
previously started (opened) target device. The uiop parameter is ignored by the SCSI adapter device
driver during the DUMPWRITE command. Spanned, or consolidated, commands are not supported using
the DUMPWRITE option. Gathered write commands are also not supported using the DUMPWRITE
option. No queuing of sc_buf structures is supported during dump processing because the dump routine
runs essentially as a subroutine call from the caller’s dump routine. Control is returned when the entire
sc_buf structure has been processed.
Attention: Also, both adapter-device-driver and device-driver writers should be aware that any error
occurring during the DUMPWRITE option is considered unsuccessful. Therefore, no error recovery is
employed during the DUMPWRITE. Return values from the call to the dddump routine indicate the
specific nature of the failure.
Successful completion of the selected operation is indicated by a 0 return value to the subroutine.
Unsuccessful completion is indicated by a return code set to one of the following values for the errno
global variable. The various sc_buf status fields, including the b_error field, are not set by the SCSI
adapter device driver at completion of the DUMPWRITE command. Error logging is, of necessity, not
supported during the dump.
v An errno value of EINVAL indicates that a request that was not valid passed to the SCSI adapter
device driver, such as to attempt a DUMPSTART command before successfully executing a DUMPINIT
command.
v An errno value of EIO indicates that the SCSI adapter device driver was unable to complete the
command due to a lack of required resources or an I/O error.
v An errno value of ETIMEDOUT indicates that the adapter did not respond with completion status before
the passed command time-out value expired.
The SCSI target-mode interface is intended to be used with the SCSI initiator-mode interface to provide
the equivalent of a full-duplex communications path between processor type devices. Both communicating
devices must support target-mode and initiator-mode. To work with the SCSI subsystem in this manner, an
attached device’s target-mode and initiator-mode interfaces must meet certain minimum requirements:
v The device’s target-mode interface must be capable of receiving and processing at least the following
SCSI commands:
– send
– request sense
– inquiry
The data returned by the inquiry command must set the peripheral device type field to processor
device. The device should support the vendor and product identification fields. Additional functional
SCSI requirements, such as SCSI message support, must be addressed by examining the detailed
functional specification of the SCSI initiator that the target-mode device is attached to.
v The attached device’s initiator mode interface must be capable of sending the following SCSI
commands:
– send
– request sense
In addition, the inquiry command should be supported by the attached initiator if it needs to identify
SCSI target devices. Additional functional SCSI requirements, such as SCSI message support, must be
addressed by examining the detailed functional specification of the SCSI target that the initiator-mode
device is attached to.
SCSI target mode in the SCSI subsystem does not attempt to implement any receive-data protocol, with
the exception of actions taken to prevent an application from excessive receive-data-buffer usage. Any
protocol required to maintain or otherwise manage the communications of data must be implemented in
user-supplied programs. The only delays in receiving data are those inherent in the SCSI subsystem and
the hardware environment in which it operates.
The SCSI target mode is capable of simultaneously receiving data from all attached SCSI IDs using SCSI
send commands. In target-mode, the host adapter is assumed to act as a single SCSI Logical Unit
Number (LUN) at its assigned SCSI ID. Therefore, only one logical connection is possible between each
attached SCSI initiator on the SCSI Bus and the host adapter. The SCSI subsystem is designed to be fully
capable of simultaneously sending SCSI commands in initiator-mode while receiving data in target-mode.
In both cases, the combination of the SCSI adapter device driver and the SCSI adapter must be capable
of stopping the flow of data from the initiator device.
For adapters allowing a shared pool of buffers to be used for all attached initiators’ data transfers, an
additional problem can result. If any single initiator instance is allowed to transfer data continually, the
entire shared pool of buffers can fill up. These filled-up buffers prevent other initiator instances from
transferring data. To solve this problem, the combination of the SCSI adapter device driver and the host
SCSI adapter must stop the flow of data from a particular initiator ID on the bus. This could include
disconnecting during the data phase for a particular ID but allowing other IDs to continue data transfer.
This could begin when the number of tm_buf structures on a target-mode instance’s tm_buf queue equals
the number of buffers allocated for this device. When a threshold percentage of the number of buffers is
processed and returned to the SCSI adapter device driver’s buffer-free routine, the ID can be enabled
again for the continuation of data transfer.
First, because the receive buffer routine is running on the hardware interrupt level, the SCSI device driver
must limit operations in order to limit routine processing time. In particular, the data copy, which occurs
because the data is queued ahead of the user read request, must not occur in the receive buffer routine.
Data copying in this routine will adversely affect system response time. Data copy is best performed in a
Second, the receive buffer routine is called at the SCSI adapter device driver hardware interrupt level, so
care must be taken when disabling interrupts. They must be disabled to the correct level in places in the
SCSI device driver’s lower run priority routines, which manipulate variables also modified in the receive
buffer routine. To allow the SCSI device driver to disable to the correct level, the SCSI adapter
device-driver writer must provide a configuration database attribute, named intr_priority, that defines the
interrupt class, or priority, that the adapter runs on. The SCSI device-driver configuration method should
pass this attribute to the SCSI device driver along with other configuration data for the device instance.
Third, the SCSI device-driver writer must follow any other general system rules for writing a routine that
must run in an interrupt environment. For example, the routine must not attempt to sleep or wait on I/O
operations. It can perform wake-up calls to allow the process level to handle those operations.
After the tm_buf structure has been passed to the SCSI device driver receive buffer routine, the SCSI
device driver is considered to be responsible for it. Responsibilities include processing the data and any
error conditions and also maintaining the next pointer for chained tm_buf structures. The SCSI device
driver’s responsibilities for the tm_buf structures end when it passes the structure back to the SCSI
adapter device driver.
Until the tm_buf structure is again passed to the SCSI device driver receive buffer routine, the SCSI
adapter device driver is considered responsible for it. The SCSI adapter device-driver writer must be
aware that during the time the SCSI device driver is responsible for the tm_buf structure, it is still possible
for the SCSI adapter device driver to access the structure’s contents. Access is possible because only one
copy of the structure is in memory, and only a pointer to the structure is passed to the SCSI device driver.
Note: Under no circumstances should the SCSI adapter device driver access the structure or modify its
contents while the SCSI device driver is responsible for it, or the other way around.
It is recommended that the SCSI device-driver writer implement a threshold level to wake up the process
level with available tm_buf structures. This way, processing for some of the buffers, including copying the
data to the user buffer, can be overlapped with time spent waiting for more data. It is also recommended
the writer implement a threshold level for these buffers to handle cases where the send command data
length exceeds the aggregate receive-data buffer space. A suggested threshold level is 25% of the
device’s total buffers. That is, when 25% or more of the number of buffers allocated for this device is
queued and no end to the send command is encountered, the SCSI device driver receive buffer routine
should wake the process level to process these buffers.
Note: Reserved fields must not be modified by the SCSI device driver, unless noted otherwise.
Nonreserved fields can be modified, except where noted otherwise.
1. The tm_correlator field is an optional field for the SCSI device driver. This field is a copy of the field
with the same name that was passed by the SCSI device driver in the SCIOSTARTTGT ioctl. The
SCSI device driver should use this field to speed the search for the target-mode device instance the
tm_buf structure is associated with. Alternatively, the SCSI device driver can combine the
tm_buf.user_id and tm_buf.adap_devno fields to find the associated device.
2. The adap_devno field is the device major and minor numbers of the adapter instance on which this
target mode device is defined. This field can be used to find the particular target-mode instance the
tm_buf structure is associated with.
Note: The SCSI device driver must not modify this field.
3. The data_addr field is the kernel space address where the data begins for this buffer.
4. The data_len field is the length of valid data in the buffer starting at the tm_buf.data_addr location in
memory.
5. The user_flag field is a set of bit flags that can be set to communicate information about this data
buffer to the SCSI device driver. Except where noted, one or more of the following flags can be set:
TM_HASDATA
Set to indicate a valid tm_buf structure
TM_MORE_DATA
Set if more data is coming (that is, more tm_buf structures) for a particular send command.
This is only possible for adapters that support spanning the send command data across
multiple receive buffers. This flag cannot be used with the TM_ERROR flag.
TM_ERROR
Set if any error occurred on a particular send command. This flag cannot be used with the
TM_MORE_DATA flag.
6. The user_id field is set to the SCSI ID of the initiator that sent the data to this target mode instance. If
more than one adapter is used for target mode in this system, this ID might not be unique. Therefore,
this field must be used in combination with the tm_buf.adap_devno field to find the target-mode
instance this ID is associated with.
Note: The SCSI device driver must not modify this field.
7. The status_validity field contains the following bit flag:
SC_ADAPTER_ERROR
Indicates the tm_buf.general_card_status is valid.
8. The general_card_status field is a returned status field that gives a broad indication of the class of
error encountered by the adapter. This field is valid when its status-validity bit is set in the
tm_buf.status_validity field. The definition of this field is the same as that found in the sc_buf
structure definition, except the SC_CMD_TIMEOUT value is not possible and is never returned for a
target-mode transfer.
9. The next field is a tm_buf pointer that is either null, meaning this is the only or last tm_buf structure,
or else contains a non-null pointer to the next tm_buf structure.
Each send command, no matter how short its data length, requires its own tm_buf structure. For host
SCSI adapters capable of spanning multiple receive-data buffers with data from a single send command,
the SCSI adapter device driver must set the TM_MORE_DATA flag in the tm_buf.user_flag fields of all
but the final tm_buf structure holding data for the send command. The SCSI device driver must be
designed to support the TM_MORE_DATA flag. Using this flag, the target-mode SCSI device driver can
associate multiple buffers with the single transfer they represent. The end of a send command will be the
boundary used by the SCSI device driver to satisfy a user read request.
The SCSI adapter device driver is responsible for sending the tm_buf structures for a particular initiator
SCSI ID to the SCSI device driver in the order they were received. The SCSI device driver is responsible
for processing these tm_buf structures in the order they were received. There is no particular ordering
implied in the processing of simultaneous send commands from different SCSI IDs, as long as the data
from an individual SCSI ID’s send command is processed in the order it was received.
The pointer to the tm_buf structure chain is passed by the SCSI adapter device driver to the SCSI device
driver’s receive buffer routine. The address of this routine is registered with the SCSI adapter device driver
by the SCSI device driver using the SCIOSTARTTGT ioctl. The duties of the receive buffer routine include
queuing the tm_buf structures and waking up a process-level routine (typically the SCSI device driver’s
read routine) to process the received data.
When the process-level SCSI device driver routine finishes processing one or more tm_buf structures, it
passes them to the SCSI adapter device driver’s buffer-free routine. The address of this routine is
registered with the SCSI device driver in an output field in the structure passed to the SCSI adapter device
driver SCIOSTARTTGT ioctl operation. The buffer-free routine must be a pinned routine the SCSI device
driver can directly access. The buffer-free routine is typically called directly from the SCSI device driver
buffer-handling routine. The SCSI device driver chains one or more tm_buf structures by using the next
field (a null value for the last tm_buf next field ends the chain). It then passes a pointer, which points to
the head of the chain, to the SCSI adapter device driver buffer-free routine. These tm_buf structures must
all be for the same target-mode instance. Also, the SCSI device driver must not modify the tm_buf.user_id
or tm_buf.adap_devno field.
The SCSI adapter device driver takes the tm_buf structures passed to its buffer-free routine and attempts
to make the described receive buffers available to the adapter for future data transfers. Because it is
desirable to keep as many buffers as possible available to the adapter, the SCSI device driver should pass
Every SCSI adapter device driver must support the IOCINFO ioctl operation. The structure to be returned
to the caller is the devinfo structure, including the scsi union definition for the SCSI adapter, which can be
found in the /usr/include/sys/devinfo.h file. The SCSI device driver should request the IOCINFO ioctl
operation (probably during its open routine) to get the maximum transfer size of the adapter.
Note: The SCSI adapter device driver ioctl operations can only be called from the process level. They
cannot be run from a call on any more favored priority levels. Attempting to call them from a more
favored priority level can result in a system crash.
Except where noted otherwise, the arg parameter for each of the ioctl operations described here must
contain a long integer. In this field, the least significant byte is the SCSI LUN and the next least significant
byte is the SCSI ID value. (The upper two bytes are reserved and should be set to 0.) This provides the
information required to allocate or deallocate resources and perform SCSI bus operations for the ioctl
operation requested.
where adapter is a file descriptor and iocmd is an sc_passthru structure as defined in the
/usr/include/sys/scsi.h header file. The SCSI ID and LUN should be placed in the sc_passthru
parameter block.
The SCSI status byte and the adapter status bytes are returned through the sc_passthru
structure. If the SCIOCMD operation returns a value of -1 and the errno global variable is set to a
nonzero value, the requested operation has failed. In this case, the caller should evaluate the
returned status bytes to determine why the operation failed and what recovery actions should be
taken.
If a SCIOCMD operation fails because a field in the sc_passthru structure has an invalid value,
then the subroutine will return a value of -1 and set the errno global variable to EINVAL. In
addition the einval_arg field will be set to the field number (starting with 1 for the version field) of
the field that had an invalid value. A value of 0 for the einval_arg field indicates no additional
information on the failure is available.
The devinfo structure defines the maximum transfer size for the command. If an attempt is made
to transfer more than the maximum, a value of -1 is returned and the errno global variable set to a
value of EINVAL. Refer to the Small Computer System Interface (SCSI) Specification for the
applicable device to get request sense information.
If the FCP SCIOCMD ioctl operation completes successfully, then the adap_set_flags field might
have the SC_RET_ID flag set. This field is set only if the world_wide_name and node_ name
fields were provided in the ioctl call and the FC adapter driver detects that the scsi_id field of this
device has changed. The scsi_id field will contain the new scsi_id value.
The version field of the scsi_passthru structure can be set to the value of SC_VERSION_2 in
/usr/include/sys/scsi.h or SCSI_VERSION_2 in /usrinclude/sys/scsi_buf.h, and the user can
provide the following fields:
v variable_cdb_ptr - pointer to a buffer that contains the SCSI cdb variable.
v variable_cdb_length - the length of the variable cdb to which the variable_cdb_ptr points.
When the SCIOCMD ioctl request with the version field set to SCSI_VERSION_2 completes and the
device did not fully satisfy the request, the residual field indicates left over data. If the request
completes successfully, the residual field indicates the device does not have all the requested
data. If the request did not complete successfully, check the status_validity to see whether a
valid SCSI bus problem exists. If a valid SCSI bus problem exists, the residual field indicates the
number of bytes by which the device failed to complete the request.
For more information, see SCIOCMD SCSI Adapter Device Driver ioctl Operation in AIX 5L
Version 5.3 Technical Reference: Kernel and Subsystems Volume 2.
SCIOHALT
This operation halts outstanding transactions to this ID/LUN device and causes the SCSI adapter
device driver to stop accepting transactions for this device. This situation remains in effect until the
SCSI device driver sends another transaction with the SC_RESUME flag set (in the sc_buf.flags
field) for this ID/LUN combination. The SCIOHALT ioctl operation causes the SCSI adapter device
driver to fail the command in progress, if any, as well as all queued commands for the device with
a return value of ENXIO in the sc_buf.bufstruct.b_error field. If an SCIOSTART operation has
not been previously issued, this command fails.
The following values for the errno global variable are supported:
0 Indicates successful completion.
Note: In normal system operation, this command should not be issued, as it would force the
device to drop a SCSI reservation another initiator (and, hence, another system) might
have. If an SCIOSTART operation has not been previously issued, this command is
unsuccessful.
The following values for the errno global variable are supported:
0 Indicates successful completion.
EIO Indicates an unrecovered I/O error occurred.
EINVAL
Indicates that the selected SCSI ID and LUN have not been started.
ETIMEDOUT
Indicates that the command did not complete.
SCIOGTHW
This operation is only supported by SCSI adapter device drivers that support gathered write
commands. The purpose of the operation is to indicate support for gathered writes to SCSI device
drivers that intend to use this facility. If the SCSI adapter device driver does not support gathered
write commands, it must fail the operation. The SCSI device driver should call this operation from
its open routine for a particular device instance. If the operation is unsuccessful, the SCSI device
driver should not attempt to run a gathered write command.
The arg parameter to the SCIOGTHW is set to null by the caller to indicate that no input
parameter is passed:
The following values for the errno global variable are supported:
0 Indicates successful completion and in particular that the adapter driver supports gathered
writes.
EINVAL
Indicates that the SCSI adapter device driver does not support gathered writes.
Only a kernel process or device driver can call these ioctls. If attempted by a user process, the ioctl will
fail, and the errno global variable will be set to EPERM.
Only a kernel process or device driver can invoke these ioctls. If attempted by a user process, the ioctl will
fail, and the errno global variable will be set to EPERM.
The event registration performed by this ioctl operation is allowed once per device session. Only the first
SCIOEVENT ioctl operation is accepted after the device session is opened. Succeeding SCIOEVENT ioctl
operations will fail, and the errno global variable will be set to EINVAL. The event registration is canceled
automatically when the device session is closed.
The arg parameter to the SCIOEVENT ioctl operation should be set to the address of an sc_event_struct
structure, which is defined in the /usr/include/sys/scsi.h file. The following parameters are supported:
id The caller sets id to the SCSI ID of the attached SCSI target device for initiator-mode.
For target-mode, the caller sets the id to the SCSI ID of the attached SCSI initiator
device.
lun The caller sets the lun field to the SCSI LUN of the attached SCSI target device for
initiator-mode. For target-mode, the caller sets the lun field to 0.
The following values for the errno global variable are supported:
Related Information
Logical File System Kernel Services
Technical References
The following reference articles can be found in AIX 5L Version 5.3 Technical Reference: Kernel and
Subsystems Volume 2:
v scdisk SCSI Device Driver
v scsidisk SCSI Device Driver
v SCSI Adapter Device Driver
v SCIOCMD SCSI Adapter Device Driver ioctl Operation
v SCIODIAG (Diagnostic) SCSI Adapter Device Driver ioctl Operation
v SCIODNLD (Download) SCSI Adapter Device Driver ioctl Operation
v SCIOEVENT (Event) SCSI Adapter Device Driver ioctl Operation
v SCIOGTHW (Gathered Write) SCSI Adapter Device Driver ioctl Operation
v SCIOHALT (HALT) SCSI Adapter Device Driver ioctl Operation
v SCIOINQU (Inquiry) SCSI Adapter Device Driver ioctl Operation
v SCIOREAD (Read) SCSI Adapter Device Driver ioctl Operation
v SCIORESET (Reset) SCSI Adapter Device Driver ioctl Operation
v SCIOSTART (Start SCSI) SCSI Adapter Device Driver ioctl Operation
v SCIOSTARTTGT (Start Target) SCSI Adapter Device Driver ioctl Operation
v SCIOSTOP (Stop Device) SCSI Adapter Device Driver ioctl Operation
v SCIOSTOPTGT (Stop Target) SCSI Adapter Device Driver ioctl Operation
v SCIOSTUNIT (Start Unit) SCSI Adapter Device Driver ioctl Operation
v SCIOTRAM (Diagnostic) SCSI Adapter Device Driver ioctl Operation
v SCIOTUR (Test Unit Ready) SCSI Adapter Device Driver ioctl Operation
The adapter device driver is designed to shield you from having to communicate directly with the system
I/O hardware. This gives you the ability to successfully write a device driver without having a detailed
knowledge of the system hardware. You can look at the subsystem as a two-tiered structure in which the
adapter device driver is the bottom or supporting layer. As a programmer, you need only worry about the
upper layer. This chapter only discusses writing a device driver, because the adapter device driver is
already provided.
The adapter device driver, or lower layer, is responsible only for the communications to and from the bus,
and any error logging and recovery. The upper layer is responsible for all of the device-specific
commands. The device driver should handle all commands directed towards its specific device by building
the necessary sequence of I/O requests to the adapter device driver in order to properly communicate with
the device.
These I/O requests contain the commands that are needed by the device. One important aspect to note is
that the device driver cannot access any of the adapter resources and should never try to pass the
commands directly to the adapter, since it has absolutely no knowledge of the meaning of those
commands.
The device driver should also process the various required reservations and releases needed for the
device. The device driver is notified through the iodone kernel service once the adapter has completed
the processing of the command. The device driver should then notify its calling process that the request
has completed processing through the iodone kernel service.
A device driver does not need to access the diagnostic commands. Commands received from the device
driver through the strategy routine of the adapter are processed from a queue. Once the command has
completed, the device driver is notified through the iodone kernel service.
The adapter device driver does not contain the ddread and ddwrite entry points, but does contain the
ddconfig, ddopen, ddclose, dddump, and ddioctl entry points.
Therefore, the adapter device driver’s entry in the kernel devsw table contains only those entries plus an
additional ddstrategy entry point. This ddstrategy routine is the path that the device driver uses to pass
commands to the device driver. Access to these entry points is possible through the following kernel
services:
v fp_open
v fp_close
v devdump
v fp_ioctl
v devstrat
The adapter is accessed by the device driver through the /dev/fscsi# special files, where # indicates
ascending numbers 0,1, 2, and so on. The adapter is designed so that multiple devices on the same
adapter can be accessed at the same time.
The iSCSI adapter is accessed by the device driver through the /dev/iscsin special files, where n indicates
ascending numbers 0, 1, 2, and so on. The adapter is designed so that multiple devices on the same
adapter can be accessed at the same time.
The Virtual SCSI Client adapter is accessed by the device driver through the /dev/vscsiX special files,
where X indicates ascending numbers 0, 1, 2, and so on. The adapter is designed such that multiple
devices on the same adapter can be accessed at the same time.
For additional information on spanned and gathered write commands, see “Understanding the Execution of
FCP, iSCSI, and Virtual SCSI Client Initiator I/O Requests” on page 284.
scsi_buf Structure
The I/O requests made from the device driver to the adapter device driver are completed through the use
of the scsi_buf structure, which is defined in the /usr/include/sys/scsi_buf.h header file. This structure,
which is similar to the buf structure in other drivers, is passed between the two subsystem drivers through
the strategy routine. The following is a brief description of the fields contained in the scsi_buf structure:
v Reserved fields should be set to a value of 0, except where noted.
v The bufstruct field contains a copy of the standard buf buffer structure that documents the I/O request.
Included in this structure, for example, are the buffer address, byte count, and transfer direction. The
b_work field in the buf structure is reserved for use by the adapter device driver. The current definition
of the buf structure is in the /usr/include/sys/buf.h include file.
v The bp field points to the original buffer structure received by the Device Driver from the caller, if any.
This can be a chain of entries in the case of spanned transfers (commands that transfer data from or to
more than one system-memory buffer). A null pointer indicates a nonspanned transfer. The null value
specifically tells the adapter device driver that all the information needed to perform the DMA data
transfer is contained in the bufstruct fields of the scsi_buf structure.
v The scsi_command field, defined as a scsi_cmd structure, contains, for example, the SCSI command
length, SCSI command, and a flag variable:
– The scsi_length field is the number of bytes in the actual SCSI command. This is normally 6,10,12,
or 16 (decimal).
– The FCP_flags field contains the following bit flags:
Chapter 13. FCP, iSCSI, and Virtual SCSI Client Subsystem 253
v The adapter_status field is an output parameter that is valid when its status_validity bit is nonzero.
The scsi_buf.bufstruct.b_error field should be set to EIO anytime the adapter_status field is valid.
This field contains generic adapter card status. It is intentionally general in coverage so that it can
report error status from any typical adapter.
If an error is detected during execution of a command, and the error prevented the command from
actually being sent to the transport layer by the adapter, then the error should be processed or
recovered, or both, by the adapter device driver.
If it is recovered successfully by the adapter device driver, the error is logged, as appropriate, but is not
reflected in the adapter_status byte. If the error cannot be recovered by the adapter device driver, the
appropriate adapter_status bit is set and the scsi_buf structure is returned to the device driver for
further processing.
If an error is detected after the command was actually sent to the device, then it should be processed
or recovered, or both, by the device driver.
For error logging, the adapter device driver logs transport layer and adapter-related conditions, andl the
device driver logs device-related errors. In the following description, a capital letter (A) after the error
name indicates that the adapter device driver handles error logging. A capital letter (H) indicates that the
device driver handles error logging.
Some of the following error conditions indicate a device failure. Others are transport layer or
adapter-related.
SCSI_HOST_IO_BUS_ERR (A)
The system I/O transport layer generated or detected an error during a DMA or Programmed
I/O (PIO) transfer.
SCSI_TRANSPORT_FAULT (H)
The transport protocol or hardware was unsuccessful.
SCSI_CMD_TIMEOUT (H)
The command timed out before completion.
SCSI_NO_DEVICE_RESPONSE (H)
The target device did not respond to selection phase.
SCSI_ADAPTER_HDW_FAILURE (A)
The adapter indicated an onboard hardware failure.
SCSI_ADAPTER_SFW_FAILURE (A)
The adapter indicated microcode failure.
SCSI_FUSE_OR_TERMINAL_PWR (A)
The adapter indicated a blown terminator fuse or bad termination.
SCSI_TRANSPORT_RESET (A)
The adapter indicated the transport layer has been reset.
SCSI_WW_NAME_CHANGE (A)
The adapter indicated the device at this SCSI ID has a new world wide name.
SCSI_TRANSPORT_BUSY (A)
The adapter indicated the transport layer is busy.
SCSI_TRANSPORT_DEAD (A)
The adapter indicated the transport layer currently inoperative and is likely to remain this way
for an extended time.
v The add_status field contains additional device status. For devices, this field contains the Response
code returned.
v When the device driver queues multiple transactions to a device, the adap_q_status field indicates
whether or not the adapter driver has cleared its queue for this device after an error has occurred. The
Note: Commands with the value of SC_NO_Q for the q_tag_msg field (except for request sense
commands) should not be queued to a device whose queue contains a command with another
value for q_tag_msg. If commands with the SC_NO_Q value (except for request sense) are sent to
the device, then the device driver must make sure that no active commands are using different
values for q_tag_ms. Similarly, the device driver must also make sure that a command with a
q_tag_msg value of SC_ORDERED_Q, SC_HEAD_Q, or SC_SIMPLE_Q is not sent to a device that has a
command with the q_tag_msg field of SC_NO_Q.
v The flags field contains bit flags sent from the device driver to the adapter device driver. The following
flags are defined:
SC_RESUME
When set, means the adapter device driver should resume transaction queuing for this ID/LUN.
Error recovery is complete after a SCIOLHALT operation, check condition, or severe transport
error. This flag is used to restart the adapter device driver following a reported error.
SC_DELAY_CMD
When set, means the adapter device driver should delay sending this command (following a
reset or BDR to this device) by at least the number of seconds specified to the adapter device
driver in its configuration information. For devices that do not require this function, this flag
should not be set.
SC_Q_CLR
When set, means the adapter driver should clear its transaction queue for this ID/LUN. The
transaction containing this flag setting does not require an actual command in the scsi_buf
because it is flushed back to the device driver with the rest of the transactions for this ID/LUN.
However, this transaction must have the SCSI ID field (scsi_buf.scsi_id) and the LUN field
(scsi_buf.lun_id) filled in with the device’s SCSI ID and LUN. This flag is valid only during error
Chapter 13. FCP, iSCSI, and Virtual SCSI Client Subsystem 255
recovery of a check condition or command terminated at a command tag queuing device when
the SC_DID_NOT_CLR_Q flag is set in the scsi_buf.adap_q_status field.
SC_Q_RESUME
When set, means that the adapter driver should resume its halted transaction queue for this
ID/LUN. The transaction containing this flag setting does not require an actual command to be
sent to the adapter driver. However, this transaction must have the SCSI ID field
(scsi_buf.scsi_id) and the LUN field (scsi_buf.lun_id) filled in with the device’s SCSI ID and
logical unit number (LUN). If the transaction containing this flag setting is the first issued by the
device driver after it receives an error (indicating that the adapter driver’s queue is halted), then
the SC_RESUME flag must be set also.
SC_CLEAR_ACA
When set, means the SCSI adapter driver should issue a Clear ACA task management request
for this ID/LUN. This flag should be used in conjunction with either the SC_Q_CLR or
SC_Q_RESUME flags to clear or resume the SCSI adapter driver’s queue for this device. If
neither of these flags is used, then this transaction is treated as if the SC_Q_RESUME flag is
also set. The transaction containing the SC_CLEAR_ACA flag setting does not require an
actual SCSI command in the sc_buf. If this transaction contains a SCSI command then it will
be processed depending on whether SC_Q_CLR or SC_Q_RESUME is set. This transaction
must have the SCSI ID field (scsi_buf.scsi_id) and the LUN field (scsi_buf.lun_id) filled in
with the device’s SCSI ID and LUN. This flag is valid only during error recovery of a check
condition or command terminated at a command tag queuing.
SC_TARGET_RESET
When set, means the SCSI adapter driver should issue a Target Reset task management
request for this ID/LUN. This flag should be used in conjunction with ethe SC_Q_CLR flag
flag.The transaction containing this flag setting does allow an actual command to be sent to the
adapter driver. However, this transaction must have the SCSI ID field (scsi_buf.scsi_id) filled in
with the device’s SCSI ID. If the transaction containing this flag setting is the first issued by the
device driver after it receives an error (indicating that the adapter driver’s queue is halted), then
the SC_RESUME flag must be set also.
SC_LUN_RESET
When set, means the SCSI adapter driver should issue a Lun Reset task management request
for this ID/LUN. This flag should be used in conjunction with ethe SC_Q_CLR flag flag.The
transaction containing this flag setting does allow an actual command to be sent to the adapter
driver. However, this transaction must have the the SCSI ID field (scsi_buf.scsi_id) and the
LUN field (scsi_buf.lun_id) filled in with the device’s SCSI ID and logical unit number (LUN). If
the transaction containing this flag setting is the first issued by the device driver after it receives
an error (indicating that the adapter driver’s queue is halted), then the SC_RESUME flag must
be set also.
v The dev_flags field contains additional values sent from the FCP device driver to the FCP adapter
device driver. This field is not used for iSCSI or Virtual SCSI device drivers. The following values are
defined:
FC_CLASS1
When set, this tells the SCSI adapter driver that it should issue this request as a Fibre Channel
Class 1 request. If the SCSI adapter driver does not support this class, then it will fail the
scsi_buf with an error of EINVAL. If no Fibre Channel Class is specified in the scsi_buf then the
SCSI adapter will default to a Fibre Channel Class.
FC_CLASS2
When set, this tells the SCSI adapter driver that it should issue this request as a Fibre Channel
Class 2 request. If the SCSI adapter driver does not support this class, then it will fail the
scsi_buf with an error of EINVAL. If no Fibre Channel Class is specified in the scsi_buf then the
SCSI adapter will default to a Fibre Channel Class.
The adapter driver’s strategy routine validates all of the information contained in the scsi_buf structure
and also performs any necessary queuing of the transaction request. If no queuing is necessary, the
adapter driver’s start subroutine is called.
When an interrupt occurs, adapter driver interrupt routine fills in the status_validity field and the
appropriate scsi_status or adapter_status field of the scsi_buf structure. The bufstruct.b_resid field is
also filled in with the value of nontransferred bytes. The adapter driver’s interrupt routine then passes this
newly filled in scsi_buf structure to the iodone kernel service, which then signals the device driver’s
iodone subroutine. The adapter driver’s start routine is also called from the interrupt routine to process
any additional transactions on the queue.
The device driver’s iodone routine should then process all of the applicable fields in the queued scsi_buf
structure for any errors and attempt error recovery if necessary. The device driver should then dequeue
the scsi_buf structure and then pass a pointer to the structure back to the iodone kernel service so that it
can notify the originator of the request.
Chapter 13. FCP, iSCSI, and Virtual SCSI Client Subsystem 257
FCP, iSCSI, and Virtual SCSI Client Adapter Device Driver Routines
This section describes the following routines:
v config
v open
v close
v openx
v strategy
v ioctl
v start
v interrupt
config Routine
The config routine performs all of the processing needed to configure, unconfigure, and read Vital Product
Data (VPD) for the adapter. When this routine is called to configure an adapter, it performs the required
checks and building of data structures needed to prepare the adapter for the processing of requests.
When asked to unconfigure or terminate an adapter, this routine deallocates any structures defined for the
adapter and marks the adapter as unconfigured. This routine can also be called to return the Vital Product
Data for the adapter, which contains information that is used to identify the serial number, change level, or
part number of the adapter.
open Routine
The open routine establishes a connection between a special file and a file descriptor. This file descriptor
is the link to the special file that is the access point to a device and is used by all subsequent calls to
perform I/O requests to the device. Interrupts are enabled and any data structures needed by the adapter
driver are also initialized.
close Routine
The close routine marks the adapter as closed and disables all future interrupts, which causes the driver
to reject all future requests to this adapter.
openx Routine
The openx routine allows a process with the proper authority to open the adapter in diagnostic mode. If
the adapter is already open in either normal or diagnostic mode, the openx subroutine has a return value
of -1. Improper authority results in an errno value of EPERM, while an already open error results in an
errno value of EACCES. If the adapter is in diagnostic mode, only the close and ioctl routines are allowed.
All other routines return a value of -1 and an errno value of EACCES.
While in diagnostics mode, the adapter can run diagnostics, run wrap tests, and download microcode. The
openx routine is called with an ext parameter that contains the adapter mode and the SC_DIAGNOSTIC
value, both of which are defined in the sys/scsi.h header file.
strategy Routine
The strategy routine is the link between the device driver and the adapter device driver for all normal I/O
requests. Whenever the device driver receives a call, it builds an scsi_buf structure with the correct
parameters and then passes it to this routine, which in turn queues up the request if necessary. Each
request on the pending queue is then processed by building the necessary commands required to carry
out the request. When the command has completed, the device driver is notified through the iodone
kernel service.
ioctl Routine
The ioctl routine allows various diagnostic and nondiagnostic adapter operations. Operations include the
following:
v IOCINFO
start Routine
The start routine is responsible for checking all pending queues and issuing commands to the adapter.
When a command is issued to the adapter, the scsi_buf is converted into an adapter specific request
needed for the scsi_buf. At this time, the bufstruct.b_addr for the scsi_buf will be mapped for DMA.
When the adapter specific request is completed, the adapter will be notified of this request.
interrupt Routine
The interrupt routine is called whenever the adapter posts an interrupt. When this occurs, the interrupt
routine will find the scsi_buf corresponding to this interrupt. The buffer for the scsi_buf will be unmapped
from DMA. If an error occurred, the status_validity, scsi_status, and adapter_status fields will be set
accordingly. The bufstruct.b_resid field will be set with the number of nontransferred bytes. The interrupt
handler then runs the iodone kernel service against the scsi_buf, which will send the scsi_buf back to
the device driver which originated it.
Chapter 13. FCP, iSCSI, and Virtual SCSI Client Subsystem 259
IOCINFO for FCP Adapters
This operation lets an FCP device driver obtain important information about a FCP adapter, including the
adapter’s SCSI ID, the maximum data transfer size in bytes, and the FC topology to which the adapter is
connected. By knowing the maximum data transfer size, a FCP device driver can control several different
devices on several different adapters. This operation returns a devinfo structure as defined in the
sys/devinfo.h header file with the device type DD_BUS and subtype DS_FCP. The following is an
example of a call to obtain the information:
rc = fp_ioctl(fp, IOCINFO, &infostruct, NULL);
where fp is a pointer to a file structure and infostruct is a devinfo structure. A non-zero rc value indicates
an error. Note that the devinfo structure is a union of several structures and that fcp is the structure that
applies to the adapter. For example, the maximum transfer size value is contained in the
infostruct.un.fcp.max_transfer variable and the card ID is contained in infostruct.un.fcp.scsi_id.
where fp is a pointer to a file structure and infostruct is a devinfo structure. A non-zero rc value indicates
an error. Note that the devinfo structure is a union of several structures and that iscsi is the structure that
applies to the adapter. For example, the maximum transfer size value is contained in the
infostruct.un.iscsi.max_transfer variable.
where fp is a pointer to a file structure and infostruct is a devinfo structure. A non-zero rc value
indicates an error. The devinfo structure is a union of several structures and Virtual SCSI is the structure
that applies to the adapter. For example, the maximum transfer size value is contained in the
infostruct.un.vscsi.max_transfer variable.
SCIOLSTART
This operation opens a logical path to the device and causes the adapter device driver to allocate and
initialize all of the data areas needed for the device. The SCIOLSTOP operation should be issued when
those data areas are no longer needed. This operation should be issued before any nondiagnostic
operation except for IOCINFO. The following is a typical call:
rc = fp_ioctl(fp, SCIOLSTART, &sciolst);
For iSCSI adapters, this version field of the scsi_sciolst must be set to the value of SCSI_VERSION_1
(defined in the /usr/include/sys/scsi_buf.h file). In addition, iSCSI adapters require the caller to set the
following fields:
v lun_id of the device’s LUN ID
v parms.iscsi.name to the device’s iSCSI target name
v parms.iscsi.iscsi_ip_addr to the device’s IP V4 or IP V6 address
v parms.iscsi.port_num to the devices TCP port number
If the iSCSI SCIOLSTART ioctl operation completes successfully, then the adap_set_flags field should
have the SCIOL_RET_ID_ALIAS flag and the scsi_id field set to a SCSI ID alias that can be used for
subsequent ioctl calls to this device other than SCIOLSTART.
For Virtual SCSI adapters, the version field of the scsi_sciolst structure must be set to the value of
SCSI_VERSION_1 (defined in the /usr/include/sys/scsi_buf.h file). In addition, Virtual SCSI adapters
require the caller to set the lun_id field to the Logical Unit Id (LUN) of the device being started.
For AIX 5.2 with 5200-01 and later, if the FCP SCIOLSTART ioctl operation completes successfully, and
the adap_set_flags field has the SCIOL_DYNTRK_ENABLED flag set, then Dynamic Tracking of FC
Devices has been enabled for this device.
All FC adapter ioctl calls for AIX 5.2 with 5200-01 and later, should set the version field to
SCSI_VERSION_1 if indicated in the ioctl structure comments in the header files. The world_wide_name
and node_name fields of all SCSI_VERSION_1 ioctl structures should also be set. This is especially
important if dynamic tracking has been enabled on this adapter. With dynamic tracking, the FC adapter
driver can recover from scsi_id changes of FC devices while devices are online. Because the scsi_id can
change, use of the world_wide_name and node_name fields is necessary to ensure communication with
the intended device.
Failure to use a SCSI_VERSION_1 ioctl structure for SCIOLSTART when dynamic tracking is enabled can
produce undesired results, and temporarily disable dynamic tracking for a given device. If a target has at
least one lun activated by SCIOLSTART with the version field set to SCSI_VERSION_1, then a
Chapter 13. FCP, iSCSI, and Virtual SCSI Client Subsystem 261
SCSI_VERISON_0 SCIOLSTART fails. If this is the first lun activated by SCIOLSTART on this target and
the version field is set to SCSI_VERSION_0, then an error log of type INFO is generated and dynamic
tracking is temporarily disabled for this target until a corresponding SCSI_VERSION_0 SCIOLSTOP is
issued.
The version field for all ioctl structures should be set consistently. For example, if an SCIOLSTART
operation is performed with the version field set to SCSI_VERSION_1, but the SCIOLINQU or
SCIOLSTOP ioctl operations have the version field set to SCSI_VERSION_0, then the ioctl call fails if
dynamic tracking is enabled because the version fields do not match.
If the FCP SCIOLSTART ioctl operation completes successfully, then the adap_set_flags field might have
the SCIOL_RET_ID_ALIAS flag set. This field is set only if the world_wide_name field was provided in
the ioctl call and the FC adapter driver detects that the scsi_id field of this device has changed. The
scsi_id field contains the new scsi_id value.
If the caller of SCIOLSTART is a kernel extension, then the SCIOL_RET_HANDLE flag can be set in the
adap_set_flags field along with the kernext_handle field. In this case the kernext_handle field can be
used for scsi_buf structures issued to the adapter driver for this device.
A nonzero return value indicates an error has occurred and all operations to this SCSI/LUN pair should
cease because the device is either already started or failed the start operation. Possible errno values are:
SCIOLSTOP
This operation closes a logical path to the device and causes the adapter device driver to deallocate all
data areas that were allocated by the SCIOLSTART operation. This operation should only be issued after
a successful SCIOLSTART operation to a device. The following is a typical call:
rc = fp_ioctl(fp, SCIOLSTOP, &sciolst);
A non-zero return value indicates an error has occurred. Possible errno values are:
For FCP adapters, the version field of the scsi_sciolst structure must be set to the value of
SCSI_VERSION_1, which is defined in the /usr/include/sys/scsi_buf.h file. In addition, the following fields
can be set:
v world_wide_name - The caller can set the world_wide_name field to the World Wide Name of the
attached target device. If Dynamic Tracking of FC devices is enabled, the world_wide_name field
must be set to ensure communication with the device because the scsi_id field of a device can change
after dynamic tracking events.
For Virtual SCSIadapters, the version field of the scsi_sciolst structure must be set to the value of
SCSI_VERSION_1 (defined in the /usr/include/sys/scsi_buf.h file). In addition, Virtual SCSI adapters
require the caller to set the lun_id field to the Logical Unit Id (LUN) of the device being stopped.
SCIOLEVENT
This operation lets a device driver register a particular device instance for receiving asynchronous event
status by calling the SCIOLEVENT ioctl operation for the adapter device driver. When an event covered by
the SCIOLEVENT ioctl operation is detected by the adapter device driver, it builds an scsi_event_info
structure and passes a pointer to the structure and to the asynchronous event-handler routine entry point,
which was previously registered.
The information reported in the scsi_event_info.events field does not queue to the device driver, but is
instead reported as one or more flags as they occur. Because the data does not queue, the adapter
device driver writer can use a single scsi_event_info structure and pass it one at a time, by pointer, to
each asynchronous event handler routine for the appropriate device instance. After determining for which
device the events are being reported, the device driver must copy the scsi_event_info.events field into
local space and must not modify the contents of the rest of the scsi_event_info structure.
Because the event status is optional, the device driver writer determines what action is necessary to take
upon receiving event status. The writer might decide to save the status and report it back to the calling
application, or the device driver or application level program can take error recovery actions.
This operation should only be issued after a successful SCIOLSTART operation to a device. The following
is a typical call:
rc = fp_ioctl(fp, SCIOLEVENT, &scevent);
A non-zero return value indicates an error has occurred. Possible errno values are:
For FCP adapters, the version field of the scsi_event_struct structure must be set to the value of
SCSI_VERSION_1, which is defined in the /usr/include/sys/scsi_buf.h file. In addition, the following fields
can be set:
v world_wide_name - The caller can set the world_wide_name field to the World Wide Name of the
attached target device. If Dynamic Tracking of FC devices is enabled, the world_wide_name field
must be set to ensure communication with the device because the scsi_id field of a device can change
after dynamic tracking events.
v node_name - The caller can set the node_name field to the Node Name of the attached target device.
If the world_wide_name field and the version field are set to SCSI_VERSION_1 but the node_name field
is not set, the scsi_id field is used for device lookup instead of the world_wide_name. If Dynamic
Chapter 13. FCP, iSCSI, and Virtual SCSI Client Subsystem 263
Tracking of FC devices is enabled, the node_name field must be set to ensure communication with
the device because the scsi_id field of a device can change after dynamic tracking events.
If the FCP SCIOLEVENT ioctl operation completes successfully, then the adap_set_flags field might have
the SC_RET_ID flag set. This field is set only if the world_wide_name and node_ name fields were
provided in the ioctl call and the FC adapter driver detects that the scsi_id field of this device has
changed. The scsi_id field contains the new scsi_id value.
SCIOLINQU
This operation issues an inquiry command to an device and is used to aid in device configuration. The
following is a typical call:
rc = ioctl(adapter, SCIOLINQU, &inquiry_block);
where adapter is a file descriptor and inquiry_block is a scsi_inquiry structure as defined in the
/usr/include/sys/scsi_buf.h header file. The FCP ID or iSCSI device’s SCSI ID alias, and LUN should be
placed in the scsi_inquiry parameter block. The SC_ASYNC flag should not be set on the first call to this
operation and should only be set if a bus fault has occurred. Possible errno values are:
EIO A system error has occurred. Consider retrying the operation several times, because
another attempt might be successful.
EFAULT A user process copy has failed.
EINVAL The device is not opened.
EACCES The adapter is in diagnostics mode.
ENOMEM A memory request has failed.
ETIMEDOUT The command has timed out. Consider retrying the operation several times, because
another attempt might be successful.
ENODEV The device is not responding. Possibly no LUNs exist on the present FCP ID.
ENOCONNECT A bus fault has occurred and the operation should be retried with the SC_ASYNC flag set
in the scsi_inquiry structure. In the case of multiple retries, this flag should be set only on
the last retry.
For FCP adapters, the version field of the scsi_inquiry structure must be set to the value of
SCSI_VERSION_1, which is defined in the /usr/include/sys/scsi_buf.h file. In addition, the following fields
can be set:
v world_wide_name - The caller can set the world_wide_name field to the World Wide Name of the
attached target device. If Dynamic Tracking of FC devices is enabled, the world_wide_name field
must be set to ensure communication with the device because the scsi_id field of a device can change
after dynamic tracking events.
v node_name - The caller can set the node_name field to the Node Name of the attached target device.
If the world_wide_name field and the version field are set to SCSI_VERSION_1 but the node_name field
is not set, the scsi_id field is used for device lookup instead of the world_wide_name. If Dynamic
Tracking of FC devices is enabled, the node_name field must be set to ensure communication with
the device because the scsi_id field of a device can change after dynamic tracking events.
When the SCIOLINQU ioctl request with the version field set to SCSI_VERSION_2 completes and the device
did not fully satisfy the request, the residual field indicates left over data. If the request completes
successfully, the residual field indicates the device does not have all the requested data. If the request did
not complete successfully, check the status_validity to see whether a valid SCSI bus problem exists. If a
valid SCSI bus problem exists, the residual field indicates the number of bytes by which the device failed
to complete the request.
SCIOLSTUNIT
This operation issues a start unit command to an device and is used to aid in device configuration. The
following is a typical call:
rc = ioctl(adapter, SCIOLSTUNIT, &start_block);
where adapter is a file descriptor and start_block is a scsi_startunit structure as defined in the
/usr/include/sys/scsi_buf.h header file. The FCP ID or iSCSI device’s SCSI ID alias, and LUN should be
placed in the scsi_startunit parameter block. The start_flag field designates the start option, which when
set to true, makes the device available for use. When this field is set to false, the device is stopped.
The SC_ASYNC flag should not be set on the first call to this operation and should only be set if a bus
fault has occurred. The immed_flag field supports overlapping start operations to several devices on the
adapter. When this field is set to false, status is returned only when the operation has completed. When
this field is set to true, status is returned as soon as the device receives the command. The SCIOLTUR
operation can then be issued to check on completion of the operation on a particular device.
Note that when the FCP or iSCSI adapter issues simultaneous start operations, it is important that a delay
of 10 seconds is buffered between successive SCIOLSTUNIT operations to devices sharing a common
power supply because damage to the system or devices can occur if this precaution is not followed.
Possible errno values are:
EIO A system error has occurred. Consider retrying the operation several times, because another
attempt might be successful.
EFAULT A user process copy has failed.
EINVAL The device is not opened.
EACCES The adapter is in diagnostics mode.
ENOMEM A memory request has failed.
ETIMEDOUT The command has timed out. Consider retrying the operation several times, because another
attempt might be successful.
ENODEV The device is not responding. Possibly no LUNs exist on the present FCP ID.
ENOCONNECT A bus fault has occurred. Try the operation again with the SC_ASYNC flag set in the
scsi_inquiry structure. In the case of multiple retries, this flag should be set only on the last
retry.
For FCP adapters, the version field of the scsi_startunit structure must be set to the value of
SCSI_VERSION_1, which is defined in the /usr/include/sys/scsi_buf.h file. In addition, the following fields
can be set:
v world_wide_name - The caller can set the world_wide_name field to the World Wide Name of the
attached target device. If Dynamic Tracking of FC devices is enabled, the world_wide_name field
must be set to ensure communication with the device because the scsi_id field of a device can change
after dynamic tracking events.
v node_name - The caller can set the node_name field to the Node Name of the attached target device.
If the world_wide_name field and the version field are set to SCSI_VERSION_1 but the node_name field
is not set, the scsi_id field is used for device lookup instead of the world_wide_name. If Dynamic
Tracking of FC devices is enabled, the node_name field must be set to ensure communication with
the device because the scsi_id field of a device can change after dynamic tracking events.
Chapter 13. FCP, iSCSI, and Virtual SCSI Client Subsystem 265
If the FCP SCIOLSTUNIT ioctl operation completes successfully, then the adap_set_flags field might
have the SC_RET_ID flag set. This field is set only if the world_wide_name and node_ name fields were
provided in the ioctl call and the FC adapter driver detects that the scsi_id field of this device has
changed. The scsi_id field contains the new scsi_id value.
SCIOLTUR
This operation issues a Test Unit Ready command to an adapter and aids in device configuration. The
following is a typical call:
rc = ioctl(adapter, SCIOLTUR, &ready_struct);
where adapter is a file descriptor and ready_struct is a scsi_ready structure as defined in the
/usr/include/sys/scsi_buf.h header file. The FCP ID or iSCSI device’s SCSI ID alias, and LUN should be
placed in the scsi_ready parameter block. The SC_ASYNC flag should not be set on the first call to this
operation and should only be set if a bus fault has occurred. The status of the device can be determined
by evaluating the two output fields: status_validity and scsi_status. Possible errno values are:
EIO A system error has occurred. Consider retrying the operation several (around three) times,
because another attempt might be successful. If an EIO error occurs and the status_validity
field is set to SC_FCP_ERROR, then the scsi_status field has a valid value and should be inspected.
If the status_validity field is zero and remains so on successive retries, then an unrecoverable
error has occurred with the device.
If the status_validity field is SC_FCP_ERROR and the scsi_status field contains a Check Condition
status, then the SCIOLTUR operation should be retried after several seconds.
If after successive retries, the Check Condition status remains, the device should be considered
inoperable.
EFAULT A user process copy has failed.
EINVAL The device is not opened.
EACCES The adapter is in diagnostics mode.
ENOMEM A memory request has failed.
ETIMEDOUT The command has timed out. Consider retrying the operation several times, because another
attempt might be successful.
ENODEV The device is not responding and possibly no LUNs exist on the present target.
ENOCONNECT A bus fault has occurred and the operation should be retried with the SC_ASYNC flag set in the
scsi_inquiry structure. In the case of multiple retries, this flag should be set only on the last
retry.
For FCP adapters, the version field of the scsi_ready structure must be set to the value of
SCSI_VERSION_1, which is defined in the /usr/include/sys/scsi_buf.h file. In addition, the following fields
can be set:
v world_wide_name - The caller can set the world_wide_name field to the World Wide Name of the
attached target device. If Dynamic Tracking of FC devices is enabled, the world_wide_name field
must be set to ensure communication with the device because the scsi_id field of a device can change
after dynamic tracking events.
v node_name - The caller can set the node_name field to the Node Name of the attached target device.
If the world_wide_name field and the version field are set to SCSI_VERSION_1 but the node_name field
is not set, the scsi_id field is used for device lookup instead of the world_wide_name. If Dynamic
Tracking of FC devices is enabled, the node_name field must be set to ensure communication with
the device because the scsi_id field of a device can change after dynamic tracking events.
For Virtual SCSI adapters, the version field of the scsi_sciolst structure must be set to the value of
SCSI_VERSION_1 (defined in the /usr/include/sys/scsi_buf.h file). In addition, Virtual SCSI adapters
If the FCP SCIOLTUR ioctl operation completes successfully, then the adap_set_flags field might have
the SC_RET_ID flag set. This field is set only if the world_wide_name and node_ name fields were
provided in the ioctl call and the FC adapter driver detects that the scsi_id field of this device has
changed. The scsi_id field contains the new scsi_id value.
SCIOLREAD
This operation issues an read command to an device and is used to aid in device configuration. The
following is a typical call:
rc = ioctl(adapter, SCIOLREAD, &readblk);
where adapter is a file descriptor and readblk is a scsi_readblk structure as defined in the
/usr/include/sys/scsi_buf.h header file. The FCP ID or iSCSI device’s SCSI ID alias, and the LUN should
be placed in the scsi_readblk parameter block. The SC_ASYNC flag should not be set on the first call to
this operation and should only be set if a bus fault has occurred. Possible errno values are:
EIO A system error has occurred. Consider retrying the operation several times, because another
attempt might be successful.
EFAULT A user process copy has failed.
EINVAL The device is not opened.
EACCES The adapter is in diagnostics mode.
ENOMEM A memory request has failed.
ETIMEDOUT The command has timed out. Consider retrying the operation several times, because another
attempt might be successful.
ENODEV The device is not responding. Possibly no LUNs exist on the present target.
ENOCONNECT A bus fault has occurred and the operation should be retried with the SC_ASYNC flag set in
the scsi_readblk structure. In the case of multiple retries, this flag should be set only on the
last retry.
For FCP adapters, the version field of the scsi_readblk structure must be set to the value of
SCSI_VERSION_1, which is defined in the /usr/include/sys/scsi_buf.h file. In addition, the following fields
can be set:
v world_wide_name - The caller can set the world_wide_name field to the World Wide Name of the
attached target device. If Dynamic Tracking of FC devices is enabled, the world_wide_name field
must be set to ensure communication with the device because the scsi_id field of a device can change
after dynamic tracking events.
v node_name - The caller can set the node_name field to the Node Name of the attached target device.
If the world_wide_name field and the version field are set to SCSI_VERSION_1 but the node_name field
is not set, the scsi_id field is used for device lookup instead of the world_wide_name. If Dynamic
Tracking of FC devices is enabled, the node_name field must be set to ensure communication with
the device because the scsi_id field of a device can change after dynamic tracking events.
When the SCIOLREAD ioctl request with the version field set to SCSI_VERSION_2 completes and the
device did not fully satisfy the request, the residual field indicates left over data. If the request completes
successfully, the residual field indicates the device does not have all the requested data. If the request did
not complete successfully, check the status_validity to see whether a valid SCSI bus problem exists. If a
valid SCSI bus problem exists, the residual field indicates the number of bytes by which the device failed
to complete the request.
Chapter 13. FCP, iSCSI, and Virtual SCSI Client Subsystem 267
If the FCP SCIOLREAD ioctl operation completes successfully, then the adap_set_flags field might have
the SC_RET_ID flag set. This field is set only if the world_wide_name and node_ name fields were
provided in the ioctl call and the FC adapter driver detects that the scsi_id field of this device has
changed. The scsi_id field contains the new scsi_id value.
SCIOLRESET
If the SCIOLRESET_LUN_RESET flag is not set in the flags field of the scsi_sciolst, then this operation
causes a device to release all reservations, clear all current commands, and return to an initial state by
issuing a Target Reset, which resets all LUNs associated with the specified FCP ID or iSCSI device’s
SCSI ID alias. If the SCIOLRESET_LUN_RESET flag is set in the flags field of the scsi_sciolst, then this
operation causes an FCP device to release all reservations, clear all current commands, and return to an
initial state by issuing a Lun Reset, which resets just the specified LUN associated with the specified FCP
ID or iSCSI device’s SCSI ID alias.
A reserve command should be issued after the SCIOLRESET operation to prevent other initiators from
claiming the device. Note that because a certain amount of time exists between a reset and reserve
command, it is still possible for another initiator to successfully reserve a particular device. The following is
a typical call:
rc = fp_ioctl(fp, SCIOLRESET, &sciolst);
A nonzero return value indicates an error has occurred. Possible errno values are:
For FCP adapters, the version field of the scsi_sciolst structure must be set to the value of
SCSI_VERSION_1, which is defined in the /usr/include/sys/scsi_buf.h file. In addition, the following fields
can be set:
v world_wide_name - The caller can set the world_wide_name field to the World Wide Name of the
attached target device. If Dynamic Tracking of FC devices is enabled, the world_wide_name field
must be set to ensure communication with the device because the scsi_id field of a device can change
after dynamic tracking events.
v node_name - The caller can set the node_name field to the Node Name of the attached target device.
If the world_wide_name field and the version field are set to SCSI_VERSION_1 but the node_name field
is not set, the scsi_id field is used for device lookup instead of the world_wide_name. If Dynamic
Tracking of FC devices is enabled, the node_name field must be set to ensure communication with
the device because the scsi_id field of a device can change after dynamic tracking events.
If the FCP SCIOLRESET ioctl operation completes successfully, then the adap_set_flags field might have
the SCIOL_RET_ID_ALIAS flag set. This field is set only if the world_wide_name and node_ name fields
were provided in the ioctl call and the FC adapter driver detects that the scsi_id field of this device has
changed. The scsi_id field contains the new scsi_id value.
After the SCIOLHALT operation is sent, the device driver must set the SC_RESUME flag in the next
scsi_buf structure sent to the adapter device driver, or all subsequent scsi_buf structures sent are
ignored.
The adapter also performs normal error recovery procedures during this command. The following is a
typical call:
rc = fp_ioctl(fp, SCIOLHALT, &sciolst);
A nonzero return value indicates an error has occurred. Possible errno values are:
For FCP adapters, the version field of the scsi_sciolst structure must be set to the value of
SCSI_VERSION_1, which is defined in the /usr/include/sys/scsi_buf.h file. In addition, the following fields
can be set:
v world_wide_name - The caller can set the world_wide_name field to the World Wide Name of the
attached target device. If Dynamic Tracking of FC devices is enabled, the world_wide_name field
must be set to ensure communication with the device because the scsi_id field of a device can change
after dynamic tracking events.
v node_name - The caller can set the node_name field to the Node Name of the attached target device.
If the world_wide_name field and the version field are set to SCSI_VERSION_1 but the node_name field
is not set, the scsi_id field is used for device lookup instead of the world_wide_name. If Dynamic
Tracking of FC devices is enabled, the node_name field must be set to ensure communication with
the device because the scsi_id field of a device can change after dynamic tracking events.
If the FCP SCIOLHALT ioctl operation completes successfully, then the adap_set_flags field might have
the SCIOL_RET_ID_ALIAS flag set. This field is set only if the world_wide_name and node_ name fields
were provided in the ioctl call and the FC adapter driver detects that the scsi_id field of this device has
changed. The scsi_id field contains the new scsi_id value.
SCIOLCMD
After the SCSI device has been successfully started using SCIOLSTART, the SCIOLCMD ioctl operation
provides the means for issuing any SCSI command to the specified device. The SCSI adapter driver
performs no error recovery or logging on failures of this ioctl operation. The following is a typical call:
rc = ioctl(adapter, SCIOLCMD, &iocmd);
where adapter is a file descriptor and iocmd is a scsi_iocmd structure as defined in the
/usr/include/sys/scsi_buf.h header file. The SCSI ID or iSCSI device’s SCSI ID alias, and LUN ID should
be placed in the scsi_iocmd parameter block.
Chapter 13. FCP, iSCSI, and Virtual SCSI Client Subsystem 269
The SCSI status byte and the adapter status bytes are returned using the scsi_iocmd structure. If the
SCIOLCMD operation returns a value of -1 and the errno global variable is set to a nonzero value, the
requested operation has failed. In this case, the caller should evaluate the returned status bytes to
determine why the operation failed and what recovery actions should be taken.
The devinfo structure defines the maximum transfer size for the command. If an attempt is made to
transfer more than the maximum, a value of -1 is returned and the errno global variable set to a value of
EINVAL. Refer to the Small Computer System Interface (SCSI) Specification for the applicable device to
get request sense information.
EIO A system error has occurred. Consider retrying the operation several (around three) times, because
another attempt might be successful. If an EIO error occurs and the status_validity field is set to
SC_SCSI_ERROR, then the scsi_status field has a valid value and should be inspected.
If the status_validity field is zero and remains so on successive retries then an unrecoverable error
has occurred with the device.
If the status_validity field is SC_SCSI_ERROR and the scsi_status field contains a Check Condition
status, then a SCSI request sense should be issued using the SCIOLCMD ioctl to recover the the
sense data.
EFAULT A user process copy has failed.
EINVAL The device is not opened.
EACCES The adapter is in diagnostics mode.
ENOMEM A memory request has failed.
ETIMEDOUT The command has timed out. Consider retrying the operation several times, because another
attempt might be successful.
ENODEV The device is not responding.
ETIMEDOUT The operation did not complete before the time-out value was exceeded.
For FCP adapters, the version field of the scsi_iocmd structure must be set to the value of
SCSI_VERSION_1, which is defined in the /usr/include/sys/scsi_buf.h file. In addition, the following fields
can be set:
v world_wide_name - The caller can set the world_wide_name field to the World Wide Name of the
attached target device. If Dynamic Tracking of FC devices is enabled, the world_wide_name field
must be set to ensure communication with the device because the scsi_id field of a device can change
after dynamic tracking events.
v node_name - The caller can set the node_name field to the Node Name of the attached target device.
If the world_wide_name field and the version field are set to SCSI_VERSION_1 but the node_name field
is not set, the scsi_id field is used for device lookup instead of the world_wide_name. If Dynamic
Tracking of FC devices is enabled, the node_name field must be set to ensure communication with
the device because the scsi_id field of a device can change after dynamic tracking events.
The version field of the scsi_iocmd structure can be set to the value of SCSI_VERSION_2, and the user
can provide the following fields:
v variable_cdb_ptr - pointer to a buffer that contains the SCSI variablecdb.
v variable_cdb_length - the length of the cdb variable to which the variable_cdb_ptr points.
When the SCIOLCMD ioctl request with the version field set to SCSI_VERSION_2 completes and the device
did not fully satisfy the request, the residual field indicates left over data. If the request completes
successfully, the residual field indicates the device does not have all the requested data. If the request did
not complete successfully, check the status_validity to see whether a valid SCSI bus problem exists. If a
valid SCSI bus problem exists, the residual field indicates the number of bytes by which the device failed
to complete the request.
If the FCP SCIOLCMD ioctl operation completes successfully, then the adap_set_flags field might have
the SC_RET_ID flag set. This field is set only if the world_wide_name and node_ name fields were
provided in the ioctl call and the FC adapter driver detects that the scsi_id field of this device has
changed. The scsi_id field contains the new scsi_id value.
SCIOLNMSRV
Note: SCIOLNMSRV is specific to FCP.
This operation issues a query name server request to find all SCSI devices and is used to aid in SCSI
device configuration. The following is a typical call:
rc = ioctl(adapter, SCIOLNMSRV, &nmserv);
where adapter is a file descriptor and nmserv is a scsi_nmserv structure as defined in the
/usr/include/sys/scsi_buf.h header file. The caller of this ioctl, must allocate a buffer be referenced by
the scsi_id_list field. In addition the caller must set the list_len field to indicate the size of the buffer in
bytes.
On successful completion, the num_ids field indicates the number of SCSI IDs returned in the current list.
If the more ids were found then could be placed in the list, then the adapter driver updates the list_len
field to indicate the length of buffer needed to receive all SCSI IDs.
EIO A system error has occurred. Consider retrying the operation several times, because another
attempt might be successful.
EFAULT A user process copy has failed.
EINVAL The physical configuration does not support this request.
ENOMEM A memory request has failed.
ETIMEDOUT The command has timed out. Consider retrying the operation several times, because another
attempt might be successful.
ENODEV The device is not responding. Possibly no LUNs exist on the present target.
SCIOLQWWN
Note: SCIOLQWWN is specific to FCP.
This operation issues a request to find the SCSI ID of a device for the specified world wide name. The
following is a typical call:
rc = ioctl(adapter, SCIOLQWWN, &qrywwn);
where adapter is a file descriptor and qrywwn is a scsi_qry_wwn structure as defined in the
/usr/include/sys/scsi_buf.h header file. The caller of this ioctl, must specify the device’s world wide name
in the world_wide_name field. On successful completion, the scsi_id field is returned with the SCSI ID of
the device with this world wide name.
EIO A system error has occurred. Consider retrying the operation several times, because another
attempt might be successful.
EFAULT A user process copy has failed.
EINVAL The physical configuration does not support this request.
ENOMEM A memory request has failed.
Chapter 13. FCP, iSCSI, and Virtual SCSI Client Subsystem 271
ETIMEDOUT The command has timed out. Consider retrying the operation several times, because another
attempt might be successful.
ENODEV The device is not responding. Possibly no LUNs exist on the present FCP ID.
SCIOLPAYLD
The SCIOLPAYLD ioctl is not supported by the Virtual SCSI adapter driver.
This operation provides the means for issuing a transport payload to the specified device. The SCSI
adapter driver performs no error recovery or logging on failures of this ioctl operation. The following is a
typical call:
rc = ioctl(adapter, SCIOLPAYLD, &payld);
where adapter is a file descriptor and payld is a scsi_trans_payld structure as defined in the
/usr/include/sys/scsi_buf.h header file. The SCSI ID or SCSI ID alias should be placed in the
scsi_trans_payld. In addition the user must allocate a payload buffer referenced by the payld_bufferfield
and a response buffer referenced by the response_buffer field. The fields payld_size and
response_size specify the size in bytes of the payload buffer and response buffer, respectively. In addition
the caller can also set payld_type (for FC this is the FC-4 type), and payld_ctl (for FC this is the router
control field),.
If the SCIOLPAYLD operation returns a value of -1 and the errno global variable is set to a nonzero
value, the requested operation has failed. In this case, the caller should evaluate the returned status bytes
to determine why the operation failed and what recovery actions should be taken.
SCIOLCHBA
The SCIOLCHBA ioctl is not supported by the Virtual SCSI adapter driver.
When the device has been successfully opened, the SCIOLCHBA operation provides the means for
issuing one or more common host bus adapter (HBA) API commands to the adapter. The FC adapter
driver performs full error recovery on failures of this operation.
The arg parameter contains the address of a scsi_chba structure, which is defined in the
/usr/include/sys/scsi_buf.h file.
The cmd field in the scsi_chba structure determines the common HBA API operation that is performed.
If the SCIOLCHBA operation fails, the subroutine returns a value of -1 and sets the errno global variable
to a nonzero value. In this case, the caller should evaluate the returned status bytes to determine why the
operation was unsuccessful and what recovery actions should be taken.
If a SCIOLCHBA operation fails because a field in the scsi_chba structure has an invalid value, the
subroutine returns a value of -1 and set the errno global variable to EINVAL.
When the device has been successfully opened, the SCIOLPASSTHRUHBA operation provides the
means for issuing passthru commands to the adapter. The FC adapter driver performs full error recovery
on failures of this operation.
The arg parameter contains the address of a scsi_passthru_hba structure, which is defined in the
/usr/include/sys/scsi_buf.h file.
The cmd field in the scsi_passthru_hba structure determines the type of passthru operation to be
performed.
If the SCIOPASSTHRUHBA operation fails, the subroutine returns a value of -1 and sets the errno global
variable to a nonzero value. In this case, the caller should evaluate the returned status bytes to determine
why the operation was unsuccessful and what recovery actions should be taken.
The adapter device driver manages the transport layer but not the devices. It can send and receive
commands, but it cannot interpret the contents of the command. The lower driver also provides recovery
and logging for errors related to the transport layer and system I/O hardware. Management of the device
specifics is left to the device driver. The interface of the two drivers supports communication between the
upper driver and the different transport layer adapters without requiring special code paths for each
adapter.
The device driver also provides recovery and logging for errors related to the device that it controls.
The operating system provides several kernel services that let the device driver communicate with adapter
device driver entry points without having the actual name or address of those entry points. See “Logical
File System Kernel Services” on page 66 for more information.
Chapter 13. FCP, iSCSI, and Virtual SCSI Client Subsystem 273
Communication between Devices
When two devices communicate, one assumes the initiator-mode role, and the other assumes the
target-mode role. The initiator-mode device generates the command, which requests an operation, and the
target-mode device receives the command and acts. It is possible for a device to perform both roles
simultaneously.
When writing a new adapter device driver, the writer must know which mode or modes must be supported
to meet the requirements of the adapter and any interfaced device drivers.
Initiator-Mode Support
The interface between the device driver and the adapter device driver for initiator-mode support (that is,
the attached device acts as a target) is accessed through calls to the adapter device driver open, close,
ioctl, and strategy subroutines. I/O requests are queued to the adapter device driver through calls to its
strategy entry point.
Communication between the device driver and the adapter device driver for a particular initiator I/O
request is made through the scsi_buf structure, which is passed to and from the strategy subroutine in
the same way a standard driver uses a struct buf structure.
Fast Failure of I/O is controlled by a new fscsi device attribute, fc_err_recov. The default setting for this
attribute is delayed_fail, which is the I/O failure behavior seen in previous versions of AIX. To enable Fast
I/O Failure, set this attribute to fast_fail, as shown in the example:
chdev -l fscsi0 -a fc_err_recov=fast_fail
In this example, the fscsi device instance is fscsi0. Fast fail logic is called when the adapter driver
receives an indication from the switch that there has been a link event involving a remote storage device
port by way of a Registered State Change Notification (RSCN) from the switch.
Fast I/O Failure is useful in situations where multipathing software is used. Setting the fc_err_recov
attribute to fast_fail can decrease the I/O fail times because of link loss between the storage device and
switch. This would support faster failover to alternate paths.
In single-path configurations, especially configurations with a single path to a paging device, the
delayed_fail default setting is recommended.
If any of these requirements is not met, the fscsi device logs an error log of type INFO indicating that one
of these requirements is not met and that Fast I/O Failure is not enabled.
If dynamic tracking of FC devices is enabled, the FC adapter driver detects when the Fibre Channel
N_Port ID of a device changes. The FC adapter driver then reroutes traffic destined for that device to the
new address while the devices are still online. Events that can cause an N_Port ID to change include
moving a cable between a switch and storage device from one switch port to another, connecting two
separate switches using an inter-switch link (ISL), and possibly rebooting a switch.
Dynamic tracking of FC devices is controlled by a new fscsi device attribute, dyntrk. The default setting
for this attribute is no. To enable dynamic tracking of FC devices, set this attribute to dyntrk=yes, as shown
in the example:
chdev -l fscsi0 -a dyntrk=yes
In this example, the fscsi device instance is fscsi0. Dynamic tracking logic is called when the adapter
driver receives an indication from the switch that there has been a link event involving a remote storage
device port.
Note: The sn_location attribute might not be displayed, so running the lsattr command on an hdisk,
for example, might not show the attribute even though it could be present in ODM.
v The FC device drivers can track devices on a SAN fabric, which is a fabric as seen from a single host
bus adapter, if the N_Port IDs on the fabric stabilize within about 15 seconds. If cables are not reseated
or N_Port IDs continue to change after the initial 15 seconds, I/O failures could result.
v Devices are not tracked across host bus adapters. Devices only track if they remain visible from the
same HBA that they were originally connected to.
For example, if device A is moved from one location to another on fabric A attached to host bus adapter
A (in other words, its N_Port on fabric A changes), the device is seamlessly tracked without any user
intervention, and I/O to this device can continue.
However, if a device A is visible from HBA A but not from HBA B, and device A is moved from the fabric
attached to HBA A to the fabric attached to HBA B, device A is not accessible on fabric A nor on fabric
B. User intervention would be required to make it available on fabric B by running the cfgmgr
command. The AIX device instance on fabric A would no longer be usable, and a new device instance
Chapter 13. FCP, iSCSI, and Virtual SCSI Client Subsystem 275
on fabric B would be created. This device would have to be added manually to volume groups,
multipath device instances, and so on. This is essentially the same as removing a device from fabric A
and adding a new device to fabric B.
v No dynamic tracking can be performed for FC dump devices while an AIX system dump is in progress.
In addition, dynamic tracking is not supported when booting or running the cfgmgr command. SAN
changes should not be made while any of these operations are in progress.
v After devices are tracked, ODM might contain stale information because Small Computer System
Interface (SCSI) IDs in ODM no longer reflect actual SCSI IDs on the SAN. ODM remains in this state
until cfgmgr is run manually or the system is rebooted (provided all drivers, including any third party FC
SCSI target drivers, are dynamic-tracking capable). If cfgmgr is run manually, cfgmgr must be run on
all affected fscsi devices. This can be accomplished by running cfgmgr without any options, or by
running cfgmgr on each fscsi device individually.
Note: Running cfgmgr at run time to recalibrate the SCSI IDs might not update the SCSI ID in ODM
for a storage device if the storage device is currently opened, such as when volume groups are
varied on. The cfgmgr command must be run on devices that are not opened or the system
must be rebooted to recalibrate the SCSI IDs. Stale SCSI IDs in ODM have no adverse affect on
the FC drivers, and recalibration of SCSI IDs in ODM is not necessary for the FC drivers to
function properly. Any applications that communicate with the adapter driver directly using ioctl
calls and that use the SCSI ID values from ODM, however, must be updated (see the next bullet)
to avoid using potentially stale SCSI IDs.
v All applications and kernel extensions that communicate with the FC adapter driver, either through ioctl
calls or directly to the FC driver’s entry points, must support the version 1 ioctl and scsi_buf APIs of
the FC adapter driver to work properly with FC dynamic tracking. Noncompliant applications or kernel
extensions might not function properly or might even fail after a dynamic tracking event. If the FC
adapter driver detects an application or kernel extension that is not adhering to the new version 1 ioctl
and scsi_buf API, an error log of type INFO is generated and dynamic tracking might not be enabled for
the device that this application or kernel extension is trying to communicate with.
For ISVs developing kernel extensions or applications that communicate with the AIX Fibre Channel
Driver stack, refer to the “Required FCP, iSCSI, and Virtual SCSI Client Adapter Device Driver ioctl
Commands” on page 297 and “Understanding the scsi_buf Structure” on page 286 for changes
necessary to support dynamic tracking.
v Even with dynamic tracking enabled, users should make SAN changes, such as moving or swapping
cables and establishing ISL links, during maintenance windows. Making SAN changes during full
production runs is discouraged because the interval of time to perform any SAN changes is too short.
Cables that are not reseated correctly, for example, could result in I/O failures. Performing these
operations during periods of little or no traffic minimizes the impact of I/O failures.
The base AIX FC SCSI Disk and FC SCSI Tape and FastT device drivers support dynamic tracking. The
IBM ESS, EMC Symmetrix, and HDS storage devices support dynamic tracking provided that the vendor
provides the ODM filesets with the necessary sn_location and node_name attributes. Contact the storage
vendor if you are not sure if your current level of ODM fileset supports dynamic tracking.
If vendor-specific ODM entries are not being used for the storage device, but the ESS, Symmetrix, or HDS
storage subsystem is configured with the MPIO Other FC SCSI Disk message, dynamic tracking is
supported for the devices in this configuration. This supersedes the need for the sn_location attribute. All
current AIX Path Control Modules (PCM) shipped with the AIX base support dynamic tracking.
The STK tape device using the standard AIX device driver also supports dynamic tracking provided the
STK fileset contains the necessary sn_location and node_name attributes.
Note: SAN changes involving tape devices should be made with no active I/O. Because of the serial
nature of tape devices, a single I/O failure can cause an application to fail, including tape backups.
Chapter 13. FCP, iSCSI, and Virtual SCSI Client Subsystem 277
dyntrk fc_err_recov FC Driver Behavior
yes fast_fail If the driver receives a Registered
State Change Notification (RSCN)
from the switch, this could indicate a
link loss between a remote storage
port and switch. After an initial
15-second delay, the FC drivers query
to see if the device is on the fabric. If
not, I/Os are flushed back by the
adapter. Future retries or new I/Os fail
immediately if the device is still not on
the fabric. The storage driver (disk,
tape, FastT) will likely not delay
between retries. If the FC drivers
detect the device is on the fabric but
the SCSI ID has changed, the FC
device drivers reroute traffic to the
new SCSI ID.
When dynamic tracking is disabled, there is a marked difference between the delayed_fail and fast_fail
settings of the fc_err_recov attribute. However, with dynamic tracking enabled, the setting of the
fc_err_recov attribute is less significant. This is because there is some overlap in the dynamic tracking
and fast fail error-recovery policies. Therefore, enabling dynamic tracking inherently enables some of the
fast fail logic.
The general error recovery procedure when a device is no longer reachable on the fabric is the same for
both fc_err_recov settings with dynamic tracking enabled. The minor difference is that the storage drivers
can choose to inject delays between I/O retries if fc_err_recov is set to delayed_fail. This increases the
I/O failure time by an additional amount, depending on the delay value and number of retries, before
permanently failing the I/O. With high I/O traffic, however, the difference between delayed_fail and
fast_fail might be more noticeable.
SAN administrators might want to experiment with these settings to find the correct combination of settings
for their environment.
A device driver can register a particular device instance for receiving asynchronous event status by calling
the SCIOLEVENT ioctl operation for the adapter device driver. When an event covered by the
SCIOLEVENT ioctl operation is detected by the adapter device driver, it builds an scsi_event_info
structure and passes a pointer to the structure and to the asynchronous event-handler routine entry point,
which was previously registered. The fields in the structure are filled in by the adapter device driver as
follows:
scsi_id
For initiator mode, this is set to the SCSI ID or SCSI ID alias of the attached target device. For
target mode, this is set to the SCSI ID or SCSI ID alias of the attached initiator device.
lun_id
For initiator mode, this is set to the SCSI LUN of the attached target device. For target mode, this
is set to 0.
The information reported in the scsi_event_info.events field does not queue to the device driver, but is
instead reported as one or more flags as they occur. Because the data does not queue, the adapter
device driver writer can use a single scsi_event_info structure and pass it one at a time, by pointer, to
each asynchronous event handler routine for the appropriate device instance. After determining for which
device the events are being reported, the device driver must copy the scsi_event_info.events field into
local space and must not modify the contents of the rest of the scsi_event_info structure.
Because the event status is optional, the device driver writer determines which action is necessary to take
upon receiving event status. The writer might decide to save the status and report it back to the calling
application, or the device driver or application level program can take error recovery actions.
The unrecoverable adapter command failure event is not necessarily a fatal condition, but it can indicate
that the adapter is not functioning properly. Possible actions by the application program include:
v Ending of the session with the device in the near future.
v Ending of the session after multiple (two or more) such events.
v Attempt to continue the session indefinitely.
The SCSI Reset detection event is mainly intended as information only, but can be used by the application
to perform further actions, if necessary.
Chapter 13. FCP, iSCSI, and Virtual SCSI Client Subsystem 279
The maximum buffer usage detected event only applies to a given target-mode device; it will not be
reported for an initiator-mode device. This event indicates to the application that this particular target-mode
device instance has filled its maximum allotted buffer space. The application should perform read system
calls fast enough to prevent this condition. If this event occurs, data is not lost, but it is delayed to prevent
further buffer usage. Data reception will be restored when the application empties enough buffers to
continue reasonable operations. The num_bufs attribute might need to be increased to help minimize this
problem. Also, it is possible that regardless of the number of buffers, the application simply is not
processing received data fast enough. This might require some fine tuning of the application’s data
processing routines.
Because the event handling routine is running on the hardware interrupt level, the device driver must be
careful to limit operations in that routine. Processing should be kept to a minimum. In particular, if any
error recovery actions are performed, it is recommended that the event-handling routine set state or status
flags only and allow a process level routine to perform the actual operations.
The device driver must be careful to disable interrupts at the correct level in places where the device
driver’s lower execution priority routines manipulate variables that are also modified by the event-handling
routine. To allow the device driver to disable at the correct level, the adapter device driver writer must
provide a configuration database attribute that defines the interrupt class, or priority, it runs on. This
attribute must be named intr_priority so that the device driver configuration method knows which attribute
of the parent adapter to query. The device driver configuration method should then pass this interrupt
priority value to the device driver along with other configuration data for the device instance.
The SCSI device driver writer must follow any other general system rules for writing a routine that must
execute in an interrupt environment. For example, the routine must not attempt to sleep or wait on I/O
operations. It can perform wakeups to allow the process level to handle those operations.
Because the device driver copies the information from the scsi_event_info.events field on each call to its
asynchronous event-handling routine, there is no resource to free and no information that must be passed
back later to the adapter device driver.
Autosense Data
When a device returns a check condition or command terminated (the scsi_buf.scsi_status will have the
value of SC_CHECK_CONDITION or SC_COMMAND_TERMINATED, respectively), it will also return the request sense
data.
Note: Subsequent commands to the device will clear the request sense data.
If the device driver has specified a valid autosense buffer (scsi_buf.autosense_length > 0 and the
scsi_buf.autosense_buffer_ptr field is not NULL), then the adapter device driver will copy the returned
autosense data into the buffer referenced by scsi_buf.autosense_buffer_ptr. When this occurs, the
adapter device driver will set the SC_AUTOSENSE_DATA_VALID flag in the scsi_buf.adap_set_flags.
FCP, iSCSI, and Virtual SCSI Client Initiator-Mode Recovery When Not
Command Tag Queuing
If an error such as a check condition or hardware failure occurs, the transaction active during the error is
returned with the scsi_buf.bufstruct.b_error field set to EIO. Other transactions in the queue might be
returned with the scsi_buf.bufstruct.b_error field set to ENXIO. If the adapter driver decides not to return
other outstanding commands it has queued to it, then the failed transaction will be returned to the device
driver with an indication that the queue for this device is not cleared by setting the
SC_DID_NOT_CLEAR_Q flag in the scsi_buf.adap_q_status field. The device driver should process or
recover the condition, rerunning any mode selects or device reservations to recover from this condition
properly. After this recovery, it should reschedule the transaction that had the error. In many cases, the
device driver only needs to retry the unsuccessful operation.
The adapter device driver should never retry a SCSI command on error after the command has
successfully been given to the adapter. The consequences for retrying a command at this point range from
minimal to catastrophic, depending upon the type of device. Commands for certain devices cannot be
retried immediately after a failure (for example, tapes and other sequential access devices). If such an
error occurs, the failed command returns an appropriate error status with an iodone call to the device
driver for error recovery. Only the device driver that originally issued the command knows if the command
can be retried on the device. The adapter device driver must only retry commands that were never
successfully transferred to the adapter. In this case, if retries are successful, the scsi_buf status should
not reflect an error. However, the adapter device driver should perform error logging on the retried
condition.
The first transaction passed to the adapter device driver during error recovery must include a special flag.
This SC_RESUME flag in the scsi_buf.flags field must be set to inform the adapter device driver that the
device driver has recognized the fatal error and is beginning recovery operations. Any transactions passed
to the adapter device driver, after the fatal error occurs and before the SC_RESUME transaction is issued,
should be flushed; that is, returned with an error type of ENXIO through an iodone call.
Note: If a device driver continues to pass transactions to the adapter device driver after the adapter
device driver has flushed the queue, these transactions are also flushed with an error return of
ENXIO through the iodone service. This gives the device driver a positive indication of all
transactions flushed.
Chapter 13. FCP, iSCSI, and Virtual SCSI Client Subsystem 281
Initiator-Mode Recovery During Command Tag Queuing
If the device driver is queuing multiple transactions to the device and either a check condition error or a
command terminated error occurs, the adapter driver does not clear all transactions in its queues for the
device. It returns the failed transaction to the device driver with an indication that the queue for this device
is not cleared by setting the SC_DID_NOT_CLEAR_Q flag in the scsi_buf.adap_q_status field. The
adapter driver halts the queue for this device awaiting error recovery notification from the device driver.
The device driver then has three options to recover from this error:
v Send one error recovery command (request sense) to the device.
v Clear the adapter driver’s queue for this device.
v Resume the adapter driver’s queue for this device.
When the adapter driver’s queue is halted, the device drive can get sense data from a device by setting
the SC_RESUME flag in the scsi_buf.flags field and the SC_NO_Q flag in scsi_buf.q_tag_msg field of
the request-sense scsi_buf. This action notifies the adapter driver that this is an error-recovery transaction
and should be sent to the device while the remainder of the queue for the device remains halted. When
the request sense completes, the device driver needs to either clear or resume the adapter driver’s queue
for this device.
The device driver can notify the adapter driver to clear its halted queue by sending a transaction with the
SC_Q_CLR flag in the scsi_buf.flags field. This transaction must not contain a command because it is
cleared from the adapter driver’s queue without being sent to the adapter. However, this transaction must
have the SCSI ID field ( scsi_buf.scsi_id) and the LUN field ( scsi_buf.lun_id) filled in with the device’s
SCSI ID and logical unit number (LUN), respectively. Upon receiving an SC_Q_CLR transaction, the
adapter driver flushes all transactions for this device and sets their scsi_buf.bufstruct.b_error fields to
ENXIO. The device driver must wait until the scsi_buf with the SC_Q_CLR flag set is returned before it
resumes issuing transactions. The first transaction sent by the device driver after it receives the returned
SC_Q_CLR transaction must have the SC_RESUME flag set in the scsi_buf.flags fields.
If the device driver wants the adapter driver to resume its halted queue, it must send a transaction with the
SC_Q_RESUME flag set in the scsi_buf.flags field. This transaction can contain an actual command, but
it is not required. However, this transaction must have the SCSI ID field (scsi_buf.scsi_id) and the LUN
field (scsi_buf.lun_id) filled in with the device’s SCSI ID and logical unit number (LUN). If this is the first
transaction issued by the device driver after receiving the error (indicating that the adapter driver’s queue
is halted),then the SC_RESUME flag must be set as well as the SC_Q_RESUME flag.
Chapter 13. FCP, iSCSI, and Virtual SCSI Client Subsystem 283
If the SCSI adapter driver is a adapter driver and autosense data is returned from the device, then the
adapter driver will also fill in the adap_set_flags and autosense_buffer_ptr fields of the scsi_buf
structure. When a transaction completes, the scsi_intr routine causes the scsi_buf entry to be
removed from the device queue and calls the iodone kernel service, passing the just dequeued
scsi_buf structure for the device as the parameter. The scsi_start routine is then called again to
process the next transaction on the device queue. The iodone kernel service calls the device driver
dd_iodone entry point, signaling the device driver that the particular transaction has completed.
5. The device driver dd_iodone routine investigates the I/O completion codes in the scsi_buf status
entries and performs error recovery, if required. If the operation completed correctly, the device driver
dequeues the original buffer structures. It calls the iodone kernel service with the original buffer
pointers to notify the originator of the request.
Internal commands differ from operating system-initiated transactions in several ways. The primary
difference is that the device driver is required to generate a struct buf that is not related to a specific
request. Also, the actual commands are typically more control-oriented than data transfer-related.
The only special requirement for commands with short data-phase transfers (less than or equal to 256
bytes) is that the device driver must have pinned the memory being transferred into or out of system
memory pages. However, due to system hardware considerations, additional precautions must be taken for
data transfers into system memory pages when the transfers are larger than 256 bytes. The problem is
that any system memory area with a DMA data operation in progress causes the entire memory page that
contains it to become inaccessible.
As a result, a device driver that initiates an internal command with more than 256 bytes must have
preallocated and pinned an area of some multiple whose size is the system page size. The driver must not
place in this area any other data areas that it may need to access while I/O is being performed into or out
of that page. Memory pages so allocated must be avoided by the device driver from the moment the
transaction is passed to the adapter device driver until the device driver iodone routine is called for the
transaction (and for any other transactions to those pages).
The device driver can send only one scsi_buf structure per call to the adapter device driver. Thus, the
scsi_buf.bufstruct.av_forw pointer should be null when given to the adapter device driver, which
indicates that this is the only request. The device driver can queue multiple scsi_buf requests by making
multiple calls to the adapter device driver strategy routine.
To enhance the transport layer performance, the device driver should consolidate multiple queued requests
when possible into a single command. To allow the adapter device driver the ability to handle the scatter
and gather operations required, the scsi_buf.bp should always point to the first buf structure entry for the
spanned transaction. A null-terminated list of additional struct buf entries should be chained from the first
field through the buf.av_forw field to give the adapter device driver enough information to perform the
DMA scatter and gather operations required. This information must include at least the buffer’s starting
address, length, and cross-memory descriptor.
The spanned requests should always be for requests in either the read or write direction but not both,
since the adapter device driver must be given a single command to handle the requests. The spanned
request should always consist of complete I/O requests (including the additional struct buf entries). The
device driver should not attempt to use partial requests to reach the maximum transfer size.
The maximum transfer size is actually adapter-dependent. The IOCINFO ioctl operation can be used to
discover the adapter device driver’s maximum allowable transfer size. To ease the design, implementation,
and testing of components that may need to interact with multiple adapter device drivers, a required
minimum size has been established that all adapter device drivers must be capable of supporting. The
value of this minimum/maximum transfer size is defined as the following value in the /usr/include/sys/
scsi.h file:
SC_MAXREQUEST /* maximum transfer request for a single */
/* FCP or iSCSI command (in bytes) */
If a transfer size larger than the supported maximum is attempted, the adapter device driver returns a
value of EINVAL in the scsi_buf.bufstruct.b_error field.
Due to system hardware requirements, the device driver must consolidate only commands that are
memory page-aligned at both their starting and ending addresses. Specifically, this applies to the
consolidation of inner memory buffers. The ending address of the first buffer and the starting address of all
subsequent buffers should be memory page-aligned. However, the starting address of the first memory
buffer and the ending address of the last do not need to be aligned so.
The purpose of consolidating transactions is to decrease the number of commands and transport layer
phases required to perform the required operation. The time required to maintain the simple chain of buf
structure entries is significantly less than the overhead of multiple (even two) transport layer transactions.
Fragmented Commands
Single I/O requests larger than the maximum transfer size must be divided into smaller requests by the
device driver. For calls to a device driver’s character I/O (read/write) entry points, the uphysio kernel
service can be used to break up these requests. For a fragmented command such as this, the
scsi_buf.bp field should be null so that the adapter device driver uses only the information in the
scsi_buf structure to prepare for the DMA operation.
Command tag queuing refers to queuing multiple commands to a device. Queuing to the device can
improve performance because the device itself determines the most efficient way to order and process
commands. Devices that support command tag queuing can be divided into two classes: those that clear
Chapter 13. FCP, iSCSI, and Virtual SCSI Client Subsystem 285
their queues on error and those that do not. Devices that do not clear their queues on error resume
processing of queued commands when the error condition is cleared (either by receiving the next
command for NACA=0 error recovery or by receiving a Clear ACA task management command for
NACA=1 error recovery). Devices that do clear their queues flush all commands currently outstanding.
Command tag queuing requires the adapter, the device, the device driver, and the adapter driver to
support this capability. For a device driver to queue multiple commands to a device (that supports
command tag queuing), it must be able to provide at least one of the following values in the
scsi_buf.q_tag_msg:
v SC_SIMPLE_Q
v SC_HEAD_OF_Q
v SC_ORDERED_Q
The disk device driver and adapter driver do support this capability. This implementation provides some
queuing-specific changeable attributes for disks that can queue commands. With this information, the disk
device driver attempts to queue to the disk, first by queuing commands to the adapter driver. The adapter
driver then queues these commands to the adapter, providing that the adapter supports command tag
queuing. If the adapter does not support command tag queuing, then the adapter driver sends only one
command at a time to the adapter and so multiple commands are not queued to the disk.
Chapter 13. FCP, iSCSI, and Virtual SCSI Client Subsystem 287
v The adapter_status field is an output parameter that is valid when its status_validity bit is nonzero.
The scsi_buf.bufstruct.b_error field should be set to EIO any time the adapter_status field is valid.
This field contains generic adapter card status. It is intentionally general in coverage so that it can
report error status from any typical adapter.
If an error is detected while an command is running, and the error prevented the command from
actually being sent to the transport layer by the adapter, then the error should be processed or
recovered, or both, by the adapter device driver.
If it is recovered successfully by the adapter device driver, the error is logged, as appropriate, but is not
reflected in the adapter_status byte. If the error cannot be recovered by the adapter device driver, the
appropriate adapter_status bit is set and the scsi_buf structure is returned to the device driver for
further processing.
If an error is detected after the command was actually sent to the device, then it should be processed
or recovered, or both, by the device driver.
For error logging, the adapter device driver logs transport layer and adapter-related conditions, and the
device driver logs device-related errors. In the following description, a capital letter (A) after the error
name indicates that the adapter device driver handles error logging. A capital letter (H) indicates that the
device driver handles error logging.
Some of the following error conditions indicate a device failure. Others are transport layer or
adapter-related.
SCSI_HOST_IO_BUS_ERR (A)
The system I/O transport layer generated or detected an error during a DMA or Programmed
I/O (PIO) transfer.
SCSI_TRANSPORT_FAULT (H)
The transport protocol or hardware was unsuccessful.
SCSI_CMD_TIMEOUT (H)
The command timed out before completion.
SCSI_NO_DEVICE_RESPONSE (H)
The target device did not respond to selection phase.
SCSI_ADAPTER_HDW_FAILURE (A)
The adapter indicated an onboard hardware failure.
SCSI_ADAPTER_SFW_FAILURE (A)
The adapter indicated microcode failure.
SCSI_FUSE_OR_TERMINAL_PWR (A)
The adapter indicated a blown terminator fuse or bad termination.
SCSI_TRANSPORT_RESET (A)
The adapter indicated the transport layer has been reset.
SCSI_WW_NAME_CHANGE (A)
The adapter indicated the device at this SCSI ID has a new world wide name. For AIX 5.2 with
5200-01 and later, if Dynamic Tracing of FC Devices is enabled, the adapter driver has
detected a change to the scsi_id field for this device and a scsi_buf structure with the
SC_DEV_RESTART flag can be sent to the device. For more information, see 290.
Note: Commands with the value of SC_NO_Q for the q_tag_msg field (except for request sense
commands) should not be queued to a device whose queue contains a command with another
Chapter 13. FCP, iSCSI, and Virtual SCSI Client Subsystem 289
value for q_tag_msg. If commands with the SC_NO_Q value (except for request sense) are sent to
the device, then the device driver must make sure that no active commands are using different
values for q_tag_msg. Similarly, the device driver must also make sure that a command with a
q_tag_msg value of SC_ORDERED_Q, SC_HEAD_Q, or SC_SIMPLE_Q is not sent to a device that has a
command with the q_tag_msg field of SC_NO_Q.
v The flags field contains bit flags sent from the device driver to the adapter device driver. The following
flags are defined:
SC_CLEAR_ACA
When set, means the SCSI adapter driver should issue a Clear ACA task management request
for this ID/LUN. This flag should be used in conjunction with either the SC_Q_CLR or SC_Q_RESUME
flags to clear or resume the SCSI adapter driver’s queue for this device. If neither of these flags
is used, then this transaction is treated as if the SC_Q_RESUME flag is also set. The transaction
containing the SC_CLEAR_ACA flag setting does not require an actual SCSI command in the
sc_buf. If this transaction contains a SCSI command then it will be processed depending on
whether SC_Q_CLR or SC_Q_RESUME is set.
This transaction must have the SCSI ID field (scsi_buf.scsi_id) and the LUN field
(scsi_buf.lun_id) filled in with the device’s SCSI ID and logical unit number (LUN). This flag is
valid only during error recovery of a check condition or command terminated at a command tag
queuing.
SC_DELAY_CMD
When set, means the adapter device driver should delay sending this command (following a
reset or BDR to this device) by at least the number of seconds specified to the adapter device
driver in its configuration information. For devices that do not require this function, this flag
should not be set.
SC_DEV_RESTART
If a scsi_buf request fails with a status of SCSI_WW_NAME_CHANGE, a scsi_buf request
with the SC_DEV_RESTART flag can be sent if the device driver is dynamic tracking capable.
For AIX 5.2 with 5200-01 and later, if Dynamic Tracking of FC Devices is enabled, a scsi_buf
request with SC_DEV_RESTART performs a handshake, indicating that the device driver
acknowledges the device address change and that the FC adapter driver can proceed with
tracking operations. If the SC_DEV_RESTART flag is set, then the SC_Q_CLR flag must also
be set. In addition, no scsi command can be included in this scsi_buf structure. Failure to meet
these two criteria will result in a failure with adapter status of SCSI_ADAPTER_SFW_FAILURE.
After the SC_DEV_RESTART call completes successfully, the device driver performs device
validation procedures, such as those performed during an open (Test Unit Ready, Inquiry, Serial
Number validation, etc.), in order to confirm the identity of the device after the fabric event.
If an SC_DEV_RESTART call fails with any adapter status, the SC_DEV_RESTART call can be
retried as deemed appropriate by the device driver, because a future retry might succeed.
SC_LUN_RESET
When set, means the SCSI adapter driver should issue a Lun Reset task management request
for this ID/LUN. This flag should be used in conjunction with ethe SC_Q_CLR flag flag.The
transaction containing this flag setting does allow an actual command to be sent to the adapter
driver. However, this transaction must have the the SCSI ID field (scsi_buf.scsi_id) and the
LUN field (scsi_buf.lun_id) filled in with the device’s SCSI ID and logical unit number (LUN). If
the transaction containing this flag setting is the first issued by the device driver after it receives
an error (indicating that the adapter driver’s queue is halted), then the SC_RESUME flag must
be set also.
SC_Q_CLR
When set, means the adapter driver should clear its transaction queue for this ID/LUN. The
transaction containing this flag setting does not require an actual command in the scsi_buf
because it is flushed back to the device driver with the rest of the transactions for this ID/LUN.
Chapter 13. FCP, iSCSI, and Virtual SCSI Client Subsystem 291
SC_AUTOSENSE_DATA_VALID
Autosense data was placed in the autosense buffer referenced by the autosense_buffer_ptr
field.
v The autosense_length field contains the length in bytes of the SCSI device driver’s sense buffer, which
is referenced via the autosense_buffer_ptr field. For devices this field must be non-zero, otherwise the
autosense data will be lost.
v The autosense_buffer_ptr field contains the address of the SCSI devices driver’s autosense buffer for
this command. For devices this field must be non-NULL, otherwise the autosense data will be lost.
v The dev_burst_len field contains the burst size if this write operation in bytes. This should only be set
by the device driver if it h as negotiated with the device and it allows burst of write data without transfer
readys. For most operations, this should be set to 0.
v The scsi_id field contains the 64-bit SCSI ID for this device. This field must be set for devices.
v The lun_id field contains the 64-bit lun ID for this device. This field must be set for devices.
v The kernext_handle field contains the pointer returned from the kernext_handle field of the
scsi_sciolst argument for the SCIOLSTART ioctl operation. For AIX 5.2 with 5200-01 and later, if
Dynamic Tracking of FC Devices is enabled, the kernext_handle field must be set for all scsi_buf
calls that are sent to the the adapter driver. Failure to do so results in a failure with an adapter status of
SCSI_ADAPTER_SFW_FAILURE.
v The version field contains the version of this scsi_buf structure. Beginning with AIX 5.2, this field
should be set to a value of SCSI_VERSION_1. The version field of the scsi_buf structure should be
consistent with the version of the scsi_sciolst argument used for the SCIOLSTART ioctl operation.
Chapter 13. FCP, iSCSI, and Virtual SCSI Client Subsystem 293
example, if this is a nonshared device). If the caller attempts to initiate this system call without the proper
authority, the device driver should return a value of -1, with the errno global variable set to a value of
EPERM.
The SC_DIAGNOSTIC option gives the caller an exclusive open to the selected device. This option
requires appropriate authority to run. If the caller attempts to execute this system call without the proper
authority, the device driver should return a value of -1, with the errno global variable set to a value of
EPERM. The SC_DIAGNOSTIC option may be executed only if the device is not already opened for normal
operation. If this ioctl operation is attempted when the device is already opened, or if an openx call with
the SC_DIAGNOSTIC option is already in progress, a return value of -1 should be passed, with the errno
global variable set to a value of EACCES. Once successfully opened with the SC_DIAGNOSTIC flag, the
device driver is placed in Diagnostic mode for the selected device.
Once successfully opened, the device is placed in Exclusive Access mode. If another caller tries to do any
type of open, a return value of -1 is passed, with the errno global variable set to a value of EACCES.
The remaining options for the ext parameter are reserved for future requirements.
The following table shows how the various combinations of ext options should be handled in the device
driver.
Chapter 13. FCP, iSCSI, and Virtual SCSI Client Subsystem 295
Closing the Device
When a device driver is preparing to close a device through the adapter device driver, it must ensure that
all transactions are complete. When the adapter device driver receives a SCIOLSTOP ioctl operation and
there are pending I/O requests, the ioctl operation does not return until all have completed. New requests
received during this time are rejected from the adapter device driver’s ddstrategy routine.
Error Processing
It is the responsibility of the device driver to process check conditions and other returned errors properly.
The adapter device driver only passes commands without otherwise processing them and is not
responsible for device error recovery.
The actual DMA transfer goes to a dummy buffer inside the adapter device driver and then is block-copied
to the destination buffer. Internal device driver operations that typically have small data-transfer phases are
control-type commands, such as Mode select, Mode sense, and Request sense. However, this discussion
applies to any command received by the adapter device driver that has a data-phase size of 256 bytes or
less.
Internal commands with data phases larger than 256 bytes require the device driver to allocate specifically
the required memory on the process level. The memory pages containing this memory cannot be
accessed in any way by the CPU (that is, the device driver) from the time the transaction is passed to the
adapter device driver until the device driver receives the iodone call for the transaction.
Internally, the devsw table has entry points for the ddconfig, ddopen, ddclose, dddump, ddioctl, and
ddstrat routines. The device drivers pass their commands to the adapter device driver by calling the
adapter device driver ddstrat routine. (This routine is unavailable to other operating system programs due
to the lack of a block-device special file.)
Access to the adapter device driver’s ddconfig, ddopen, ddclose, dddump, ddioctl, and ddstrat entry
points by the device drivers is performed through the kernel services provided. These include such
services as fp_opendev, fp_close, fp_ioctl, devdump, and devstrat.
Performing Dumps
A adapter device driver must have a dddump entry point if it is used to access a system dump device. A
device driver must have a dddump entry point if it drives a dump device. Examples of dump devices are
disks and tapes.
Note: Adapter-device-driver writers should be aware that system services providing interrupt and
timer services are unavailable for use in the dump routine. Kernel DMA services are assumed to be
The DUMPQUERY option should return a minimum transfer size of 0 bytes, and a maximum transfer size
equal to the maximum transfer size supported by the adapter device driver.
Calls to the adapter device driver DUMPWRITE option should use the arg parameter as a pointer to the
scsi_buf structure to be processed. Using this interface, a write command can be executed on a
previously started (opened) target device. The uiop parameter is ignored by the adapter device driver
during the DUMPWRITE command. Spanned, or consolidated, commands are not supported using the
DUMPWRITE option. Gathered write commands are also not supported using the DUMPWRITE option.
No queuing of scsi_buf structures is supported during dump processing because the dump routine runs
essentially as a subroutine call from the caller’s dump routine. Control is returned when the entire
scsi_buf structure has been processed.
Note: Also, both adapter-device-driver and device-driver writers should be aware that any error occurring
during the DUMPWRITE option is considered unsuccessful. Therefore, no error recovery is
employed during the DUMPWRITE. Return values from the call to the dddump routine indicate the
specific nature of the failure.
Successful completion of the selected operation is indicated by a 0 return value to the subroutine.
Unsuccessful completion is indicated by a return code set to one of the following values for the errno
global variable. The various scsi_buf status fields, including the b_error field, are not set by the adapter
device driver at completion of the DUMPWRITE command. Error logging is, of necessity, not supported
during the dump.
v An errno value of EINVAL indicates that a request that was not valid passed to the adapter device driver,
such as to attempt a DUMPSTART command before successfully executing a DUMPINIT command.
v An errno value of EIO indicates that the adapter device driver was unable to complete the command
due to a lack of required resources or an I/O error.
v An errno value of ETIMEDOUT indicates that the adapter did not respond with completion status before
the passed command time-out value expired.
Required FCP, iSCSI, and Virtual SCSI Client Adapter Device Driver
ioctl Commands
Various ioctl operations must be performed for proper operation of the adapter device driver. The ioctl
operations described here are the minimum set of commands the adapter device driver must implement to
support device drivers. Other operations might be required in the adapter device driver to support, for
example, system management facilities and diagnostics. Device driver writers also need to understand
these ioctl operations.
Every adapter device driver must support the IOCINFO ioctl operation. The structure to be returned to the
caller is the devinfo structure, including the union definition for the adapter, which can be found in the
/usr/include/sys/devinfo.h file. The device driver should request the IOCINFO ioctl operation (probably
during its open routine) to get the maximum transfer size of the adapter.
Note: The adapter device driver ioctl operations can only be called from the process level. They cannot
be executed from a call on any more favored priority levels. Attempting to call them from a more
favored priority level can result in a system crash.
Chapter 13. FCP, iSCSI, and Virtual SCSI Client Subsystem 297
commands, usually after signal processing by the device driver. This might be used by a device driver to
end an operation instead of waiting for completion or a time out. The SCIOLRESET operation is provided
for clearing device hard errors and competing initiator reservations during open processing by the device
driver.
For more information on these ioctl operations, see “FCP, iSCSI, and Virtual SCSI Client Adapter ioctl
Operations” on page 259.
Only a kernel process or device driver can invoke these ioctls. If attempted by a user process, the ioctl will
fail, and the errno global variable will be set to EPERM.
The event registration performed by this ioctl operation is allowed once per device session. Only the first
SCIOLEVENT ioctl operation is accepted after the device session is opened. Succeeding SCIOLEVENT
ioctl operations will fail, and the errno global variable will be set to EINVAL. The event registration is
canceled automatically when the device session is closed.
The arg parameter to the SCIOLEVENT ioctl operation should be set to the address of an
scsi_event_struct structure, which is defined in the /usr/include/sys/scsi_buf.h file. The following
parameters are supported:
scsi_id
The caller sets id to the SCSI ID or SCSI ID alias of the attached target device for initiator-mode.
For target-mode, the caller sets the id to the SCSI ID or SCSI ID alias of the attached initiator
device.
lun_id The caller sets the lun field to the LUN of the attached target device for initiator-mode. For
target-mode, the caller sets the lun field to 0.
mode Identifies whether the initiator-mode or target-mode device is being registered. These values are
possible:
SC_IM_MODE
This is an initiator-mode device.
The following values for the errno global variable are supported:
Related Information
Logical File System Kernel Services.
scdisk SCSI Device Driver in AIX 5L Version 5.3 Technical Reference: Kernel and Subsystems Volume 2.
Chapter 13. FCP, iSCSI, and Virtual SCSI Client Subsystem 299
300 Kernel Extensions and Device Support Programming Concepts
Chapter 14. Integrated Device Electronics (IDE) Subsystem
This overview describes the interface between an Integrated Device Electronics (IDE) device driver and an
IDE adapter device driver. It is directed toward those designing and writing an IDE device driver that
interfaces with an existing IDE adapter device driver. It is also meant for those designing and writing an
IDE adapter device driver that interfaces with existing IDE device drivers.
This section frequently refers to both an IDE device driver and an IDE adapter device driver. These two
distinct device drivers work together in a layered approach to support attachment of a range of IDE
devices. The IDE adapter device driver is the lower device driver of the pair, and the IDE device driver is
the upper device driver.
The IDE adapter device driver manages the IDE bus, but not the IDE devices. It can send and receive IDE
commands, but it cannot interpret the contents of the command. The lower driver also provides recovery
and logging for errors related to the IDE bus and system I/O hardware. Management of the device
specifics is left to the IDE device driver. The interface of the two drivers allows the upper driver to
communicate with different IDE bus adapters without requiring special code paths for each adapter.
The IDE device driver also provides command retries and logging for errors related to the IDE device it
controls.
The operating system provides several kernel services allowing the IDE device driver to communicate with
IDE adapter device driver entry points without having the actual name or address of those entry points.
See “Logical File System Kernel Services” on page 66 for more information.
The IDE adapter driver should never retry an IDE command on error after the command has successfully
been given to the adapter. The consequences for the adapter driver retrying an IDE command at this point
range from minimal to catastrophic, depending upon the type of device. Commands for certain devices
cannot be retried immediately after a failure (for example, tapes and other sequential access devices). If
such an error occurs, the failed command returns an appropriate error status with an iodone call to the
IDE device driver for error recovery. Only the IDE device driver that originally issued the command knows
if the command can be retried on the device. The IDE adapter driver must only retry commands that were
never successfully transferred to the adapter. In this case, if retries are successful, the ataide_buf status
should not reflect an error. However, the IDE adapter driver should perform error logging on the retried
condition.
Internal commands differ from operating system-initiated transactions in several ways. The primary
difference is that the IDE device driver is required to generate a struct buf that is not related to a specific
request. Also, the actual IDE commands are typically more control oriented than data transfer related.
The only special requirement for commands is that the IDE device driver must have pinned the transfer
data buffers. However, due to system hardware considerations, additional precautions must be taken for
data transfers into system memory pages. The problem is that any system memory area with a DMA data
operation in progress causes the entire memory page that contains it to become inaccessible.
As a result, an IDE device driver that initiates an internal command must have preallocated and pinned an
area of some multiple of system page size. The driver must not place in this area any other data that it
might need to access while I/O is being performed into or out of that page. Memory pages allocated must
be avoided by the device driver from the moment the transaction is passed to the adapter driver until the
device driver iodone routine is called for the transaction.
To enhance IDE bus performance, the IDE device driver should consolidate multiple queued requests
when possible into a single IDE command. To allow the IDE adapter driver the ability to handle the scatter
and gather operations required, the ataide_buf.bp should always point to the first buf structure entry for
the spanned transaction. A null-terminated list of additional struct buf entries should be chained from the
first field through the buf.av_forw field to give the IDE adapter driver enough information to perform the
DMA scatter and gather operations required. This information must include at least the buffer’s starting
address, length, and cross-memory descriptor.
The spanned requests should always be for requests in either the read or write direction but not both,
because the IDE adapter driver must be given a single IDE command to handle the requests. The
spanned request should always consist of complete I/O requests (including the additional struct buf
entries). The IDE device driver should not attempt to use partial requests to reach the maximum transfer
size.
The maximum transfer size is actually adapter-dependent. The IOCINFO ioctl operation can be used to
discover the IDE adapter driver’s maximum allowable transfer size. If a transfer size larger than the
supported maximum is attempted, the IDE adapter driver returns a value of EINVAL in the
ataide_buf.bufstruct.b_error field.
Due to system hardware requirements, the IDE device driver must consolidate only commands that are
memory page-aligned at both their starting and ending addresses. Specifically, this applies to the
consolidation of memory buffers. The ending address of the first buffer and the starting address of all
subsequent buffers should be memory page-aligned. However, the starting address of the first memory
buffer and the ending address of the last do not need to be aligned.
The purpose of consolidating transactions is to decrease the number of IDE commands and bus phases
required to perform the required operation. The time required to maintain the simple chain of buf structure
entries is significantly less than the overhead of multiple (even two) IDE bus transactions.
Fragmented Commands
Single I/O requests larger than the maximum transfer size must be divided into smaller requests by the
IDE device driver. For calls to an IDE device driver’s character I/O (read/write) entry points, the uphysio
kernel service can be used to break up these requests. For a fragmented command such as this, the
ataide_buf.bp field should be NULL so that the IDE adapter driver uses only the information in the
ataide_buf structure to prepare for the DMA operation.
ataide_buf Structure
The ataide_buf structure is used for communication between the IDE device driver and the IDE adapter
driver during an initiator I/O request. This structure is passed to and from the strategy routine in the same
way a standard driver uses a struct buf structure.
If an error is detected while an IDE command is being processed, and the error prevented the IDE
command from actually being sent to the IDE bus by the adapter, then the error should be processed or
recovered, or both, by the IDE adapter driver.
If it is recovered successfully by the IDE adapter driver, the error is logged, as appropriate, but is not
reflected in the ata.errval byte. If the error cannot be recovered by the IDE adapter driver, the
appropriate ata.errval bit is set and the ataide_buf structure is returned to the IDE device driver for
further processing.
If an error is detected after the command was actually sent to the IDE device, then the adapter driver will
return the command to the device driver for error processing and possible retries.
For error logging, the IDE adapter driver logs IDE bus- and adapter-related conditions, where as the IDE
device driver logs IDE device-related errors. In the following description, a capital letter ″A″ after the error
name indicates that the IDE adapter driver handles error logging. A capital letter ″H″ indicates that the IDE
device driver handles error logging.
Some of the following error conditions indicate an IDE device failure. Others are IDE bus- or
adapter-related.
ATA_IDE_DMA_ERROR (A)
The system I/O bus generated or detected an error during a DMA transfer.
ATA_ERROR_VALID (H)
The request sent to the device failed.
ATA_CMD_TIMEOUT (A) (H)
The command timed out before completion.
ATA_NO_DEVICE_RESPONSE (A)
The target device did not respond.
Internally, the devsw table has entry points for the ddconfig, ddopen, ddclose, dddump, ddioctl, and
ddstrategy routines. The IDE device drivers pass their IDE commands to the IDE adapter driver by calling
the IDE adapter driver ddstrategy routine. (This routine is unavailable to other operating system programs
due to the lack of a block-device special file.)
Access to the IDE adapter driver’s ddconfig, ddopen, ddclose, dddump, ddioctl, and ddstrategy entry
points by the IDE device drivers is performed through the kernel services provided. These include such
kernel services as fp_opendev, fp_close, fp_ioctl, devdump, and devstrat.
The DUMPQUERY option should return a minimum transfer size of 0 bytes, and a maximum transfer size
equal to the maximum transfer size supported by the IDE adapter driver.
Calls to the IDE adapter driver DUMPWRITE option should use the arg parameter as a pointer to the
ataide_buf structure to be processed. Using this interface, an IDE write command can be executed on a
previously started (opened) target device. The uiop parameter is ignored by the IDE adapter driver during
the DUMPWRITE command. Spanned or consolidated commands are not supported using the
DUMPWRITE option. Gathered write commands are also not supported using the DUMPWRITE option. No
queuing of ataide_buf structures is supported during dump processing because the dump routine runs
essentially as a subroutine call from the caller’s dump routine. Control is returned when the entire
ataide_buf structure has been processed.
Note: No error recovery techniques are used during the DUMPWRITE option because any error occurring
during DUMPWRITE is a real problem as the system is already unstable. Return values from the
call to the dddump routine indicate the specific nature of the failure.
Successful completion of the selected operation is indicated by a 0 return value to the subroutine.
Unsuccessful completion is indicated by a return code set to one of the following values for the errno
global variable. The various ataide_buf status fields, including the b_error field, are not set by the IDE
adapter driver at completion of the DUMPWRITE command. Error logging is, of necessity, not supported
during the dump.
v An errno value of EINVAL indicates that an invalid request (unknown command or bad parameter) was
passed to the IDE adapter driver, such as to attempt a DUMPSTART command before successfully
executing a DUMPINIT command.
v An errno value of EIO indicates that the IDE adapter driver was unable to complete the command due
to a lack of required resources or an I/O error.
v An errno value of ETIMEDOUT indicates that the adapter did not respond to a command that was put
in its register before the passed command time-out value expired.
Every IDE adapter driver must support the IOCINFO ioctl operation. The structure to be returned to the
caller is the devinfo structure, including the ide union definition for the IDE adapter found in the
/usr/include/sys/devinfo.h file. The IDE device driver should request the IOCINFO ioctl operation
(probably during its open routine) to get the maximum transfer size of the adapter.
Note: The IDE adapter driver ioctl operations can only be called from the process level. They cannot be
executed from a call on any more favored priority levels. Attempting to call them from a more
favored priority level can result in a system crash.
Except where noted otherwise, the arg parameter for each of the ioctl operations described here must
contain a long integer. In this field, the least significant byte is the IDE device ID value. (The upper three
bytes are reserved and should be set to 0.) This provides the information required to allocate or deallocate
resources and perform IDE bus operations for the ioctl operation requested.
IDEIOSTOP
This operation deallocates resources local to the IDE adapter driver for this IDE device. This
should be run on the last close of an IDE device. If an IDEIOSTART operation has not been
previously issued, this command is unsuccessful.
For more information, see IDEIOSTOP (Stop) IDE Adapter Device Driver ioctl Operation in AIX 5L
Version 5.3 Technical Reference: Kernel and Subsystems Volume 1.
IDEIORESET
This operation causes the IDE adapter driver to send an ATAPI device reset to the specified IDE
device ID.
The IDE device driver should use this command only when directed to do a forced open. This
occurs in for the situation when the device needs to be reset to clear an error condition.
Note: In normal system operation, this command should not be issued, as it would reset all
devices connected to the controller. If an IDEIOSTART operation has not been previously
issued, this command is unsuccessful.
IDEIOINQU
This operation allows the caller to issue an IDE device inquiry command to a selected device.
For more information, see IDEIOINQU (Inquiry) IDE Adapter Device Driver ioctl Operation in AIX
5L Version 5.3 Technical Reference: Kernel and Subsystems Volume 1.
IDEIOSTUNIT
This operation allows the caller to issue an IDE Start Unit command to a selected IDE device. For
the IDEIOSTUNIT operation, the arg parameter operation is the address of an ide_startunit
structure. This structure is defined in the /usr/include/sys/ide.h file.
For more information, see IDEIOSTUNIT (Start Unit) IDE Adapter Device Driver ioctl Operation in
AIX 5L Version 5.3 Technical Reference: Kernel and Subsystems Volume 1.
IDEIOTUR
This operation allows the caller to issue an IDE Test Unit Ready command to a selected IDE
device.
For more information, see IDEIOTUR (Test Unit Ready) IDE Adapter Device Driver ioctl Operation
in AIX 5L Version 5.3 Technical Reference: Kernel and Subsystems Volume 1.
Related Information
Logical File System Kernel Services
Technical References
The ddconfig, ddopen, ddclose, dddump, ddioctl, ddread, ddstrategy, ddwrite entry points in AIX 5L
Version 5.3 Technical Reference: Kernel and Subsystems Volume 2.
The fp_opendev, fp_close, fp_ioctl, devdump, devstrat kernel services in AIX 5L Version 5.3 Technical
Reference: Kernel and Subsystems Volume 2.
IDE Adapter Device Driver, idecdrom IDE Device Driver, idedisk IDE Device Driver, IDEIOIDENT (Identify
Device) IDE Adapter Device Driver ioctl Operation, IDEIOINQU (Inquiry) IDE Adapter Device Driver ioctl
Operation, IDEIOREAD (Read) IDE Adapter Device Driver ioctl Operation, IDEIOSTART (Start IDE)
Adapter Device Driver ioctl Operation, IDEIOSTOP (Stop) Device IDE Adapter Device Driver ioctl
Operation, IDEIOSTUNIT (Start Unit) IDE Adapter Device Driver ioctl Operation, and IDEIOTUR (Test Unit
Ready) IDE Adapter Device Driver ioctl Operation in AIX 5L Version 5.3 Technical Reference: Kernel and
Subsystems Volume 2.
In contrast, with direct access, entering and retrieving information depends only on the location of the data
and not on a reference to data previously accessed. Because of this, access time for information on direct
access storage devices (DASDs) is effectively independent of the location of the data.
Direct access storage devices (DASDs) include both fixed and removable storage devices. Typically, these
devices are hard disks. A fixed storage device is any storage device defined during system configuration to
be an integral part of the system DASD. If a fixed storage device is not available at some time during
normal operation, the operating system detects an error.
A removable storage device is any storage device you define during system configuration to be an optional
part of the system DASD. Removable storage devices can be removed from the system at any time during
normal operation. As long as the device is logically unmounted before you remove it, the operating system
does not detect an error.
The following types of devices are not considered DASD and are not supported by the logical volume
manager (LVM):
v Diskettes
v CD-ROM (compact disk read-only memory)
v DVD-ROM (DVD read-only memory)
v WORM (write-once read-mostly)
By using direct access storage, you can quickly retrieve information from random addresses as a stream
of one or more blocks. Many DASDs perform best when the blocks to be retrieved are close in physical
address to each other.
A DASD consists of a set of flat, circular rotating platters. Each platter has one or two sides on which data
is stored. Platters are read by a set of nonrotating, but positionable, read or read/write heads that move
together as a unit.
The following terms are used when discussing DASD device block operations:
sector An addressable subdivision of a track used to record one block of a program or data. On a DASD,
this is a contiguous, fixed-size block. Every sector of every DASD is exactly 512 bytes.
track A circular path on the surface of a disk on which information is recorded and from which recorded
information is read; a contiguous set of sectors. A track corresponds to the surface area of a single
platter swept out by a single head while the head remains stationary.
A DASD contains at least 17 sectors per track. Otherwise, the number of sectors per track is not
defined architecturally and is device-dependent. A typical DASD track can contain 17, 35, or 75
sectors.
A DASD can contain 1024 tracks. The number of tracks per DASD is not defined architecturally and
is device-dependent.
There must be at least 43 heads on a DASD. Otherwise, the number is not defined architecturally
and is device-dependent. A typical DASD has 8 heads.
cylinder The tracks of a DASD that can be accessed without repositioning the heads. If a DASD has n
number of vertically aligned heads, a cylinder has n number of vertically aligned tracks.
Related Information
Programming in the Kernel Environment Overview
If your system stops with an 888 number flashing in the operator panel display, the system has generated
a dump and saved it to a dump device.
In AIX Version 4, some of the error log and dump commands are delivered in an optionally installable
package called bos.sysmgt.serv_aid. System dump commands included in the bos.sysmgt.serv_aid
include the sysdumpstart command. See the Software Service Aids Package for more information.
When you install the operating system, the dump device is automatically configured for you. By default, the
primary device is /dev/hd6, which is a paging logical volume, and the secondary device is
/dev/sysdumpnull.
Note: If your system has 4 GB or more of memory, then the default dump device is /dev/lg_dumplv, and
is a dedicated dump device.
If a dump occurs to paging space, the system will automatically copy the dump when the system is
rebooted. By default, the dump is copied to the/var/adm/ras directory in the root volume group. See the
sysdumpdev command for details on how to control dump copying.
Starting with AIX 5.1, the dumpcheck facility will notify you if your dump device needs to be larger, or if the
file system containing the copy directory is too small. It will also automatically turn compression on if this
will alleviate these conditions. This notification appears in the system error log. If you need to increase the
size of your dump device, refer to “Increasing the Size of a Dump Device” on page 320 in this section.
For maximum effectiveness, dumpcheck should be run when the system is most heavily loaded. At such
times, the system dump is most likely to be at its maximum size. Also, even with dumpcheck watching the
dump size, it may still happen that the dump won’t fit on the dump device or in the copy directory at the
time it happens. This could occur if there is a peak in system load right at dump time.
Before AIX 5.1, use the dmp_add kernel service. For more information, see dmp_add Kernel Service in
AIX 5L Version 5.3 Technical Reference: Kernel and Subsystems Volume 1.
A user-initiated dump is different from a dump initiated by an unexpected system halt because the user
can designate which dump device to use. When the system halts unexpectedly, a system dump is initiated
automatically to the primary dump device.
You can start a system dump by using one of the methods listed below.
You have access to the sysdumpstart command and can start a dump using one of these methods:
v Using the Command Line
v Using SMIT
v Using the Reset Button
v Using Special Key Sequences
Note: You must have root user authority to start a dump by using the sysdumpstart command.
Using SMIT
Use the following SMIT commands to choose a dump device and start the system dump:
Note: You must have root user authority to start a dump using SMIT. SMIT uses the sysdumpstart
command to start a system dump.
1. Check which dump device is appropriate for your system (the primary or secondary device) by using
the following SMIT fast path command:
smit dump
2. Choose the Show Current Dump Devices option and write the available devices on notepaper.
3. Enter the following SMIT fast path command again:
smit dump
4. Choose either the primary (the first example option) or secondary (the second example option) dump
device to hold your dump information:
Start a Dump to the Primary Dump Device
OR
Start a Dump to the Secondary Dump Device
Base your decision on the list of devices you made in step 2.
5. Refer to “Checking the Status of a System Dump” on page 316 if a value shows in the operator panel
display. If the operator panel display is blank, the dump was not started. Try again using the Reset
button.
Note: To start a dump with the reset button or a key sequence you must have the key switch, or mode
switch, in the Service position, or have set the Always Allow System Dump value to true. To do
this:
a. Use the following SMIT fast path command:
smit dump
b. Set the Always Allow System Dump value to true. This is essential on systems that do not
have a mode switch.
Start a system dump with the Reset button by doing the following (this procedure works for all system
configurations and will work in circumstances where other methods for starting a dump will not work):
1. If your machine has a key mode switch, do one of the following:
v Turn the key mode switch to the Service position.
v Set Always Allow System Dump to true.
Your system writes the dump information to the primary dump device.
Note: The procedure for using the reset button can vary, depending upon your hardware configuration.
Note: You can start a system dump by this method only on the native keyboard.
You can check whether the dump was successful, and if not, what caused the dump to fail. If a 0cx is
displayed, see “Status Codes” below.
Note: If the dump fails and upon reboot you see an error log entry with the label DSI_PROC or ISI_PROC,
and the Detailed Data area shows an EXVAL of 000 0005, this is probably a paging space I/O error.
If the paging space (probably/dev/hd6) is the dump device or on the same hard drive as the dump
device, your dump may have failed due to a problem with that hard drive. You should run
diagnostics against that disk.
Status Codes
Find your status code in the following list, and follow the instructions:
000 The kernel debugger is started. If there is an ASCII terminal attached to one of the native serial ports, enter q
dump at the debugger prompt (>) on that terminal and then wait for flashing 888s to appear in the operator
panel display. After the flashing 888 appears, go to “Checking the Status of a System Dump.”
0c0 The dump completed successfully. Go to “Copying a System Dump” on page 317.
0c1 An I/O error occurred during the dump. Go to “System Dump Facility” on page 313.
0c2 A user-requested dump is not finished. Wait at least 1 minute for the dump to complete and for the operator
panel display value to change. If the operator panel display value changes, find the new value on this list. If
the value does not change, then the dump did not complete due to an unexpected error.
0c4 The dump ran out of space . A partial dump was written to the dump device, but there is not enough space
on the dump device to contain the entire dump. To prevent this problem from occurring again, you must
increase the size of your dump media. Go to “Increase the Size of a Dump Device” on page 319.
0c5 The dump failed due to an internal error.
0c7 A network dump is in progress, and the host is waiting for the server to respond. The value in the operator
panel display should alternate between 0c7 and 0c2 or 0c9. If the value does not change, then the dump did
not complete due to an unexpected error.
0c8 The dump device has been disabled. The current system configuration does not designate a device for the
requested dump. Enter the sysdumpdev command to configure the dump device.
0c9 A dump started by the system did not complete. Wait at least 1 minute for the dump to complete and for the
operator panel display value to change. If the operator panel display value changes, find the new value on
the list. If the value does not change, then the dump did not complete due to an unexpected error.
Note: If you intend to use a tape to send a snap image to IBM for software support. The tape must be
one of the following formats: 8mm, 2.3 Gb capacity, 8mm, 5.0 Gb capacity, or 4mm, 4.0 Gb
capacity. Using other formats will prevent or delay software support from being able to examine the
contents.
There are two procedures for copying a system dump, depending on whether you’re using a dataless
workstation or a non-dataless machine:
v Copying a System Dump on a Dataless Workstation
v Copying a System Dump on a Non-Dataless Machine
Copy the System Dump from the Server: The dump is copied like any other file. To copy the dump to
tape, use the tar command:
tar -c
To copy the dump back from the external media (such as a tape drive), use the tar command. Enter the
following to copy the dump from /dev/rmt0:
tar -x
Reboot Your Machine: Reboot in Normal mode using the following steps:
1. Switch off the power on your machine.
2. Turn the mode switch to the Normal position.
3. Switch on the power on your machine.
If your system brings up the login prompt, go to “Copy a System Dump after Rebooting in Normal Mode.”
If your system stops with a number in the operator panel display instead of bringing up the login prompt,
reboot your machine from Maintenance mode, then go to “Copy a System Dump after Booting from
Maintenance Mode.”
Copy a System Dump after Rebooting in Normal Mode: After rebooting in Normal mode, copy a
system dump by doing the following:
1. Log in to your system as root user.
2. Copy the system dump to tape using the following snap command:
/usr/sbin/snap -gfkD -o /dev/rmt#
where # (pound sign) is the number of your available tape device (the most common is /dev/rmt0 ) .
To find the correct number, enter the following lsdev command, and look for the tape device listed as
Available:
lsdev -C -c tape -H
Note: If your dump went to a paging space logical volume, it has been copied to a directory in your
root volume group, /var/adm/ras. See Configure a Dump Device and the sysdumpdev
command for more details. These dumps are still copied by the snap command. The
sysdumpdev -L command lists the exact location of the dump.
3. To copy the dump back from the external media (such as a tape drive), use the pax command. Enter
the following to copy the dump from /dev/rmt0:
pax -rf/dev/rmt0
To copy the dump from any other media, enter:
tar -xftapedevice
Note: Use this procedure only if you cannot boot your machine in Normal mode.
When a system dump occurs, all of the kernel segment that resides in real memory is dumped (the kernel
segment is segment 0). Memory resident user data (such as u-blocks) are also dumped.
The minimum size for the dump space can best be determined using the sysdumpdev -e command. This
gives an estimated dump size taking into account the memory currently in use by the system. If dumps are
being compressed, then the estimate shown is for the compressed size of thedump, not the original size.
In general, compressed dump size estimates will be much higher than the actual size. This occurs
because of the unpredictability of the compression algorithm’s efficiency. You should still ensure your dump
device is large enough to hold the estimated size in order to avoid losing dump data.
If sysdumpdev -e returns the message, Estimated dump size in bytes: 9830400, then the dump device
should be at least 9830400 bytes or 12MB (if you are using three 4MB partitions for the disk).
Note: When a client dumps to a remote dump server, the dumps are stored as files on the server. For
example, the /export/dump/kakrafon/dump file will contain kakrafon’s dump. Therefore, the file
system used for the /export/dump/kakrafon directory must be large enough to hold the client
dumps.
Note: You can also determine the dump devices using SMIT. Select the Show Current Dump
Devices option from the System Dump SMIT menu.
2. Determine your logical volume type by using SMIT. Enter the SMIT fast path smit lvm or smitty lvm.
You will go directly to Logical Volumes. Select the List all Logical Volumes by Volume Group option.
Find your dump volume in the list and note its Type (in the second column). For example, this might be
paging in the case of hd6 or sysdump in the case of hd7.
If you have confirmed that your dump device type is sysdump, refer to the extendlv command for more
information.
Error Logging
The error facility records device-driver entries in the system error log. These error log entries record any
software or hardware failures that need to be available either for informational purposes or for fault
detection and corrective action. The device driver, using the errsave kernel service, adds error records to
the /dev/error special file.
The errdemon daemon picks up the error record and creates an error log entry. When you access the
error log either through SMIT (System Management Interface Tool) or with the errpt command, the error
record is formatted according to the error template in the error template repository and presented in either
a summary or detailed report.
Before initiating the error logging process, determine what services are available to developers, and what
services are available to the customer, service personnel, and defect personnel.
v Determine the Importance of the Error: Use system resources for logging only information that is
important or helpful to the intended audience. Work with the hardware developer, if possible, to identify
detectable errors and the information that should be relayed concerning those errors.
v Determine the Text of the Message: Use regular national language support (NLS) XPG/4 messages
instead of the codepoints. For more information about NLS messages, see Message Facility in AIX 5L
Version 5.3 National Language Support Guide and Reference.
v Determine the Correct Level of Thresholding: Each software or hardware error to be logged, can be
limited by thresholding to avoid filling the error log with duplicate information. Side effects of runaway
error logging include overwriting existing error log entries and unduly alarming the end user. The error
log is limited in size. When its size limit is reached, the log wraps. If a particular error is repeated
needlessly, existing information is overwritten, which might cause inaccurate diagnostic analyses. The
end user or service person can perceive a situation as more serious or pervasive than it is if they see
hundreds of identical or nearly identical error entries.
You are responsible for implementing the proper level of thresholding in the device driver code.
The default size of the error log is 1 MB. As shipped, it cleans up any entries older than 30 days. To
ensure that your error log entries are informative, noticed, and remain intact, test your driver thoroughly.
where:
buf Specifies a pointer to a buffer that contains an error record as described in the sys/errids.h header file.
cnt Specifies a number of bytes in the error record contained in the buffer pointed to by the buf parameter.
The following sample code is an example of a device driver error logging routine. This routine takes data
passed to it from some part of the main body of the device driver. This code simply fills in the structure
with the pertinent information, then passes it on using the errsave kernel service.
void
errsv_ex (int err_id, unsigned int port_num,
int line, char *file, uint data1, uint data2)
{
dderr log;
char errbuf[255];
ddex_dds *p_dds;
p_dds = dds_dir[port_num];
log.err.error_id = err_id;
if (port_num = BAD_STATE) {
sprintf(log.err.resource_name, "%s :%d",
p_dds->dds_vpd.adpt_name, data1);
data1 = 0;
}
else
sprintf(log.err.resource_name,"%s",p_dds->dds_vpd.devname);
log.data1 = data1;
log.data2 = data2;
The data to be passed to the errsave kernel service is defined in the dderr structure, which is defined in a
local header file, dderr.h. The definition for dderr is:
typedef struct dderr {
struct err_rec0 err;
int data1; /* use data1 and data2 to show detail */
int data2; /* data in the errlog report. Define */
/* these fields in the errlog template */
/* These fields may not be used in all */
/* cases. */
} dderr;
The first field of the dderr.h header file is comprised of the err_rec0 structure, which is defined in the
sys/err_rec.h header file. This structure contains the ID (or label) and a field for the resource name. The
two data fields hold the detail data for the error log report. As an alternative, you could simply list the fields
within the function.
Note: Care must be taken when logging a data structure, because error logging does not support padding
done by the compiler.
Because err_rec0 is 20 bytes in length, 0x14 bytes, the compiler normally inserts 4 bytes of padding
before data, when compiling in 64-bit mode. The structure then looks like the following:
struct {
struct err_rec0 err;
int padding;
long data; /* 64 bits of data, 64-bit aligned */
} myerr;
Thus the Detail_Data item in the template begins formatting at the padding data item rather than at data.
After you add the templates using the errupdate command, compile the device driver code along with the
new header file. Simulate the error and verify that it was written to the error log correctly. Some details to
check for include:
v Is the error demon running? This can be verified by running the ps -ef command and checking for
/usr/lib/errdemon as part of the output.
v Is the error part of the error template repository? Verify this by running the errpt -at command.
v Was the new header file, which was created by the errupdate command and which contains the error
label and unique error identification number, included in the device driver code when it was compiled?
Introduction
The operating system is shipped with permanent trace event points. These events provide general visibility
to system execution. You can extend the visibility into applications by inserting additional events and
providing formatting rules.
The collection of trace data was designed so that system performance and flow would be minimally
altered by activating trace. Because of this, the facility is extremely useful as a performance analysis tool
and as a problem determination tool.
The trace facility does not strongly couple data reduction to instrumentation but provides a stream of
system events. It is not required to presuppose what statistics are needed. The statistics or data reduction
are to a large degree separated from the instrumentation.
You can choose to develop the minimum, maximum, and average time for task A from the flow of events.
But it is also possible to extract the average time for task A when called by process B, extract the average
time for task A when conditions XYZ are met, develop a standard deviation for task A, or even decide that
some other task, recognized by a stream of events, is more meaningful to summarize. This flexibility is
invaluable for diagnosing performance or functional problems.
The trace facility generates large volumes of data. This data cannot be captured for extended periods of
time without overflowing the storage device. This allows two practical ways that the trace facility can be
used natively.
First, the trace facility can be triggered in multiple ways to capture small increments of system activity. It is
practical to capture seconds to minutes of system activity in this way for post-processing. This is sufficient
time to characterize major application transactions or interesting sections of a long task.
Second, the trace facility can be configured to direct the event stream to standard output. This allows a
real-time process to connect to the event stream and provide data reduction in real-time, thereby creating
a long term monitoring capability. A logical extension for specialized instrumentation is to direct the data
stream to an auxiliary device that can either store massive amounts of data or provide dynamic data
reduction.
For AIX 5.3, tracing can be limited to a specified set of processes or threads. This can greatly reduce the
amount of data generated and allow you to target the trace to report on specific tasks of interest.
The trace facility causes predefined events to be written to a trace log. The tracing action is then stopped.
Tracing from a command line is discussed in “Controlling trace” on page 324. Tracing from a software
application is discussed and an example is presented in “Examples of Coding Events and Formatting
Events” on page 339.
After a trace is started and stopped, you must format it before viewing it.
To format the trace events that you have defined, you must provide a stanza that describes how the trace
formatter is to interpret the data that has been collected. This is described in “Syntax for Stanzas in the
trace Format File” on page 327.
The trcrpt command provides a general purpose report facility. The report facility provides little data
reduction, but converts the raw binary event stream to a readable ASCII listing of the event stream. Data
can be visually extracted by a reader, or tools can be developed to further reduce the data.
Usually, when you want to show interaction with other system routines, use the system channel. The
generic channels are provided so that you can control how much data is written to the trace log. Only your
data is written to one of the generic channels.
For more information on trace hooks, see “Macros for Recording trace Events” on page 325.
where args is simply the options list desired that you would enter using the trace command if starting a
system trace (channel 0). If starting a generic trace, include a -g option in the args string. On successful
completion, trcstart returns the channel ID. For generic tracing this channel ID can be used to record to
the private generic channel.
For an example of the trcstart routine, see “Examples of Coding Events and Formatting Events” on page
339.
When compiling a program using this subroutine, you must request the link to the librts.a library. Use -l
rts as a compile option.
Controlling trace
Basic controls for the trace facility exist as trace subcommands, standalone commands, and subroutines.
If you configure the trace routine to run asynchronously (the -a option), you can control the trace facility
with the following commands:
This report facility does not attempt to extract summary statistics (such as CPU utilization and disk
utilization) from the event stream. This can be done in several other ways. To create simple summaries,
consider using awk scripts to process the output obtained from the trcrpt command.
An event record should be as short as possible. Many system events use only the hookword and
timestamp. There is another event type you should seldom use because it is less efficient. It is a long
format that allows you to record a variable length data. In this long form, the 16-bit data field of the
hookword is converted to a length field that describes the length of the event record.
The macros to record system (channel 0) events with a time stamp are:
v TRCHKL0T (hw)
v TRCHKL1T (hw,D1)
v TRCHKL2T (hw,D1,D2)
v TRCHKL3T (hw,D1,D2,D3)
v TRCHKL4T (hw,D1,D2,D3,D4)
v TRCHKL5T (hw,D1,D2,D3,D4,D5)
Similarly, to record non-time stamped system events (channel 0) on versions of AIX prior to AIX 5L Version
5.3 with the 5300-05 Technology Level, use the following macros:
v TRCHKL0 (hw)
v TRCHKL1 (hw,D1)
v TRCHKL2 (hw,D1,D2)
v TRCHKL3 (hw,D1,D2,D3)
v TRCHKL4 (hw,D1,D2,D3,D4)
v TRCHKL5 (hw,D1,D2,D3,D4,D5)
In AIX 5L Version 5.3 with the 5300-05 Technology Level and above, a time stamp is recorded with each
event regardless of the type of macro used.
There are only two macros to record events to one of the generic channels (channels 1-7). They are:
v TRCGEN (ch,hw,d1,len,buf)
v TRCGENT (ch,hw,d1,len,buf)
To allow you to define events in your environments or during development, a range of event IDs exist for
temporary use. The range of event IDs for temporary use is hex 010 through hex 0FF. No permanent
(shipped) events are assigned in this range. You can freely use this range of IDs in your own environment.
If you do use IDs in this range, do not let the code leave your environment.
Permanent events must have event IDs assigned by the current owner of the trace component. To obtain
a trace event id, send a note with a subject of help to [email protected].
You should conserve event IDs because they are limited. Event IDs can be extended by the data field. The
only reason to have a unique ID is that an ID is the level at which collection and report filtering is available
in the trace facility. An ID can be collected or not collected by the trace collection process and reported or
not reported by the trace report facility. Whole applications can be instrumented using only one event ID.
The only restriction is that the granularity on choosing visibility is to choose whether events for that
application are visible.
A new event can be formatted by the trace report facility (trcrpt command) if you create a stanza for the
event in the trace format file. The trace format file is an editable ASCII file. The syntax for a format stanzas
is shown in Syntax for Stanzas in the trace Format File. All permanently assigned event IDs should have
an appropriate stanza in the default trace format file shipped with the base operating system.
Consult a performance analyst for decisions regarding what events and data to capture as permanent
events for a new component. The following paragraphs provide some guidelines for these decisions.
Events should capture execution flow and data flow between major components or major sections of a
component. For example, there are existing events that capture the interface between the virtual memory
manager and the logical volume manager. If work is being queued, data that identifies the queued item (a
handle) should be recorded with the event. When a queue element is being processed, the ″dequeue″
event should provide this identifier as data also, so that the queue element being serviced is identified.
Data or requests that are identified by different handles at different levels of the system should have
events and data that allow them to be uniquely identified at any level. For example, a read request to the
physical file system is identified by a file descriptor and a current offset in the file. To a virtual memory
manager, the same request is identified by a segment ID and a virtual page address. At the disk device
driver level, this request is identified as a pointer to a structure that contains pertinent data for the request.
The file descriptor or segment information is not available at the device driver level. Events must provide
the necessary data to link these identifiers so that, for example, when a disk interrupt occurs for incoming
data, the identifier at that level (which can simply be the buffer address for where the data is to be copied)
can be linked to the original user request for data at some offset into a file.
Use events to give visibility to resource consumption. Whenever resources are claimed, returned, created,
or deleted an event should record the fact. For example, claiming or returning buffers to a buffer pool or
growing or shrinking the number of buffers in the pool.
The following guidelines can help you determine where and when you should have trace hooks in your
code:
v Tracing entry and exit points of every function is not necessary. Provide only key actions and data.
v Show linkage between major code blocks or processes.
v If work is queued, associate a name (handle) with it and output it as data.
v If a queue is being serviced, the trace event should indicate the unique element being serviced.
v If a work request or response is being referenced by different handles as it passes through different
software components, trace the transactions so the action or receipt can be identified.
v Place trace hooks so that requests, responses, errors, and retries can be observed.
v Identify when resources are claimed, returned, created, or destroyed.
A trace format stanza can be as long as required to describe the rules for any particular event. The stanza
can be continued to the next line by terminating the present line with a backslash (\). The fields are:
event_id
Each stanza begins with the three-digit hexadecimal event ID that the stanza describes, followed
by a space.
V.R This field describes the version (V) and release (R) that the event was first assigned. Any integers
work for V and R, and you might want to keep your own tracking mechanism.
L= The text description of an event can begin at various indentation levels. This improves the
readability of the report output. The indentation levels correspond to the level at which the system
is running. The recognized levels are:
APPL Application level
SVC Transitioning system call
KERN Kernel level
INT Interrupt
event_label
The event_label is an ASCII text string that describes the overall use of the event ID. This is used
In most cases, the data length part of the specifier can also be the letter ″W″ which indicates that the word size of the
trace hook is to be used. For example, XW will format 4 or 8 bytes into hexadecimal, depending upon whether the
trace hook comes from a 32 or 64 bit environment.
Am.n This value specifies that m bytes of data are consumed as ASCII text, and that it is displayed
in an output field that is n characters wide. The data pointer is moved m bytes.
S1, S2, S4 Left justified string. The length of the field is defined as 1 byte (S1), 2 bytes (S2), or 4 bytes
(S4) and so on. The data pointer is moved accordingly. SW indicates that the word size for the
trace event is to be used.
Bm.n Binary data of m bytes and n bits. The data pointer is moved accordingly.
Xm Hexadecimal data of m bytes. The data pointer is moved accordingly.
D2, D4 Signed decimal format. Data length of 2 (D2) bytes or 4 (D4) bytes is consumed.
U2, U4 Unsigned decimal format. Data length of 2 or 4 bytes is consumed.
F4, F8 Floating point of 4 or 8 bytes.
Gm.n Positions the data pointer. It specifies that the data pointer is positioned m bytes and n bits
into the data.
Om.n Skip or omit data. It omits m bytes and n bits.
Rm Reverse the data pointer m bytes.
Wm Position DATA_POINTER at word m. The word size is either 4 or 8 bytes, depending upon
whether or not this is a 32 or 64 bit format trace. This bares no relation to the %W format
specifier.
m8 Output the next 8 bytes as time in milliseconds from the beginning of the trace. mW will format
only 8 bytes of data. The DATA_POINTER is advanced by 8 bytes.
u4, u8 Output the next 4 or 8 bytes as time in microseconds. mW will format either 4 or 8 bytes of
data depending on whether the current hook is 32 or 64 bits. The DATA_POINTER is
advanced by 4 or 8 bytes.
Some macros are provided that can be used as format fields to quickly access data. For example:
$D1, $D2, $D3, $D4, $D5 These macros access data words 1 through 5 of the event record
without moving the data pointer. The data accessed by a macro is
hexadecimal by default. A macro can be cast to a different data type (X,
D, U, B) by using a % character followed by the new format code. For
example, the following macro causes data word one to be accessed,
but to be considered as 2 bytes and 3 bits of binary data:
$D1%B2.3
$HD This macro accesses the first 16 bits of data contained in the hookword,
in a similar manner as the $D1 through $D5 macros access the various
data words. It is also considered as hexadecimal data, and also can be
cast.
You can define other macros and use other formatting techniques in the trace format file. This is shown in
the following trace format file example.
# I. General Information
#
# The formats shown below apply to the data placed into the
# trcrpt format buffer. These formats in general mirror the binary
# format of the data in the trace stream. The exceptions are
# hooks from a 32-bit application on a 64-bit kernel, and hooks from a
# 64-bit application on a 32-bit kernel. These exceptions are noted
# below as appropriate.
#
# Trace formatting templates should not use the thread id or time
# stamp from the buffer. The thread id should be obtained with the
# $TID macro. The time stamp is a raw timer value used by trcrpt to
# calculate the elapsed and delta times. These values are either
# 4 or 8 bytes depending upon the system the trace was run on, not upon
# the environment from which the hook was generated.
# The system environment, 32 or 64 bit, and the hook’s
# environment, 32 or 64 bit, are obtained from the $TRACEENV and $HOOKENV
# macros discussed below.
#
# To interpret the time stamp, it is necessary to get the values from
# hook 0x00a, subhook 0x25c, used to convert it to nanoseconds.
# The 3 data words of interest are all 8 bytes in length and are in
# the generic buffer, see the template for hook 00A.
# The first data word gives the multiplier, m, and the second word
# is the divisor, d. These values should be set to 1 if the
# third word doesn’t contain a 2. The nanosecond time is then
# calculated with nt = t * m / d where t is the time from the trace.
#
# Also, on a 64-bit system, there will be a header on the trace stream.
# This header serves to identify the stream as coming from a
# 64-bit system. There is no such header on the data stream on a
# 32-bit system. This data stream, on both systems, is produced with
# the "-o -" option of the trace command.
# This header consists only of a 4-byte magic number, 0xEFDF1114.
#
# A. Binary format for the 32-bit trace data
# TRCHKL0 MMMTDDDDiiiiiiii
# TRCHKL0T MMMTDDDDiiiiiiiitttttttt
# TRCHKL1 MMMTDDDD11111111iiiiiiii
# TRCHKL1T MMMTDDDD11111111iiiiiiiitttttttt
# Note that trchkg covers TRCHKL2-TRCHKL5.
# trchkg MMMTDDDD1111111122222222333333334444444455555555iiiiiiii
# trchkgt MMMTDDDD1111111122222222333333334444444455555555 i... t...
# trcgent MMMTLLLL11111111vvvvvvvvvvvvvvvvvvvvvvvvvvxxxxxx i... t...
#
# legend:
# MMM = hook id
# T = hooktype
# D = hookdata
# i = thread id, 4 bytes on a 32 byte system and 8 bytes on a 64-bit
Step 1: Enable the trace: Enable and disable the trace from your software that has the trace hooks
defined. The following code shows the use of trace events to time the running of a program loop.
#include <sys/trcctl.h>
#include <sys/trcmacros.h>
#include <sys/trchkid.h>
main()
{
printf("configuring trace collection \n");
if (trcstart("-ad")){
perror("trcstart");
Step 2: Compile your program: When you compile the sample program, you need to link to the librts.a
library:
cc -o sample sample.c -l rts
Step 3: Run the program: Run the program. In this case, it can be done with the following command:
./sample
Step 4: Add a stanza to the format file: This provides the report generator with the information to
correctly format your file. The report facility does not know how to format the HKWD_USER1 event, unless
you provide rules in the trace format file.
The following is an example of a stanza for the HKWD_USER1 event. The HKWD_USER1 event is event
ID 010 hexadecimal. You can verify this by looking at the sys/trchkid.h header file.
# User event HKWD_USER1 Formatting Rules Stanza
# An example that will format the event usage of the sample program
010 1.0 L=APPL "USER EVENT - HKWD_USER1" O2.0 \n\
"The # of loop iterations =" U4\n\
"The elapsed time of the last loop = "\
endtimer(0x010,0x010) starttimer(0x010,0x010)
Note: When entering the example stanza, do not modify the master format file /etc/trcfmt. Instead, make
a copy and keep it in your own directory. This allows you to always have the original trace format
file available. If you are going to ship your formatting stanzas, the trcupdate command is used to
add your stanzas to the default trace format file. See the trcupdate command in AIX 5L Version 5.3
Commands Reference, Volume 5 for information about how to code the input stanzas.
Step 5: Run the format/filter program: Filter the output report to get only your events. To do this, run
the trcrpt command:
trcrpt -d 010 -t mytrcfmt -O exec=on -o sample.rpt
Usage Hints
The following sections provide some examples and suggestions for use of the trace facility.
The following command captures the total copy command run, and stops the trace when the command
finishes:
trace -ax "cp /etc/trcfmt /tmp/junk"
Note: This example is more educational if the source file is not already cached in system memory. The
trcfmt file can be in memory if you have been modifying it or producing trace reports. In that case,
choose as the source file some other file that is 50 to 100 KB and has not been touched.
You can see that event ID 15b is the open event. Now, process the data from the copy example (the data
is probably still in the log file) as follows:
trcrpt -d 15b -O "exec=on"
The report is written to standard output and you can determine the number of opens that occurred. If you
want to see only the opens that were performed by the cp process, run the report command again using:
trcrpt -d 15b -p cp -O "exec=on"
Trace event groups should only be manipulated using either the trcevgrp command, or SMIT. The
trcevgrp command allows groups to be created, modified, removed, and listed.
Reserved event groups may not be changed or removed by the trcevgrp command. These are generally
groups used to perform system support. A reserved event group must be created using the ODM facilities.
Such a group will have three attributes as shown below:
SWservAt:
attribute = "(name)_trcgrp"
default = " "
value = "(list-of-hooks)"
SWservAt:
attribute = "(name)_trcgrpdesc"
default = " "
value = "description"
SWservAt:
attribute = "(name)_trcgrptype"
default = " "
value = "reserved"
The hook IDs must be enclosed in double quotation marks (″) and separated by commas.
In the kernel environment (including the kernel, kernel extensions, and device drivers), memory overlay
problems have been especially difficult to debug because tools for finding them have not been available.
Starting with AIX 4.2.1, however, the Memory Overlay Detection System (MODS) helps detect memory
overlay problems in the kernel, kernel extensions, and device drivers.
Note: This feature does not detect problems in application code; it only monitors kernel and kernel
extension code.
bosdebug command
The bosdebug command turns the MODS facility on and off. Only the root user can run the bosdebug
command.
After you have run bosdebug with the options you want, run the bosboot -a command, then shut down
and reboot your system (using the shutdown -r command). If you need to make any changes to your
bosdebug settings, you must run bosboot -a and shutdown -r again.
MODS works by turning on additional checking to help detect the conditions listed above. When any of
these conditions is detected, your system crashes immediately and produces a dump file that points
directly at the offending code. (In previous versions, a system dump might point to unrelated code that
happened to be running later when the invalid situation was finally detected.)
If your system crashes while the MODS is turned on, then MODS has most likely done its job.
The xmalloc subcommand provides details on exactly what memory address (if any) was involved in the
situation, and displays mini-tracebacks for the allocation or free records of this memory.
You can use these commands, as well as standard crash techniques, to determine exactly what went
wrong.
MODS limitations
There are limitations to the Memory Overlay Detection System. Although it significantly improves your
chances, MODS cannot detect all memory overlays. Also, turning MODS on has a variable negative
impact (depending on how frequently xmalloc or xmfree is called) on overall system performance and
causes somewhat more memory to be used in the kernel and the network memory heaps. If your system
is running at full processor utilization, or if you are already near the maximums for kernel memory usage,
turning on the MODS might cause performance degradation and/or system hangs.
Practical experience with the MODS, however, suggests that the great majority of customers can use it
with minimal impact to their systems.
MODS benefits
The following benefits are gained from using the MODS:
v You can more easily test and debug your own kernel extensions and device drivers.
v Difficult problems that previously required multiple attempts to recreate and debug them generally
require many fewer such attempts.
Related Information
Software Product Packaging in AIX 5L Version 5.3 General Programming Concepts: Writing and
Debugging Programs
Commands References
The errinstall command, errlogger command, errmsg command, errupdate command, extendlv
command in AIX 5L Version 5.3 Commands Reference, Volume 2.
The sysdumpdev command, sysdumpstart command, trace command, trcrpt command in AIX 5L
Version 5.3 Commands Reference, Volume 5.
Technical References
errsave kernel service in AIX 5L Version 5.3 Technical Reference: Kernel and Subsystems Volume 1.
The degree of integration with the system administrative commands is limited by the amount of
functionality provided by the module. When all of the functionality is present, the administrative commands
are able to create, delete, modify and view user and group accounts.
The security library and loadable authentication module communicate through the secmethod_table
interface. The secmethod_table structure contains a list of subroutine pointers. Each subroutine pointer
performs a well-defined operation. These subroutine are used by the security library to perform the
operations which would have been performed using the local security database files.
Notes:
1. Any module which provides a method_attrlist() interface must also provide this interface.
2. Attributes which are related to password expiration or restrictions should be reported by the
method_attrlist() interface.
3. If this interface is not provided the method_getpasswd() interface must be provided.
Several of the functions make use of a table parameter to select between user, group and system
identification information. The table parameter has one of the following values:
When a table parameter is used by an authentification interface, ″user″ is the only valid value.
Authentication Interfaces
Authentication interfaces perform password validation and modification. The authentication interfaces verify
that a user is allowed access to the system. The authentication interfaces also maintain the authentication
information, typically passwords, which are used to authorize user access.
method_authenticate verifies that a named user has the correct authentication information, typically a
password, for a user account.
method_authenticate is called indirectly as a result of calling the authenticate subroutine. The grammar
given in the SYSTEM attribute normally specifies the name of the loadable authentication module, but it is
not required to do so.
method_authenticate returns AUTH_SUCCESS with a reenter value of zero on success. On failure a value
of AUTH_FAILURE, AUTH_UNAVAIL or AUTH_NOTFOUND is returned.
The user parameter points to the requested user. The oldpassword parameter points to the user’s current
password. The newpassword parameter points to the user’s new password. The message parameter
points to a character pointer. It will be set to a message which is output to the user.
method_chpass is called indirectly as a result of calling the chpass subroutine. The security library will
examine the registry attribute for the user and invoke the method_chpass interface for the named
loadable authentication module.
method_chpass returns zero for success or -1 for failure. On failure the message parameter should be
initialized with a user message.
method_getpasswd provides the encrypted password string for a user account. The encrypted password
string consists of two salt characters and 11 encrypted password characters. The crypt subroutine is used
to create this string and encrypt the user-supplied password for comparison.
method_getpasswd is called when method_authenticate would have been called, but is undefined. The
result of this call is compared to the result of a call to the crypt subroutine using the response to the
password prompt. See the description of the method_authenticate interface for a description of the
response parameter.
The longname parameter points to a fully-qualified user name for modules which include domain or
registry information in a user name. The shortname parameter points to the shortened name of the user,
without the domain or registry information.
method_normalize determines the shortened user name which corresponds to a fully-qualified user name.
The shortened user name is used for user account queries by the security library. The fully-qualified user
name is only used to perform initial authentication.
If the fully-qualified user name is successfully converted to a shortened user name, a non-zero value is
returned. If an error occurs a zero value is returned.
The user parameter points to the requested user. The message parameter points to a character pointer. It
will be set to a message which is output to the user.
method_passwdexpired determines if the authentication information for a user account is expired. This
method distinguishes between conditions which allow the user to change their information and those which
require administrator intervention. A message is returned which provides more information to the user.
method_passwdexpired returns 0 when the password has not expired, 1 when the password is expired
and the user is permitted to change their password and 2 when the password has expired and the user is
not permitted to change their password. A value of -1 is returned when an error has occurred, such as the
user does not exist.
The user parameter points to the requested user. The newpassword parameter points to the user’s new
password. The oldpassword parameter points to the user’s current password. The message parameter
points to a character pointer. It will be set to a message which is output to the user.
method_passwdrestrictions determines if new password meets the system requirements. This method
distinguishes between conditions which allow the user to change their password by selecting a different
password and those which prevent the user from changing their password at the present time. A message
is returned which provides more information to the user.
method_passwdrestrictions returns a value of 0 when newpassword meets all of the requirements, 1 when
the password does not meet one or more requirements and 2 when the password may not be changed. A
value of -1 is returned when an error has occurred, such as the user does not exist.
Identification Interfaces
Identification interfaces perform user and group identity functions. The identification interfaces store and
retrieve user and group identifiers and account information.
The key parameter refers to an entry in the named table. The table parameter refers to one of the three
tables. The attributes parameter refers to an array of pointers to attribute names. The results parameter
refers to an array of value return data structures. Each value return structure contains either the value of
the corresponding attribute or a flag indicating a cause of failure. The size parameter is the number of
array elements.
method_getentry retrieves user, group and system attributes. One or more attributes may be retrieved for
each call. Success or failure is reported for each attribute.
method_getentry is called as a result of calling the getuserattr, getgroupattr and getconfattr subroutines.
method_getentry returns a value of 0 if the key entry was found in the named table. When the entry does
not exist in the table, the global variable errno must be set to ENOENT. If an error in the value of table or
size is detected, the errno variable must be set to EINVAL. Individual attribute values have additional
information about the success or failure for each attribute. On failure a value of -1 is returned.
The id parameter refers to a group name or GID value, depending upon the value of the type parameter.
The type parameters indicates whether the id parameter is to be interpreted as a (char *) which
references the group name, or (gid_t) for the group.
method_getgracct retrieves basic group account information. The id parameter may be a group name or
identifier, as indicated by the type parameter. The basic group information is the group name and identifier.
The group member list is not returned by this interface.
method_getgracct returns a pointer to the group’s group file entry on success. The group file entry may not
include the list of members. On failure a NULL pointer is returned.
The gid parameter is the group identifier for the requested group.
method_getgrgid retrieves group account information given the group identifier. The group account
information consists of the group name, identifier and complete member list.
method_getgrgid returns a pointer to the group’s group file structure on success. On failure a NULL
pointer is returned.
method_getgrnam retrieves group account information given the group name. The group account
information consists of the group name, identifier and complete member list.
method_getgrnam is called as a result of calling the getgrnam subroutine. This interface may also be
called if method_getentry is not defined.
method_getgrnam returns a pointer to the group’s group file structure on success. On failure a NULL
pointer is returned.
method_getgrset retrieves supplemental group information given a user name. The supplemental group
information consists of a comma separated list of group identifiers. The named user is a member of each
listed group.
method_getgrset returns a pointer to the user’s concurrent group set on success. On failure a NULL
pointer is returned.
The group parameter points to the requested group. The result parameter points to a storage area which
will be filled with the group members. The type parameters indicates whether the result parameter is to be
interpreted as a (char **) which references a user name array, or (uid_t) array. The size parameter is a
pointer to the number of users in the named group. On input it is the size of the result field.
method_getgrusers retrieves group membership information given a group name. The return value may be
an array of user names or identifiers.
method_getgrusers may be called by the security library to obtain the group membership information for a
group.
method_getgrusers returns 0 on success. On failure a value of -1 is returned and the global variable errno
is set. The value ENOENT must be used when the requested group does not exist. The value ENOSPC
must be used when the list of group members does not fit in the provided array. When ENOSPC is
returned the size parameter is modified to give the size of the required result array.
method_getpwnam is called as a result of calling the getpwnam subroutine. This interface may also be
called if method_getentry is not defined.
method_getpwnam returns a pointer to the user’s password structure on success. On failure a NULL
pointer is returned.
method_getpwuid retrieves user account information given the user identifier. The user account information
consists of the user name, identifier, primary group identifier, full name, login directory and login shell.
method_getpwuid returns a pointer to the user’s password structure on success. On failure a NULL pointer
is returned.
The key parameter refers to an entry in the named table. The table parameter refers to one of the three
tables. The attributes parameter refers to an array of pointers to attribute names. The values parameter
refers to an array of value structures which correspond to the attributes. Each value structure contains a
flag indicating if the attribute was output. The size parameter is the number of array elements.
method_putentry stores user, group and system attributes. One or more attributes may be retrieved for
each call. Success or failure is reported for each attribute. Values will be saved until method_commit is
invoked.
method_putentry is called as a result of calling the putuserattr, putgroupattr and putconfattr subroutines.
method_putentry returns 0 when the attributes have been updated. On failure a value of -1 is returned and
the global variable errno is set to indicate the cause. A value of ENOSYS is used when updating
information is not supported by the module. A value of EPERM is used when the invoker does not have
permission to create the group. A value of ENOENT is used when the entry does not exist. A value of
EROFS is used when the module was not opened for updates.
The entry parameter points to the structure to be output. The account name is contained in the structure.
method_putgrent stores group account information given a group entry. The group account information
consists of the group name, identifier and complete member list. Values will be saved until method_commit
is invoked.
The group parameter points to the requested group. The users parameter points to a NUL character
separated, double NUL character terminated, list of group members.
method_putgrusers stores group membership information given a group name. Values will be saved until
method_commit is invoked.
method_putgrusers returns 0 when the group has been successfully updated. On failure a value of -1 is
returned and the global variable errno is set to indicate the cause. A value of ENOSYS is used when
updating groups is not supported by the module. A value of EPERM is used when the invoker does not
have permission to update the group. A value of ENOENT is used when the group does not exist. A value
of EROFS is used when the module was not opened for updates.
The entry parameter points to the structure to be output. The account name is contained in the structure.
method_putpwent stores user account information given a user entry. The user account information
consists of the user name, identifier, primary group identifier, full name, login directory and login shell.
Values will be saved until method_commit is invoked.
method_putpwent returns 0 when the user has been successfully updated. On failure a value of -1 is
returned and the global variable errno is set to indicate the cause. A value of ENOSYS is used when
updating users is not supported by the module. A value of EPERM is used when the invoker does not
have permission to update the user. A value of ENOENT is used when the user does not exist. A value of
EROFS is used when the module was not opened for updates.
Support Interfaces
Support interfaces perform functions such as initiating and terminating access to the module, creating and
deleting accounts, and serializing access to information.
method_close indicates that access to the loadable module has ended and all system resources may be
freed. The loadable module must not assume this interface will be invoked as a process may terminate
without calling this interface.
method_close is called when the session count maintained by enduserdb reaches zero.
There are no defined error return values. It is expected that the method_close interface handle common
programming errors, such as being invoked with an invalid token, or repeatedly being invoked with the
same token.
The key parameter refers to an entry in the named table. If it is NULL it refers to all entries in the table.
The table parameter refers to one of the three tables.
method_commit indicates that the specified pending modifications are to be made permanent. An entire
table or a single entry within a table may be specified. method_lock will be called prior to calling
method_commit. method_unlock will be called after method_commit returns.
method_commit is called when putgroupattr or putuserattr are invoked with a Type parameter of
SEC_COMMIT. The value of the Group or User parameter will be passed directly to method_commit.
method_commit returns a value of 0 for success. A value of -1 is returned to indicate an error and the
global variable errno is set to indicate the cause. A value of ENOSYS is used when the load module does
not support modification requests for any users. A value of EROFS is used when the module is not
currently opened for updates. A value of EINVAL is used when the table parameter refers to an invalid
table. A value of EIO is used when a potentially temporary input-output error has occurred.
method_delgroup removes a group account and all associated information. A call to method_commit is not
required. The group will be removed immediately.
method_delgroup is called when putgroupattr is invoked with a Type parameter of SEC_DELETE. The
value of the Group and Attribute parameters will be passed directly to method_delgroup.
method_delgroup returns 0 when the group has been successfully removed. On failure a value of -1 is
returned and the global variable errno is set to indicate the cause. A value of ENOSYS is used when
deleting groups is not supported by the module. A value of EPERM is used when the invoker does not
have permission to delete the group. A value of ENOENT is used when the group does not exist. A value
of EROFS is used when the module was not opened for updates. A value of EBUSY is used when the
group has defined members.
method_delgroup removes a user account and all associated information. A call to method_commit is not
required. The user will be removed immediately.
method_deluser is called when putuserattr is invoked with a Type parameter of SEC_DELETE. The value
of the User and Attribute parameters will be passed directly to method_deluser.
method_deluser returns 0 when the user has been successfully removed. On failure a value of -1 is
returned and the global variable errno is set to indicate the cause. A value of ENOSYS is used when
deleting users is not supported by the module. A value of EPERM is used when the invoker does not have
permission to delete the user. A value of ENOENT is used when the user does not exist. A value of
EROFS is used when the module was not opened for updates.
The key parameter refers to an entry in the named table. If it is NULL it refers to all entries in the table.
The table parameter refers to one of the three tables. The wait parameter is the number of second to wait
for the lock to be acquired. If the wait parameter is zero the call returns without waiting if the entry cannot
be locked immediately.
method_lock informs the loadable modules that access to the underlying mechanisms should be serialized
for a specific table or table entry.
method_lock is called by the security library when serialization is required. The return value will be saved
and used by a later call to method_unlock when serialization is no longer required.
method_newgroup creates a group account. The basic group account information must be provided with
calls to method_putgrent or method_putentry. The group account information will not be made permanent
until method_commit is invoked.
method_newgroup is called when putgroupattr is invoked with a Type parameter of SEC_NEW. The value
of the Group parameter will be passed directly to method_newgroup.
method_newgroup returns 0 when the group has been successfully created. On failure a value of -1 is
returned and the global variable errno is set to indicate the cause. A value of ENOSYS is used when
creating group is not supported by the module. A value of EPERM is used when the invoker does not have
permission to create the group. A value of EEXIST is used when the group already exists. A value of
EROFS is used when the module was not opened for updates. A value of EINVAL is used when the group
has an invalid format, length or composition.
method_newuser is called when putuserattr is invoked with a Type parameter of SEC_NEW. The value of
the User parameter will be passed directly to method_newuser.
method_newuser returns 0 when the user has been successfully created. On failure a value of -1 is
returned and the global variable errno is set to indicate the cause. A value of ENOSYS is used when
creating users is not supported by the module. A value of EPERM is used when the invoker does not have
permission to create the user. A value of EEXIST is used when the user already exists. A value of EROFS
is used when the module was not opened for updates. A value of EINVAL is used when the user has an
invalid format, length or composition.
The name parameter is a pointer to the stanza name in the configuration file. The domain parameter is the
value of the domain= attribute in the configuration file. The mode parameter is either O_RDONLY or
O_RDWR. The options parameter is a pointer to the options= attribute in the configuration file.
method_open prepares a loadable module for use. The domain and options attributes are passed to
method_open.
method_open is called by the security library when the loadable module is first initialized and when
setuserdb is first called after method_close has been called due to an earlier call to enduserdb. The return
value will be saved for a future call to method_close.
method_unlock informs the loadable modules that an earlier need for access serialization has ended.
method_unlock is called by the security library when serialization is no longer required. The return value
from the earlier call to method_lock be used.
Configuration Files
The security library uses the /usr/lib/security/methods.cfg file to control which modules are used by the
system. A stanza exists for each loadable module which is to be used by the system. Each stanza
contains a number of attributes used to load and initialize the module. The loadable module may use this
information to configure its operation when the method_open() interface is invoked immediately after the
module is loaded.
Interfaces are divided into three categories: identification, authentication and support. Identification
interfaces are used when a compound module is performing an identification operation, such as the
getpwnam() subroutine. Authentication interfaces are used when a compound module is performing an
authentication operation, such as the authenticate() subroutine. Support subroutines are used when
initializing the loadable module, creating or deleting entries, and performing other non-data operations. The
table Method Interface Types describes the purpose of each interface. The table below describes which
support interfaces are called in a compound module and their order of invocation.
Related Information
Identification and Authentication Subroutines
IBM may not offer the products, services, or features discussed in this document in other countries.
Consult your local IBM representative for information on the products and services currently available in
your area. Any reference to an IBM product, program, or service is not intended to state or imply that only
that IBM product, program, or service may be used. Any functionally equivalent product, program, or
service that does not infringe any IBM intellectual property right may be used instead. However, it is the
user’s responsibility to evaluate and verify the operation of any non-IBM product, program, or service.
IBM may have patents or pending patent applications covering subject matter described in this document.
The furnishing of this document does not give you any license to these patents. You can send license
inquiries, in writing, to:
The following paragraph does not apply to the United Kingdom or any other country where such
provisions are inconsistent with local law: INTERNATIONAL BUSINESS MACHINES CORPORATION
PROVIDES THIS PUBLICATION "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR
IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF NON-INFRINGEMENT,
MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Some states do not allow disclaimer
of express or implied warranties in certain transactions, therefore, this statement may not apply to you.
This information could include technical inaccuracies or typographical errors. Changes are periodically
made to the information herein; these changes will be incorporated in new editions of the publication. IBM
may make improvements and/or changes in the product(s) and/or the program(s) described in this
publication at any time without notice.
Licensees of this program who wish to have information about it for the purpose of enabling: (i) the
exchange of information between independently created programs and other programs (including this one)
and (ii) the mutual use of the information which has been exchanged, should contact:
IBM Corporation
Dept. LRAS/Bldg. 003
11400 Burnet Road
Austin, TX 78758-3498
U.S.A.
Such information may be available, subject to appropriate terms and conditions, including in some cases,
payment of a fee.
The licensed program described in this document and all licensed material available for it are provided by
IBM under terms of the IBM Customer Agreement, IBM International Program License Agreement or any
equivalent agreement between us.
For license inquiries regarding double-byte (DBCS) information, contact the IBM Intellectual Property
Department in your country or send inquiries, in writing, to:
IBM may use or distribute any of the information you supply in any way it believes appropriate without
incurring any obligation to you.
Information concerning non-IBM products was obtained from the suppliers of those products, their
published announcements or other publicly available sources. IBM has not tested those products and
cannot confirm the accuracy of performance, compatibility or any other claims related to non-IBM products.
Questions on the capabilities of non-IBM products should be addressed to the suppliers of those products.
Any references in this information to non-IBM Web sites are provided for convenience only and do not in
any manner serve as an endorsement of those Web sites. The materials at those Web sites are not part of
the materials for this IBM product and use of those Web sites is at your own risk.
This information contains examples of data and reports used in daily business operations. To illustrate
them as completely as possible, the examples include the names of individuals, companies, brands, and
products. All of these names are fictitious and any similarity to the names and addresses used by an
actual business enterprise is entirely coincidental.
COPYRIGHT LICENSE:
This information contains sample application programs in source language, which illustrates programming
techniques on various operating platforms. You may copy, modify, and distribute these sample programs in
any form without payment to IBM, for the purposes of developing, using, marketing or distributing
application programs conforming to the application programming interface for the operating platform for
which the sample programs are written. These examples have not been thoroughly tested under all
conditions. IBM, therefore, cannot guarantee or imply reliability, serviceability, or function of these
programs. You may copy, modify, and distribute these sample programs in any form without payment to
IBM for the purposes of developing, using, marketing, or distributing application programs conforming to
IBM’s application programming interfaces.
Each copy or any portion of these sample programs or any derivative work, must include a copyright
notice as follows:
(c) (your company name) (year). Portions of this code are derived from IBM Corp. Sample Programs. (c)
Copyright IBM Corp. _enter the year or years_. All rights reserved.
Trademarks
The following terms are trademarks of International Business Machines Corporation in the United States,
other countries, or both:
400
AIX
AIX 5L
BladeCenter
eServer
IBM
Micro Channel
PowerPC
RS/6000
UNIX is a registered trademark of The Open Group in the United States and other countries.
Index 365
FCP device driver (continued) IDE subsystem (continued)
SC_SINGLE 294 IDE device driver
SCIOLEVENT 298 design requirements 307
FDDI device driver 131 entry points 307
configuration parameters 131 internal commands 303
entry points 132 responsibilities relative to adapter device
trace and error logging 133 driver 301
Fiber Distributed Data Interface device driver 131 IDEIOIDENT 310
file descriptor 66 IDEIOINQU 309
file systems IDEIOREAD 309
logical file system 39 IDEIORESET 309
virtual file system 40 IDEIOSTART 309
files IDEIOSTOP 309
/dev/error 320 IDEIOSTUNIT 309
/dev/systrctl 324 IDEIOTUR 309
/etc/trcfmt 324, 340 initiator I/O request execution 303
sys/err_rec.h 321 spanned commands 304
sys/errids.h 321 structures
sys/trchkid.h 325, 326, 340 ataide_buf structure 304
sys/trcmacros.h 325 typical adapter transaction sequence 302
filesystem 39 input device, subsystem 193
fine granularity timer services 83 input ring mechanism 200
Forum Compliant ATM LAN Emulation device interface
driver 118 low function terminal subsystem 199
function interrupt execution environment 6
callback 51 interrupt management
defining levels 53
setting priorities 54
G interrupt management kernel services 53
g-nodes 41 interrupts
getattr subroutine management services 46
modifying attributes 107 INTSTOLLONG macro 27
graphic input device 193 IOCINFO
FCP adapters 260
iSCSI adapters 260
H Virtual SCSI 260
hardware interrupt kernel services 46 ioctl commands
SCIOCMD 244
iSCSI 251, 252, 273, 278, 283, 284, 285, 286
I autosense data 280
command tag queuing 285
I/O kernel services
consolidated commands 285
block I/O 45
error recovery 280
buffer cache 46
fragmented commands 285
character I/O 46
initiator I/O requests 284
DMA management 47
initiator-mode recovery 281, 282
interrupt management 46
NACA=1 error 281
memory buffer (mbuf) 47
openx subroutine options 293
IDE subsystem
returned status 282
adapter driver
SC_CHECK_CONDITION 282
entry points 307
scsi_buf structure 286
ioctl commands 308, 309
spanned commands 285
performing dumps 307
iSCSI adapters
consolidated commands 304
IOCINFO 260
device communication
initiator-mode support 301
error processing 307
error recovery
K
analyzing returned status 302 kernel data
initiator mode 302 accessing in a system call 24
fragmented commands 304 kernel environment 1
base kernel services 2
Index 367
logical file system (continued) memory kernel services (continued)
file routines 40 memory pinning 69
v-nodes 40 user memory access 69
file system role 39 message queue kernel services 75
logical volume device driver mkdev command
bottom half 209 adding devices to the system 109
data structures 209 configuring devices 103
physical device driver interface 211 MODS 313, 343
pseudo-device driver role 208 MPQP device handlers
top half 209 binary synchronous communication
logical volume manager message types 115
DASD support 205 receive errors 116
logical volume subsystem entry points 115
bad block processing 211 multiprocessor-safe timer services 83
logical volume device driver 208 Multiprotocol device handlers 115
physical volumes
comparison with logical volumes 205
reserved sectors 206 N
LONG32TOLONG64 macro 26 NACA=1 error 281
loopback kernel services 77 NDD_ADAP_CHECK 138
low function terminal NDD_AUTO_RMV 138
configuration commands 200 NDD_BUS_ERR 138
functional description 199 NDD_CLEAR_STATS 126, 141, 181
interface 199 NDD_CMD_FAIL 138
components 200 NDD_DEBUG_TRACE 127
configuration 199 NDD_DISABLE_ADAPTER 183
device driver entry points 200 NDD_DISABLE_ADDRESS 126, 141, 181
ioctls 200 NDD_DISABLE_MULTICAST 126, 182
terminal emulation 199 NDD_DUMP_ADDR 183
to display device drivers 200 NDD_ENABLE_ADAPTER 183
to system keyboard 200 NDD_ENABLE_ADDRESS 126, 140, 181
low function terminal interface NDD_ENABLE_MULTICAST 127, 182
AIXwindows support 200 NDD_GET_ALL_STATS 127, 141, 181
low function terminal subsystem 199 NDD_GET_STATS 128, 140, 180
accented characters supported 202 NDD_MIB_ADDR 128, 141, 181
lsattr command NDD_MIB_GET 128, 140, 180
displaying attribute characteristics of devices 109 NDD_MIB_QUERY 128, 140, 180
lscfg command NDD_PIO_FAIL 137
displaying device diagnostic information 109 NDD_PROMISCUOUS_OFF 182
lsconn command NDD_PROMISCUOUS_ON 182
displaying device connections 109 NDD_SET_LINK_STATUS 183
lsdev command NDD_SET_MAC_ADDR 183
displaying device information 109 NDD_TX_ERROR 138
lsparent command NDD_TX_TIMEOUT 138
displaying information about parent devices 109 network kernel services
address family domain 76
communications device handler interface 77
M interface address 76
macros loopback 77
INTSTOLLONG 27 network interface device driver 76
LONG32TOLONG64 26 protocol 77
memory buffer (mbuf) 47 routing 76
management kernel services 62
management services
file descriptor 66 O
mbuf structures object data manager 104
communications device handlers 112 ODM 104
memory buffer (mbuf) kernel services 47 odmadd command
memory buffer (mbuf) macros 47 adding devices to predefined database 104
memory kernel services openx subroutine 293
memory management 68 SC_DIAGNOSTIC 293
Index 369
SCSI subsystem (continued) status blocks (continued)
adapter device driver (continued) communications device handler (continued)
responsibilities relative to SCSI device driver 219 CIO_START_DONE 114
target-mode ioctl commands 246 CIO_TX_DONE 114
asynchronous event handling 220 communications device handlers and 113
command tag queuing 228 status codes
device communication communications device handlers and 113
initiator-mode support 220 status codes, system dump 316
target-mode support 220 storage 205
error processing 236 stream-based tty subsystem 199
error recovery structures
initiator mode 222 scsi_buf 286
target mode 225 subroutines
initiator I/O request execution close 193
fragmented commands 227 ioctl 193
gathered write commands 227 open 193
spanned or consolidated commands 226 read 193
initiator-mode adapter transaction sequence 225 write 193
SCSI device driver subsystem
asynchronous event-handling routine 222 graphic input device 193
closing a device 236 low function terminal 199
design requirements 233 streams-based tty 199
entry points 237 system calls
internal commands 225 accessing kernel data in 24
responsibilities relative to adapter device asynchronous signals 33
driver 219 error information 35
using openx subroutine options 233 exception handling 33, 34
structures execution 24
sc_buf structure 228 in kernel protection domain 23
tm_buf structure 236, 240 in user protection domain 23
target-mode interface 238, 239, 241 nesting for kernel-mode use 34
interaction with initiator-mode interface 238 page faulting 34
SCSI_ADAPTER_HDW_FAILURE 288 passing parameters 25
SCSI_ADAPTER_SFW_FAILURE 288 preempting 32
scsi_buf structure 286 services for all kernel extensions 35
fields 286 services for kernel processes only 35
SCSI_CMD_TIMEOUT 288 setjmpx kernel service 33
SCSI_FUSE_OR_TERMINAL_PWR 288 signal handling in 32
SCSI_HOST_IO_BUS_ERR 288 stacking saved contexts 33
SCSI_NO_DEVICE_RESPONSE 288 using with kernel extensions 2
SCSI_TRANSPORT_BUSY 289 wait termination 33
SCSI_TRANSPORT_DEAD 289 system dump
SCSI_TRANSPORT_FAULT 288 checking status 316
SCSI_TRANSPORT_RESET 288 configuring dump devices 313
SCSI_WW_NAME_CHANGE 288 copy from server 317
security kernel services 81 copying from dataless machines 317
serial optical link device handlers 116 copying on a non-dataless machine 318
signal management 78 copying to other media 317
Small Computer Systems Interface subsystem 219 including device driver data 314
SOL device handlers locating 317
changing device attributes 118 reboot in normal mode 317
configuring physical and logical devices 117 starting 314
entry points 116, 117 system dump facility 313
special files interfaces 117
status and exception codes 113
status blocks T
communications device handler terminal emulation
CIO_ASYNC_STATUS 115 low function terminal 199
CIO_HALT_DONE 114 threads
CIO_LOST_STATUS 114 creating 78
CIO_NULL_BLK 114 time-of-day kernel services 82
Index 371
372 Kernel Extensions and Device Support Programming Concepts
Readers’ Comments — We’d Like to Hear from You
AIX 5L Version 5.3
Kernel Extensions and Device Support Programming Concepts
Overall, how satisfied are you with the information in this book?
How satisfied are you that the information in this book is:
When you send comments to IBM, you grant IBM a nonexclusive right to use or distribute your comments in any
way it believes appropriate without incurring any obligation to you. IBM or any other organizations will only use the
personal information that you supply to contact you about the issues that you state on this form.
Name Address
Company or Organization
Phone No.
___________________________________________________________________________________________________
Readers’ Comments — We’d Like to Hear from You Cut or Fold
SC23-4900-03 Along Line
_ _ _ _ _ _ _Fold
_ _ _and
_ _ _Tape
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _Please
_ _ _ _ _do
_ _not
_ _ staple
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _Fold
_ _ _and
_ _ Tape
______
NO POSTAGE
NECESSARY
IF MAILED IN THE
UNITED STATES
IBM Corporation
Information Development
Department 04XA-905-6C006
11501 Burnet Road
Austin, TX 78758-3493
_________________________________________________________________________________________
Fold and Tape Please do not staple Fold and Tape
Cut or Fold
SC23-4900-03 Along Line
Printed in U.S.A.
SC23-4900-03