Linux Kernel Internals 2.4
Linux Kernel Internals 2.4
1. Booting
5. IPC mechanisms
● 5.1 Semaphores
● 5.2 Message queues
● 5.3 Shared Memory
● 5.4 Linux IPC Primitives
Introduction to the Linux 2.4 kernel. The latest copy of this document can be always downloaded from:
http://www.moses.uklinux.net/patches/lki.sgml This guide is now part of the Linux Documentation
Project and can also be downloaded in various formats from: http://www.linuxdoc.org/guides.html or
can be read online (latest version) at: http://www.moses.uklinux.net/patches/lki.html This documentation
is free software; you can redistribute it and/or modify it under the terms of the GNU General Public
License as published by the Free Software Foundation; either version 2 of the License, or (at your
option) any later version. The author is working as senior Linux kernel engineer at VERITAS Software
Ltd and wrote this book for the purpose of supporting the short training course/lectures he gave on this
subject, internally at VERITAS. Thanks to Juan J. Quintela ([email protected]), Francis
Galiegue ([email protected]), Hakjun Mun ([email protected]), Matt Kraai
([email protected]), Nicholas Dronen ([email protected]), Samuel
S Chessman ([email protected]), Nadeem Hasan ([email protected]) for various
corrections and suggestions. The Linux Page Cache chapter was written by: Christoph Hellwig
([email protected]). The IPC Mechanisms chapter was written by: Russell Weight
([email protected]) and Mingming Cao ([email protected])
1. Booting
● 1.1 Building the Linux Kernel Image
● 1.2 Booting: Overview
● 1.3 Booting: BIOS POST
● 1.4 Booting: bootsector and setup
● 1.5 Using LILO as a bootloader
● 1.6 High level initialisation
● 1.7 SMP Bootup on x86
● 1.8 Freeing initialisation data and code
● 1.9 Processing kernel command line
5. IPC mechanisms
● 5.1 Semaphores
● 5.2 Message queues
● 5.3 Shared Memory
● 5.4 Linux IPC Primitives
1. Booting
1.1 Building the Linux Kernel Image
This section explains the steps taken during compilation of the Linux kernel and the output produced at
each stage. The build process depends on the architecture so I would like to emphasize that we only
consider building a Linux/x86 kernel.
When the user types 'make zImage' or 'make bzImage' the resulting bootable kernel image is stored as
arch/i386/boot/zImage or arch/i386/boot/bzImage respectively. Here is how the image is
built:
1. C and assembly source files are compiled into ELF relocatable object format (.o) and some of them
are grouped logically into archives (.a) using ar(1).
2. Using ld(1), the above .o and .a are linked into vmlinux which is a statically linked, non-stripped
ELF 32-bit LSB 80386 executable file.
3. System.map is produced by nm vmlinux, irrelevant or uninteresting symbols are grepped out.
4. Enter directory arch/i386/boot.
5. Bootsector asm code bootsect.S is preprocessed either with or without -D__BIG_KERNEL__,
depending on whether the target is bzImage or zImage, into bbootsect.s or bootsect.s
respectively.
6. bbootsect.s is assembled and then converted into 'raw binary' form called bbootsect (or
bootsect.s assembled and raw-converted into bootsect for zImage).
7. Setup code setup.S (setup.S includes video.S) is preprocessed into bsetup.s for bzImage
or setup.s for zImage. In the same way as the bootsector code, the difference is marked by
-D__BIG_KERNEL__ present for bzImage. The result is then converted into 'raw binary' form
called bsetup.
8. Enter directory arch/i386/boot/compressed and convert /usr/src/linux/vmlinux to
$tmppiggy (tmp filename) in raw binary format, removing .note and .comment ELF sections.
9. gzip -9 < $tmppiggy > $tmppiggy.gz
10. Link $tmppiggy.gz into ELF relocatable (ld -r) piggy.o.
11. Compile compression routines head.S and misc.c (still in arch/i386/boot/compressed
directory) into ELF objects head.o and misc.o.
12. Link together head.o, misc.o and piggy.o into bvmlinux (or vmlinux for zImage, don't
mistake this for /usr/src/linux/vmlinux!). Note the difference between -Ttext 0x1000 used
for vmlinux and -Ttext 0x100000 for bvmlinux, i.e. for bzImage compression loader is
high-loaded.
13. Convert bvmlinux to 'raw binary' bvmlinux.out removing .note and .comment ELF
sections.
14. Go back to arch/i386/boot directory and, using the program tools/build, cat together
bbootsect, bsetup and compressed/bvmlinux.out into bzImage (delete extra 'b' above
for zImage). This writes important variables like setup_sects and root_dev at the end of the
bootsector.
The size of the bootsector is always 512 bytes. The size of the setup must be greater than 4 sectors but is
limited above by about 12K - the rule is:
0x4000 bytes >= 512 + setup_sects * 512 + room for stack while running bootsector/setup
We will see later where this limitation comes from.
The upper limit on the bzImage size produced at this step is about 2.5M for booting with LILO and 0xFFFF
paragraphs (0xFFFF0 = 1048560 bytes) for booting raw image, e.g. from floppy disk or CD-ROM
(El-Torito emulation mode).
Note that while tools/build does validate the size of boot sector, kernel image and lower bound of setup
size, it does not check the *upper* bound of said setup size. Therefore it is easy to build a broken kernel by
just adding some large ".space" at the end of setup.S.
We consider here the Linux bootsector in detail. The first few lines initialise the convenience macros to be
used for segment values:
(the numbers on the left are the line numbers of bootsect.S file) The values of DEF_INITSEG,
DEF_SETUPSEG, DEF_SYSSEG and DEF_SYSSIZE are taken from include/asm/boot.h:
/* Don't touch these, unless you really know what you're doing. */
#define DEF_INITSEG 0x9000
#define DEF_SYSSEG 0x1000
#define DEF_SETUPSEG 0x9020
#define DEF_SYSSIZE 0x7F00
63 movsw
64 ljmp $INITSEG, $go
Lines 54-63 move the bootsector code from address 0x7C00 to 0x90000. This is achieved by:
1. set %ds:%si to $BOOTSEG:0 (0x7C0:0 = 0x7C00)
2. set %es:%di to $INITSEG:0 (0x9000:0 = 0x90000)
3. set the number of 16bit words in %cx (256 words = 512 bytes = 1 sector)
4. clear DF (direction) flag in EFLAGS to auto-increment addresses (cld)
5. go ahead and copy 512 bytes (rep movsw)
The reason this code does not use rep movsd is intentional (hint - .code16).
Line 64 jumps to label go: in the newly made copy of the bootsector, i.e. in segment 0x9000. This and the
following three instructions (lines 64-76) prepare the stack at $INITSEG:0x4000-0xC, i.e. %ss =
$INITSEG (0x9000) and %sp = 0x3FF4 (0x4000-0xC). This is where the limit on setup size comes from
that we mentioned earlier (see Building the Linux Kernel Image).
Lines 77-103 patch the disk parameter table for the first disk to allow multi-sector reads:
The floppy disk controller is reset using BIOS service int 0x13 function 0 (reset FDC) and setup sectors are
loaded immediately after the bootsector, i.e. at physical address 0x90200 ($INITSEG:0x200), again using
BIOS service int 0x13, function 2 (read sector(s)). This happens during lines 107-124:
107 load_setup:
108 xorb %ah, %ah # reset FDC
109 xorb %dl, %dl
110 int $0x13
111 xorw %dx, %dx # drive 0, head 0
112 movb $0x02, %cl # sector 2, track 0
113 movw $0x0200, %bx # address = 512, in
INITSEG
114 movb $0x02, %ah # service 2, "read
sector(s)"
115 movb setup_sects, %al # (assume all on
head 0, track 0)
116 int $0x13 # read it
117 jnc ok_load_setup # ok - continue
124 ok_load_setup:
If loading failed for some reason (bad floppy or someone pulled the diskette out during the operation), we
dump error code and retry in an endless loop. The only way to get out of it is to reboot the machine, unless
retry succeeds but usually it doesn't (if something is wrong it will only get worse).
If loading setup_sects sectors of setup code succeeded we jump to label ok_load_setup:.
Then we proceed to load the compressed kernel image at physical address 0x10000. This is done to
preserve the firmware data areas in low memory (0-64K). After the kernel is loaded, we jump to
$SETUPSEG:0 (arch/i386/boot/setup.S). Once the data is no longer needed (e.g. no more calls to
BIOS) it is overwritten by moving the entire (compressed) kernel image from 0x10000 to 0x1000 (physical
addresses, of course). This is done by setup.S which sets things up for protected mode and jumps to
0x1000 which is the head of the compressed kernel, i.e.
arch/386/boot/compressed/{head.S,misc.c}. This sets up stack and calls
decompress_kernel() which uncompresses the kernel to address 0x100000 and jumps to it.
Note that old bootloaders (old versions of LILO) could only load the first 4 sectors of setup, which is why
there is code in setup to load the rest of itself if needed. Also, the code in setup has to take care of various
combinations of loader type/version vs zImage/bzImage and is therefore highly complex.
Let us examine the kludge in the bootsector code that allows to load a big kernel, known also as "bzImage".
The setup sectors are loaded as usual at 0x90200, but the kernel is loaded 64K chunk at a time using a
special helper routine that calls BIOS to move data from low to high memory. This helper routine is
referred to by bootsect_kludge in bootsect.S and is defined as bootsect_helper in
setup.S. The bootsect_kludge label in setup.S contains the value of setup segment and the offset
of bootsect_helper code in it so that bootsector can use the lcall instruction to jump to it
(inter-segment jump). The reason why it is in setup.S is simply because there is no more space left in
bootsect.S (which is strictly not true - there are approximately 4 spare bytes and at least 1 spare byte in
bootsect.S but that is not enough, obviously). This routine uses BIOS service int 0x15 (ax=0x8700) to
move to high memory and resets %es to always point to 0x10000. This ensures that the code in
bootsect.S doesn't run out of low memory when copying data from disk.
Another thing worth noting is Linux's ability to execute an "alternative init program" by means of passing
"init=" boot commandline. This is useful for recovering from accidentally overwritten /sbin/init or
debugging the initialisation (rc) scripts and /etc/inittab by hand, executing them one at a time.
These evaluate to gcc attribute specificators (also known as "gcc magic") as defined in
include/linux/init.h:
#ifndef MODULE
#define __init __attribute__ ((__section__ (".text.init")))
#define __initdata __attribute__ ((__section__ (".data.init")))
#else
#define __init
#define __initdata
#endif
What this means is that if the code is compiled statically into the kernel (i.e. MODULE is not defined) then
it is placed in the special ELF section .text.init, which is declared in the linker map in
arch/i386/vmlinux.lds. Otherwise (i.e. if it is a module) the macros evaluate to nothing.
What happens during boot is that the "init" kernel thread (function init/main.c:init()) calls the
arch-specific function free_initmem() which frees all the pages between addresses __init_begin
and __init_end.
On a typical system (my workstation), this results in freeing about 260K of memory.
The functions registered via module_init() are placed in .initcall.init which is also freed in
the static case. The current trend in Linux, when designing a subsystem (not necessarily a module), is to
provide init/exit entry points from the early stages of design so that in the future, the subsystem in question
can be modularised if needed. Example of this is pipefs, see fs/pipe.c. Even if a given subsystem will
never become a module, e.g. bdflush (see fs/buffer.c), it is still nice and tidy to use the
module_init() macro against its initialisation function, provided it does not matter when exactly is the
function called.
There are two more macros which work in a similar manner, called __exit and __exitdata, but they
are more directly connected to the module support and therefore will be explained in a later section.
5. checksetup() goes through the code in ELF section .setup.init and invokes each function,
passing it the word if it matches. Note that using the return value of 0 from the function registered via
__setup(), it is possible to pass the same "variable=value" to more than one function with "value"
invalid to one and valid to another. Jeff Garzik commented: "hackers who do that get spanked :)"
Why? Because this is clearly ld-order specific, i.e. kernel linked in one order will have functionA
invoked before functionB and another will have it in reversed order, with the result depending on the
order.
So, how do we write code that processes boot commandline? We use the __setup() macro defined in
include/linux/init.h:
/*
* Used for kernel command line parameter setup
*/
struct kernel_param {
const char *str;
int (*setup_func)(char *);
};
#ifndef MODULE
#define __setup(str, fn) \
static char __setup_str_##fn[] __initdata = str; \
static struct kernel_param __setup_##fn __initsetup = \
{ __setup_str_##fn, fn }
#else
#define __setup(str,func) /* nothing */
endif
So, you would typically use it in your code like this (taken from code of real driver, BusLogic HBA
drivers/scsi/BusLogic.c):
if (ints[0] != 0) {
BusLogic_Error("BusLogic: Obsolete Command Line
Entry "
"Format Ignored\n", NULL);
return 0;
}
if (str == NULL || *str == '\0')
return 0;
return BusLogic_ParseDriverOptions(str);
}
__setup("BusLogic=", BusLogic_Setup);
Note that __setup() does nothing for modules, so the code that wishes to process boot commandline and
can be either a module or statically linked must invoke its parsing function manually in the module
initialisation routine. This also means that it is possible to write code that processes parameters when
compiled as a module but not when it is static or vice versa.
/*
* The default maximum number of threads is set to a safe
* value: the thread structures can take up at most half
* of memory.
*/
max_threads = mempages / (THREAD_SIZE/PAGE_SIZE) / 2;
which, on IA32 architecture, basically means num_physpages/4. As an example, on a 512M machine, you can
create 32k threads. This is a considerable improvement over the 4k-epsilon limit for older (2.2 and earlier) kernels.
Moreover, this can be changed at runtime using the KERN_MAX_THREADS sysctl(2), or simply using procfs
interface to kernel tunables:
# cat /proc/sys/kernel/threads-max
32764
# echo 100000 > /proc/sys/kernel/threads-max
# cat /proc/sys/kernel/threads-max
100000
# gdb -q vmlinux /proc/kcore
Core was generated by `BOOT_IMAGE=240ac18 ro root=306
video=matrox:vesa:0x118'.
#0 0x0 in ?? ()
(gdb) p max_threads
$1 = 100000
The set of processes on the Linux system is represented as a collection of struct task_struct structures which
are linked in two ways:
1. as a hashtable, hashed by pid, and
2. as a circular, doubly-linked list using p->next_task and p->prev_task pointers.
The hashtable is called pidhash[] and is defined in include/linux/sched.h:
The tasks are hashed by their pid value and the above hashing function is supposed to distribute the elements
uniformly in their domain (0 to PID_MAX-1). The hashtable is used to quickly find a task by given pid, using
find_task_pid() inline from include/linux/sched.h:
return p;
}
The tasks on each hashlist (i.e. hashed to the same value) are linked by p->pidhash_next/pidhash_pprev
which are used by hash_pid() and unhash_pid() to insert and remove a given process into the hashtable.
These are done under protection of the read-write spinlock called tasklist_lock taken for WRITE.
The circular doubly-linked list that uses p->next_task/prev_task is maintained so that one could go through
all tasks on the system easily. This is achieved by the for_each_task() macro from
include/linux/sched.h:
#define for_each_task(p) \
for (p = &init_task ; (p = p->next_task) != &init_task ; )
Users of for_each_task() should take tasklist_lock for READ. Note that for_each_task() is using
init_task to mark the beginning (and end) of the list - this is safe because the idle task (pid 0) never exits.
The modifiers of the process hashtable or/and the process table links, notably fork(), exit() and ptrace(),
must take tasklist_lock for WRITE. What is more interesting is that the writers must also disable interrupts on
the local CPU. The reason for this is not trivial: the send_sigio() function walks the task list and thus takes
tasklist_lock for READ, and it is called from kill_fasync() in interrupt context. This is why writers must
disable interrupts while readers don't need to.
Now that we understand how the task_struct structures are linked together, let us examine the members of
task_struct. They loosely correspond to the members of UNIX 'struct proc' and 'struct user' combined together.
The other versions of UNIX separated the task state information into one part which should be kept memory-resident
at all times (called 'proc structure' which includes process state, scheduling information etc.) and another part which is
only needed when the process is running (called 'u area' which includes file descriptor table, disk quota information
etc.). The only reason for such ugly design was that memory was a very scarce resource. Modern operating systems
(well, only Linux at the moment but others, e.g. FreeBSD seem to improve in this direction towards Linux) do not
need such separation and therefore maintain process state in a kernel memory-resident data structure at all times.
The task_struct structure is declared in include/linux/sched.h and is currently 1680 bytes in size.
The state field is declared as:
#define TASK_RUNNING 0
#define TASK_INTERRUPTIBLE 1
#define TASK_UNINTERRUPTIBLE 2
#define TASK_ZOMBIE 4
#define TASK_STOPPED 8
#define TASK_EXCLUSIVE 32
Why is TASK_EXCLUSIVE defined as 32 and not 16? Because 16 was used up by TASK_SWAPPING and I forgot to
shift TASK_EXCLUSIVE up when I removed all references to TASK_SWAPPING (sometime in 2.3.x).
The volatile in p->state declaration means it can be modified asynchronously (from interrupt handler):
1. TASK_RUNNING: means the task is "supposed to be" on the run queue. The reason it may not yet be on the
runqueue is that marking a task as TASK_RUNNING and placing it on the runqueue is not atomic. You need to
hold the runqueue_lock read-write spinlock for read in order to look at the runqueue. If you do so, you will
then see that every task on the runqueue is in TASK_RUNNING state. However, the converse is not true for the
reason explained above. Similarly, drivers can mark themselves (or rather the process context they run in) as
TASK_INTERRUPTIBLE (or TASK_UNINTERRUPTIBLE) and then call schedule(), which will then
remove it from the runqueue (unless there is a pending signal, in which case it is left on the runqueue).
2. TASK_INTERRUPTIBLE: means the task is sleeping but can be woken up by a signal or by expiry of a timer.
3. TASK_UNINTERRUPTIBLE: same as TASK_INTERRUPTIBLE, except it cannot be woken up.
4. TASK_ZOMBIE: task has terminated but has not had its status collected (wait()-ed for) by the parent
(natural or by adoption).
5. TASK_STOPPED: task was stopped, either due to job control signals or due to ptrace(2).
6. TASK_EXCLUSIVE: this is not a separate state but can be OR-ed to either one of TASK_INTERRUPTIBLE
or TASK_UNINTERRUPTIBLE. This means that when this task is sleeping on a wait queue with many other
tasks, it will be woken up alone instead of causing "thundering herd" problem by waking up all the waiters.
Task flags contain information about the process states which are not mutually exclusive:
mm_release */
#define PF_USEDFPU 0x00100000 /* task used FPU this
quantum (SMP) */
● kernel threads,
● user tasks.
The idle thread is created at compile time for the first CPU; it is then "manually" created for each CPU by means of
arch-specific fork_by_hand() in arch/i386/kernel/smpboot.c, which unrolls the fork(2) system call by
hand (on some archs). Idle tasks share one init_task structure but have a private TSS structure, in the per-CPU array
init_tss. Idle tasks all have pid = 0 and no other task can share pid, i.e. use CLONE_PID flag to clone(2).
Kernel threads are created using kernel_thread() function which invokes the clone(2) system call in kernel
mode. Kernel threads usually have no user address space, i.e. p->mm = NULL, because they explicitly do
exit_mm(), e.g. via daemonize() function. Kernel threads can always access kernel address space directly. They
are allocated pid numbers in the low range. Running at processor's ring 0 (on x86, that is) implies that the kernel
threads enjoy all I/O privileges and cannot be pre-empted by the scheduler.
User tasks are created by means of clone(2) or fork(2) system calls, both of which internally invoke
kernel/fork.c:do_fork().
Let us understand what happens when a user process makes a fork(2) system call. Although fork(2) is
architecture-dependent due to the different ways of passing user stack and registers, the actual underlying function
do_fork() that does the job is portable and is located at kernel/fork.c.
The following steps are done:
1. Local variable retval is set to -ENOMEM, as this is the value which errno should be set to if fork(2) fails to
allocate a new task structure.
2. If CLONE_PID is set in clone_flags then return an error (-EPERM), unless the caller is the idle thread
(during boot only). So, normal user threads cannot pass CLONE_PID to clone(2) and expect it to succeed. For
fork(2), this is irrelevant as clone_flags is set to SIFCHLD - this is only relevant when do_fork() is
invoked from sys_clone() which passes the clone_flags from the value requested from userspace.
3. current->vfork_sem is initialised (it is later cleared in the child). This is used by sys_vfork()
(vfork(2) system call, corresponds to clone_flags = CLONE_VFORK|CLONE_VM|SIGCHLD) to make
the parent sleep until the child does mm_release(), for example as a result of exec()ing another program
or exit(2)-ing.
4. A new task structure is allocated using arch-dependent alloc_task_struct() macro. On x86 it is just a
gfp at GFP_KERNEL priority. This is the first reason why fork(2) system call may sleep. If this allocation fails,
we return -ENOMEM.
5. All the values from current process' task structure are copied into the new one, using structure assignment *p =
*current. Perhaps this should be replaced by a memset? Later on, the fields that should not be inherited by
the child are set to the correct values.
6. Big kernel lock is taken as the rest of the code would otherwise be non-reentrant.
7. If the parent has user resources (a concept of UID, Linux is flexible enough to make it a question rather than a
fact), then verify if the user exceeded RLIMIT_NPROC soft limit - if so, fail with -EAGAIN, if not, increment
the count of processes by given uid p->user->count.
8. If the system-wide number of tasks exceeds the value of the tunable max_threads, fail with -EAGAIN.
9. If the binary being executed belongs to a modularised execution domain, increment the corresponding module's
reference count.
10. If the binary being executed belongs to a modularised binary format, increment the corresponding module's
reference count.
11. The child is marked as 'has not execed' (p->did_exec = 0)
12. The child is marked as 'not-swappable' (p->swappable = 0)
13. The child is put into 'uninterruptible sleep' state, i.e. p->state = TASK_UNINTERRUPTIBLE (TODO:
why is this done? I think it's not needed - get rid of it, Linus confirms it is not needed)
14. The child's p->flags are set according to the value of clone_flags; for plain fork(2), this will be p->flags
= PF_FORKNOEXEC.
15. The child's pid p->pid is set using the fast algorithm in kernel/fork.c:get_pid() (TODO:
lastpid_lock spinlock can be made redundant since get_pid() is always called under big kernel lock
from do_fork(), also remove flags argument of get_pid(), patch sent to Alan on 20/06/2000 - followup
later).
16. The rest of the code in do_fork() initialises the rest of child's task structure. At the very end, the child's task
structure is hashed into the pidhash hashtable and the child is woken up (TODO: wake_up_process(p)
sets p->state = TASK_RUNNING and adds the process to the runq, therefore we probably didn't need to set
p->state to TASK_RUNNING earlier on in do_fork()). The interesting part is setting
p->exit_signal to clone_flags & CSIGNAL, which for fork(2) means just SIGCHLD and setting
p->pdeath_signal to 0. The pdeath_signal is used when a process 'forgets' the original parent (by
dying) and can be set/get by means of PR_GET/SET_PDEATHSIG commands of prctl(2) system call (You
might argue that the way the value of pdeath_signal is returned via userspace pointer argument in prctl(2)
is a bit silly - mea culpa, after Andries Brouwer updated the manpage it was too late to fix ;)
Thus tasks are created. There are several ways for tasks to terminate:
1. by making exit(2) system call;
2. by being delivered a signal with default disposition to die;
3. by being forced to die under certain exceptions;
4. by calling bdflush(2) with func == 1 (this is Linux-specific, for compatibility with old distributions that still
had the 'update' line in /etc/inittab - nowadays the work of update is done by kernel thread kupdate).
Functions implementing system calls under Linux are prefixed with sys_, but they are usually concerned only with
argument checking or arch-specific ways to pass some information and the actual work is done by do_ functions. So it
is with sys_exit() which calls do_exit() to do the work. Although, other parts of the kernel sometimes invoke
sys_exit() while they should really call do_exit().
The function do_exit() is found in kernel/exit.c. The points to note about do_exit():
● Uses global kernel lock (locks but doesn't unlock).
● On architectures that use lazy FPU switching (ia64, mips, mips64) (TODO: remove 'flags' argument of sparc,
sparc64), do whatever the hardware requires to pass the FPU ownership (if owned by current) to "none".
● p->counter: number of clock ticks left to run in this scheduling slice, decremented by a timer. When this
field becomes lower than or equal to zero, it is reset to 0 and p->need_resched is set. This is also
sometimes called 'dynamic priority' of a process because it can change by itself.
● p->priority: the process' static priority, only changed through well-known system calls like nice(2),
POSIX.1b sched_setparam(2) or 4.4BSD/SVR4 setpriority(2).
● p->rt_priority: realtime priority
● p->policy: the scheduling policy, specifies which scheduling class the task belongs to. Tasks can change
their scheduling class using the sched_setscheduler(2) system call. The valid values are SCHED_OTHER
(traditional UNIX process), SCHED_FIFO (POSIX.1b FIFO realtime process) and SCHED_RR (POSIX
round-robin realtime process). One can also OR SCHED_YIELD to any of these values to signify that the
process decided to yield the CPU, for example by calling sched_yield(2) system call. A FIFO realtime process
will run until either a) it blocks on I/O, b) it explicitly yields the CPU or c) it is preempted by another realtime
process with a higher p->rt_priority value. SCHED_RR is the same as SCHED_FIFO, except that when
its timeslice expires it goes back to the end of the runqueue.
The scheduler's algorithm is simple, despite the great apparent complexity of the schedule() function. The
function is complex because it implements three scheduling algorithms in one and also because of the subtle
SMP-specifics.
The apparently 'useless' gotos in schedule() are there for a purpose - to generate the best optimised (for i386)
code. Also, note that scheduler (like most of the kernel) was completely rewritten for 2.4, therefore the discussion
below does not apply to 2.2 or earlier kernels.
Let us look at the function in detail:
1. If current->active_mm == NULL then something is wrong. Current process, even a kernel thread
(current->mm == NULL) must have a valid p->active_mm at all times.
2. If there is something to do on the tq_scheduler task queue, process it now. Task queues provide a kernel
mechanism to schedule execution of functions at a later time. We shall look at it in details elsewhere.
3. Initialise local variables prev and this_cpu to current task and current CPU respectively.
4. Check if schedule() was invoked from interrupt handler (due to a bug) and panic if so.
5. Release the global kernel lock.
6. If there is some work to do via softirq mechanism, do it now.
7. Initialise local pointer struct schedule_data *sched_data to point to per-CPU (cacheline-aligned to
prevent cacheline ping-pong) scheduling data area, which contains the TSC value of last_schedule and the
pointer to last scheduled task structure (TODO: sched_data is used on SMP only but why does
init_idle() initialises it on UP as well?).
8. runqueue_lock spinlock is taken. Note that we use spin_lock_irq() because in schedule() we
guarantee that interrupts are enabled. Therefore, when we unlock runqueue_lock, we can just re-enable
them instead of saving/restoring eflags (spin_lock_irqsave/restore variant).
9. task state machine: if the task is in TASK_RUNNING state, it is left alone; if it is in TASK_INTERRUPTIBLE
state and a signal is pending, it is moved into TASK_RUNNING state. In all other cases, it is deleted from the
runqueue.
10. next (best candidate to be scheduled) is set to the idle task of this cpu. However, the goodness of this candidate
is set to a very low value (-1000), in hope that there is someone better than that.
11. If the prev (current) task is in TASK_RUNNING state, then the current goodness is set to its goodness and it is
marked as a better candidate to be scheduled than the idle task.
12. Now the runqueue is examined and a goodness of each process that can be scheduled on this cpu is compared
with current value; the process with highest goodness wins. Now the concept of "can be scheduled on this cpu"
must be clarified: on UP, every process on the runqueue is eligible to be scheduled; on SMP, only process not
already running on another cpu is eligible to be scheduled on this cpu. The goodness is calculated by a function
called goodness(), which treats realtime processes by making their goodness very high (1000 +
p->rt_priority), this being greater than 1000 guarantees that no SCHED_OTHER process can win; so they
only contend with other realtime processes that may have a greater p->rt_priority. The goodness function
returns 0 if the process' time slice (p->counter) is over. For non-realtime processes, the initial value of
goodness is set to p->counter - this way, the process is less likely to get CPU if it already had it for a while,
i.e. interactive processes are favoured more than CPU bound number crunchers. The arch-specific constant
PROC_CHANGE_PENALTY attempts to implement "cpu affinity" (i.e. give advantage to a process on the same
CPU). It also gives a slight advantage to processes with mm pointing to current active_mm or to processes
with no (user) address space, i.e. kernel threads.
13. if the current value of goodness is 0 then the entire list of processes (not just the ones on the runqueue!) is
examined and their dynamic priorities are recalculated using simple algorithm:
recalculate:
{
struct task_struct *p;
spin_unlock_irq(&runqueue_lock);
read_lock(&tasklist_lock);
for_each_task(p)
p->counter = (p->counter >> 1) +
p->priority;
read_unlock(&tasklist_lock);
spin_lock_irq(&runqueue_lock);
}
Note that the we drop the runqueue_lock before we recalculate. The reason is that we go through entire set
of processes; this can take a long time, during which the schedule() could be called on another CPU and
select a process with goodness good enough for that CPU, whilst we on this CPU were forced to recalculate.
Ok, admittedly this is somewhat inconsistent because while we (on this CPU) are selecting a process with the
best goodness, schedule() running on another CPU could be recalculating dynamic priorities.
14. From this point on it is certain that next points to the task to be scheduled, so we initialise next->has_cpu
to 1 and next->processor to this_cpu. The runqueue_lock can now be unlocked.
15. If we are switching back to the same task (next == prev) then we can simply reacquire the global kernel
lock and return, i.e. skip all the hardware-level (registers, stack etc.) and VM-related (switch page directory,
recalculate active_mm etc.) stuff.
16. The macro switch_to() is architecture specific. On i386, it is concerned with a) FPU handling, b) LDT
handling, c) reloading segment registers, d) TSS handling and e) reloading debug registers.
struct list_head {
struct list_head *next, *prev;
};
#define LIST_HEAD(name) \
struct list_head name = LIST_HEAD_INIT(name)
#define INIT_LIST_HEAD(ptr) do { \
(ptr)->next = (ptr); (ptr)->prev = (ptr); \
} while (0)
The first three macros are for initialising an empty list by pointing both next and prev pointers to itself. It is
obvious from C syntactical restrictions which ones should be used where - for example, LIST_HEAD_INIT() can
be used for structure's element initialisation in declaration, the second can be used for static variable initialising
declarations and the third can be used inside a function.
The macro list_entry() gives access to individual list element, for example (from
fs/file_table.c:fs_may_remount_ro()):
struct super_block {
...
struct list_head s_files;
...
} *sb = &some_super_block;
struct file {
...
struct list_head f_list;
...
} *file;
A good example of the use of list_for_each() macro is in the scheduler where we walk the runqueue looking
for the process with highest goodness:
static LIST_HEAD(runqueue_head);
struct list_head *tmp;
struct task_struct *p;
list_for_each(tmp, &runqueue_head) {
p = list_entry(tmp, struct task_struct, run_list);
if (can_schedule(p)) {
int weight = goodness(p, this_cpu, prev->active_mm);
if (weight > c)
c = weight, next = p;
}
}
Here, p->run_list is declared as struct list_head run_list inside task_struct structure and serves
as anchor to the list. Removing an element from the list and adding (to head or tail of the list) is done by
list_del()/list_add()/list_add_tail() macros. The examples below are adding and removing a task
from runqueue:
static DECLARE_WAIT_QUEUE_HEAD(rtc_wait);
So, the interrupt handler obtains the data by reading from some device-specific I/O port (CMOS_READ() macro turns
into a couple outb/inb) and then wakes up whoever is sleeping on the rtc_wait wait queue.
Now, the read(2) system call could be implemented as:
add_wait_queue(&rtc_wait, &wait);
current->state = TASK_INTERRUPTIBLE;
do {
spin_lock_irq(&rtc_lock);
data = rtc_irq_data;
rtc_irq_data = 0;
spin_unlock_irq(&rtc_lock);
if (data != 0)
break;
out:
current->state = TASK_RUNNING;
remove_wait_queue(&rtc_wait, &wait);
return retval;
}
spin_lock_irq(&rtc_lock);
l = rtc_irq_data;
spin_unlock_irq(&rtc_lock);
if (l != 0)
return POLLIN | POLLRDNORM;
return 0;
}
All the work is done by the device-independent function poll_wait() which does the necessary waitqueue
manipulations; all we need to do is point it to the waitqueue which is woken up by our device-specific interrupt
handler.
struct timer_list {
struct list_head list;
unsigned long expires;
unsigned long data;
void (*function)(unsigned long);
volatile int running;
};
The list field is for linking into the internal list, protected by the timerlist_lock spinlock. The expires field
is the value of jiffies when the function handler should be invoked with data passed as a parameter. The
running field is used on SMP to test if the timer handler is currently running on another CPU.
The functions add_timer() and del_timer() add and remove a given timer to the list. When a timer expires, it
is removed automatically. Before a timer is used, it MUST be initialised by means of init_timer() function. And
before it is added, the fields function and expires must be set.
2.9 Tasklets
Not yet, will be in future revision.
2.10 Softirqs
Not yet, will be in future revision.
Native Linux programs use int 0x80 whilst binaries from foreign flavours of UNIX (Solaris, UnixWare 7 etc.) use the
lcall7 mechanism. The name 'lcall7' is historically misleading because it also covers lcall27 (e.g. Solaris/x86), but the
handler function is called lcall7_func.
When the system boots, the function arch/i386/kernel/traps.c:trap_init() is called which sets up the
IDT so that vector 0x80 (of type 15, dpl 3) points to the address of system_call entry from
arch/i386/kernel/entry.S.
When a userspace application makes a system call, the arguments are passed via registers and the application executes
'int 0x80' instruction. This causes a trap into kernel mode and processor jumps to system_call entry point in
entry.S. What this does is:
1. Save registers.
2. Set %ds and %es to KERNEL_DS, so that all data (and extra segment) references are made in kernel address
space.
3. If the value of %eax is greater than NR_syscalls (currently 256), fail with ENOSYS error.
4. If the task is being ptraced (tsk->ptrace & PF_TRACESYS), do special processing. This is to support
programs like strace (analogue of SVR4 truss(1)) or debuggers.
5. Call sys_call_table+4*(syscall_number from %eax). This table is initialised in the same file
(arch/i386/kernel/entry.S) to point to individual system call handlers which under Linux are
(usually) prefixed with sys_, e.g. sys_open, sys_exit, etc. These C system call handlers will find their
arguments on the stack where SAVE_ALL stored them.
6. Enter 'system call return path'. This is a separate label because it is used not only by int 0x80 but also by lcall7,
lcall27. This is concerned with handling tasklets (including bottom halves), checking if a schedule() is
needed (tsk->need_resched != 0), checking if there are signals pending and if so handling them.
Linux supports up to 6 arguments for system calls. They are passed in %ebx, %ecx, %edx, %esi, %edi (and %ebp used
temporarily, see _syscall6() in asm-i386/unistd.h). The system call number is passed via %eax.
/*
* Bits in microcode_status. (31 bits of room for future expansion)
*/
#define MICROCODE_IS_OPEN 0 /* set if device is in use
*/
/*
* We enforce only one user at a time here with open/close.
*/
static int microcode_open(struct inode *inode, struct file *file)
{
if (!capable(CAP_SYS_RAWIO))
return -EPERM;
MOD_INC_USE_COUNT;
return 0;
}
● void clear_bit(int nr, volatile void *addr): clear bit nr in the bitmap pointed to by addr.
● void change_bit(int nr, volatile void *addr): toggle bit nr (if set clear, if clear set) in the bitmap pointed to by
addr.
● int test_and_set_bit(int nr, volatile void *addr): atomically set bit nr and return the old bit value.
● int test_and_clear_bit(int nr, volatile void *addr): atomically clear bit nr and return the old bit value.
● int test_and_change_bit(int nr, volatile void *addr): atomically toggle bit nr and return the old bit value.
These operations use the LOCK_PREFIX macro, which on SMP kernels evaluates to bus lock instruction prefix and to
nothing on UP. This guarantees atomicity of access in SMP environment.
Sometimes bit manipulations are not convenient, but instead we need to perform arithmetic operations - add, subtract,
increment decrement. The typical cases are reference counts (e.g. for inodes). This facility is provided by the
atomic_t data type and the following operations:
● atomic_read(&v): read the value of atomic_t variable v.
● void atomic_add(int i, volatile atomic_t *v): add integer i to the value of atomic variable pointed to by v.
● void atomic_sub(int i, volatile atomic_t *v): subtract integer i from the value of atomic variable pointed to by
v.
● int atomic_sub_and_test(int i, volatile atomic_t *v): subtract integer i from the value of atomic variable
pointed to by v; return 1 if the new value is 0, return 0 otherwise.
● void atomic_inc(volatile atomic_t *v): increment the value by 1.
● int atomic_dec_and_test(volatile atomic_t *v): decrement the value; return 1 if the new value is 0, return 0
otherwise.
● int atomic_inc_and_test(volatile atomic_t *v): increment the value; return 1 if the new value is 0, return 0
otherwise.
● int atomic_add_negative(int i, volatile atomic_t *v): add the value of i to v and return 1 if the result is
negative. Return 0 if the result is greater than or equal to 0. This operation is used for implementing
semaphores.
save_flags(flags);
cli();
/* critical code */
restore_flags(flags);
While this is ok on UP, it obviously is of no use on SMP because the same code sequence may be executed
simultaneously on another cpu, and while cli() provides protection against races with interrupt context on each
CPU individually, it provides no protection at all against races between contexts running on different CPUs. This is
where spinlocks are useful for.
There are three types of spinlocks: vanilla (basic), read-write and big-reader spinlocks. Read-write spinlocks should be
used when there is a natural tendency of 'many readers and few writers'. Example of this is access to the list of
registered filesystems (see fs/super.c). The list is guarded by the file_systems_lock read-write spinlock
because one needs exclusive access only when registering/unregistering a filesystem, but any process can read the file
/proc/filesystems or use the sysfs(2) system call to force a read-only scan of the file_systems list. This makes
it sensible to use read-write spinlocks. With read-write spinlocks, one can have multiple readers at a time but only one
writer and there can be no readers while there is a writer. Btw, it would be nice if new readers would not get a lock
while there is a writer trying to get a lock, i.e. if Linux could correctly deal with the issue of potential writer starvation
by multiple readers. This would mean that readers must be blocked while there is a writer attempting to get the lock.
This is not currently the case and it is not obvious whether this should be fixed - the argument to the contrary is -
readers usually take the lock for a very short time so should they really be starved while the writer takes the lock for
potentially longer periods?
Big-reader spinlocks are a form of read-write spinlocks heavily optimised for very light read access, with a penalty for
writes. There is a limited number of big-reader spinlocks - currently only two exist, of which one is used only on
sparc64 (global irq) and the other is used for networking. In all other cases where the access pattern does not fit into
any of these two scenarios, one should use basic spinlocks. You cannot block while holding any kind of spinlock.
Spinlocks come in three flavours: plain, _irq() and _bh().
1. Plain spin_lock()/spin_unlock(): if you know the interrupts are always disabled or if you do not race
with interrupt context (e.g. from within interrupt handler), then you can use this one. It does not touch interrupt
state on the current CPU.
2. spin_lock_irq()/spin_unlock_irq(): if you know that interrupts are always enabled then you can
use this version, which simply disables (on lock) and re-enables (on unlock) interrupts on the current CPU. For
example, rtc_read() uses spin_lock_irq(&rtc_lock) (interrupts are always enabled inside
read()) whilst rtc_interrupt() uses spin_lock(&rtc_lock) (interrupts are always disabled
inside interrupt handler). Note that rtc_read() uses spin_lock_irq() and not the more generic
spin_lock_irqsave() because on entry to any system call interrupts are always enabled.
3. spin_lock_irqsave()/spin_unlock_irqrestore(): the strongest form, to be used when the
interrupt state is not known, but only if interrupts matter at all, i.e. there is no point in using it if our interrupt
handlers don't execute any critical code.
The reason you cannot use plain spin_lock() if you race against interrupt handlers is because if you take it and
then an interrupt comes in on the same CPU, it will busy wait for the lock forever: the lock holder, having been
interrupted, will not continue until the interrupt handler returns.
The most common usage of a spinlock is to access a data structure shared between user process context and interrupt
handlers:
my_ioctl()
{
spin_lock_irq(&my_lock);
/* critical section */
spin_unlock_irq(&my_lock);
}
my_irq_handler()
{
spin_lock(&lock);
/* critical section */
spin_unlock(&lock);
}
Read-write semaphores differ from basic semaphores in the same way as read-write spinlocks differ from basic
spinlocks: one can have multiple readers at a time but only one writer and there can be no readers while there are
writers - i.e. the writer blocks all readers and new readers block while a writer is waiting.
Also, basic semaphores can be interruptible - just use the operations down/up_interruptible() instead of the
plain down()/up() and check the value returned from down_interruptible(): it will be non zero if the
operation was interrupted.
Using semaphores for mutual exclusion is ideal in situations where a critical code section may call by reference
unknown functions registered by other subsystems/modules, i.e. the caller cannot know apriori whether the function
blocks or not.
A simple example of semaphore usage is in kernel/sys.c, implementation of gethostname(2)/sethostname(2)
system calls.
if (!capable(CAP_SYS_ADMIN))
return -EPERM;
if (len < 0 || len > __NEW_UTS_LEN)
return -EINVAL;
down_write(&uts_sem);
errno = -EFAULT;
if (!copy_from_user(system_utsname.nodename, name, len)) {
system_utsname.nodename[len] = 0;
errno = 0;
}
up_write(&uts_sem);
return errno;
}
if (len < 0)
return -EINVAL;
down_read(&uts_sem);
i = 1 + strlen(system_utsname.nodename);
if (i > len)
i = len;
errno = 0;
if (copy_to_user(name, system_utsname.nodename, i))
errno = -EFAULT;
up_read(&uts_sem);
return errno;
}
EPERM returned.
2. long init_module(const char *name, struct module *image): loads the relocated module
image and causes the module's initialisation routine to be invoked. Only a process with CAP_SYS_MODULE can
invoke this system call, others will get EPERM returned.
3. long delete_module(const char *name): attempts to unload the module. If name == NULL,
attempt is made to unload all unused modules.
4. long query_module(const char *name, int which, void *buf, size_t bufsize,
size_t *ret): returns information about a module (or about all modules).
The command interface available to users consists of:
● insmod: insert a single module.
● modinfo: print some information about a module, e.g. author, description, parameters the module accepts, etc.
Apart from being able to load a module manually using either insmod or modprobe, it is also possible to have the
module inserted automatically by the kernel when a particular functionality is required. The kernel interface for this is
the function called request_module(name) which is exported to modules, so that modules can load other
modules as well. The request_module(name) internally creates a kernel thread which execs the userspace
command modprobe -s -k module_name, using the standard exec_usermodehelper() kernel interface (which
is also exported to modules). The function returns 0 on success, however it is usually not worth checking the return
code from request_module(). Instead, the programming idiom is:
if (check_some_feature() == NULL)
request_module(module);
if (check_some_feature() == NULL)
return -ENODEV;
read_lock(&file_systems_lock);
fs = *(find_filesystem(name));
if (fs && !try_inc_mod_count(fs->owner))
fs = NULL;
read_unlock(&file_systems_lock);
We can examine one of the lists from gdb running on a live kernel thus:
Note that we deducted 8 from the address 0xdfb5a2e8 to obtain the address of the struct inode
(0xdfb5a2e0) according to the definition of list_entry() macro from include/linux/list.h.
To understand how inode cache works, let us trace a lifetime of an inode of a regular file on ext2 filesystem
as it is opened and closed:
fd = open("file", O_RDONLY);
close(fd);
The open(2) system call is implemented in fs/open.c:sys_open function and the real work is done
by fs/open.c:filp_open() function, which is split into two parts:
1. open_namei(): fills in the nameidata structure containing the dentry and vfsmount structures.
2. dentry_open(): given a dentry and vfsmount, this function allocates a new struct file and
links them together; it also invokes the filesystem specific f_op->open() method which was set
in inode->i_fop when inode was read in open_namei() (which provided inode via
dentry->d_inode).
The open_namei() function interacts with dentry cache via path_walk(), which in turn calls
real_lookup(), which invokes the filesystem specific inode_operations->lookup() method.
The role of this method is to find the entry in the parent directory with the matching name and then do
iget(sb, ino) to get the corresponding inode - which brings us to the inode cache. When the inode is
read in, the dentry is instantiated by means of d_add(dentry, inode). While we are at it, note that
for UNIX-style filesystems which have the concept of on-disk inode number, it is the lookup method's job
to map its endianness to current CPU format, e.g. if the inode number in raw (fs-specific) dir entry is in
little-endian 32 bit format one could do:
So, when we open a file we hit iget(sb, ino) which is really iget4(sb, ino, NULL, NULL),
which does:
1. Attempt to find an inode with matching superblock and inode number in the hashtable under
protection of inode_lock. If inode is found, its reference count (i_count) is incremented; if it
was 0 prior to incrementation and the inode is not dirty, it is removed from whatever type list
(inode->i_list) it is currently on (it has to be inode_unused list, of course) and inserted into
inode_in_use type list; finally, inodes_stat.nr_unused is decremented.
2. If inode is currently locked, we wait until it is unlocked so that iget4() is guaranteed to return an
unlocked inode.
3. If inode was not found in the hashtable then it is the first time we encounter this inode, so we call
get_new_inode(), passing it the pointer to the place in the hashtable where it should be inserted
to.
4. get_new_inode() allocates a new inode from the inode_cachep SLAB cache but this
operation can block (GFP_KERNEL allocation), so it must drop the inode_lock spinlock which
guards the hashtable. Since it has dropped the spinlock, it must retry searching the inode in the
hashtable afterwards; if it is found this time, it returns (after incrementing the reference by __iget)
the one found in the hashtable and destroys the newly allocated one. If it is still not found in the
hashtable, then the new inode we have just allocated is the one to be used; therefore it is initialised to
the required values and the fs-specific sb->s_op->read_inode() method is invoked to
populate the rest of the inode. This brings us from inode cache back to the filesystem code -
remember that we came to the inode cache when filesystem-specific lookup() method invoked
iget(). While the s_op->read_inode() method is reading the inode from disk, the inode is
locked (i_state = I_LOCK); it is unlocked after the read_inode() method returns and all the
waiters for it are woken up.
Now, let's see what happens when we close this file descriptor. The close(2) system call is implemented in
fs/open.c:sys_close() function, which calls do_close(fd, 1) which rips (replaces with
NULL) the descriptor of the process' file descriptor table and invokes the filp_close() function which
does most of the work. The interesting things happen in fput(), which checks if this was the last
reference to the file, and if so calls fs/file_table.c:_fput() which calls __fput() which is
where interaction with dcache (and therefore with inode cache - remember dcache is a Master of inode
cache!) happens. The fs/dcache.c:dput() does dentry_iput() which brings us back to inode
cache via iput(inode) so let us understand fs/inode.c:iput(inode):
1. If parameter passed to us is NULL, we do absolutely nothing and return.
2. if there is a fs-specific sb->s_op->put_inode() method, it is invoked immediately with no
spinlocks held (so it can block).
3. inode_lock spinlock is taken and i_count is decremented. If this was NOT the last reference to
this inode then we simply check if there are too many references to it and so i_count can wrap
around the 32 bits allocated to it and if so we print a warning and return. Note that we call
printk() while holding the inode_lock spinlock - this is fine because printk() can never
block, therefore it may be called in absolutely any context (even from interrupt handlers!).
4. If this was the last active reference then some work needs to be done.
The work performed by iput() on the last inode reference is rather complex so we separate it into a list of
its own:
1. If i_nlink == 0 (e.g. the file was unlinked while we held it open) then the inode is removed from
hashtable and from its type list; if there are any data pages held in page cache for this inode, they are
removed by means of truncate_all_inode_pages(&inode->i_data). Then the
filesystem-specific s_op->delete_inode() method is invoked, which typically deletes the
on-disk copy of the inode. If there is no s_op->delete_inode() method registered by the
filesystem (e.g. ramfs) then we call clear_inode(inode), which invokes
s_op->clear_inode() if registered and if inode corresponds to a block device, this device's
reference count is dropped by bdput(inode->i_bdev).
2. if i_nlink != 0 then we check if there are other inodes in the same hash bucket and if there is
none, then if inode is not dirty we delete it from its type list and add it to inode_unused list,
incrementing inodes_stat.nr_unused. If there are inodes in the same hashbucket then we
delete it from the type list and add to inode_unused list. If this was an anonymous inode (NetApp
.snapshot) then we delete it from the type list and clear/destroy it completely.
#include <linux/module.h>
#include <linux/init.h>
module_init(init_bfs_fs)
module_exit(exit_bfs_fs)
The module_init()/module_exit() macros ensure that, when BFS is compiled as a module, the
functions init_bfs_fs() and exit_bfs_fs() turn into init_module() and
cleanup_module() respectively; if BFS is statically linked into the kernel, the exit_bfs_fs() code
vanishes as it is unnecessary.
The struct file_system_type is declared in include/linux/fs.h:
struct file_system_type {
const char *name;
int fs_flags;
struct super_block *(*read_super) (struct super_block *,
void *, int);
struct module *owner;
struct vfsmount *kern_mnt; /* For kernel mount, if it's
FS_SINGLE fs */
struct file_system_type * next;
};
FS_NOMOUNT for filesystems that cannot be mounted from userspace by means of mount(2) system
call: they can however be mounted internally using kern_mount() interface, e.g. pipefs.
● read_super: a pointer to the function that reads the super block during mount operation. This
function is required: if it is not provided, mount operation (whether from userspace or inkernel) will
always fail except in FS_SINGLE case where it will Oops in get_sb_single(), trying to
dereference a NULL pointer in fs_type->kern_mnt->mnt_sb with (fs_type->kern_mnt
= NULL).
● owner: pointer to the module that implements this filesystem. If the filesystem is statically linked
into the kernel then this is NULL. You don't need to set this manually as the macro THIS_MODULE
does the right thing automatically.
● kern_mnt: for FS_SINGLE filesystems only. This is set by kern_mount() (TODO:
kern_mount() should refuse to mount filesystems if FS_SINGLE is not set).
● next: linkage into singly-linked list headed by file_systems (see fs/super.c). The list is
protected by the file_systems_lock read-write spinlock and functions
register/unregister_filesystem() modify it by linking and unlinking the entry from the
list.
The job of the read_super() function is to fill in the fields of the superblock, allocate root inode and
initialise any fs-private information associated with this mounted instance of the filesystem. So, typically
the read_super() would do:
1. Read the superblock from the device specified via sb->s_dev argument, using buffer cache
bread() function. If it anticipates to read a few more subsequent metadata blocks immediately then
it makes sense to use breada() to schedule reading extra blocks asynchronously.
2. Verify that superblock contains the valid magic number and overall "looks" sane.
3. Initialise sb->s_op to point to struct super_block_operations structure. This structure
contains filesystem-specific functions implementing operations like "read inode", "delete inode", etc.
4. Allocate root inode and root dentry using d_alloc_root().
5. If the filesystem is not mounted read-only then set sb->s_dirt to 1 and mark the buffer containing
superblock dirty (TODO: why do we do this? I did it in BFS because MINIX did it...)
/*
* Open file table structure
*/
struct files_struct {
atomic_t count;
rwlock_t file_lock;
int max_fds;
int max_fdset;
int next_fd;
struct file ** fd; /* current fd array */
fd_set *close_on_exec;
fd_set *open_fds;
fd_set close_on_exec_init;
fd_set open_fds_init;
struct file * fd_array[NR_OPEN_DEFAULT];
};
The file->count is a reference count, incremented by get_file() (usually called by fget()) and
decremented by fput() and by put_filp(). The difference between fput() and put_filp() is
that fput() does more work usually needed for regular files, such as releasing flock locks, releasing
dentry, etc, while put_filp() is only manipulating file table structures, i.e. decrements the count,
removes the file from the anon_list and adds it to the free_list, under protection of files_lock
spinlock.
The tsk->files can be shared between parent and child if the child thread was created using clone()
system call with CLONE_FILES set in the clone flags argument. This can be seen in
kernel/fork.c:copy_files() (called by do_fork()) which only increments the
file->count if CLONE_FILES is set instead of the usual copying file descriptor table in time-honoured
tradition of classical UNIX fork(2).
When a file is opened, the file structure allocated for it is installed into current->files->fd[fd] slot
and a fd bit is set in the bitmap current->files->open_fds . All this is done under the write
protection of current->files->file_lock read-write spinlock. When the descriptor is closed a fd
bit is cleared in current->files->open_fds and current->files->next_fd is set equal to
fd as a hint for finding the first unused descriptor next time this process wants to open a file.
struct fown_struct {
int pid; /* pid or -pgrp where SIGIO should
be sent */
uid_t uid, euid; /* uid/euid of process setting the
owner */
int signum; /* posix.1b rt signal to be
delivered on IO */
};
struct file {
struct list_head f_list;
struct dentry *f_dentry;
struct vfsmount *f_vfsmnt;
struct file_operations *f_op;
atomic_t f_count;
unsigned int f_flags;
mode_t f_mode;
loff_t f_pos;
unsigned long f_reada, f_ramax, f_raend, f_ralen,
f_rawin;
struct fown_struct f_owner;
unsigned int f_uid, f_gid;
int f_error;
7. f_mode: a combination of userspace flags and mode, set by dentry_open(). The point of the
conversion is to store read and write access in separate bits so one could do easy checks like
(f_mode & FMODE_WRITE) and (f_mode & FMODE_READ).
8. f_pos: a current file position for next read or write to the file. Under i386 it is of type long long,
i.e. a 64bit value.
9. f_reada, f_ramax, f_raend, f_ralen, f_rawin: to support readahead - too complex to be discussed
by mortals ;)
10. f_owner: owner of file I/O to receive asynchronous I/O notifications via SIGIO mechanism (see
fs/fcntl.c:kill_fasync()).
11. f_uid, f_gid - set to user id and group id of the process that opened the file, when the file structure is
created in get_empty_filp(). If the file is a socket, used by ipv4 netfilter.
12. f_error: used by NFS client to return write errors. It is set in fs/nfs/file.c and checked in
mm/filemap.c:generic_file_write().
13. f_version - versioning mechanism for invalidating caches, incremented (using global event)
whenever f_pos changes.
14. private_data: private per-file data which can be used by filesystems (e.g. coda stores credentials
here) or by device drivers. Device drivers (in the presence of devfs) could use this field to
differentiate between multiple instances instead of the classical minor number encoded in
file->f_dentry->d_inode->i_rdev.
Now let us look at file_operations structure which contains the methods that can be invoked on files.
Let us recall that it is copied from inode->i_fop where it is set by s_op->read_inode() method.
It is declared in include/linux/fs.h:
struct file_operations {
struct module *owner;
loff_t (*llseek) (struct file *, loff_t, int);
ssize_t (*read) (struct file *, char *, size_t, loff_t *);
ssize_t (*write) (struct file *, const char *, size_t,
loff_t *);
int (*readdir) (struct file *, void *, filldir_t);
unsigned int (*poll) (struct file *, struct
poll_table_struct *);
int (*ioctl) (struct inode *, struct file *, unsigned int,
unsigned long);
int (*mmap) (struct file *, struct vm_area_struct *);
int (*open) (struct inode *, struct file *);
int (*flush) (struct file *);
int (*release) (struct inode *, struct file *);
int (*fsync) (struct file *, struct dentry *, int datasync);
int (*fasync) (int, struct file *, int);
int (*lock) (struct file *, int, struct file_lock *);
ssize_t (*readv) (struct file *, const struct iovec *,
unsigned long, loff_t *);
1. owner: a pointer to the module that owns the subsystem in question. Only drivers need to set it to
THIS_MODULE, filesystems can happily ignore it because their module counts are controlled at
mount/umount time whilst the drivers need to control it at open/release time.
2. llseek: implements the lseek(2) system call. Usually it is omitted and
fs/read_write.c:default_llseek() is used, which does the right thing (TODO: force all
those who set it to NULL currently to use default_llseek - that way we save an if() in llseek())
3. read: implements read(2) system call. Filesystems can use
mm/filemap.c:generic_file_read() for regular files and
fs/read_write.c:generic_read_dir() (which simply returns -EISDIR) for directories
here.
4. write: implements write(2) system call. Filesystems can use
mm/filemap.c:generic_file_write() for regular files and ignore it for directories here.
5. readdir: used by filesystems. Ignored for regular files and implements readdir(2) and getdents(2)
system calls for directories.
6. poll: implements poll(2) and select(2) system calls.
7. ioctl: implements driver or filesystem-specific ioctls. Note that generic file ioctls like FIBMAP,
FIGETBSZ, FIONREAD are implemented by higher levels so they never read f_op->ioctl()
method.
8. mmap: implements the mmap(2) system call. Filesystems can use generic_file_mmap here for
regular files and ignore it on directories.
9. open: called at open(2) time by dentry_open(). Filesystems rarely use this, e.g. coda tries to
cache the file locally at open time.
10. flush: called at each close(2) of this file, not necessarily the last one (see release() method
below). The only filesystem that uses this is NFS client to flush all dirty pages. Note that this can
return an error which will be passed back to userspace which made the close(2) system call.
11. release: called at the last close(2) of this file, i.e. when file->f_count reaches 0. Although
defined as returning int, the return value is ignored by VFS (see fs/file_table.c:__fput()).
12. fsync: maps directly to fsync(2)/fdatasync(2) system calls, with the last argument specifying
whether it is fsync or fdatasync. Almost no work is done by VFS around this, except to map file
descriptor to a file structure (file = fget(fd)) and down/up inode->i_sem semaphore.
Ext2 filesystem currently ignores the last argument and does exactly the same for fsync(2) and
fdatasync(2).
13. fasync: this method is called when file->f_flags & FASYNC changes.
14. lock: the filesystem-specific portion of the POSIX fcntl(2) file region locking mechanism. The only
bug here is that because it is called before fs-independent portion (posix_lock_file()), if it
succeeds but the standard POSIX lock code fails then it will never be unlocked on fs-dependent
level..
15. readv: implements readv(2) system call.
16. writev: implements writev(2) system call.
struct super_block {
struct list_head s_list; /* Keep this first
*/
kdev_t s_dev;
unsigned long s_blocksize;
unsigned char s_blocksize_bits;
unsigned char s_lock;
unsigned char s_dirt;
struct file_system_type *s_type;
struct super_operations *s_op;
struct dquot_operations *dq_op;
unsigned long s_flags;
unsigned long s_magic;
struct dentry *s_root;
wait_queue_head_t s_wait;
union {
struct minix_sb_info minix_sb;
struct ext2_sb_info ext2_sb;
..... all filesystems that need sb-private info ...
void *generic_sbp;
} u;
/*
* The next field is for VFS *only*. No filesystems have any
business
* even looking at it. You had been warned.
*/
struct semaphore s_vfs_rename_sem; /* Kludge */
spell "root" other than "/" and so use more generic d_alloc() function to bind the dentry to a
name, e.g. pipefs mounts itself on "pipe:" as its own root instead of "/".
12. s_wait: waitqueue of processes waiting for superblock to be unlocked.
13. s_dirty: a list of all dirty inodes. Recall that if inode is dirty (inode->i_state & I_DIRTY)
then it is on superblock-specific dirty list linked via inode->i_list.
14. s_files: a list of all open files on this superblock. Useful for deciding whether filesystem can be
remounted read-only, see fs/file_table.c:fs_may_remount_ro() which goes through
sb->s_files list and denies remounting if there are files opened for write (file->f_mode &
FMODE_WRITE) or files with pending unlink (inode->i_nlink == 0).
15. s_bdev: for FS_REQUIRES_DEV, this points to the block_device structure describing the device the
filesystem is mounted on.
16. s_mounts: a list of all vfsmount structures, one for each mounted instance of this superblock.
17. s_dquot: more diskquota stuff.
The superblock operations are described in the super_operations structure declared in
include/linux/fs.h:
struct super_operations {
void (*read_inode) (struct inode *);
void (*write_inode) (struct inode *, int);
void (*put_inode) (struct inode *);
void (*delete_inode) (struct inode *);
void (*put_super) (struct super_block *);
void (*write_super) (struct super_block *);
int (*statfs) (struct super_block *, struct statfs *);
int (*remount_fs) (struct super_block *, int *, char *);
void (*clear_inode) (struct inode *);
void (*umount_begin) (struct super_block *);
};
1. read_inode: reads the inode from the filesystem. It is only called from
fs/inode.c:get_new_inode() from iget4() (and therefore iget()). If a filesystem
wants to use iget() then read_inode() must be implemented - otherwise
get_new_inode() will panic. While inode is being read it is locked (inode->i_state =
I_LOCK). When the function returns, all waiters on inode->i_wait are woken up. The job of the
filesystem's read_inode() method is to locate the disk block which contains the inode to be read
and use buffer cache bread() function to read it in and initialise the various fields of inode
structure, for example the inode->i_op and inode->i_fop so that VFS level knows what
operations can be performed on the inode or corresponding file. Filesystems that don't implement
read_inode() are ramfs and pipefs. For example, ramfs has its own inode-generating function
ramfs_get_inode() with all the inode operations calling it as needed.
2. write_inode: write inode back to disk. Similar to read_inode() in that it needs to locate the
relevant block on disk and interact with buffer cache by calling mark_buffer_dirty(bh). This
method is called on dirty inodes (those marked dirty with mark_inode_dirty()) when the inode
module_init(init_pipe_fs)
module_exit(exit_pipe_fs)
The filesystem is of type FS_NOMOUNT|FS_SINGLE, which means it cannot be mounted from userspace
and can only have one superblock system-wide. The FS_SINGLE file also means that it must be mounted
via kern_mount() after it is successfully registered via register_filesystem(), which is exactly
what happens in init_pipe_fs(). The only bug in this function is that if kern_mount() fails (e.g.
because kmalloc() failed in add_vfsmnt()) then the filesystem is left as registered but module
initialisation fails. This will cause cat /proc/filesystems to Oops. (have just sent a patch to Linus
mentioning that although this is not a real bug today as pipefs can't be compiled as a module, it should be
written with the view that in the future it may become modularised).
The result of register_filesystem() is that pipe_fs_type is linked into the file_systems
list so one can read /proc/filesystems and find "pipefs" entry in there with "nodev" flag indicating
that FS_REQUIRES_DEV was not set. The /proc/filesystems file should really be enhanced to
support all the new FS_ flags (and I made a patch to do so) but it cannot be done because it will break all
the user applications that use it. Despite Linux kernel interfaces changing every minute (only for the better)
when it comes to the userspace compatibility, Linux is a very conservative operating system which allows
module_init(init_bfs_fs)
module_exit(exit_bfs_fs)
The reason why exec_domains_lock is a read-write is that only registration and unregistration requests
modify the list, whilst doing cat /proc/filesystems calls
fs/exec_domain.c:get_exec_domain_list(), which needs only read access to the list.
Registering a new execution domain defines a "lcall7 handler" and a signal number conversion map.
Actually, ABI patch extends this concept of exec domain to include extra information (like socket options,
where abi_dispatch() is a wrapper around the table of function pointers that implement this
personality's system calls uw7_funcs.
struct address_space {
struct list_head clean_pages;
struct list_head dirty_pages;
struct list_head locked_pages;
unsigned long nrpages;
struct address_space_operations *a_ops;
struct inode *host;
struct vm_area_struct *i_mmap;
struct vm_area_struct *i_mmap_shared;
spinlock_t i_shared_lock;
};
To understand the way address_spaces works, we only need to look at a few of this fields:
clean_pages, dirty_pages and locked_pages are double linked lists of all clean, dirty and
locked pages that belong to this address_space, nrpages is the total number of pages in this
address_space. a_ops defines the methods of this object and host is an pointer to the inode this
address_space belongs to - it may also be NULL, e.g. in the case of the swapper address_space
(mm/swap_state.c,).
The usage of clean_pages, dirty_pages, locked_pages and nrpages is obvious, so we will
take a tighter look at the address_space_operations structure, defined in the same header:
struct address_space_operations {
int (*writepage)(struct page *);
int (*readpage)(struct file *, struct page *);
For a basic view at the principle of address_spaces (and the pagecache) we need to take a look at
->writepage and ->readpage, but in practice we need to take a look at ->prepare_write and
->commit_write, too.
You can probably guess what the address_space_operations methods do by virtue of their names alone;
nevertheless, they do require some explanation. Their use in the course of filesystem data I/O, by far the
most common path through the pagecache, provides a good way of understanding them. Unlike most
other UNIX-like operating systems, Linux has generic file operations (a subset of the SYSVish vnode
operations) for data IO through the pagecache. This means that the data will not directly interact with the
file- system on read/write/mmap, but will be read/written from/to the pagecache whenever possible. The
pagecache has to get data from the actual low-level filesystem in case the user wants to read from a page
not yet in memory, or write data to disk in case memory gets low.
In the read path the generic methods will first try to find a page that matches the wanted inode/index
tuple.
hash = page_hash(inode->i_mapping, index);
Then we test whether the page actually exists.
hash = page_hash(inode->i_mapping, index); page =
__find_page_nolock(inode->i_mapping, index, *hash);
When it does not exist, we allocate a new free page, and add it to the page- cache hash.
page = page_cache_alloc(); __add_to_page_cache(page, mapping,
index, hash);
After the page is hashed we use the ->readpage address_space operation to actually fill the page with
data. (file is an open instance of inode).
error = mapping->a_ops->readpage(file, page);
Finally we can copy the data to userspace.
For writing to the filesystem two pathes exist: one for writable mappings (mmap) and one for the
write(2) family of syscalls. The mmap case is very simple, so it will be discussed first. When a user
modifies mappings, the VM subsystem marks the page dirty.
SetPageDirty(page);
The bdflush kernel thread that is trying to free pages, either as background activity or because memory
gets low will try to call ->writepage on the pages that are explicitly marked dirty. The ->writepage
method does now have to write the pages content back to disk and free the page.
The second write path is _much_ more complicated. For each page the user writes to, we are basically
doing the following: (for the full code see mm/filemap.c:generic_file_write()).
page = __grab_cache_page(mapping, index, &cached_page);
mapping->a_ops->prepare_write(file, page, offset,
offset+bytes); copy_from_user(kaddr+offset, buf, bytes);
mapping->a_ops->commit_write(file, page, offset,
offset+bytes);
So first we try to find the hashed page or allocate a new one, then we call the ->prepare_write
address_space method, copy the user buffer to kernel memory and finally call the ->commit_write
method. As you probably have seen ->prepare_write and ->commit_write are fundamentally different
from ->readpage and ->writepage, because they are not only called when physical IO is actually
wanted but everytime the user modifies the file. There are two (or more?) ways to handle this, the first
one uses the Linux buffercache to delay the physical IO, by filling a page->buffers pointer with
buffer_heads, that will be used in try_to_free_buffers (fs/buffers.c) to request IO once memory
gets low, and is used very widespread in the current kernel. The other way just sets the page dirty and
relies on ->writepage to do all the work. Due to the lack of a validitity bitmap in struct page this does
not work with filesystem that have a smaller granuality then PAGE_SIZE.
5. IPC mechanisms
This chapter describes the semaphore, shared memory, and message queue IPC mechanisms as
implemented in the Linux 2.4 kernel. It is organized into four sections. The first three sections cover the
interfaces and support functions for semaphores, message queues, and shared memory respectively. The last
section describes a set of common functions and data structures that are shared by all three mechanisms.
5.1 Semaphores
The functions described in this section implement the user level semaphore mechanisms. Note that this
implementation relies on the use of kernel splinlocks and kernel semaphores. To avoid confusion, the term
"kernel semaphore" will be used in reference to kernel semaphores. All other uses of the word "sempahore"
will be in reference to the user level semaphores.
sys_semctl()
For the IPC_INFO, SEM_INFO, and SEM_STAT commands, semctl_nolock() is called to perform the
necessary functions.
For the GETALL, GETVAL, GETPID, GETNCNT, GETZCNT, IPC_STAT, SETVAL,and SETALL
commands, semctl_main() is called to perform the necessary functions.
For the IPC_RMID and IPC_SET command, semctl_down() is called to perform the necessary functions.
Throughout both of these operations, the global sem_ids.sem kernel semaphore is held.
sys_semop()
After validating the call parameters, the semaphore operations data is copied from user space to a temporary
buffer. If a small temporary buffer is sufficient, then a stack buffer is used. Otherwise, a larger buffer is
allocated. After copying in the semaphore operations data, the global semaphores spinlock is locked, and
the user-specified semaphore set ID is validated. Access permissions for the semaphore set are also
validated.
All of the user-specified semaphore operations are parsed. During this process, a count is maintained of all
the operations that have the SEM_UNDO flag set. A decrease flag is set if any of the operations subtract
from a semaphore value, and an alter flag is set if any of the semaphore values are modified (i.e.
increased or decreased). The number of each semaphore to be modified is validated.
If SEM_UNDO was asserted for any of the semaphore operations, then the undo list for the current task is
searched for an undo structure associated with this semaphore set. During this search, if the semaphore set
ID of any of the undo structures is found to be -1, then freeundos() is called to free the undo structure and
remove it from the list. If no undo structure is found for this semaphore set then alloc_undo() is called to
allocate and initialize one.
The try_atomic_semop() function is called with the do_undo parameter equal to 0 in order to execute the
sequence of operations. The return value indicates that either the operations passed, failed, or were not
executed because they need to block. Each of these cases are further described below:
When awakened, the task re-locks the global semaphore spinlock, determines why it was awakened, and
how it should respond. The following cases are handled:
● If the the semaphore set has been removed, then the system call fails with EIDRM.
● If the status element of the sem_queue structure is set to 1, then the task was awakened in order to
retry the semaphore operations. Another call to try_atomic_semop() is made to execute the sequence
of semaphore operations. If try_atomic_sweep() returns 1, then the task must block again as
described above. Otherwise, 0 is returned for success, or an appropriate error code is returned in case
of failure. Before sys_semop() returns, current->semsleeping is cleared, and the sem_queue is
removed from the queue. If any of the specified semaphore operations were altering operations
(increase or decrease), then update_queue() is called to traverse the queue of pending semaphore
operations for the semaphore set and awaken any sleeping tasks that no longer need to block.
● If the status element of the sem_queue structure is NOT set to 1, and the sem_queue element has
not been dequeued, then the task was awakened by an interrupt. In this case, the system call fails with
EINTR. Before returning, current->semsleeping is cleared, and the sem_queue is removed from the
queue. Also, update_queue() is called if any of the operations were altering operations.
● If the status element of the sem_queue structure is NOT set to 1, and the sem_queue element has
been dequeued, then the semaphore operations have already been executed by update_queue(). The
queue status, which could be 0 for success or a negated error code for failure, becomes the return
value of the system call.
struct sem_array
struct sem
struct seminfo
struct seminfo {
int semmap;
int semmni;
int semmns;
int semmnu;
int semmsl;
int semopm;
int semume;
int semusz;
int semvmx;
int semaem;
};
struct semid64_ds
struct semid64_ds {
struct ipc64_perm sem_perm; /* permissions ..
see
ipc.h */
__kernel_time_t sem_otime; /* last semop time
*/
unsigned long __unused1;
__kernel_time_t sem_ctime; /* last change time
*/
unsigned long __unused2;
unsigned long sem_nsems; /* no. of semaphores
in
array */
unsigned long __unused3;
struct sem_queue
struct sembuf
struct sem_undo
newary()
newary() relies on the ipc_alloc() function to allocate the memory required for the new semaphore set. It
allocates enough memory for the semaphore set descriptor and for each of the semaphores in the set. The
allocated memory is cleared, and the address of the first element of the semaphore set descriptor is passed to
ipc_addid(). ipc_addid() reserves an array entry for the new semaphore set descriptor and initializes the (
struct kern_ipc_perm) data for the set. The global used_sems variable is updated by the number of
semaphores in the new set and the initialization of the ( struct kern_ipc_perm) data for the new set is
completed. Other initialization for this set performed are listed below:
● The sem_base element for the set is initialized to the address immediately following the ( struct
sem_array) portion of the newly allocated data. This corresponds to the location of the first
semaphore in the set.
● The sem_pending queue is initialized as empty.
All of the operations following the call to ipc_addid() are performed while holding the global semaphores
spinlock. After unlocking the global semaphores spinlock, newary() calls ipc_buildid() (via sem_buildid()).
This function uses the index of the semaphore set descriptor to create a unique ID, that is then returned to
the caller of newary().
freeary()
freeary() is called by semctl_down() to perform the functions listed below. It is called with the global
semaphores spinlock locked and it returns with the spinlock unlocked
● The ipc_rmid() function is called (via the sem_rmid() wrapper) to delete the ID for the semaphore set
and to retrieve a pointer to the semaphore set.
● The undo list for the semaphore set is invalidated.
● All pending processes are awakened and caused to fail with EIDRM.
● The number of used semaphores is reduced by the number of semaphores in the removed set.
semctl_down()
semctl_down() provides the IPC_RMID and IPC_SET operations of the semctl() system call. The
semaphore set ID and the access permissions are verified prior to either of these operations, and in either
case, the global semaphore spinlock is held throughout the operation.
IPC_RMID
The IPC_RMID operation calls freeary() to remove the semaphore set.
IPC_SET
The IPC_SET operation updates the uid, gid, mode, and ctime elements of the semaphore set.
semctl_nolock()
semctl_nolock() is called by sys_semctl() to perform the IPC_INFO, SEM_INFO and SEM_STAT
functions.
SEM_STAT
SEM_STAT causes a temporary semid64_ds buffer to be initialized. The global semaphore spinlock is then
held while copying the sem_otime, sem_ctime, and sem_nsems values into the buffer. This data is
then copied to user space.
semctl_main()
semctl_main() is called by sys_semctl() to perform many of the supported functions, as described in the
subsections below. Prior to performing any of the following operations, semctl_main() locks the global
semaphore spinlock and validates the semaphore set ID and the permissions. The spinlock is released before
returning.
GETALL
The GETALL operation loads the current semaphore values into a temporary kernel buffer and copies them
out to user space. The small stack buffer is used if the semaphore set is small. Otherwise, the spinlock is
temporarily dropped in order to allocate a larger buffer. The spinlock is held while copying the semaphore
values in to the temporary buffer.
SETALL
The SETALL operation copies semaphore values from user space into a temporary buffer, and then into the
semaphore set. The spinlock is dropped while copying the values from user space into the temporary buffer,
and while verifying reasonable values. If the semaphore set is small, then a stack buffer is used, otherwise a
larger buffer is allocated. The spinlock is regained and held while the following operations are performed
on the semaphore set:
● The semaphore values are copied into the semaphore set.
● The semaphore adjustments of the undo queue for the semaphore set are cleared.
● The update_queue() function is called to traverse the queue of pending semops and look for any tasks
that can be completed as a result of the SETALL operation. Any pending tasks that are no longer
blocked are awakened.
IPC_STAT
In the IPC_STAT operation, the sem_otime, sem_ctime, and sem_nsems value are copied into a
stack buffer. The data is then copied to user space after dropping the spinlock.
GETVAL
For GETVAL in the non-error case, the return value for the system call is set to the value of the specified
semaphore.
GETPID
For GETPID in the non-error case, the return value for the system call is set to the pid associated with the
last operation on the semaphore.
GETNCNT
For GETNCNT in the non-error case, the return value for the system call is set to the number of processes
waiting on the semaphore being less than zero. This number is calculated by the count_semncnt() function.
GETZCNT
For GETZCNT in the non-error case, the return value for the system call is set to the number of processes
waiting on the semaphore being set to zero. This number is calculated by the count_semzcnt() function.
SETVAL
After validating the new semaphore value, the following functions are performed:
● The undo queue is searched for any adjustments to this semaphore. Any adjustments that are found
are reset to zero.
● The semaphore value is set to the value provided.
● The update_queue() function is called to traverse the queue of pending semops and look for any tasks
that can be completed as a result of the SETALL operation. Any pending tasks that are no longer
blocked are awakened.
count_semncnt()
count_semncnt() counts the number of tasks waiting on the value of a semaphore to be less than zero.
count_semzcnt()
count_semzcnt() counts the number of tasks waiting on the value of a semaphore to be zero.
update_queue()
update_queue() traverses the queue of pending semops for a semaphore set and calls try_atomic_semop() to
determine which sequences of semaphore operations would succeed. If the status of the queue element
indicates that blocked tasks have already been awakened, then the queue element is skipped over. For other
elements of the queue, the q-alter flag is passed as the undo parameter to try_atomic_semop(),
indicating that any altering operations should be undone before returning.
If the sequence of operations would block, then update_queue() returns without making any changes.
A sequence of operations can fail if one of the semaphore operations would cause an invalid semaphore
value, or an operation marked IPC_NOWAIT is unable to complete. In such a case, the task that is blocked
on the sequence of semaphore operations is awakened, and the queue status is set with an appropriate error
code. The queue element is also dequeued.
If the sequence of operations is non-altering, then they would have passed a zero value as the undo
parameter to try_atomic_semop(). If these operations succeeded, then they are considered complete and are
removed from the queue. The blocked task is awakened, and the queue element status is set to indicate
success.
If the sequence of operations would alter the semaphore values, but can succeed, then sleeping tasks that no
longer need to be blocked are awakened. The queue status is set to 1 to indicate that the blocked task has
been awakened. The operations have not been performed, so the queue element is not removed from the
queue. The semaphore operations would be executed by the awakened task.
try_atomic_semop()
try_atomic_semop() is called by sys_semop() and update_queue() to determine if a sequence of semaphore
operations will all succeed. It determines this by attempting to perform each of the operations.
If a blocking operation is encountered, then the process is aborted and all operations are reversed.
-EAGAIN is returned if IPC_NOWAIT is set. Otherwise 1 is returned to indicate that the sequence of
semaphore operations is blocked.
If a semaphore value is adjusted beyond system limits, then then all operations are reversed, and -ERANGE
is returned.
If all operations in the sequence succeed, and the do_undo parameter is non-zero, then all operations are
reversed, and 0 is returned. If the do_undo parameter is zero, then all operations succeeded and remain in
force, and the sem_otime, field of the semaphore set is updated.
sem_revalidate()
sem_revalidate() is called when the global semaphores spinlock has been temporarily dropped and needs to
be locked again. It is called by semctl_main() and alloc_undo(). It validates the semaphore ID and
permissions and on success, returns with the global semaphores spinlock locked.
freeundos()
freeundos() traverses the process undo list in search of the desired undo structure. If found, the undo
structure is removed from the list and freed. A pointer to the next undo structure on the process list is
returned.
alloc_undo()
alloc_undo() expects to be called with the global semaphores spinlock locked. In the case of an error, it
returns with it unlocked.
The global semaphores spinlock is unlocked, and kmalloc() is called to allocate sufficient memory for both
the sem_undo structure, and also an array of one adjustment value for each semaphore in the set. On
success, the global spinlock is regained with a call to sem_revalidate().
The new semundo structure is then initialized, and the address of this structure is placed at the address
provided by the caller. The new undo structure is then placed at the head of undo list for the current task.
sem_exit()
sem_exit() is called by do_exit(), and is responsible for executing all of the undo adjustments for the exiting
task.
If the current process was blocked on a semaphore, then it is removed from the sem_queue list while
holding the global semaphores spinlock.
The undo list for the current task is then traversed, and the following operations are performed while
holding and releasing the the global semaphores spinlock around the processing of each element of the list.
The following operations are performed for each of the undo elements:
● The undo structure and the semaphore set ID are validated.
● The undo list of the corresponding semaphore set is searched to find a reference to the same undo
structure and to remove it from that list.
● The adjustments indicated in the undo structure are applied to the semaphore set.
● update_queue() is called to traverse the queue of pending semops and awaken any sleeping tasks that
no longer need to be blocked as a result of executing the undo operations.
● The undo structure is freed.
When the processing of the list is complete, the current->semundo value is cleared.
sys_msgctl()
The parameters passed to sys_msgctl() are: a message queue ID (msqid), the operation (cmd), and a
pointer to a user space buffer of type msgid_ds (buf). Six operations are provided in this function:
IPC_INFO, MSG_INFO,IPC_STAT, MSG_STAT, IPC_SET and IPC_RMID. The message queue ID and
the operation parameters are validated; then, the operation(cmd) is performed as follows:
IPC_INFO ( or MSG_INFO)
The global message queue information is copied to user space.
IPC_STAT ( or MSG_STAT)
A temporary buffer of type struct msqid64_ds is initialized and the global message queue spinlock is
locked. After verifying the access permissions of the calling process, the message queue information
associated with the message queue ID is loaded into the temporary buffer, the global message queue
spinlock is unlocked, and the contents of the temporary buffer are copied out to user space by
copy_msqid_to_user().
IPC_SET
The user data is copied in via copy_msqid_to_user(). The global message queue semaphore and spinlock
are obtained and released at the end. After the the message queue ID and the current process access
permissions are validated, the message queue information is updated with the user provided data. Later,
expunge_all() and ss_wakeup() are called to wake up all processes sleeping on the receiver and sender
waiting queues of the message queue. This is because some receivers may now be excluded by stricter
access permissions and some senders may now be able to send the message due to an increased queue size.
IPC_RMID
The global message queue semaphore is obtained and the global message queue spinlock is locked. After
validating the message queue ID and the current task access permissions, freeque() is called to free the
resources related to the message queue ID. The global message queue semaphore and spinlock are released.
sys_msgsnd()
sys_msgsnd() receives as parameters a message queue ID (msqid), a pointer to a buffer of type struct
msg_msg (msgp), the size of the message to be sent (msgsz), and a flag indicating wait vs. not wait
(msgflg). There are two task waiting queues and one message waiting queue associated with the message
queue ID. If there is a task in the receiver waiting queue that is waiting for this message, then the message is
delivered directly to the receiver, and the receiver is awakened. Otherwise, if there is enough space
available in the message waiting queue, the message is saved in this queue. As a last resort, the sending task
enqueues itself on the sender waiting queue. A more in-depth discussion of the operations performed by
sys_msgsnd() follows:
1. Validates the user buffer address and the message type, then invokes load_msg() to load the contents
of the user message into a temporary object msg of type struct msg_msg. The message type and
message size fields of msg are also initialized.
2. Locks the global message queue spinlock and gets the message queue descriptor associated with the
message queue ID. If no such message queue exists, returns EINVAL.
3. Invokes ipc_checkid() (via msg_checkid())to verify that the message queue ID is valid and calls
ipcperms() to check the calling process' access permissions.
4. Checks the message size and the space left in the message waiting queue to see if there is enough
room to store the message. If not, the following substeps are performed:
1. If IPC_NOWAIT is specified in msgflg the global message queue spinlock is unlocked, the
memory resources for the message are freed, and EAGAIN is returned.
2. Invokes ss_add() to enqueue the current task in the sender waiting queue. It also unlocks the
global message queue spinlock and invokes schedule() to put the current task to sleep.
3. When awakened, obtains the global spinlock again and verifies that the message queue ID is
still valid. If the message queue ID is not valid, ERMID is returned.
4. Invokes ss_del() to remove the sending task from the sender waiting queue. If there is any
signal pending for the task, sys_msgsnd() unlocks the global spinlock, invokes free_msg() to
free the message buffer, and returns EINTR. Otherwise, the function goes back to check again
whether there is enough space in the message waiting queue.
5. Invokes pipelined_send() to try to send the message to the waiting receiver directly.
6. If there is no receiver waiting for this message, enqueues msg into the message waiting
queue(msq->q_messages). Updates the q_cbytes and the q_qnum fields of the message queue
descriptor, as well as the global variables msg_bytes and msg_hdrs, which indicate the total
number of bytes used for messages and the total number of messages system wide.
7. If the message has been successfully sent or enqueued, updates the q_lspid and the q_stime
fields of the message queue descriptor and releases the global message queue spinlock.
sys_msgrcv()
The sys_msgrcv() function receives as parameters a message queue ID (msqid), a pointer to a buffer of
type msg_msg (msgp), the desired message size(msgsz), the message type (msgtyp), and the flags
(msgflg). It searches the message waiting queue associated with the message queue ID, finds the first
message in the queue which matches the request type, and copies it into the given user buffer. If no such
message is found in the message waiting queue, the requesting task is enqueued into the receiver waiting
queue until the desired message is available. A more in-depth discussion of the operations performed by
sys_msgrcv() follows:
1. First, invokes convert_mode() to derive the search mode from msgtyp. sys_msgrcv() then locks the
global message queue spinlock and obtains the message queue descriptor associated with the message
queue ID. If no such message queue exists, it returns EINVAL.
2. Checks whether the current task has the correct permissions to access the message queue.
3. Starting from the first message in the message waiting queue, invokes testmsg() to check whether the
message type matches the required type. sys_msgrcv() continues searching until a matched message
is found or the whole waiting queue is exhausted. If the search mode is SEARCH_LESSEQUAL,
then the first message on the queue with the lowest type less than or equal to msgtyp is searched.
4. If a message is found, sys_msgrcv() performs the following substeps:
1. If the message size is larger than the desired size and msgflg indicates no error allowed,
unlocks the global message queue spinlock and returns E2BIG.
2. Removes the message from the message waiting queue and updates the message queue
statistics.
3. Wakes up all tasks sleeping on the senders waiting queue. The removal of a message from the
queue in the previous step makes it possible for one of the senders to progress. Goes to the last
step
5. If no message matching the receivers criteria is found in the message waiting queue, then msgflg is
checked. If IPC_NOWAIT is set, then the global message queue spinlock is unlocked and ENOMSG
is returned. Otherwise, the receiver is enqueued on the receiver waiting queue as follows:
1. A msg_receiver data structure msr is allocated and is added to the head of waiting queue.
2. The r_tsk field of msr is set to current task.
3. The r_msgtype and r_mode fields are initialized with the desired message type and mode
respectively.
4. If msgflg indicates MSG_NOERROR, then the r_maxsize field of msr is set to be the value
of msgsz otherwise it is set to be INT_MAX.
5. The r_msg field is initialized to indicate that no message has been received yet.
6. After the initialization is complete, the status of the receiving task is set to
TASK_INTERRUPTIBLE, the global message queue spinlock is unlocked, and schedule() is
invoked.
6. After the receiver is awakened, the r_msg field of msr is checked. This field is used to store the
pipelined message or in the case of an error, to store the error status. If the r_msg field is filled with
the desired message, then go to the last step Otherwise, the global message queue spinlock is locked
again.
7. After obtaining the spinlock, the r_msg field is re-checked to see if the message was received while
waiting for the spinlock. If the message has been received, the last step occurs.
8. If the r_msg field remains unchanged, then the task was awakened in order to retry. In this case,
msr is dequeued. If there is a signal pending for the task, then the global message queue spinlock is
unlocked and EINTR is returned. Otherwise, the function needs to go back and retry.
9. If the r_msg field shows that an error occurred while sleeping, the global message queue spinlock is
unlocked and the error is returned.
10. After validating that the address of the user buffer msp is valid, message type is loaded into the
mtype field of msp,and store_msg() is invoked to copy the message contents to the mtext field of
msp. Finally the memory for the message is freed by function free_msg().
struct msg_queue
struct msg_msg
struct msg_msgseg
struct msg_sender
struct msg_sender {
struct list_head list;
struct task_struct* tsk;
};
struct msg_receiver
int r_mode;
long r_msgtype;
long r_maxsize;
struct msqid64_ds
struct msqid64_ds {
struct ipc64_perm msg_perm;
__kernel_time_t msg_stime; /* last msgsnd time */
unsigned long __unused1;
__kernel_time_t msg_rtime; /* last msgrcv time */
unsigned long __unused2;
__kernel_time_t msg_ctime; /* last change time */
unsigned long __unused3;
unsigned long msg_cbytes; /* current number of bytes
on queue */
unsigned long msg_qnum; /* number of messages in
queue */
unsigned long msg_qbytes; /* max number of bytes on
queue */
__kernel_pid_t msg_lspid; /* pid of last msgsnd */
__kernel_pid_t msg_lrpid; /* last receive pid */
unsigned long __unused4;
unsigned long __unused5;
};
struct msqid_ds
struct msqid_ds {
struct ipc_perm msg_perm;
struct msg *msg_first; /* first message on
queue,unused */
struct msg *msg_last; /* last message in
queue,unused */
__kernel_time_t msg_stime; /* last msgsnd time */
__kernel_time_t msg_rtime; /* last msgrcv time */
__kernel_time_t msg_ctime; /* last change time */
unsigned long msg_lcbytes; /* Reuse junk fields for 32
bit */
unsigned long msg_lqbytes; /* ditto */
unsigned short msg_cbytes; /* current number of bytes
on queue */
unsigned short msg_qnum; /* number of messages in
queue */
unsigned short msg_qbytes; /* max number of bytes on
queue */
__kernel_ipc_pid_t msg_lspid; /* pid of last msgsnd */
__kernel_ipc_pid_t msg_lrpid; /* last receive pid */
};
msg_setbuf
struct msq_setbuf {
unsigned long qbytes;
uid_t uid;
gid_t gid;
mode_t mode;
};
freeque()
When a message queue is going to be removed, the freeque() function is called. This function assumes that
the global message queue spinlock is already locked by the calling function. It frees all kernel resources
associated with that message queue. First, it calls ipc_rmid() (via msg_rmid()) to remove the message queue
descriptor from the array of global message queue descriptors. Then it calls expunge_all to wake up all
receivers and ss_wakeup() to wake up all senders sleeping on this message queue. Later the global message
queue spinlock is released. All messages stored in this message queue are freed and the memory for the
message queue descriptor is freed.
ss_wakeup()
ss_wakeup() wakes up all the tasks waiting in the given message sender waiting queue. If this function is
called by freeque(), then all senders in the queue are dequeued.
ss_add()
ss_add() receives as parameters a message queue descriptor and a message sender data structure. It fills the
tsk field of the message sender data structure with the current process, changes the status of current
process to TASK_INTERRUPTIBLE, then inserts the message sender data structure at the head of the
sender waiting queue of the given message queue.
ss_del()
If the given message sender data structure (mss) is still in the associated sender waiting queue, then
ss_del() removes mss from the queue.
expunge_all()
expunge_all() receives as parameters a message queue descriptor(msq) and an integer value (res)
indicating the reason for waking up the receivers. For each sleeping receiver associated with msq, the
r_msg field is set to the indicated wakeup reason (res), and the associated receiving task is awakened.
This function is called when a message queue is removed or a message control operation has been
performed.
load_msg()
When a process sends a message, the sys_msgsnd() function first invokes the load_msg() function to load
the message from user space to kernel space. The message is represented in kernel memory as a linked list
of data blocks. Associated with the first data block is a msg_msg structure that describes the overall
message. The datablock associated with the msg_msg structure is limited to a size of DATA_MSG_LEN.
The data block and the structure are allocated in one contiguous memory block that can be as large as one
page in memory. If the full message will not fit into this first data block, then additional data blocks are
allocated and are organized into a linked list. These additional data blocks are limited to a size of
DATA_SEG_LEN, and each include an associated msg_msgseg) structure. The msg_msgseg structure and
the associated data block are allocated in one contiguous memory block that can be as large as one page in
memory. This function returns the address of the new msg_msg structure on success.
store_msg()
The store_msg() function is called by sys_msgrcv() to reassemble a received message into the user space
buffer provided by the caller. The data described by the msg_msg structure and any msg_msgseg structures
are sequentially copied to the user space buffer.
free_msg()
The free_msg() function releases the memory for a message data structure msg_msg, and the message
segments.
convert_mode()
convert_mode() is called by sys_msgrcv(). It receives as parameters the address of the specified message
type (msgtyp) and a flag (msgflg). It returns the search mode to the caller based on the value of
msgtyp and msgflg. If msgtyp is null, then SEARCH_ANY is returned. If msgtyp is less than 0, then
msgtyp is set to it's absolute value and SEARCH_LESSEQUAL is returned. If MSG_EXCEPT is
specified in msgflg, then SEARCH_NOTEQUAL is returned. Otherwise SEARCH_EQUAL is returned.
testmsg()
The testmsg() function checks whether a message meets the criteria specified by the receiver. It returns 1 if
one of the following conditions is true:
● The search mode indicates searching any message (SEARCH_ANY).
● The search mode is SEARCH_LESSEQUAL and the message type is less than or equal to desired
type.
● The search mode is SEARCH_EQUAL and the message type is the same as desired type.
● Search mode is SEARCH_NOTEQUAL and the message type is not equal to the specified type.
pipelined_send()
pipelined_send() allows a process to directly send a message to a waiting receiver rather than deposit the
message in the associated message waiting queue. The testmsg() function is invoked to find the first
receiver which is waiting for the given message. If found, the waiting receiver is removed from the receiver
waiting queue, and the associated receiving task is awakened. The message is stored in the r_msg field of
the receiver, and 1 is returned. In the case where no receiver is waiting for the message, 0 is returned.
In the process of searching for a receiver, potential receivers may be found which have requested a size that
is too small for the given message. Such receivers are removed from the queue, and are awakened with an
error status of E2BIG, which is stored in the r_msg field. The search then continues until either a valid
receiver is found, or the queue is exhausted.
copy_msqid_to_user()
copy_msqid_to_user() copies the contents of a kernel buffer to the user buffer. It receives as parameters a
user buffer, a kernel buffer of type msqid64_ds, and a version flag indicating the new IPC version vs. the
old IPC version. If the version flag equals IPC_64, then copy_to_user() is invoked to copy from the kernel
buffer to the user buffer directly. Otherwise a temporary buffer of type struct msqid_ds is initialized, and
the kernel data is translated to this temporary buffer. Later copy_to_user() is called to copy the contents of
the the temporary buffer to the user buffer.
copy_msqid_from_user()
The function copy_msqid_from_user() receives as parameters a kernel message buffer of type struct
msq_setbuf, a user buffer and a version flag indicating the new IPC version vs. the old IPC version. In the
case of the new IPC version, copy_from_user() is called to copy the contents of the user buffer to a
temporary buffer of type msqid64_ds. Then, the qbytes,uid, gid, and mode fields of the kernel buffer
are filled with the values of the corresponding fields from the temporary buffer. In the case of the old IPC
version, a temporary buffer of type struct msqid_ds is used instead.
sys_shmctl()
IPC_INFO
A temporary shminfo64 buffer is loaded with system-wide shared memory parameters and is copied out to
user space for access by the calling application.
SHM_INFO
The global shared memory semaphore and the global shared memory spinlock are held while gathering
system-wide statistical information for shared memory. The shm_get_stat() function is called to calculate
both the number of shared memory pages that are resident in memory and the number of shared memory
pages that are swapped out. Other statistics include the total number of shared memory pages and the
number of shared memory segments in use. The counts of swap_attempts and swap_successes are
hard-coded to zero. These statistics are stored in a temporary shm_info buffer and copied out to user space
for the calling application.
SHM_STAT, IPC_STAT
For SHM_STAT and IPC_STATA, a temporary buffer of type struct shmid64_ds is initialized, and the
global shared memory spinlock is locked.
For the SHM_STAT case, the shared memory segment ID parameter is expected to be a straight index (i.e.
0 to n where n is the number of shared memory IDs in the system). After validating the index, ipc_buildid()
is called (via shm_buildid()) to convert the index into a shared memory ID. In the passing case of
SHM_STAT, the shared memory ID will be the return value. Note that this is an undocumented feature, but
is maintained for the ipcs(8) program.
For the IPC_STAT case, the shared memory segment ID parameter is expected to be an ID that was
generated by a call to shmget(). The ID is validated before proceeding. In the passing case of IPC_STAT, 0
will be the return value.
For both SHM_STAT and IPC_STAT, the access permissions of the caller are verified. The desired
statistics are loaded into the temporary buffer and then copied out to the calling application.
SHM_LOCK, SHM_UNLOCK
After validating access permissions, the global shared memory spinlock is locked, and the shared memory
segment ID is validated. For both SHM_LOCK and SHM_UNLOCK, shmem_lock() is called to perform
the function. The parameters for shmem_lock() identify the function to be performed.
IPC_RMID
During IPC_RMID the global shared memory semaphore and the global shared memory spinlock are held
throughout this function. The Shared Memory ID is validated, and then if there are no current attachments,
shm_destroy() is called to destroy the shared memory segment. Otherwise, the SHM_DEST flag is set to
mark it for destruction, and the IPC_PRIVATE flag is set to prevent other processes from being able to
reference the shared memory ID.
IPC_SET
After validating the shared memory segment ID and the user access permissions, the uid, gid, and mode
flags of the shared memory segment are updated with the user data. The shm_ctime field is also updated.
These changes are made while holding the global shared memory semaphore and the global share memory
spinlock.
sys_shmat()
sys_shmat() takes as parameters, a shared memory segment ID, an address at which the shared memory
segment should be attached(shmaddr), and flags which will be described below.
If shmaddr is non-zero, and the SHM_RND flag is specified, then shmaddr is rounded down to a
multiple of SHMLBA. If shmaddr is not a multiple of SHMLBA and SHM_RND is not specified, then
EINVAL is returned.
The access permissions of the caller are validated and the shm_nattch field for the shared memory
segment is incremented. Note that this increment guarantees that the attachment count is non-zero and
prevents the shared memory segment from being destroyed during the process of attaching to the segment.
These operations are performed while holding the global shared memory spinlock.
The do_mmap() function is called to create a virtual memory mapping to the shared memory segment
pages. This is done while holding the mmap_sem semaphore of the current task. The MAP_SHARED flag
is passed to do_mmap(). If an address was provided by the caller, then the MAP_FIXED flag is also passed
to do_mmap(). Otherwise, do_mmap() will select the virtual address at which to map the shared memory
segment.
NOTE shm_inc() will be invoked within the do_mmap() function call via the shm_file_operations
structure. This function is called to set the PID, to set the current time, and to increment the number of
attachments to this shared memory segment.
After the call to do_mmap(), the global shared memory semaphore and the global shared memory spinlock
are both obtained. The attachment count is then decremented. The the net change to the attachment count is
1 for a call to shmat() because of the call to shm_inc(). If, after decrementing the attachment count, the
resulting count is found to be zero, and if the segment is marked for destruction (SHM_DEST), then
shm_destroy() is called to release the shared memory segment resources.
Finally, the virtual address at which the shared memory is mapped is returned to the caller at the user
specified address. If an error code had been returned by do_mmap(), then this failure code is passed on as
the return value for the system call.
sys_shmdt()
The global shared memory semaphore is held while performing sys_shmdt(). The mm_struct of the
current process is searched for the vm_area_struct associated with the shared memory address. When
it is found, do_munmap() is called to undo the virtual address mapping for the shared memory segment.
Note also that do_munmap() performs a call-back to shm_close(), which performs the shared-memory book
keeping functions, and releases the shared memory segment resources if there are no other attachments.
sys_shmdt() unconditionally returns 0.
struct shminfo64 {
unsigned long shmmax;
unsigned long shmmin;
unsigned long shmmni;
unsigned long shmseg;
unsigned long shmall;
unsigned long __unused1;
unsigned long __unused2;
unsigned long __unused3;
unsigned long __unused4;
};
struct shm_info
struct shm_info {
int used_ids;
unsigned long shm_tot; /* total allocated shm */
unsigned long shm_rss; /* total resident shm */
unsigned long shm_swp; /* total swapped shm */
unsigned long swap_attempts;
unsigned long swap_successes;
};
struct shmid_kernel
struct shmid64_ds
struct shmid64_ds {
struct ipc64_perm shm_perm; /* operation perms
*/
size_t shm_segsz; /* size of segment
(bytes) */
__kernel_time_t shm_atime; /* last attach time
*/
unsigned long __unused1;
__kernel_time_t shm_dtime; /* last detach time
*/
unsigned long __unused2;
__kernel_time_t shm_ctime; /* last change time
*/
unsigned long __unused3;
__kernel_pid_t shm_cpid; /* pid of creator */
__kernel_pid_t shm_lpid; /* pid of last
operator */
unsigned long shm_nattch; /* no. of current
attaches */
unsigned long __unused4;
unsigned long __unused5;
};
struct shmem_inode_info
struct shmem_inode_info {
spinlock_t lock;
unsigned long max_index;
swp_entry_t i_direct[SHMEM_NR_DIRECT]; /* for the first
blocks */
shm_get_stat()
shm_get_stat() cycles through all of the shared memory structures, and calculates the total number of
memory pages in use by shared memory and the total number of shared memory pages that are swapped
out. There is a file structure and an inode structure for each shared memory segment. Since the required
data is obtained via the inode, the spinlock for each inode structure that is accessed is locked and unlocked
in sequence.
shmem_lock()
shmem_lock() receives as parameters a pointer to the shared memory segment descriptor and a flag
indicating lock vs. unlock.The locking state of the shared memory segment is stored in an associated inode.
This state is compared with the desired locking state; shmem_lock() simply returns if they match.
While holding the semaphore of the associated inode, the locking state of the inode is set. The following list
of items occur for each page in the shared memory segment:
● find_lock_page() is called to lock the page (setting PG_locked) and to increment the reference count
of the page. Incrementing the reference count assures that the shared memory segment remains
locked in memory throughout this operation.
● If the desired state is locked, then PG_locked is cleared, but the reference count remains incremented.
● If the desired state is unlocked, then the reference count is decremented twice once for the current
reference, and once for the existing reference which caused the page to remain locked in memory.
Then PG_locked is cleared.
shm_destroy()
During shm_destroy() the total number of shared memory pages is adjusted to account for the removal of
the shared memory segment. ipc_rmid() is called (via shm_rmid()) to remove the Shared Memory ID.
shmem_lock is called to unlock the shared memory pages, effectively decrementing the reference counts to
zero for each page. fput() is called to decrement the usage counter f_count for the associated file object,
and if necessary, to release the file object resources. kfree() is called to free the shared memory segment
descriptor.
shm_inc()
shm_inc() sets the PID, sets the current time, and increments the number of attachments for the given
shared memory segment. These operations are performed while holding the global shared memory spinlock.
shm_close()
shm_close() updates the shm_lprid and the shm_dtim fields and decrements the number of attached
shared memory segments. If there are no other attachments to the shared memory segment, then
shm_destroy() is called to release the shared memory segment resources. These operations are all performed
while holding both the global shared memory semaphore and the global shared memory spinlock.
shmem_file_setup()
The function shmem_file_setup() sets up an unlinked file living in the tmpfs file system with the given
name and size. If there are enough systen memory resource for this file, it creates a new dentry under the
mount root of tmpfs, and allocates a new file descriptor and a new inode object of tmpfs type. Then it
associates the new dentry object with the new inode object by calling d_instantiate() and saves the address
of the dentry object in the file descriptor. The i_size field of the inode object is set to be the file size and
the i_nlink field is set to be 0 in order to mark the inode unlinked. Also, shmem_file_setup() stores the
address of the shmem_file_operations structure in the f_op field, and initializes f_mode and
f_vfsmnt fields of the file descriptor properly. The function shmem_truncate() is called to complete the
initialization of the inode object. On success, shmem_file_setup() returns the new file descriptor.
ipc_alloc()
If the memory allocation is greater than PAGE_SIZE, then vmalloc() is used to allocate memory.
Otherwise, kmalloc() is called with GFP_KERNEL to allocate the memory.
ipc_addid()
When a new semaphore set, message queue, or shared memory segment is added, ipc_addid() first calls
grow_ary() to insure that the size of the corresponding descriptor array is sufficiently large for the system
maximum. The array of descriptors is searched for the first unused element. If an unused element is found,
the count of descriptors which are in use is incremented. The kern_ipc_perm structure for the new resource
descriptor is then initialized, and the array index for the new descriptor is returned. When ipc_addid()
succeeds, it returns with the global spinlock for the given IPC type locked.
ipc_rmid()
ipc_rmid() removes the IPC descriptor from the the global descriptor array of the IPC type, updates the
count of IDs which are in use, and adjusts the maximum ID in the corresponding descriptor array if
necessary. A pointer to the IPC descriptor associated with given IPC ID is returned.
ipc_buildid()
ipc_buildid() creates a unique ID to be associated with each descriptor within a given IPC type. This ID is
created at the time a new IPC element is added (e.g. a new shared memory segment or a new semaphore
set). The IPC ID converts easily into the corresponding descriptor array index. Each IPC type maintains a
sequence number which is incremented each time a descriptor is added. An ID is created by multiplying the
sequence number with SEQ_MULTIPLIER and adding the product to the descriptor array index. The
sequence number used in creating a particular IPC ID is then stored in the corresponding descriptor. The
existence of the sequence number makes it possible to detect the use of a stale IPC ID.
ipc_checkid()
ipc_checkid() divides the given IPC ID by the SEQ_MULTIPLIER and compares the quotient with the seq
value saved corresponding descriptor. If they are equal, then the IPC ID is considered to be valid and 1 is
returned. Otherwise, 0 is returned.
grow_ary()
grow_ary() handles the possibility that the maximum (tunable) number of IDs for a given IPC type can be
dynamically changed. It enforces the current maximum limit so that it is no greater than the permanent
system limit (IPCMNI) and adjusts it down if necessary. It also insures that the existing descriptor array is
large enough. If the existing array size is sufficiently large, then the current maximum limit is returned.
Otherwise, a new larger array is allocated, the old array is copied into the new array, and the old array is
freed. The corresponding global spinlock is held when updating the descriptor array for the given IPC type.
ipc_findkey()
ipc_findkey() searches through the descriptor array of the specified ipc_ids object, and searches for the
specified key. Once found, the index of the corresponding descriptor is returned. If the key is not found,
then -1 is returned.
ipcperms()
ipcperms() checks the user, group, and other permissions for access to the IPC resources. It returns 0 if
permission is granted and -1 otherwise.
ipc_lock()
ipc_lock() takes an IPC ID as one of its parameters. It locks the global spinlock for the given IPC type, and
returns a pointer to the descriptor corresponding to the specified IPC ID.
ipc_unlock()
ipc_unlock() releases the global spinlock for the indicated IPC type.
ipc_lockall()
ipc_lockall() locks the global spinlock for the given IPC mechanism (i.e. shared memory, semaphores, and
messaging).
ipc_unlockall()
ipc_unlockall() unlocks the global spinlock for the given IPC mechanism (i.e. shared memory, semaphores,
and messaging).
ipc_get()
ipc_get() takes a pointer to a particular IPC type (i.e. shared memory, semaphores, or message queues) and
a descriptor ID, and returns a pointer to the corresponding IPC descriptor. Note that although the descriptors
for each IPC type are of different data types, the common kern_ipc_perm structure type is embedded as the
first entity in every case. The ipc_get() function returns this common data type. The expected model is that
ipc_get() is called through a wrapper function (e.g. shm_get()) which casts the data type to the correct
descriptor data type.
ipc_parse_version()
ipc_parse_version() removes the IPC_64 flag from the command if it is present and returns either IPC_64
or IPC_OLD.
struct kern_ipc_perm
Each of the IPC descriptors has a data object of this type as the first element. This makes it possible to
access any descriptor from any of the generic IPC functions using a pointer of this data type.
struct ipc_ids
The ipc_ids structure describes the common data for semaphores, message queues, and shared memory.
There are three global instances of this data structure-- semid_ds, msgid_ds and shmid_ds-- for
semaphores, messages and shared memory respectively. In each instance, the sem semaphore is used to
protect access to the structure. The entries field points to an IPC descriptor array, and the ary spinlock
protects access to this array. The seq field is a global sequence number which will be incremented when a
new IPC resource is created.
struct ipc_ids {
int size;
int in_use;
int max_id;
unsigned short seq;
unsigned short seq_max;
struct semaphore sem;
spinlock_t ary;
struct ipc_id* entries;
};
struct ipc_id
An array of struct ipc_id exists in each instance of the ipc_ids structure. The array is dynamically allocated
and may be replaced with larger array by grow_ary() as required. The array is sometimes referred to as the
descriptor array, since the kern_ipc_perm data type is used as the common descriptor data type by the IPC
generic functions.
struct ipc_id {
struct kern_ipc_perm* p;
};