Linux Doc en
Linux Doc en
Introduction 1.1
CodingStyle 1.2
Timer 1.3
Namespaces 1.5
Resource Control 1.5.1
1
Introduction
Introduction
This project aims to translate Linux Documentation/ to Chinese.
2
CodingStyle
Contents
Chapter 1: Indentation
Chapter 2: Breaking long lines and strings
Chapter 3: Placing Braces and Spaces
3.1: Spaces
Chapter 4: Naming
Chapter 5: Typedefs
Chapter 6: Functions
Chapter 7: Centralized exiting of functions
Chapter 8: Commenting
Chapter 9: You've made a mess of it
Chapter 10: Kconfig configuration files
Chapter 11: Data structures
Chapter 12: Macros, Enums and RTL
Chapter 13: Printing kernel messages
Chapter 14: Allocating memory
Chapter 15: The inline disease
Chapter 16: Function return values and names
Chapter 17: Don't re-invent the kernel macros
Chapter 18: Editor modelines and other cruft
Chapter 19: Inline assembly
Appendix I: References
First off, I'd suggest printing out a copy of the GNU coding standards, and NOT read it. Burn them, it's a great symbolic
gesture.
Chapter 1: Indentation
Tabs are 8 characters, and thus indentations are also 8 characters. There are heretic movements that try to make
indentations 4 (or even 2!) characters deep, and that is akin to trying to define the value of PI to be 3.
Rationale: The whole idea behind indentation is to clearly define where a block of control starts and ends. Especially when
you've been looking at your screen for 20 straight hours, you'll find it a lot easier to see how the indentation works if you
have large indentations.
Now, some people will claim that having 8-character indentations makes the code move too far to the right, and makes it
hard to read on a 80-character terminal screen. The answer to that is that if you need more than 3 levels of indentation,
you're screwed anyway, and should fix your program.
In short, 8-char indents make things easier to read, and have the added benefit of warning you when you're nesting your
functions too deep. Heed that warning.
The preferred way to ease multiple indentation levels in a switch statement is to align the "switch" and its subordinate
"case" labels in the same column instead of "double-indenting" the "case" labels. E.g.:
3
CodingStyle
switch (suffix) {
case 'G':
case 'g':
mem <<= 30;
break;
case 'M':
case 'm':
mem <<= 20;
break;
case 'K':
case 'k':
mem <<= 10;
/* fall through */
default:
break;
}
Don't put multiple statements on a single line unless you have something to hide:
if (condition) do_this;
do_something_everytime;
Don't put multiple assignments on a single line either. Kernel coding style is super simple. Avoid tricky expressions.
Outside of comments, documentation and except in Kconfig, spaces are never used for indentation, and the above
example is deliberately broken.
Get a decent editor and don't leave whitespace at the end of lines.
The limit on the length of lines is 80 columns and this is a strongly preferred limit.
Statements longer than 80 columns will be broken into sensible chunks, unless exceeding 80 columns significantly
increases readability and does not hide information. Descendants are always substantially shorter than the parent and are
placed substantially to the right. The same applies to function headers with a long argument list. However, never break
user-visible strings such as printk messages, because that breaks the ability to grep for them.
if (x is true) {
we do y
}
This applies to all non-function statement blocks (if, switch, for, while, do). E.g.:
4
CodingStyle
switch (action) {
case KOBJ_ADD:
return "add";
case KOBJ_REMOVE:
return "remove";
case KOBJ_CHANGE:
return "change";
default:
return NULL;
}
However, there is one special case, namely functions: they have the opening brace at the beginning of the next line, thus:
int function(int x)
{
body of function
}
Heretic people all over the world have claimed that this inconsistency is ... well ... inconsistent, but all right-thinking people
know that (a) K&R are right and (b) K&R are right. Besides, functions are special anyway (you can't nest them in C).
Note that the closing brace is empty on a line of its own, except in the cases where it is followed by a continuation of the
same statement, ie a "while" in a do-statement or an "else" in an if-statement, like this:
do {
body of do-loop
} while (condition);
and
if (x == y) {
..
} else if (x > y) {
...
} else {
....
}
Rationale: K&R.
Also, note that this brace-placement also minimizes the number of empty (or almost empty) lines, without any loss of
readability. Thus, as the supply of new-lines on your screen is not a renewable resource (think 25-line terminal screens
here), you have more empty lines to put comments on.
if (condition)
action();
and
if (condition)
do_this();
else
do_that();
This does not apply if only one branch of a conditional statement is a single statement; in the latter case use braces in both
branches:
5
CodingStyle
if (condition) {
do_this();
do_that();
} else {
otherwise();
}
3.1: Spaces
Linux kernel style for use of spaces depends (mostly) on function-versus-keyword usage. Use a space after (most)
keywords. The notable exceptions are sizeof, typeof, alignof, and attribute, which look somewhat like functions (and are
usually used with parentheses in Linux, although they are not required in the language, as in: "sizeof info" after "struct
fileinfo info;" is declared).
s = sizeof(struct file);
Do not add spaces around (inside) parenthesized expressions. This example is bad:
When declaring pointer data or a function that returns a pointer type, the preferred use of '*' is adjacent to the data name or
function name and not adjacent to the type name. Examples:
char *linux_banner;
unsigned long long memparse(char *ptr, char **retptr);
char *match_strdup(substring_t *s);
Use one space around (on each side of) most binary and ternary operators, such as any of these:
++ --
++ --
and no space around the '.' and "->" structure member operators.
Do not leave trailing whitespace at the ends of lines. Some editors with "smart" indentation will insert whitespace at the
beginning of new lines as appropriate, so you can start typing the next line of code right away. However, some such editors
do not remove the whitespace if you end up not putting a line of code there, such as if you leave a blank line. As a result,
you end up with lines containing trailing whitespace.
6
CodingStyle
Git will warn you about patches that introduce trailing whitespace, and can optionally strip the trailing whitespace for you;
however, if applying a series of patches, this may make later patches in the series fail by changing their context lines.
Chapter 4: Naming
C is a Spartan language, and so should your naming be. Unlike Modula-2 and Pascal programmers, C programmers do not
use cute names like ThisVariableIsATemporaryCounter. A C programmer would call that variable "tmp", which is much
easier to write, and not the least more difficult to understand.
HOWEVER, while mixed-case names are frowned upon, descriptive names for global variables are a must. To call a global
function "foo" is a shooting offense.
GLOBAL variables (to be used only if you really need them) need to have descriptive names, as do global functions. If you
have a function that counts the number of active users, you should call that "countactive_users()" or similar, you should
_not call it "cntusr()".
Encoding the type of a function into the name (so-called Hungarian notation) is brain damaged - the compiler knows the
types anyway and can check those, and it only confuses the programmer. No wonder MicroSoft makes buggy programs.
LOCAL variable names should be short, and to the point. If you have some random integer loop counter, it should probably
be called "i". Calling it "loop_counter" is non-productive, if there is no chance of it being mis-understood. Similarly, "tmp" can
be just about any type of variable that is used to hold a temporary value.
If you are afraid to mix up your local variable names, you have another problem, which is called the function-growth-
hormone-imbalance syndrome. See chapter 6 (Functions).
Chapter 5: Typedefs
Please don't use things like "vps_t".
It's a mistake to use typedef for structures and pointers. When you see a
vps_t a;
In contrast, if it says
Lots of people think that typedefs "help readability". Not so. They are useful only for:
1. totally opaque objects (where the typedef is actively used to hide what the object is).
Example: "pte_t" etc. opaque objects that you can only access using the proper accessor functions.
NOTE! Opaqueness and "accessor functions" are not good in themselves. The reason we have them for things like
ptet etc. is that there really is absolutely _zero portably accessible information there.
2. Clear integer types, where the abstraction helps avoid confusion whether it is "int" or "long".
u8/u16/u32 are perfectly fine typedefs, although they fit into category (d) better than here.
NOTE! Again - there needs to be a reason for this. If something is "unsigned long", then there's no reason to do
7
CodingStyle
but if there is a clear reason for why it under certain circumstances might be an "unsigned int" and under other
configurations might be "unsigned long", then by all means go ahead and use a typedef.
3. when you use sparse to literally create a new type for type-checking.
4. New types which are identical to standard C99 types, in certain exceptional circumstances.
Although it would only take a short amount of time for the eyes and brain to become accustomed to the standard types
like 'uint32_t', some people object to their use anyway.
Therefore, the Linux-specific 'u8/u16/u32/u64' types and their signed equivalents which are identical to standard types
are permitted -- although they are not mandatory in new code of your own.
When editing existing code which already uses one or the other set of types, you should conform to the existing
choices in that code.
In certain structures which are visible to userspace, we cannot require C99 types and cannot use the 'u32' form above.
Thus, we use __u32 and similar types in all structures which are shared with userspace.
Maybe there are other cases too, but the rule should basically be to NEVER EVER use a typedef unless you can clearly
match one of those rules.
In general, a pointer, or a struct that has elements that can reasonably be directly accessed should never be a typedef.
Chapter 6: Functions
Functions should be short and sweet, and do just one thing. They should fit on one or two screenfuls of text (the ISO/ANSI
screen size is 80x24, as we all know), and do one thing and do that well.
The maximum length of a function is inversely proportional to the complexity and indentation level of that function. So, if
you have a conceptually simple function that is just one long (but simple) case-statement, where you have to do lots of
small things for a lot of different cases, it's OK to have a longer function.
However, if you have a complex function, and you suspect that a less-than-gifted first-year high-school student might not
even understand what the function is all about, you should adhere to the maximum limits all the more closely. Use helper
functions with descriptive names (you can ask the compiler to in-line them if you think it's performance-critical, and it will
probably do a better job of it than you would have done).
Another measure of the function is the number of local variables. They shouldn't exceed 5-10, or you're doing something
wrong. Re-think the function, and split it into smaller pieces. A human brain can generally easily keep track of about 7
different things, anything more and it gets confused. You know you're brilliant, but maybe you'd like to understand what you
did 2 weeks from now.
In source files, separate functions with one blank line. If the function is exported, the EXPORT* macro for it should follow
immediately after the closing function brace line. E.g.:
int system_is_up(void)
{
return system_state == SYSTEM_RUNNING;
}
EXPORT_SYMBOL(system_is_up);
In function prototypes, include parameter names with their data types. Although this is not required by the C language, it is
preferred in Linux because it is a simple way to add valuable information for the reader.
8
CodingStyle
Albeit deprecated by some people, the equivalent of the goto statement is used frequently by compilers in form of the
unconditional jump instruction.
The goto statement comes in handy when a function exits from multiple locations and some common work such as cleanup
has to be done.
int fun(int a)
{
int result = 0;
char *buffer = kmalloc(SIZE);
if (buffer == NULL)
return -ENOMEM;
if (condition1) {
while (loop1) {
...
}
result = 1;
goto out;
}
...
out:
kfree(buffer);
return result;
}
Chapter 8: Commenting
Comments are good, but there is also a danger of over-commenting. NEVER try to explain HOW your code works in a
comment: it's much better to write the code so that the working is obvious, and it's a waste of time to explain badly written
code.
Generally, you want your comments to tell WHAT your code does, not HOW. Also, try to avoid putting comments inside a
function body: if the function is so complex that you need to separately comment parts of it, you should probably go back to
chapter 6 for a while. You can make small comments to note or warn about something particularly clever (or ugly), but try to
avoid excess. Instead, put the comments at the head of the function, telling people what it does, and possibly WHY it does
it.
When commenting the kernel API functions, please use the kernel-doc format. See the files Documentation/kernel-doc-
nano-HOWTO.txt and scripts/kernel-doc for details.
Linux style for comments is the C89 "/ ... /" style. Don't use C99-style "// ..." comments.
/*
* This is the preferred style for multi-line
* comments in the Linux kernel source code.
* Please use it consistently.
*
* Description: A column of asterisks on the left side,
* with beginning and ending almost-blank lines.
*/
9
CodingStyle
For files in net/ and drivers/net/ the preferred style for long (multi-line) comments is a little different.
It's also important to comment data, whether they are basic types or derived types. To this end, use just one data
declaration per line (no commas for multiple data declarations). This leaves you room for a small comment on each item,
explaining its use.
So, you can either get rid of GNU emacs, or change it to use saner values. To do the latter, you can stick the following in
your .emacs file:
(add-hook 'c-mode-common-hook
(lambda ()
;; Add kernel style
(c-add-style
"linux-tabs-only"
'("linux" (c-offsets-alist
(arglist-cont-nonempty
c-lineup-gcc-asm-reg
c-lineup-arglist-tabs-only))))))
(add-hook 'c-mode-hook
(lambda ()
(let ((filename (buffer-file-name)))
;; Enable kernel mode for the appropriate files
(when (and filename
(string-match (expand-file-name "~/src/linux-trees")
filename))
(setq indent-tabs-mode t)
(c-set-style "linux-tabs-only")))))
This will make emacs go better with the kernel coding style for C files below ~/src/linux-trees.
But even if you fail in getting emacs to do sane formatting, not everything is lost: use "indent".
Now, again, GNU indent has the same brain-dead settings that GNU emacs has, which is why you need to give it a few
command line options. However, that's not too bad, because even the makers of GNU indent recognize the authority of
K&R (the GNU people aren't evil, they are just severely misguided in this matter), so you just give indent the options "-kr -
i8" (stands for "K&R, 8 character indents"), or use "scripts/Lindent", which indents in the latest style.
10
CodingStyle
"indent" has a lot of options, and especially when it comes to comment re-formatting you may want to take a look at the
man page. But remember: "indent" is not a fix for bad programming.
config AUDIT
bool "Auditing support"
depends on NET
help
Enable auditing infrastructure that can be used with another
kernel subsystem, such as SELinux (which requires this for
logging of avc messages output). Does not do system-call
auditing without CONFIG_AUDITSYSCALL.
Seriously dangerous features (such as write support for certain filesystems) should advertise this prominently in their
prompt string:
config ADFS_FS_RW
bool "ADFS write support (DANGEROUS)"
depends on ADFS_FS
...
For full documentation on the configuration files, see the file Documentation/kbuild/kconfig-language.txt.
Reference counting means that you can avoid locking, and allows multiple users to have access to the data structure in
parallel - and not having to worry about the structure suddenly going away from under them just because they slept or did
something else for a while.
Note that locking is not a replacement for reference counting. Locking is used to keep data structures coherent, while
reference counting is a memory management technique. Usually both are needed, and they are not to be confused with
each other.
Many data structures can indeed have two levels of reference counting, when there are users of different "classes". The
subclass count counts the number of subclass users, and decrements the global count just once when the subclass count
goes to zero.
Examples of this kind of "multi-level-reference-counting" can be found in memory management ("struct mm_struct":
mm_users and mm_count), and in filesystem code ("struct super_block": s_count and s_active).
Remember: if another thread can find your data structure, and you don't have a reference count on it, you almost certainly
have a bug.
11
CodingStyle
CAPITALIZED macro names are appreciated but macros resembling functions may be named in lower case.
#define macrofun(a, b, c) \
do { \
if (a == 5) \
do_this(b, c); \
} while (0)
#define FOO(x) \
do { \
if (blah(x) < 0) \
return -EBUGGERED; \
} while(0)
is a very bad idea. It looks like a function call but exits the "calling" function; don't break the internal parsers of those
who will read the code.
might look like a good thing, but it's confusing as hell when one reads the code and it's prone to breakage from
seemingly innocent changes.
3. macros with arguments that are used as l-values: FOO(x) = y; will bite you if somebody e.g. turns FOO into an inline
function.
4. forgetting about precedence: macros defining constants using expressions must enclose the expression in
parentheses. Beware of similar issues with macros using parameters.
The cpp manual deals with macros exhaustively. The gcc internals manual also covers RTL which is used frequently with
assembly language in the kernel.
There are a number of driver model diagnostic macros in which you should use to make sure messages are matched to the
right device and driver, and are tagged with the right level: dev_err(), dev_warn(), dev_info(), and so forth. For messages
that aren't associated with a particular device, defines pr_debug() and pr_info().
12
CodingStyle
Coming up with good debugging messages can be quite a challenge; and once you have them, they can be a huge help for
remote troubleshooting. Such messages should be compiled out when the DEBUG symbol is not defined (that is, by default
they are not included). When you use dev_dbg() or pr_debug(), that's automatic. Many subsystems have Kconfig options to
turn on -DDEBUG. A related convention uses VERBOSE_DEBUG to add dev_vdbg() messages to the ones already
enabled by DEBUG.
p = kmalloc(sizeof(*p), ...);
The alternative form where struct name is spelled out hurts readability and introduces an opportunity for a bug when the
pointer variable type is changed but the corresponding sizeof that is passed to a memory allocator is not.
Casting the return value which is a void pointer is redundant. The conversion from void pointer to any other pointer type is
guaranteed by the C programming language.
Both forms check for overflow on the allocation size n * sizeof(...), and return NULL if that occurred.
A reasonable rule of thumb is to not put inline at functions that have more than 3 lines of code in them. An exception to this
rule are the cases where a parameter is known to be a compiletime constant, and as a result of this constantness you know
the compiler will be able to optimize most of your function away at compile time. For a good example of this later case, see
the kmalloc() inline function.
Often people argue that adding inline to functions that are static and used only once is always a win since there is no space
tradeoff. While this is technically correct, gcc is capable of inlining these automatically without help, and the maintenance
issue of removing the inline when a second user appears outweighs the potential value of the hint that tells gcc to do
something it would have done anyway.
13
CodingStyle
Functions can return values of many different kinds, and one of the most common is a value indicating whether the function
succeeded or failed. Such a value can be represented as an error-code integer (-Exxx = failure, 0 = success) or a
"succeeded" boolean (0 = failure, non-zero = success).
Mixing up these two sorts of representations is a fertile source of difficult-to-find bugs. If the C language included a strong
distinction between integers and booleans then the compiler would find these mistakes for us... but it doesn't. To help
prevent such bugs, always follow this convention:
For example, "add work" is a command, and the add_work() function returns 0 for success or -EBUSY for failure. In the
same way, "PCI device present" is a predicate, and the pci_dev_present() function returns 1 if it succeeds in finding a
matching device or 0 if it doesn't.
All EXPORTed functions must respect this convention, and so should all public functions. Private (static) functions need not,
but it is recommended that they do.
Functions whose return value is the actual result of a computation, rather than an indication of whether the computation
succeeded, are not subject to this rule. Generally they indicate failure by returning some out-of-range result. Typical
examples would be functions that return pointers; they use NULL or the ERR_PTR mechanism to report failure.
Similarly, if you need to calculate the size of some structure member, use
There are also min() and max() macros that do strict type checking if you need them. Feel free to peruse that header file to
see what else is already defined that you shouldn't reproduce in your code.
Or like this:
/*
Local Variables:
compile-command: "gcc -DMAGIC_DEBUG_FLAG foo.c"
End:
*/
14
CodingStyle
Do not include any of these in source files. People have their own personal editor configurations, and your source files
should not override them. This includes markers for indentation and mode configuration. People may use their own custom
mode, or may have some other magic method for making indentation work correctly.
Consider writing simple helper functions that wrap common bits of inline assembly, rather than repeatedly writing them with
slight variations. Remember that inline assembly can use C parameters.
Large, non-trivial assembly functions should go in .S files, with corresponding C prototypes defined in C header files. The C
prototypes for assembly functions should use "asmlinkage".
You may need to mark your asm statement as volatile, to prevent GCC from removing it if GCC doesn't notice any side
effects. You don't always need to do so, though, and doing so unnecessarily can limit optimization.
When writing a single inline assembly statement containing multiple instructions, put each instruction on a separate line in a
separate quoted string, and end each string except the last with \n\t to properly indent the next instruction in the assembly
output:
Appendix I: References
The C Programming Language, Second Edition
by Brian W. Kernighan and Dennis M. Ritchie. Prentice Hall, Inc., 1988. ISBN 0-13-110362-8 (paperback), 0-13-
110370-9 (hardback).
by Brian W. Kernighan and Rob Pike. Addison-Wesley, Inc., 1999. ISBN 0-201-61586-X.
GNU manuals - where in compliance with K&R and this text - for cpp, gcc, gcc internals and indent
WG14 is the international standardization working group for the programming language C
15
Timer
00-INDEX
- this file
highres.txt
- High resolution timers and dynamic ticks design notes
hpet.txt
- High Precision Event Timer Driver for Linux
hpet_example.c
- sample hpet timer test program
hrtimers.txt
- subsystem for high-resolution kernel timers
Makefile
- Build and link hpet_example
NO_HZ.txt
- Summary of the different methods for the scheduler clock-interrupts management.
timekeeping.txt
- Clock sources, clock events, sched_clock() and delay timer notes
timers-howto.txt
- how to insert delays in the kernel the right (tm) way.
timer_stats.txt
- timer usage statistics
16
Timer Usage Statistics
timer_stats should be used by kernel and userspace developers to verify that their code does not make unduly use of
timers. This helps to avoid unnecessary wakeups, which should be avoided to optimize power consumption.
timer_stats collects information about the timer events which are fired in a Linux system over a sample period:
This entry is used to control the statistics functionality and to read out the sampled information.
# echo 1 >/proc/timer_stats
# echo 0 >/proc/timer_stats
# cat /proc/timer_stats
While sampling is enabled, each readout from /proc/timer_stats will see newly updated statistics. Once sampling is
disabled, the sampled information is kept until a new sample period is started. This allows multiple readouts.
17
Timer Usage Statistics
The first column is the number of events, the second column the pid, the third column is the name of the process. The forth
column shows the function which initialized the timer and in parenthesis the callback function which was executed on
expiry.
Thomas, Ingo
Added flag to indicate 'deferrable timer' in /proc/timer_stats . A deferrable timer will appear as follows
18
Input Device
Contents
0. Disclaimer
1. Introduction
1.1 Device drivers
1.2 Event handlers
2. Simple Usage
3. Detailed Description
3.1 Device drivers
3.1.1 usbhid
3.1.2 usbmouse
3.1.3 usbkbd
3.1.4 wacom
3.1.5 iforce
3.2 Event handlers
3.2.1 keybdev
3.2.2 mousedev
3.2.3 joydev
3.2.4 evdev
4. Verifying if it works
5. Event interface
Sponsored by SuSE
0. Disclaimer
This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as
published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied
warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
more details.
You should have received a copy of the GNU General Public License along with this program; if not, write to the Free
Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
Should you need to contact me, the author, you can do so either by e-mail
mail your message to <[email protected]> , or by paper mail: Vojtech Pavlik, Simunkova 1594, Prague 8, 182 00 Czech
Republic
For your convenience, the GNU General Public License version 2 is included in the package: See the file COPYING.
1. Introduction
This is a collection of drivers that is designed to support all input devices under Linux. While it is currently used only on for
USB input devices, future use (say 2.5/2.6) is expected to expand to replace most of the existing input system, which is
why it lives in drivers/input/ instead of drivers/usb/ .
19
Input Device
The centre of the input drivers is the input module, which must be loaded before any other of the input modules - it serves
as a way of communication between two groups of modules:
2. Simple Usage
For the most usual configuration, with one USB mouse and one USB keyboard, you'll have to load the following modules
(or have them built in to the kernel):
input
mousedev
keybdev
usbcore
uhci_hcd or ohci_hcd or ehci_hcd
usbhid
After this, the USB keyboard will work straight away, and the USB mouse will be available as a character device on major
13, minor 63:
cd /dev
mkdir input
mknod input/mice c 13 63
After that you have to point GPM (the textmode mouse cut&paste tool) and XFree to this device to use it - GPM should be
called like:
And in X:
Section "Pointer"
Protocol "ImPS/2"
Device "/dev/input/mice"
ZAxisMapping 4 5
EndSection
When you do all of the above, you can use your USB mouse and keyboard.
3. Detailed Description
3.1 Device drivers
20
Input Device
Device drivers are the modules that generate events. The events are however not useful without being handled, so you
also will need to use some of the modules from section 3.2.
3.1.1 usbhid
usbhid is the largest and most complex driver of the whole suite. It handles all HID devices, and because there is a very
wide variety of them, and because the USB HID specification isn't simple, it needs to be this big.
Currently, it handles USB mice, joysticks, gamepads, steering wheels keyboards, trackballs and digitizers.
However, USB uses HID also for monitor controls, speaker controls, UPSs, LCDs and many other purposes.
The monitor and speaker controls should be easy to add to the hid/input interface, but for the UPSs and LCDs it doesn't
make much sense. For this, the hiddev interface was designed. See Documentation/hid/hiddev.txt for more information
about it.
The usage of the usbhid module is very simple, it takes no parameters, detects everything automatically and when a HID
device is inserted, it detects it appropriately.
However, because the devices vary wildly, you might happen to have a device that doesn't work well. In that case #define
DEBUG at the beginning of hid-core.c and send me the syslog traces.
3.1.2 usbmouse
For embedded systems, for mice with broken HID descriptors and just any other use when the big usbhid wouldn't be a
good choice, there is the usbmouse driver. It handles USB mice only. It uses a simpler HIDBP protocol. This also means
the mice must support this simpler protocol. Not all do. If you don't have any strong reason to use this module, use usbhid
instead.
3.1.3 usbkbd
Much like usbmouse, this module talks to keyboards with a simplified HIDBP protocol. It's smaller, but doesn't support any
extra special keys. Use usbhid instead if there isn't any special reason to use this.
3.1.4 wacom
This is a driver for Wacom Graphire and Intuos tablets. Not for Wacom PenPartner, that one is handled by the HID driver.
Although the Intuos and Graphire tablets claim that they are HID tablets as well, they are not and thus need this specific
driver.
3.1.5 iforce
A driver for I-Force joysticks and wheels, both over USB and RS232. It includes ForceFeedback support now, even though
Immersion Corp. considers the protocol a trade secret and won't disclose a word about it.
3.2.1 keybdev
keybdev is currently a rather ugly hack that translates the input events into architecture-specific keyboard raw mode (Xlated
AT Set2 on x86), and passes them into the handle_scancode function of the keyboard.c module. This works well enough on
all architectures that keybdev can generate rawmode on, other architectures can be added to it.
The right way would be to pass the events to keyboard.c directly, best if keyboard.c would itself be an event handler. This is
done in the input patch, available on the webpage mentioned below.
21
Input Device
3.2.2 mousedev
mousedev is also a hack to make programs that use mouse input work. It takes events from either mice or digitizers/tablets
and makes a PS/2-style (a la /dev/psaux) mouse device available to the userland. Ideally, the programs could use a more
reasonable interface, for example evdev
Each 'mouse' device is assigned to a single mouse or digitizer, except the last one - 'mice'. This single character device is
shared by all mice and digitizers, and even if none are connected, the device is present. This is useful for hotplugging USB
mice, so that programs can open the device even when no mice are present.
CONFIG_INPUT_MOUSEDEV_SCREEN_[XY] in the kernel configuration are the size of your screen (in pixels) in XFree86. This is
needed if you want to use your digitizer in X, because its movement is sent to X via a virtual PS/2 mouse and thus needs to
be scaled accordingly. These values won't be used if you use a mouse only.
Mousedev will generate either PS/2, ImPS/2 (Microsoft IntelliMouse) or ExplorerPS/2 (IntelliMouse Explorer) protocols,
depending on what the program reading the data wishes. You can set GPM and X to any of these. You'll need ImPS/2 if you
want to make use of a wheel on a USB mouse and ExplorerPS/2 if you want to use extra (up to 5) buttons.
3.2.3 joydev
Joydev implements v0.x and v1.x Linux joystick api, much like drivers/char/joystick/joystick.c used to in earlier versions.
See joystick-api.txt in the Documentation subdirectory for details. As soon as any joystick is connected, it can be accessed
in /dev/input on:
And so on up to js31.
3.2.4 evdev
evdev is the generic input event interface. It passes the events generated in the kernel straight to the program, with
timestamps. The API is still evolving, but should be usable now. It's described in section 5.
This should be the way for GPM and X to get keyboard and mouse events. It allows for multihead in X without any specific
multihead kernel support. The event codes are the same on all architectures and are hardware independent.
And so on up to event31.
22
Input Device
4. Verifying if it works
Typing a couple keys on the keyboard should be enough to check that a USB keyboard works and is correctly connected to
the kernel keyboard driver.
Doing a "cat /dev/input/mouse0" (c, 13, 32) will verify that a mouse is also emulated; characters should appear if you move
it.
You can test the joystick emulation with the 'jstest' utility, available in the joystick package (see
Documentation/input/joystick.txt).
You can test the event devices with the 'evtest' utility available in the LinuxConsole project CVS archive (see the URL
below).
5. Event interface
Should you want to add event device support into any application (X, gpm, svgalib ...) I <[email protected]> will be happy to
provide you any help I can. Here goes a description of the current state of things, which is going to be extended, but not
changed incompatibly as time goes:
You can use blocking and nonblocking reads, also select() on the /dev/input/eventX devices, and you'll always get a whole
number of input events on a read. Their layout is:
struct input_event {
struct timeval time;
unsigned short type;
unsigned short code;
unsigned int value;
};
'time' is the timestamp, it returns the time at which the event happened. Type is for example EV_REL for relative moment,
EV_KEY for a keypress or release. More types are defined in include/linux/input.h.
'code' is event code, for example REL_X or KEY_BACKSPACE, again a complete list is in include/linux/input.h.
'value' is the value the event carries. Either a relative change for EV_REL, absolute new value for EV_ABS (joysticks ...), or
0 for EV_KEY for release, 1 for keypress and 2 for autorepeat.
23
Programming input drivers
Contents
1. Creating an input device driver
1.0 The simplest example
1.1 What the example does
1.2 dev->open() and dev->close()
24
Programming input drivers
#include <linux/input.h>
#include <linux/module.h>
#include <linux/init.h>
#include <asm/irq.h>
#include <asm/io.h>
button_dev = input_allocate_device();
if (!button_dev) {
printk(KERN_ERR "button.c: Not enough memory\n");
error = -ENOMEM;
goto err_free_irq;
}
button_dev->evbit[0] = BIT_MASK(EV_KEY);
button_dev->keybit[BIT_WORD(BTN_0)] = BIT_MASK(BTN_0);
error = input_register_device(button_dev);
if (error) {
printk(KERN_ERR "button.c: Failed to register device\n");
goto err_free_dev;
}
return 0;
err_free_dev:
input_free_device(button_dev);
err_free_irq:
free_irq(BUTTON_IRQ, button_interrupt);
return error;
}
module_init(button_init);
module_exit(button_exit);
In the _init function, which is called either upon module load or when booting the kernel, it grabs the required resources
(it should also check for the presence of the device).
25
Programming input drivers
Then it allocates a new input device structure with input_allocate_device() and sets up input bitfields. This way the device
driver tells the other parts of the input systems what it is - what events can be generated or accepted by this input device.
Our example device can only generate EV_KEY type events, and from those only BTN_0 event code. Thus we only set
these two bits. We could have used
set_bit(EV_KEY, button_dev.evbit);
set_bit(BTN_0, button_dev.keybit);
as well, but with more than single bits the first approach tends to be shorter.
Then the example driver registers the input device structure by calling
input_register_device(&button_dev);
This adds the button_dev structure to linked lists of the input driver and calls device handler modules _connect functions
to tell them a new input device has appeared. input_register_device() may sleep and therefore must not be called from
an interrupt or with a spinlock held.
button_interrupt()
which upon every interrupt from the button checks its state and reports it via the
input_report_key()
call to the input system. There is no need to check whether the interrupt routine isn't reporting two same value events
(press, press for example) to the input system, because the input_report_* functions check that themselves.
input_sync()
call to tell those who receive the events that we've sent a complete report. This doesn't seem important in the one button
case, but is quite important for for example mouse movement, where you don't want the X and Y values to be interpreted
separately, because that'd result in a different movement.
26
Programming input drivers
return 0;
}
Note that input core keeps track of number of users for the device and makes sure that dev->open() is called only when
the first user connects to the device and that dev->close() is called when the very last user disconnects. Calls to both
callbacks are serialized.
The open() callback should return a 0 in case of success or any nonzero value in case of failure. The close() callback
(which is void) must always succeed.
See linux/input.h for the allowable values of code (from 0 to KEY_MAX). value is interpreted as a truth value, ie any
nonzero value means key pressed, zero value means key released. The input code generates events only in case the
value is different from before.
In addition to EV_KEY, there are two more basic event types: EV_REL and EV_ABS. They are used for relative and
absolute values supplied by the device. A relative value may be for example a mouse movement in the X axis. The mouse
reports it as a relative difference from the last position, because it doesn't have any absolute coordinate system to work in.
Absolute events are namely for joysticks and digitizers - devices that do work in an absolute coordinate systems.
Having the device report EV_REL buttons is as simple as with EV_KEY, simply set the corresponding bits and call the
However EV_ABS requires a little special care. Before calling input_register_device, you have to fill additional fields in the
input_dev struct for each absolute axis your device has. If our button device had also the ABS_X axis:
button_dev.absmin[ABS_X] = 0;
button_dev.absmax[ABS_X] = 255;
button_dev.absfuzz[ABS_X] = 4;
button_dev.absflat[ABS_X] = 8;
27
Programming input drivers
This setting would be appropriate for a joystick X axis, with the minimum of 0, maximum of 255 (which the joystick must be
able to reach, no problem if it sometimes reports more, but it must be able to always reach the min and max values), with
noise in the data up to +- 4, and with a center flat position of size 8.
If you don't need absfuzz and absflat, you can set them to zero, which mean that the thing is precise and always returns to
exactly the center position (if it has any).
The id* fields contain the bus ID (PCI, USB, ...), vendor ID and device ID of the device. The bus IDs are defined in
input.h. The vendor and device ids are defined in pci_ids.h, usb_ids.h and similar include files. These fields should be set
by the input device driver before registering it.
The idtype field can be used for specific information for the input device driver.
The id and name fields can be passed to userland via the evdev interface.
Userspace can query and alter current scancode to keycode mappings using EVIOCGKEYCODE and EVIOCSKEYCODE
ioctls on corresponding evdev interface. When a device has all 3 aforementioned fields filled in, the driver may rely on
kernel's default implementation of setting and querying keycode mappings.
28
Programming input drivers
They are very similar to for example key events, but they go in the other direction - from the system to the input device
driver. If your input device driver can handle these events, it has to set the respective bits in evbit, and also the callback
routine:
button_dev->event = button_event;
int button_event(struct input_dev *dev, unsigned int type, unsigned int code, int value);
{
if (type == EV_SND && code == SND_BELL) {
outb(value, BUTTON_BELL);
return 0;
}
return -1;
}
This callback routine can be called from an interrupt or a BH (although that isn't a rule), and thus must not sleep, and must
not take too long to finish.
29
Multi Touch Protocol
Contents
Introduction
Protocol Usage
Protocol Example A
Protocol Example B
Event Usage
Event Semantics
Event Computation
Finger Tracking
Gestures
Notes
Introduction
In order to utilize the full power of the new multi-touch and multi-user devices, a way to report detailed data from multiple
contacts, i.e., objects in direct contact with the device surface, is needed. This document describes the multi-touch (MT)
protocol which allows kernel drivers to report details for an arbitrary number of contacts.
The protocol is divided into two types, depending on the capabilities of the hardware. For devices handling anonymous
contacts (type A), the protocol describes how to send the raw data for all contacts to the receiver. For devices capable of
tracking identifiable contacts (type B), the protocol describes how to send updates for individual contacts via event slots.
Protocol Usage
Contact details are sent sequentially as separate packets of ABS_MT events. Only the ABS_MT events are recognized as
part of a contact packet. Since these events are ignored by current single-touch (ST) applications, the MT protocol can be
implemented on top of the ST protocol in an existing driver.
Drivers for type A devices separate contact packets by calling input_mt_sync() at the end of each packet. This generates a
SYN_MT_REPORT event, which instructs the receiver to accept the data for the current contact and prepare to receive another.
Drivers for type B devices separate contact packets by calling input_mt_slot() , with a slot as argument, at the beginning
of each packet. This generates an ABS_MT_SLOT event, which instructs the receiver to prepare for updates of the given slot.
All drivers mark the end of a multi-touch transfer by calling the usual input_sync() function. This instructs the receiver to
act upon events accumulated since last EV_SYN / SYN_REPORT and prepare to receive a new set of events/packets.
The main difference between the stateless type A protocol and the stateful type B slot protocol lies in the usage of
identifiable contacts to reduce the amount of data sent to userspace. The slot protocol requires the use of the
ABS_MT_TRACKING_ID , either provided by the hardware or computed from the raw data [5].
For type A devices, the kernel driver should generate an arbitrary enumeration of the full set of anonymous contacts
currently on the surface. The order in which the packets appear in the event stream is not important. Event filtering and
finger tracking is left to user space [3].
30
Multi Touch Protocol
For type B devices, the kernel driver should associate a slot with each identified contact, and use that slot to propagate
changes for the contact. Creation, replacement and destruction of contacts is achieved by modifying the
ABS_MT_TRACKING_ID of the associated slot. A non-negative tracking id is interpreted as a contact, and the value -1 denotes
an unused slot. A tracking id not previously present is considered new, and a tracking id no longer present is considered
removed. Since only changes are propagated, the full state of each initiated contact has to reside in the receiving end.
Upon receiving an MT event, one simply updates the appropriate attribute of the current slot.
Some devices identify and/or track more contacts than they can report to the driver. A driver for such a device should
associate one type B slot with each contact that is reported by the hardware. Whenever the identity of the contact
associated with a slot changes, the driver should invalidate that slot by changing its ABS_MT_TRACKING_ID . If the hardware
signals that it is tracking more contacts than it is currently reporting, the driver should use a BTN_TOOL_*TAP event to inform
userspace of the total number of contacts being tracked by the hardware at that moment. The driver should do this by
explicitly sending the corresponding BTN_TOOL_*TAP event and setting use_count to false when calling
input_mt_report_pointer_emulation() . The driver should only advertise as many slots as the hardware can report.
Userspace can detect that a driver can report more total contacts than slots by noting that the largest supported
BTN_TOOL_*TAP event is larger than the total number of type B slots reported in the absinfo for the ABS_MT_SLOT axis.
Protocol Example A
Here is what a minimal event sequence for a two-contact touch would look like for a type A device:
ABS_MT_POSITION_X x[0]
ABS_MT_POSITION_Y y[0]
SYN_MT_REPORT
ABS_MT_POSITION_X x[1]
ABS_MT_POSITION_Y y[1]
SYN_MT_REPORT
SYN_REPORT
The sequence after moving one of the contacts looks exactly the same; the raw data for all present contacts are sent
between every synchronization with SYN_REPORT .
ABS_MT_POSITION_X x[1]
ABS_MT_POSITION_Y y[1]
SYN_MT_REPORT
SYN_REPORT
SYN_MT_REPORT
SYN_REPORT
If the driver reports one of BTN_TOUCH or ABS_PRESSURE in addition to the ABS_MT events, the last SYN_MT_REPORT event
may be omitted. Otherwise, the last SYN_REPORT will be dropped by the input core, resulting in no zero-contact event
reaching userland.
Protocol Example B
Here is what a minimal event sequence for a two-contact touch would look like for a type B device:
31
Multi Touch Protocol
ABS_MT_SLOT 0
ABS_MT_TRACKING_ID 45
ABS_MT_POSITION_X x[0]
ABS_MT_POSITION_Y y[0]
ABS_MT_SLOT 1
ABS_MT_TRACKING_ID 46
ABS_MT_POSITION_X x[1]
ABS_MT_POSITION_Y y[1]
SYN_REPORT
ABS_MT_SLOT 0
ABS_MT_POSITION_X x[0]
SYN_REPORT
ABS_MT_TRACKING_ID -1
SYN_REPORT
The slot being modified is already 0, so the ABS_MT_SLOT is omitted. The message removes the association of slot 0 with
contact 45, thereby destroying contact 45 and freeing slot 0 to be reused for another contact.
ABS_MT_SLOT 1
ABS_MT_TRACKING_ID -1
SYN_REPORT
Event Usage
A set of ABS_MT events with the desired properties is defined. The events are divided into categories, to allow for partial
implementation. The minimum set consists of ABS_MT_POSITION_X and ABS_MT_POSITION_Y , which allows for multiple
contacts to be tracked. If the device supports it, the ABS_MT_TOUCH_MAJOR and ABS_MT_WIDTH_MAJOR may be used to provide
the size of the contact area and approaching tool, respectively.
The TOUCH and WIDTH parameters have a geometrical interpretation; imagine looking through a window at someone
gently holding a finger against the glass. You will see two regions, one inner region consisting of the part of the finger
actually touching the glass, and one outer region formed by the perimeter of the finger. The center of the touching region
(a) is ABS_MT_POSITION_X/Y and the center of the approaching finger (b) is ABS_MT_TOOL_X/Y . The touch diameter is
ABS_MT_TOUCH_MAJOR and the finger diameter is ABS_MT_WIDTH_MAJOR . Now imagine the person pressing the finger harder
against the glass. The touch region will increase, and in general, the ratio ABS_MT_TOUCH_MAJOR / ABS_MT_WIDTH_MAJOR , which
is always smaller than unity, is related to the contact pressure. For pressure-based devices, ABS_MT_PRESSURE may be used
to provide the pressure on the contact area instead. Devices capable of contact hovering can use ABS_MT_DISTANCE to
indicate the distance between the contact and the surface.
32
Multi Touch Protocol
Linux MT Win8
__________ _______________________
/ \ | |
/ \ | |
/ ____ \ | |
/ / \ \ | |
\ \ a \ \ | a |
\ \____/ \ | |
\ \ | |
\ b \ | b |
\ \ | |
\ \ | |
\ \ | |
\ / | |
\ / | |
\ / | |
\__________/ |_______________________|
In addition to the MAJOR parameters, the oval shape of the touch and finger regions can be described by adding the MINOR
parameters, such that MAJOR and MINOR are the major and minor axis of an ellipse. The orientation of the touch ellipse
can be described with the ORIENTATION parameter, and the direction of the finger ellipse is given by the vector (a - b).
For type A devices, further specification of the touch shape is possible via ABS_MT_BLOB_ID .
The ABS_MT_TOOL_TYPE may be used to specify whether the touching tool is a finger or a pen or something else. Finally, the
ABS_MT_TRACKING_ID event may be used to track identified contacts over time [5].
In the type B protocol, ABS_MT_TOOL_TYPE and ABS_MT_TRACKING_ID are implicitly handled by input core; drivers should
instead call input_mt_report_slot_state() .
Event Semantics
ABS_MT_TOUCH_MAJOR
The length of the major axis of the contact. The length should be given in surface units. If the surface has an X times Y
resolution, the largest possible value of ABS_MT_TOUCH_MAJOR is sqrt(X^2 + Y^2) , the diagonal [4].
ABS_MT_TOUCH_MINOR
The length, in surface units, of the minor axis of the contact. If the contact is circular, this event can be omitted [4].
ABS_MT_WIDTH_MAJOR
The length, in surface units, of the major axis of the approaching tool. This should be understood as the size of the tool
itself. The orientation of the contact and the approaching tool are assumed to be the same [4].
ABS_MT_WIDTH_MINOR
The length, in surface units, of the minor axis of the approaching tool. Omit if circular [4].
The above four values can be used to derive additional information about the contact. The ratio ABS_MT_TOUCH_MAJOR /
ABS_MT_WIDTH_MAJOR approximates the notion of pressure. The fingers of the hand and the palm all have different
characteristic widths.
ABS_MT_PRESSURE
The pressure, in arbitrary units, on the contact area. May be used instead of TOUCH and WIDTH for pressure-based
devices or any device with a spatial signal intensity distribution.
ABS_MT_DISTANCE
The distance, in surface units, between the contact and the surface. Zero distance means the contact is touching the
surface. A positive number means the contact is hovering above the surface.
33
Multi Touch Protocol
ABS_MT_ORIENTATION
The orientation of the touching ellipse. The value should describe a signed quarter of a revolution clockwise around the
touch center. The signed value range is arbitrary, but zero should be returned for an ellipse aligned with the Y axis of the
surface, a negative value when the ellipse is turned to the left, and a positive value when the ellipse is turned to the right.
When completely aligned with the X axis, the range max should be returned.
Touch ellipsis are symmetrical by default. For devices capable of true 360 degree orientation, the reported orientation must
exceed the range max to indicate more than a quarter of a revolution. For an upside-down finger, range max * 2 should be
returned.
Orientation can be omitted if the touch area is circular, or if the information is not available in the kernel driver. Partial
orientation support is possible if the device can distinguish between the two axis, but not (uniquely) any values in between.
In such cases, the range of ABS_MT_ORIENTATION should be [0, 1] [4].
ABS_MT_POSITION_X
ABS_MT_POSITION_Y
ABS_MT_TOOL_X
The surface X coordinate of the center of the approaching tool. Omit if the device cannot distinguish between the intended
touch point and the tool itself.
ABS_MT_TOOL_Y
The surface Y coordinate of the center of the approaching tool. Omit if the device cannot distinguish between the intended
touch point and the tool itself.
The four position values can be used to separate the position of the touch from the position of the tool. If both positions are
present, the major tool axis points towards the touch point [1]. Otherwise, the tool axes are aligned with the touch axes.
ABS_MT_TOOL_TYPE
The type of approaching tool. A lot of kernel drivers cannot distinguish between different tool types, such as a finger or a
pen. In such cases, the event should be omitted. The protocol currently supports MT_TOOL_FINGER and MT_TOOL_PEN [2]. For
type B devices, this event is handled by input core; drivers should instead use input_mt_report_slot_state() .
ABS_MT_BLOB_ID
The BLOB_ID groups several packets together into one arbitrarily shaped contact. The sequence of points forms a polygon
which defines the shape of the contact. This is a low-level anonymous grouping for type A devices, and should not be
confused with the high-level trackingID [5]. Most type A devices do not have blob capability, so drivers can safely omit this
event.
ABS_MT_TRACKING_ID
The TRACKING_ID identifies an initiated contact throughout its life cycle [5]. The value range of the TRACKING_ID should be
large enough to ensure unique identification of a contact maintained over an extended period of time. For type B devices,
this event is handled by input core; drivers should instead use input_mt_report_slot_state() .
Event Computation
The flora of different hardware unavoidably leads to some devices fitting better to the MT protocol than others. To simplify
and unify the mapping, this section gives recipes for how to compute certain events.
For devices reporting contacts as rectangular shapes, signed orientation cannot be obtained. Assuming X and Y are the
lengths of the sides of the touching rectangle, here is a simple formula that retains the most information possible:
34
Multi Touch Protocol
ABS_MT_TOUCH_MAJOR := max(X, Y)
ABS_MT_TOUCH_MINOR := min(X, Y)
ABS_MT_ORIENTATION := bool(X > Y)
The range of ABS_MT_ORIENTATION should be set to [0, 1], to indicate that the device can distinguish between a finger along
the Y axis (0) and a finger along the X axis (1).
For win8 devices with both T and C coordinates, the position mapping is
ABS_MT_POSITION_X := T_X
ABS_MT_POSITION_Y := T_Y
ABS_MT_TOOL_X := C_X
ABS_MT_TOOL_X := C_Y
Unfortunately, there is not enough information to specify both the touching ellipse and the tool ellipse, so one has to resort
to approximations. One simple scheme, which is compatible with earlier usage, is:
ABS_MT_TOUCH_MAJOR := min(X, Y)
ABS_MT_TOUCH_MINOR := <not used>
ABS_MT_ORIENTATION := <not used>
ABS_MT_WIDTH_MAJOR := min(X, Y) + distance(T, C)
ABS_MT_WIDTH_MINOR := min(X, Y)
Rationale: We have no information about the orientation of the touching ellipse, so approximate it with an inscribed circle
instead. The tool ellipse should align with the vector (T - C), so the diameter must increase with distance(T, C). Finally,
assume that the touch diameter is equal to the tool thickness, and we arrive at the formulas above.
Finger Tracking
The process of finger tracking, i.e., to assign a unique trackingID to each initiated contact on the surface, is a Euclidian
Bipartite Matching problem. At each event synchronization, the set of actual contacts is matched to the set of contacts from
the previous synchronization. A full implementation can be found in [3].
Gestures
In the specific application of creating gesture events, the TOUCH and WIDTH parameters can be used to, e.g.,
approximate finger pressure or distinguish between index finger and thumb. With the addition of the MINOR parameters,
one can also distinguish between a sweeping finger and a pointing finger, and with ORIENTATION, one can detect twisting
of fingers.
Notes
In order to stay compatible with existing applications, the data reported in a finger packet must not be recognized as single-
touch events.
For type A devices, all finger data bypasses input filtering, since subsequent events of the same type refer to different
fingers.
For example usage of the type A protocol, see the bcm5974 driver. For example usage of the type B protocol, see the hid-
egalax driver.
[1] Also, the difference ( TOOL_X - POSITION_X ) can be used to model tilt.
[2] The list can of course be extended.
[3] The mtdev project: http://bitmath.org/code/mtdev/.
[4] See the section on event computation.
35
Multi Touch Protocol
36
Namespaces
37
Resource Control
There are a lot of kinds of objects in the kernel that don't have individual limits or that have limits that are ineffective when a
set of processes is allowed to switch user ids. With user namespaces enabled in a kernel for people who don't trust their
users or their users programs to play nice this problems becomes more acute.
Therefore it is recommended that memory control groups be enabled in kernels that enable user namespaces, and it is
further recommended that userspace configure memory control groups to limit how much memory user's they don't trust to
play nice can use.
Memory control groups can be configured by installing the libcgroup package present on most distros editing
/etc/cgrules.conf , /etc/cgconfig.conf and setting up libpam-cgroup .
38
Virtual Memory
00-INDEX
- this file.
active_mm.txt
- An explanation from Linus about tsk->active_mm vs tsk->mm.
balance
- various information on memory balancing.
cleancache.txt
- Intro to cleancache and page-granularity victim cache.
frontswap.txt
- Outline frontswap, part of the transcendent memory frontend.
highmem.txt
- Outline of highmem and common issues.
hugetlbpage.txt
- a brief summary of hugetlbpage support in the Linux kernel.
hwpoison.txt
- explains what hwpoison is
ksm.txt
- how to use the Kernel Samepage Merging feature.
numa
- information about NUMA specific code in the Linux vm.
numa_memory_policy.txt
- documentation of concepts and APIs of the 2.6 memory policy support.
overcommit-accounting
- description of the Linux kernels overcommit handling modes.
page_migration
- description of page migration in NUMA systems.
pagemap.txt
- pagemap, from the userspace perspective
slub.txt
- a short users guide for SLUB.
soft-dirty.txt
- short explanation for soft-dirty PTEs
split_page_table_lock
- Separate per-table lock to improve scalability of the old page_table_lock.
transhuge.txt
- Transparent Hugepage Support, alternative way of using hugepages.
unevictable-lru.txt
- Unevictable LRU infrastructure
zswap.txt
- Intro to compressed cache for swap pages
39
HIGH MEMORY HANDLING
Contents
WHAT IS HIGH MEMORY?
TEMPORARY VIRTUAL MAPPINGS
USING KMAP_ATOMIC
COST OF TEMPORARY MAPPINGS
i386 PAE
The part of (physical) memory not covered by a permanent mapping is what we refer to as 'highmem'. There are various
architecture dependent constraints on where exactly that border lies.
In the i386 arch, for example, we choose to map the kernel into every process's VM space so that we don't have to pay the
full TLB invalidation costs for kernel entry/exit. This means the available virtual memory space (4GiB on i386) has to be
divided between user and kernel space.
The traditional split for architectures using this approach is 3:1, 3GiB for userspace and the top 1GiB for kernel space:
+--------+ 0xffffffff
| Kernel |
+--------+ 0xc0000000
| |
| User |
| |
+--------+ 0x00000000
This means that the kernel can at most map 1GiB of physical memory at any one time, but because we need virtual
address space for other things - including temporary maps to access the rest of the physical memory - the actual direct
map will typically be less (usually around ~896MiB).
Other architectures that have mm context tagged TLBs can have separate kernel and user maps. Some hardware (like
some ARMs), however, have limited virtual space when they use mm context tags.
vmap(). This can be used to make a long duration mapping of multiple physical pages into a contiguous virtual space.
It needs global synchronization to unmap.
kmap(). This permits a short duration mapping of a single page. It needs global synchronization, but is amortized
somewhat. It is also prone to deadlocks when using in a nested fashion, and so it is not recommended for new code.
40
HIGH MEMORY HANDLING
kmap_atomic(). This permits a very short duration mapping of a single page. Since the mapping is restricted to the
CPU that issued it, it performs well, but the issuing task is therefore required to stay on that CPU until it has finished,
lest some other task displace its mappings.
kmap_atomic() may also be used by interrupt contexts, since it is does not sleep and the caller may not sleep until
after kunmap_atomic() is called.
USING KMAP_ATOMIC
When and where to use kmap_atomic() is straightforward. It is used when code wants to access the contents of a page that
might be allocated from high memory (see __GFP_HIGHMEM), for example a page in the pagecache. The API has two
functions, and they can be used in a manner similar to the following:
Note that the kunmap_atomic() call takes the result of the kmap_atomic() call not the argument.
If you need to map two pages because you want to copy from one page to another you need to keep the kmap_atomic calls
strictly nested, like:
vaddr1 = kmap_atomic(page1);
vaddr2 = kmap_atomic(page2);
kunmap_atomic(vaddr2);
kunmap_atomic(vaddr1);
If CONFIG_HIGHMEM is not set, then the kernel will try and create a mapping simply with a bit of arithmetic that will
convert the page struct address into a pointer to the page contents rather than juggling mappings about. In such a case, the
unmap operation may be a null operation.
If CONFIG_MMU is not set, then there can be no temporary mappings and no highmem. In such a case, the arithmetic
approach will also be used.
i386 PAE
The i386 arch, under some circumstances, will permit you to stick up to 64GiB of RAM into your 32-bit machine. This has a
number of consequences:
41
HIGH MEMORY HANDLING
Linux needs a page-frame structure for each page in the system and the pageframes need to live in the permanent
mapping, which means:
you can have 896M/sizeof(struct page) page-frames at most; with struct page being 32-bytes that would end up being
something in the order of 112G worth of pages; the kernel, however, needs to store more than just page-frames in that
memory...
PAE makes your page tables larger - which slows the system down as more data has to be accessed to traverse in
TLB fills and the like. One advantage is that PAE has more PTE bits and can provide advanced features like NX and
PAT.
The general recommendation is that you don't use more than 8GiB on a 32-bit machine - although more might work for you
and your workload, you're pretty much on your own - don't expect kernel developers to really care much if things come
apart.
42