0% found this document useful (0 votes)
3 views42 pages

Linux Doc en

The document outlines the coding style guidelines for the Linux kernel, emphasizing readability and maintainability. It covers various aspects such as indentation, naming conventions, function design, and commenting practices. The project also aims to translate Linux documentation into Chinese and provides links to the Git repository and online resources.

Uploaded by

wenboli1999
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views42 pages

Linux Doc en

The document outlines the coding style guidelines for the Linux kernel, emphasizing readability and maintainability. It covers various aspects such as indentation, naming conventions, function design, and commenting practices. The project also aims to translate Linux documentation into Chinese and provides links to the Git repository and online resources.

Uploaded by

wenboli1999
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

Table of Contents

Introduction 1.1

CodingStyle 1.2
Timer 1.3

Timer Usage Statistics 1.3.1

Input Device 1.4


Programming input drivers 1.4.1

Multi Touch Protocol 1.4.2

Namespaces 1.5
Resource Control 1.5.1

Virtual Memory 1.6


HIGH MEMORY HANDLING 1.6.1

1
Introduction

Introduction
This project aims to translate Linux Documentation/ to Chinese.

Git Repository: https://github.com/tinyclub/linux-doc


Online Gitbook: http://tinylab.gitbooks.io/linux-doc

Current version is from Linux 3.18.y.

See more from zh-cn/README.md.

2
CodingStyle

Contents
Chapter 1: Indentation
Chapter 2: Breaking long lines and strings
Chapter 3: Placing Braces and Spaces
3.1: Spaces
Chapter 4: Naming
Chapter 5: Typedefs
Chapter 6: Functions
Chapter 7: Centralized exiting of functions
Chapter 8: Commenting
Chapter 9: You've made a mess of it
Chapter 10: Kconfig configuration files
Chapter 11: Data structures
Chapter 12: Macros, Enums and RTL
Chapter 13: Printing kernel messages
Chapter 14: Allocating memory
Chapter 15: The inline disease
Chapter 16: Function return values and names
Chapter 17: Don't re-invent the kernel macros
Chapter 18: Editor modelines and other cruft
Chapter 19: Inline assembly
Appendix I: References

Linux kernel coding style


This is a short document describing the preferred coding style for the linux kernel. Coding style is very personal, and I won't
force my views on anybody, but this is what goes for anything that I have to be able to maintain, and I'd prefer it for most
other things too. Please at least consider the points made here.

First off, I'd suggest printing out a copy of the GNU coding standards, and NOT read it. Burn them, it's a great symbolic
gesture.

Anyway, here goes:

Chapter 1: Indentation
Tabs are 8 characters, and thus indentations are also 8 characters. There are heretic movements that try to make
indentations 4 (or even 2!) characters deep, and that is akin to trying to define the value of PI to be 3.

Rationale: The whole idea behind indentation is to clearly define where a block of control starts and ends. Especially when
you've been looking at your screen for 20 straight hours, you'll find it a lot easier to see how the indentation works if you
have large indentations.

Now, some people will claim that having 8-character indentations makes the code move too far to the right, and makes it
hard to read on a 80-character terminal screen. The answer to that is that if you need more than 3 levels of indentation,
you're screwed anyway, and should fix your program.

In short, 8-char indents make things easier to read, and have the added benefit of warning you when you're nesting your
functions too deep. Heed that warning.

The preferred way to ease multiple indentation levels in a switch statement is to align the "switch" and its subordinate
"case" labels in the same column instead of "double-indenting" the "case" labels. E.g.:

3
CodingStyle

switch (suffix) {
case 'G':
case 'g':
mem <<= 30;
break;
case 'M':
case 'm':
mem <<= 20;
break;
case 'K':
case 'k':
mem <<= 10;
/* fall through */
default:
break;
}

Don't put multiple statements on a single line unless you have something to hide:

if (condition) do_this;
do_something_everytime;

Don't put multiple assignments on a single line either. Kernel coding style is super simple. Avoid tricky expressions.

Outside of comments, documentation and except in Kconfig, spaces are never used for indentation, and the above
example is deliberately broken.

Get a decent editor and don't leave whitespace at the end of lines.

Chapter 2: Breaking long lines and strings


Coding style is all about readability and maintainability using commonly available tools.

The limit on the length of lines is 80 columns and this is a strongly preferred limit.

Statements longer than 80 columns will be broken into sensible chunks, unless exceeding 80 columns significantly
increases readability and does not hide information. Descendants are always substantially shorter than the parent and are
placed substantially to the right. The same applies to function headers with a long argument list. However, never break
user-visible strings such as printk messages, because that breaks the ability to grep for them.

Chapter 3: Placing Braces and Spaces


The other issue that always comes up in C styling is the placement of braces. Unlike the indent size, there are few technical
reasons to choose one placement strategy over the other, but the preferred way, as shown to us by the prophets Kernighan
and Ritchie, is to put the opening brace last on the line, and put the closing brace first, thusly:

if (x is true) {
we do y
}

This applies to all non-function statement blocks (if, switch, for, while, do). E.g.:

4
CodingStyle

switch (action) {
case KOBJ_ADD:
return "add";
case KOBJ_REMOVE:
return "remove";
case KOBJ_CHANGE:
return "change";
default:
return NULL;
}

However, there is one special case, namely functions: they have the opening brace at the beginning of the next line, thus:

int function(int x)
{
body of function
}

Heretic people all over the world have claimed that this inconsistency is ... well ... inconsistent, but all right-thinking people
know that (a) K&R are right and (b) K&R are right. Besides, functions are special anyway (you can't nest them in C).

Note that the closing brace is empty on a line of its own, except in the cases where it is followed by a continuation of the
same statement, ie a "while" in a do-statement or an "else" in an if-statement, like this:

do {
body of do-loop
} while (condition);

and

if (x == y) {
..
} else if (x > y) {
...
} else {
....
}

Rationale: K&R.

Also, note that this brace-placement also minimizes the number of empty (or almost empty) lines, without any loss of
readability. Thus, as the supply of new-lines on your screen is not a renewable resource (think 25-line terminal screens
here), you have more empty lines to put comments on.

Do not unnecessarily use braces where a single statement will do.

if (condition)
action();

and

if (condition)
do_this();
else
do_that();

This does not apply if only one branch of a conditional statement is a single statement; in the latter case use braces in both
branches:

5
CodingStyle

if (condition) {
do_this();
do_that();
} else {
otherwise();
}

3.1: Spaces
Linux kernel style for use of spaces depends (mostly) on function-versus-keyword usage. Use a space after (most)
keywords. The notable exceptions are sizeof, typeof, alignof, and attribute, which look somewhat like functions (and are
usually used with parentheses in Linux, although they are not required in the language, as in: "sizeof info" after "struct
fileinfo info;" is declared).

So use a space after these keywords:

if, switch, case, for, do, while

but not with sizeof, typeof, alignof, or attribute. E.g.,

s = sizeof(struct file);

Do not add spaces around (inside) parenthesized expressions. This example is bad:

s = sizeof( struct file );

When declaring pointer data or a function that returns a pointer type, the preferred use of '*' is adjacent to the data name or
function name and not adjacent to the type name. Examples:

char *linux_banner;
unsigned long long memparse(char *ptr, char **retptr);
char *match_strdup(substring_t *s);

Use one space around (on each side of) most binary and ternary operators, such as any of these:

= + - < > * / % | & ^ <= >= == != ? :

but no space after unary operators:

& * + - ~ ! sizeof typeof alignof __attribute__ defined

no space before the postfix increment & decrement unary operators:

++ --

no space after the prefix increment & decrement unary operators:

++ --

and no space around the '.' and "->" structure member operators.

Do not leave trailing whitespace at the ends of lines. Some editors with "smart" indentation will insert whitespace at the
beginning of new lines as appropriate, so you can start typing the next line of code right away. However, some such editors
do not remove the whitespace if you end up not putting a line of code there, such as if you leave a blank line. As a result,
you end up with lines containing trailing whitespace.

6
CodingStyle

Git will warn you about patches that introduce trailing whitespace, and can optionally strip the trailing whitespace for you;
however, if applying a series of patches, this may make later patches in the series fail by changing their context lines.

Chapter 4: Naming
C is a Spartan language, and so should your naming be. Unlike Modula-2 and Pascal programmers, C programmers do not
use cute names like ThisVariableIsATemporaryCounter. A C programmer would call that variable "tmp", which is much
easier to write, and not the least more difficult to understand.

HOWEVER, while mixed-case names are frowned upon, descriptive names for global variables are a must. To call a global
function "foo" is a shooting offense.

GLOBAL variables (to be used only if you really need them) need to have descriptive names, as do global functions. If you
have a function that counts the number of active users, you should call that "countactive_users()" or similar, you should
_not call it "cntusr()".

Encoding the type of a function into the name (so-called Hungarian notation) is brain damaged - the compiler knows the
types anyway and can check those, and it only confuses the programmer. No wonder MicroSoft makes buggy programs.

LOCAL variable names should be short, and to the point. If you have some random integer loop counter, it should probably
be called "i". Calling it "loop_counter" is non-productive, if there is no chance of it being mis-understood. Similarly, "tmp" can
be just about any type of variable that is used to hold a temporary value.

If you are afraid to mix up your local variable names, you have another problem, which is called the function-growth-
hormone-imbalance syndrome. See chapter 6 (Functions).

Chapter 5: Typedefs
Please don't use things like "vps_t".

It's a mistake to use typedef for structures and pointers. When you see a

vps_t a;

in the source, what does it mean?

In contrast, if it says

struct virtual_container *a;

you can actually tell what "a" is.

Lots of people think that typedefs "help readability". Not so. They are useful only for:

1. totally opaque objects (where the typedef is actively used to hide what the object is).

Example: "pte_t" etc. opaque objects that you can only access using the proper accessor functions.

NOTE! Opaqueness and "accessor functions" are not good in themselves. The reason we have them for things like
ptet etc. is that there really is absolutely _zero portably accessible information there.

2. Clear integer types, where the abstraction helps avoid confusion whether it is "int" or "long".

u8/u16/u32 are perfectly fine typedefs, although they fit into category (d) better than here.

NOTE! Again - there needs to be a reason for this. If something is "unsigned long", then there's no reason to do

typedef unsigned long myflags_t;

7
CodingStyle

but if there is a clear reason for why it under certain circumstances might be an "unsigned int" and under other
configurations might be "unsigned long", then by all means go ahead and use a typedef.

3. when you use sparse to literally create a new type for type-checking.

4. New types which are identical to standard C99 types, in certain exceptional circumstances.

Although it would only take a short amount of time for the eyes and brain to become accustomed to the standard types
like 'uint32_t', some people object to their use anyway.

Therefore, the Linux-specific 'u8/u16/u32/u64' types and their signed equivalents which are identical to standard types
are permitted -- although they are not mandatory in new code of your own.

When editing existing code which already uses one or the other set of types, you should conform to the existing
choices in that code.

5. Types safe for use in userspace.

In certain structures which are visible to userspace, we cannot require C99 types and cannot use the 'u32' form above.
Thus, we use __u32 and similar types in all structures which are shared with userspace.

Maybe there are other cases too, but the rule should basically be to NEVER EVER use a typedef unless you can clearly
match one of those rules.

In general, a pointer, or a struct that has elements that can reasonably be directly accessed should never be a typedef.

Chapter 6: Functions
Functions should be short and sweet, and do just one thing. They should fit on one or two screenfuls of text (the ISO/ANSI
screen size is 80x24, as we all know), and do one thing and do that well.

The maximum length of a function is inversely proportional to the complexity and indentation level of that function. So, if
you have a conceptually simple function that is just one long (but simple) case-statement, where you have to do lots of
small things for a lot of different cases, it's OK to have a longer function.

However, if you have a complex function, and you suspect that a less-than-gifted first-year high-school student might not
even understand what the function is all about, you should adhere to the maximum limits all the more closely. Use helper
functions with descriptive names (you can ask the compiler to in-line them if you think it's performance-critical, and it will
probably do a better job of it than you would have done).

Another measure of the function is the number of local variables. They shouldn't exceed 5-10, or you're doing something
wrong. Re-think the function, and split it into smaller pieces. A human brain can generally easily keep track of about 7
different things, anything more and it gets confused. You know you're brilliant, but maybe you'd like to understand what you
did 2 weeks from now.

In source files, separate functions with one blank line. If the function is exported, the EXPORT* macro for it should follow
immediately after the closing function brace line. E.g.:

int system_is_up(void)
{
return system_state == SYSTEM_RUNNING;
}
EXPORT_SYMBOL(system_is_up);

In function prototypes, include parameter names with their data types. Although this is not required by the C language, it is
preferred in Linux because it is a simple way to add valuable information for the reader.

Chapter 7: Centralized exiting of functions

8
CodingStyle

Albeit deprecated by some people, the equivalent of the goto statement is used frequently by compilers in form of the
unconditional jump instruction.

The goto statement comes in handy when a function exits from multiple locations and some common work such as cleanup
has to be done.

The rationale is:

unconditional statements are easier to understand and follow


nesting is reduced
errors by not updating individual exit points when making modifications are prevented
saves the compiler work to optimize redundant code away ;)

int fun(int a)
{
int result = 0;
char *buffer = kmalloc(SIZE);

if (buffer == NULL)
return -ENOMEM;

if (condition1) {
while (loop1) {
...
}
result = 1;
goto out;
}
...
out:
kfree(buffer);
return result;
}

Chapter 8: Commenting
Comments are good, but there is also a danger of over-commenting. NEVER try to explain HOW your code works in a
comment: it's much better to write the code so that the working is obvious, and it's a waste of time to explain badly written
code.

Generally, you want your comments to tell WHAT your code does, not HOW. Also, try to avoid putting comments inside a
function body: if the function is so complex that you need to separately comment parts of it, you should probably go back to
chapter 6 for a while. You can make small comments to note or warn about something particularly clever (or ugly), but try to
avoid excess. Instead, put the comments at the head of the function, telling people what it does, and possibly WHY it does
it.

When commenting the kernel API functions, please use the kernel-doc format. See the files Documentation/kernel-doc-
nano-HOWTO.txt and scripts/kernel-doc for details.

Linux style for comments is the C89 "/ ... /" style. Don't use C99-style "// ..." comments.

The preferred style for long (multi-line) comments is:

/*
* This is the preferred style for multi-line
* comments in the Linux kernel source code.
* Please use it consistently.
*
* Description: A column of asterisks on the left side,
* with beginning and ending almost-blank lines.
*/

9
CodingStyle

For files in net/ and drivers/net/ the preferred style for long (multi-line) comments is a little different.

/* The preferred comment style for files in net/ and drivers/net


* looks like this.
*
* It is nearly the same as the generally preferred comment style,
* but there is no initial almost-blank line.
*/

It's also important to comment data, whether they are basic types or derived types. To this end, use just one data
declaration per line (no commas for multiple data declarations). This leaves you room for a small comment on each item,
explaining its use.

Chapter 9: You've made a mess of it


That's OK, we all do. You've probably been told by your long-time Unix user helper that "GNU emacs" automatically formats
the C sources for you, and you've noticed that yes, it does do that, but the defaults it uses are less than desirable (in fact,
they are worse than random typing - an infinite number of monkeys typing into GNU emacs would never make a good
program).

So, you can either get rid of GNU emacs, or change it to use saner values. To do the latter, you can stick the following in
your .emacs file:

(defun c-lineup-arglist-tabs-only (ignored)


"Line up argument lists by tabs, not spaces"
(let* ((anchor (c-langelem-pos c-syntactic-element))
(column (c-langelem-2nd-pos c-syntactic-element))
(offset (- (1+ column) anchor))
(steps (floor offset c-basic-offset)))
(* (max steps 1)
c-basic-offset)))

(add-hook 'c-mode-common-hook
(lambda ()
;; Add kernel style
(c-add-style
"linux-tabs-only"
'("linux" (c-offsets-alist
(arglist-cont-nonempty
c-lineup-gcc-asm-reg
c-lineup-arglist-tabs-only))))))

(add-hook 'c-mode-hook
(lambda ()
(let ((filename (buffer-file-name)))
;; Enable kernel mode for the appropriate files
(when (and filename
(string-match (expand-file-name "~/src/linux-trees")
filename))
(setq indent-tabs-mode t)
(c-set-style "linux-tabs-only")))))

This will make emacs go better with the kernel coding style for C files below ~/src/linux-trees.

But even if you fail in getting emacs to do sane formatting, not everything is lost: use "indent".

Now, again, GNU indent has the same brain-dead settings that GNU emacs has, which is why you need to give it a few
command line options. However, that's not too bad, because even the makers of GNU indent recognize the authority of
K&R (the GNU people aren't evil, they are just severely misguided in this matter), so you just give indent the options "-kr -
i8" (stands for "K&R, 8 character indents"), or use "scripts/Lindent", which indents in the latest style.

10
CodingStyle

"indent" has a lot of options, and especially when it comes to comment re-formatting you may want to take a look at the
man page. But remember: "indent" is not a fix for bad programming.

Chapter 10: Kconfig configuration files


For all of the Kconfig* configuration files throughout the source tree, the indentation is somewhat different. Lines under a
"config" definition are indented with one tab, while help text is indented an additional two spaces. Example:

config AUDIT
bool "Auditing support"
depends on NET
help
Enable auditing infrastructure that can be used with another
kernel subsystem, such as SELinux (which requires this for
logging of avc messages output). Does not do system-call
auditing without CONFIG_AUDITSYSCALL.

Seriously dangerous features (such as write support for certain filesystems) should advertise this prominently in their
prompt string:

config ADFS_FS_RW
bool "ADFS write support (DANGEROUS)"
depends on ADFS_FS
...

For full documentation on the configuration files, see the file Documentation/kbuild/kconfig-language.txt.

Chapter 11: Data structures


Data structures that have visibility outside the single-threaded environment they are created and destroyed in should
always have reference counts. In the kernel, garbage collection doesn't exist (and outside the kernel garbage collection is
slow and inefficient), which means that you absolutely have to reference count all your uses.

Reference counting means that you can avoid locking, and allows multiple users to have access to the data structure in
parallel - and not having to worry about the structure suddenly going away from under them just because they slept or did
something else for a while.

Note that locking is not a replacement for reference counting. Locking is used to keep data structures coherent, while
reference counting is a memory management technique. Usually both are needed, and they are not to be confused with
each other.

Many data structures can indeed have two levels of reference counting, when there are users of different "classes". The
subclass count counts the number of subclass users, and decrements the global count just once when the subclass count
goes to zero.

Examples of this kind of "multi-level-reference-counting" can be found in memory management ("struct mm_struct":
mm_users and mm_count), and in filesystem code ("struct super_block": s_count and s_active).

Remember: if another thread can find your data structure, and you don't have a reference count on it, you almost certainly
have a bug.

Chapter 12: Macros, Enums and RTL


Names of macros defining constants and labels in enums are capitalized.

#define CONSTANT 0x12345

11
CodingStyle

Enums are preferred when defining several related constants.

CAPITALIZED macro names are appreciated but macros resembling functions may be named in lower case.

Generally, inline functions are preferable to macros resembling functions.

Macros with multiple statements should be enclosed in a do - while block:

#define macrofun(a, b, c) \
do { \
if (a == 5) \
do_this(b, c); \
} while (0)

Things to avoid when using macros:

1. macros that affect control flow:

#define FOO(x) \
do { \
if (blah(x) < 0) \
return -EBUGGERED; \
} while(0)

is a very bad idea. It looks like a function call but exits the "calling" function; don't break the internal parsers of those
who will read the code.

2. macros that depend on having a local variable with a magic name:

#define FOO(val) bar(index, val)

might look like a good thing, but it's confusing as hell when one reads the code and it's prone to breakage from
seemingly innocent changes.

3. macros with arguments that are used as l-values: FOO(x) = y; will bite you if somebody e.g. turns FOO into an inline
function.

4. forgetting about precedence: macros defining constants using expressions must enclose the expression in
parentheses. Beware of similar issues with macros using parameters.

#define CONSTANT 0x4000


#define CONSTEXP (CONSTANT | 3)

The cpp manual deals with macros exhaustively. The gcc internals manual also covers RTL which is used frequently with
assembly language in the kernel.

Chapter 13: Printing kernel messages


Kernel developers like to be seen as literate. Do mind the spelling of kernel messages to make a good impression. Do not
use crippled words like "dont"; use "do not" or "don't" instead. Make the messages concise, clear, and unambiguous.

Kernel messages do not have to be terminated with a period.

Printing numbers in parentheses (%d) adds no value and should be avoided.

There are a number of driver model diagnostic macros in which you should use to make sure messages are matched to the
right device and driver, and are tagged with the right level: dev_err(), dev_warn(), dev_info(), and so forth. For messages
that aren't associated with a particular device, defines pr_debug() and pr_info().

12
CodingStyle

Coming up with good debugging messages can be quite a challenge; and once you have them, they can be a huge help for
remote troubleshooting. Such messages should be compiled out when the DEBUG symbol is not defined (that is, by default
they are not included). When you use dev_dbg() or pr_debug(), that's automatic. Many subsystems have Kconfig options to
turn on -DDEBUG. A related convention uses VERBOSE_DEBUG to add dev_vdbg() messages to the ones already
enabled by DEBUG.

Chapter 14: Allocating memory


The kernel provides the following general purpose memory allocators: kmalloc(), kzalloc(), kmalloc_array(), kcalloc(),
vmalloc(), and vzalloc(). Please refer to the API documentation for further information about them.

The preferred form for passing a size of a struct is the following:

p = kmalloc(sizeof(*p), ...);

The alternative form where struct name is spelled out hurts readability and introduces an opportunity for a bug when the
pointer variable type is changed but the corresponding sizeof that is passed to a memory allocator is not.

Casting the return value which is a void pointer is redundant. The conversion from void pointer to any other pointer type is
guaranteed by the C programming language.

The preferred form for allocating an array is the following:

p = kmalloc_array(n, sizeof(...), ...);

The preferred form for allocating a zeroed array is the following:

p = kcalloc(n, sizeof(...), ...);

Both forms check for overflow on the allocation size n * sizeof(...), and return NULL if that occurred.

Chapter 15: The inline disease


There appears to be a common misperception that gcc has a magic "make me faster" speedup option called "inline". While
the use of inlines can be appropriate (for example as a means of replacing macros, see## Chapter 12), it very often is not.
Abundant use of the inline keyword leads to a much bigger kernel, which in turn slows the system as a whole down, due to
a bigger icache footprint for the CPU and simply because there is less memory available for the pagecache. Just think
about it; a pagecache miss causes a disk seek, which easily takes 5 milliseconds. There are a LOT of cpu cycles that can
go into these 5 milliseconds.

A reasonable rule of thumb is to not put inline at functions that have more than 3 lines of code in them. An exception to this
rule are the cases where a parameter is known to be a compiletime constant, and as a result of this constantness you know
the compiler will be able to optimize most of your function away at compile time. For a good example of this later case, see
the kmalloc() inline function.

Often people argue that adding inline to functions that are static and used only once is always a win since there is no space
tradeoff. While this is technically correct, gcc is capable of inlining these automatically without help, and the maintenance
issue of removing the inline when a second user appears outweighs the potential value of the hint that tells gcc to do
something it would have done anyway.

Chapter 16: Function return values and names

13
CodingStyle

Functions can return values of many different kinds, and one of the most common is a value indicating whether the function
succeeded or failed. Such a value can be represented as an error-code integer (-Exxx = failure, 0 = success) or a
"succeeded" boolean (0 = failure, non-zero = success).

Mixing up these two sorts of representations is a fertile source of difficult-to-find bugs. If the C language included a strong
distinction between integers and booleans then the compiler would find these mistakes for us... but it doesn't. To help
prevent such bugs, always follow this convention:

If the name of a function is an action or an imperative command,


the function should return an error-code integer. If the name
is a predicate, the function should return a "succeeded" boolean.

For example, "add work" is a command, and the add_work() function returns 0 for success or -EBUSY for failure. In the
same way, "PCI device present" is a predicate, and the pci_dev_present() function returns 1 if it succeeds in finding a
matching device or 0 if it doesn't.

All EXPORTed functions must respect this convention, and so should all public functions. Private (static) functions need not,
but it is recommended that they do.

Functions whose return value is the actual result of a computation, rather than an indication of whether the computation
succeeded, are not subject to this rule. Generally they indicate failure by returning some out-of-range result. Typical
examples would be functions that return pointers; they use NULL or the ERR_PTR mechanism to report failure.

Chapter 17: Don't re-invent the kernel macros


The header file include/linux/kernel.h contains a number of macros that you should use, rather than explicitly coding some
variant of them yourself. For example, if you need to calculate the length of an array, take advantage of the macro

#define ARRAY_SIZE(x) (sizeof(x) / sizeof((x)[0]))

Similarly, if you need to calculate the size of some structure member, use

#define FIELD_SIZEOF(t, f) (sizeof(((t*)0)->f))

There are also min() and max() macros that do strict type checking if you need them. Feel free to peruse that header file to
see what else is already defined that you shouldn't reproduce in your code.

Chapter 18: Editor modelines and other cruft


Some editors can interpret configuration information embedded in source files, indicated with special markers. For
example, emacs interprets lines marked like this:

-*- mode: c -*-

Or like this:

/*
Local Variables:
compile-command: "gcc -DMAGIC_DEBUG_FLAG foo.c"
End:
*/

Vim interprets markers that look like this:

14
CodingStyle

/* vim:set sw=8 noet */

Do not include any of these in source files. People have their own personal editor configurations, and your source files
should not override them. This includes markers for indentation and mode configuration. People may use their own custom
mode, or may have some other magic method for making indentation work correctly.

Chapter 19: Inline assembly


In architecture-specific code, you may need to use inline assembly to interface with CPU or platform functionality. Don't
hesitate to do so when necessary. However, don't use inline assembly gratuitously when C can do the job. You can and
should poke hardware from C when possible.

Consider writing simple helper functions that wrap common bits of inline assembly, rather than repeatedly writing them with
slight variations. Remember that inline assembly can use C parameters.

Large, non-trivial assembly functions should go in .S files, with corresponding C prototypes defined in C header files. The C
prototypes for assembly functions should use "asmlinkage".

You may need to mark your asm statement as volatile, to prevent GCC from removing it if GCC doesn't notice any side
effects. You don't always need to do so, though, and doing so unnecessarily can limit optimization.

When writing a single inline assembly statement containing multiple instructions, put each instruction on a separate line in a
separate quoted string, and end each string except the last with \n\t to properly indent the next instruction in the assembly
output:

asm ("magic %reg1, #42\n\t"


"more_magic %reg2, %reg3"
: /* outputs */ : /* inputs */ : /* clobbers */);

Appendix I: References
The C Programming Language, Second Edition

by Brian W. Kernighan and Dennis M. Ritchie. Prentice Hall, Inc., 1988. ISBN 0-13-110362-8 (paperback), 0-13-
110370-9 (hardback).

The Practice of Programming

by Brian W. Kernighan and Rob Pike. Addison-Wesley, Inc., 1999. ISBN 0-201-61586-X.

GNU manuals - where in compliance with K&R and this text - for cpp, gcc, gcc internals and indent

WG14 is the international standardization working group for the programming language C

Kernel CodingStyle, by [email protected] at OLS 2002

15
Timer

00-INDEX
- this file
highres.txt
- High resolution timers and dynamic ticks design notes
hpet.txt
- High Precision Event Timer Driver for Linux
hpet_example.c
- sample hpet timer test program
hrtimers.txt
- subsystem for high-resolution kernel timers
Makefile
- Build and link hpet_example
NO_HZ.txt
- Summary of the different methods for the scheduler clock-interrupts management.
timekeeping.txt
- Clock sources, clock events, sched_clock() and delay timer notes
timers-howto.txt
- how to insert delays in the kernel the right (tm) way.
timer_stats.txt
- timer usage statistics

16
Timer Usage Statistics

timer_stats - timer usage statistics


timer_stats is a debugging facility to make the timer (ab)usage in a Linux system visible to kernel and userspace
developers. If enabled in the config but not used it has almost zero runtime overhead, and a relatively small data structure
overhead. Even if collection is enabled runtime all the locking is per-CPU and lookup is hashed.

timer_stats should be used by kernel and userspace developers to verify that their code does not make unduly use of
timers. This helps to avoid unnecessary wakeups, which should be avoided to optimize power consumption.

It can be enabled by CONFIG_TIMER_STATS in the "Kernel hacking" configuration section.

timer_stats collects information about the timer events which are fired in a Linux system over a sample period:

the pid of the task(process) which initialized the timer


the name of the process which initialized the timer
the function where the timer was initialized
the callback function which is associated to the timer
the number of events (callbacks)

timer_stats adds an entry to /proc: /proc/timer_stats

This entry is used to control the statistics functionality and to read out the sampled information.

The timer_stats functionality is inactive on bootup.

To activate a sample period issue:

# echo 1 >/proc/timer_stats

To stop a sample period issue:

# echo 0 >/proc/timer_stats

The statistics can be retrieved by:

# cat /proc/timer_stats

While sampling is enabled, each readout from /proc/timer_stats will see newly updated statistics. Once sampling is
disabled, the sampled information is kept until a new sample period is started. This allows multiple readouts.

Sample output of /proc/timer_stats :

Timerstats sample period: 3.888770 s


12, 0 swapper hrtimer_stop_sched_tick (hrtimer_sched_tick)
15, 1 swapper hcd_submit_urb (rh_timer_func)
4, 959 kedac schedule_timeout (process_timeout)
1, 0 swapper page_writeback_init (wb_timer_fn)
28, 0 swapper hrtimer_stop_sched_tick (hrtimer_sched_tick)
22, 2948 IRQ 4 tty_flip_buffer_push (delayed_work_timer_fn)
3, 3100 bash schedule_timeout (process_timeout)
1, 1 swapper queue_delayed_work_on (delayed_work_timer_fn)
1, 1 swapper queue_delayed_work_on (delayed_work_timer_fn)
1, 1 swapper neigh_table_init_no_netlink (neigh_periodic_timer)
1, 2292 ip __netdev_watchdog_up (dev_watchdog)
1, 23 events/1 do_cache_clean (delayed_work_timer_fn)
90 total events, 30.0 events/sec

17
Timer Usage Statistics

The first column is the number of events, the second column the pid, the third column is the name of the process. The forth
column shows the function which initialized the timer and in parenthesis the callback function which was executed on
expiry.

Thomas, Ingo

Added flag to indicate 'deferrable timer' in /proc/timer_stats . A deferrable timer will appear as follows

10D, 1 swapper queue_delayed_work_on (delayed_work_timer_fn)

18
Input Device

Contents
0. Disclaimer
1. Introduction
1.1 Device drivers
1.2 Event handlers
2. Simple Usage
3. Detailed Description
3.1 Device drivers
3.1.1 usbhid
3.1.2 usbmouse
3.1.3 usbkbd
3.1.4 wacom
3.1.5 iforce
3.2 Event handlers
3.2.1 keybdev
3.2.2 mousedev
3.2.3 joydev
3.2.4 evdev
4. Verifying if it works
5. Event interface

Linux Input drivers v1.0


(c) 1999-2001 Vojtech Pavlik [email protected]

Sponsored by SuSE

0. Disclaimer
This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as
published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied
warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
more details.

You should have received a copy of the GNU General Public License along with this program; if not, write to the Free
Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA

Should you need to contact me, the author, you can do so either by e-mail

mail your message to <[email protected]> , or by paper mail: Vojtech Pavlik, Simunkova 1594, Prague 8, 182 00 Czech
Republic

For your convenience, the GNU General Public License version 2 is included in the package: See the file COPYING.

1. Introduction
This is a collection of drivers that is designed to support all input devices under Linux. While it is currently used only on for
USB input devices, future use (say 2.5/2.6) is expected to expand to replace most of the existing input system, which is
why it lives in drivers/input/ instead of drivers/usb/ .

19
Input Device

The centre of the input drivers is the input module, which must be loaded before any other of the input modules - it serves
as a way of communication between two groups of modules:

1.1 Device drivers


These modules talk to the hardware (for example via USB), and provide events (keystrokes, mouse movements) to the
input module.

1.2 Event handlers


These modules get events from input and pass them where needed via various interfaces - keystrokes to the kernel, mouse
movements via a simulated PS/2 interface to GPM and X and so on.

2. Simple Usage
For the most usual configuration, with one USB mouse and one USB keyboard, you'll have to load the following modules
(or have them built in to the kernel):

input
mousedev
keybdev
usbcore
uhci_hcd or ohci_hcd or ehci_hcd
usbhid

After this, the USB keyboard will work straight away, and the USB mouse will be available as a character device on major
13, minor 63:

crw-r--r-- 1 root root 13, 63 Mar 28 22:45 mice

This device has to be created. The commands to create it by hand are:

cd /dev
mkdir input
mknod input/mice c 13 63

After that you have to point GPM (the textmode mouse cut&paste tool) and XFree to this device to use it - GPM should be
called like:

gpm -t ps2 -m /dev/input/mice

And in X:

Section "Pointer"
Protocol "ImPS/2"
Device "/dev/input/mice"
ZAxisMapping 4 5
EndSection

When you do all of the above, you can use your USB mouse and keyboard.

3. Detailed Description
3.1 Device drivers

20
Input Device

Device drivers are the modules that generate events. The events are however not useful without being handled, so you
also will need to use some of the modules from section 3.2.

3.1.1 usbhid
usbhid is the largest and most complex driver of the whole suite. It handles all HID devices, and because there is a very
wide variety of them, and because the USB HID specification isn't simple, it needs to be this big.

Currently, it handles USB mice, joysticks, gamepads, steering wheels keyboards, trackballs and digitizers.

However, USB uses HID also for monitor controls, speaker controls, UPSs, LCDs and many other purposes.

The monitor and speaker controls should be easy to add to the hid/input interface, but for the UPSs and LCDs it doesn't
make much sense. For this, the hiddev interface was designed. See Documentation/hid/hiddev.txt for more information
about it.

The usage of the usbhid module is very simple, it takes no parameters, detects everything automatically and when a HID
device is inserted, it detects it appropriately.

However, because the devices vary wildly, you might happen to have a device that doesn't work well. In that case #define
DEBUG at the beginning of hid-core.c and send me the syslog traces.

3.1.2 usbmouse
For embedded systems, for mice with broken HID descriptors and just any other use when the big usbhid wouldn't be a
good choice, there is the usbmouse driver. It handles USB mice only. It uses a simpler HIDBP protocol. This also means
the mice must support this simpler protocol. Not all do. If you don't have any strong reason to use this module, use usbhid
instead.

3.1.3 usbkbd
Much like usbmouse, this module talks to keyboards with a simplified HIDBP protocol. It's smaller, but doesn't support any
extra special keys. Use usbhid instead if there isn't any special reason to use this.

3.1.4 wacom
This is a driver for Wacom Graphire and Intuos tablets. Not for Wacom PenPartner, that one is handled by the HID driver.
Although the Intuos and Graphire tablets claim that they are HID tablets as well, they are not and thus need this specific
driver.

3.1.5 iforce
A driver for I-Force joysticks and wheels, both over USB and RS232. It includes ForceFeedback support now, even though
Immersion Corp. considers the protocol a trade secret and won't disclose a word about it.

3.2 Event handlers


Event handlers distribute the events from the devices to userland and kernel, as needed.

3.2.1 keybdev
keybdev is currently a rather ugly hack that translates the input events into architecture-specific keyboard raw mode (Xlated
AT Set2 on x86), and passes them into the handle_scancode function of the keyboard.c module. This works well enough on
all architectures that keybdev can generate rawmode on, other architectures can be added to it.

The right way would be to pass the events to keyboard.c directly, best if keyboard.c would itself be an event handler. This is
done in the input patch, available on the webpage mentioned below.

21
Input Device

3.2.2 mousedev
mousedev is also a hack to make programs that use mouse input work. It takes events from either mice or digitizers/tablets
and makes a PS/2-style (a la /dev/psaux) mouse device available to the userland. Ideally, the programs could use a more
reasonable interface, for example evdev

Mousedev devices in /dev/input (as shown above) are:

crw-r--r-- 1 root root 13, 32 Mar 28 22:45 mouse0


crw-r--r-- 1 root root 13, 33 Mar 29 00:41 mouse1
crw-r--r-- 1 root root 13, 34 Mar 29 00:41 mouse2
crw-r--r-- 1 root root 13, 35 Apr 1 10:50 mouse3
...
...
crw-r--r-- 1 root root 13, 62 Apr 1 10:50 mouse30
crw-r--r-- 1 root root 13, 63 Apr 1 10:50 mice

Each 'mouse' device is assigned to a single mouse or digitizer, except the last one - 'mice'. This single character device is
shared by all mice and digitizers, and even if none are connected, the device is present. This is useful for hotplugging USB
mice, so that programs can open the device even when no mice are present.

CONFIG_INPUT_MOUSEDEV_SCREEN_[XY] in the kernel configuration are the size of your screen (in pixels) in XFree86. This is
needed if you want to use your digitizer in X, because its movement is sent to X via a virtual PS/2 mouse and thus needs to
be scaled accordingly. These values won't be used if you use a mouse only.

Mousedev will generate either PS/2, ImPS/2 (Microsoft IntelliMouse) or ExplorerPS/2 (IntelliMouse Explorer) protocols,
depending on what the program reading the data wishes. You can set GPM and X to any of these. You'll need ImPS/2 if you
want to make use of a wheel on a USB mouse and ExplorerPS/2 if you want to use extra (up to 5) buttons.

3.2.3 joydev
Joydev implements v0.x and v1.x Linux joystick api, much like drivers/char/joystick/joystick.c used to in earlier versions.
See joystick-api.txt in the Documentation subdirectory for details. As soon as any joystick is connected, it can be accessed
in /dev/input on:

crw-r--r-- 1 root root 13, 0 Apr 1 10:50 js0


crw-r--r-- 1 root root 13, 1 Apr 1 10:50 js1
crw-r--r-- 1 root root 13, 2 Apr 1 10:50 js2
crw-r--r-- 1 root root 13, 3 Apr 1 10:50 js3
...

And so on up to js31.

3.2.4 evdev
evdev is the generic input event interface. It passes the events generated in the kernel straight to the program, with
timestamps. The API is still evolving, but should be usable now. It's described in section 5.

This should be the way for GPM and X to get keyboard and mouse events. It allows for multihead in X without any specific
multihead kernel support. The event codes are the same on all architectures and are hardware independent.

The devices are in /dev/input:

crw-r--r-- 1 root root 13, 64 Apr 1 10:49 event0


crw-r--r-- 1 root root 13, 65 Apr 1 10:50 event1
crw-r--r-- 1 root root 13, 66 Apr 1 10:50 event2
crw-r--r-- 1 root root 13, 67 Apr 1 10:50 event3
...

And so on up to event31.

22
Input Device

4. Verifying if it works
Typing a couple keys on the keyboard should be enough to check that a USB keyboard works and is correctly connected to
the kernel keyboard driver.

Doing a "cat /dev/input/mouse0" (c, 13, 32) will verify that a mouse is also emulated; characters should appear if you move
it.

You can test the joystick emulation with the 'jstest' utility, available in the joystick package (see
Documentation/input/joystick.txt).

You can test the event devices with the 'evtest' utility available in the LinuxConsole project CVS archive (see the URL
below).

5. Event interface
Should you want to add event device support into any application (X, gpm, svgalib ...) I <[email protected]> will be happy to
provide you any help I can. Here goes a description of the current state of things, which is going to be extended, but not
changed incompatibly as time goes:

You can use blocking and nonblocking reads, also select() on the /dev/input/eventX devices, and you'll always get a whole
number of input events on a read. Their layout is:

struct input_event {
struct timeval time;
unsigned short type;
unsigned short code;
unsigned int value;
};

'time' is the timestamp, it returns the time at which the event happened. Type is for example EV_REL for relative moment,
EV_KEY for a keypress or release. More types are defined in include/linux/input.h.

'code' is event code, for example REL_X or KEY_BACKSPACE, again a complete list is in include/linux/input.h.

'value' is the value the event carries. Either a relative change for EV_REL, absolute new value for EV_ABS (joysticks ...), or
0 for EV_KEY for release, 1 for keypress and 2 for autorepeat.

23
Programming input drivers

Contents
1. Creating an input device driver
1.0 The simplest example
1.1 What the example does
1.2 dev->open() and dev->close()

1.3 Basic event types


1.4 BITS_TO_LONGS() , BIT_WORD() , BIT_MASK()

1.5 The id* and name fields


1.6 The keycode, keycodemax, keycodesize fields
1.7 dev->getkeycode() and dev->setkeycode()

1.8 Key autorepeat


1.9 Other event types, handling output events

Programming input drivers

1. Creating an input device driver


1.0 The simplest example
Here comes a very simple example of an input device driver. The device has just one button and the button is accessible at
i/o port BUTTON_PORT . When pressed or released a BUTTON_IRQ happens. The driver could look like:

24
Programming input drivers

#include <linux/input.h>
#include <linux/module.h>
#include <linux/init.h>

#include <asm/irq.h>
#include <asm/io.h>

static struct input_dev *button_dev;

static irqreturn_t button_interrupt(int irq, void *dummy)


{
input_report_key(button_dev, BTN_0, inb(BUTTON_PORT) & 1);
input_sync(button_dev);
return IRQ_HANDLED;
}

static int __init button_init(void)


{
int error;

if (request_irq(BUTTON_IRQ, button_interrupt, 0, "button", NULL)) {


printk(KERN_ERR "button.c: Can't allocate irq %d\n", button_irq);
return -EBUSY;
}

button_dev = input_allocate_device();
if (!button_dev) {
printk(KERN_ERR "button.c: Not enough memory\n");
error = -ENOMEM;
goto err_free_irq;
}

button_dev->evbit[0] = BIT_MASK(EV_KEY);
button_dev->keybit[BIT_WORD(BTN_0)] = BIT_MASK(BTN_0);

error = input_register_device(button_dev);
if (error) {
printk(KERN_ERR "button.c: Failed to register device\n");
goto err_free_dev;
}

return 0;

err_free_dev:
input_free_device(button_dev);
err_free_irq:
free_irq(BUTTON_IRQ, button_interrupt);
return error;
}

static void __exit button_exit(void)


{
input_unregister_device(button_dev);
free_irq(BUTTON_IRQ, button_interrupt);
}

module_init(button_init);
module_exit(button_exit);

1.1 What the example does


First it has to include the <linux/input.h> file, which interfaces to the input subsystem. This provides all the definitions
needed.

In the _init function, which is called either upon module load or when booting the kernel, it grabs the required resources
(it should also check for the presence of the device).

25
Programming input drivers

Then it allocates a new input device structure with input_allocate_device() and sets up input bitfields. This way the device
driver tells the other parts of the input systems what it is - what events can be generated or accepted by this input device.
Our example device can only generate EV_KEY type events, and from those only BTN_0 event code. Thus we only set
these two bits. We could have used

set_bit(EV_KEY, button_dev.evbit);
set_bit(BTN_0, button_dev.keybit);

as well, but with more than single bits the first approach tends to be shorter.

Then the example driver registers the input device structure by calling

input_register_device(&button_dev);

This adds the button_dev structure to linked lists of the input driver and calls device handler modules _connect functions
to tell them a new input device has appeared. input_register_device() may sleep and therefore must not be called from
an interrupt or with a spinlock held.

While in use, the only used function of the driver is

button_interrupt()

which upon every interrupt from the button checks its state and reports it via the

input_report_key()

call to the input system. There is no need to check whether the interrupt routine isn't reporting two same value events
(press, press for example) to the input system, because the input_report_* functions check that themselves.

Then there is the

input_sync()

call to tell those who receive the events that we've sent a complete report. This doesn't seem important in the one button
case, but is quite important for for example mouse movement, where you don't want the X and Y values to be interpreted
separately, because that'd result in a different movement.

1.2 dev->open() and dev->close()


In case the driver has to repeatedly poll the device, because it doesn't have an interrupt coming from it and the polling is
too expensive to be done all the time, or if the device uses a valuable resource (eg. interrupt), it can use the open and close
callback to know when it can stop polling or release the interrupt and when it must resume polling or grab the interrupt
again. To do that, we would add this to our example driver:

26
Programming input drivers

static int button_open(struct input_dev *dev)


{
if (request_irq(BUTTON_IRQ, button_interrupt, 0, "button", NULL)) {
printk(KERN_ERR "button.c: Can't allocate irq %d\n", button_irq);
return -EBUSY;
}

return 0;
}

static void button_close(struct input_dev *dev)


{
free_irq(IRQ_AMIGA_VERTB, button_interrupt);
}

static int __init button_init(void)


{
...
button_dev->open = button_open;
button_dev->close = button_close;
...
}

Note that input core keeps track of number of users for the device and makes sure that dev->open() is called only when
the first user connects to the device and that dev->close() is called when the very last user disconnects. Calls to both
callbacks are serialized.

The open() callback should return a 0 in case of success or any nonzero value in case of failure. The close() callback
(which is void) must always succeed.

1.3 Basic event types


The most simple event type is EV_KEY, which is used for keys and buttons. It's reported to the input system via:

input_report_key(struct input_dev *dev, int code, int value)

See linux/input.h for the allowable values of code (from 0 to KEY_MAX). value is interpreted as a truth value, ie any
nonzero value means key pressed, zero value means key released. The input code generates events only in case the
value is different from before.

In addition to EV_KEY, there are two more basic event types: EV_REL and EV_ABS. They are used for relative and
absolute values supplied by the device. A relative value may be for example a mouse movement in the X axis. The mouse
reports it as a relative difference from the last position, because it doesn't have any absolute coordinate system to work in.
Absolute events are namely for joysticks and digitizers - devices that do work in an absolute coordinate systems.

Having the device report EV_REL buttons is as simple as with EV_KEY, simply set the corresponding bits and call the

input_report_rel(struct input_dev *dev, int code, int value)

function. Events are generated only for nonzero value.

However EV_ABS requires a little special care. Before calling input_register_device, you have to fill additional fields in the
input_dev struct for each absolute axis your device has. If our button device had also the ABS_X axis:

button_dev.absmin[ABS_X] = 0;
button_dev.absmax[ABS_X] = 255;
button_dev.absfuzz[ABS_X] = 4;
button_dev.absflat[ABS_X] = 8;

Or, you can just say:

27
Programming input drivers

input_set_abs_params(button_dev, ABS_X, 0, 255, 4, 8);

This setting would be appropriate for a joystick X axis, with the minimum of 0, maximum of 255 (which the joystick must be
able to reach, no problem if it sometimes reports more, but it must be able to always reach the min and max values), with
noise in the data up to +- 4, and with a center flat position of size 8.

If you don't need absfuzz and absflat, you can set them to zero, which mean that the thing is precise and always returns to
exactly the center position (if it has any).

1.4 BITS_TO_LONGS() , BIT_WORD() , BIT_MASK()


These three macros from bitops.h help some bitfield computations:

BITS_TO_LONGS(x) - returns the length of a bitfield array in longs for x bits


BIT_WORD(x) - returns the index in the array in longs for bit x
BIT_MASK(x) - returns the index in a long for bit x

1.5 The id* and name fields


The dev->name should be set before registering the input device by the input device driver. It's a string like 'Generic button
device' containing a user friendly name of the device.

The id* fields contain the bus ID (PCI, USB, ...), vendor ID and device ID of the device. The bus IDs are defined in
input.h. The vendor and device ids are defined in pci_ids.h, usb_ids.h and similar include files. These fields should be set
by the input device driver before registering it.

The idtype field can be used for specific information for the input device driver.

The id and name fields can be passed to userland via the evdev interface.

1.6 The keycode, keycodemax, keycodesize fields


These three fields should be used by input devices that have dense keymaps. The keycode is an array used to map from
scancodes to input system keycodes. The keycode max should contain the size of the array and keycodesize the size of
each entry in it (in bytes).

Userspace can query and alter current scancode to keycode mappings using EVIOCGKEYCODE and EVIOCSKEYCODE
ioctls on corresponding evdev interface. When a device has all 3 aforementioned fields filled in, the driver may rely on
kernel's default implementation of setting and querying keycode mappings.

1.7 dev->getkeycode() and dev->setkeycode()


getkeycode() and setkeycode() callbacks allow drivers to override default keycode/keycodesize/keycodemax mapping
mechanism provided by input core and implement sparse keycode maps.

1.8 Key autorepeat


... is simple. It is handled by the input.c module. Hardware autorepeat is not used, because it's not present in many
devices and even where it is present, it is broken sometimes (at keyboards: Toshiba notebooks). To enable autorepeat for
your device, just set EV_REP in dev->evbit . All will be handled by the input system.

1.9 Other event types, handling output events


The other event types up to now are:

EV_LED - used for the keyboard LEDs.

28
Programming input drivers

EV_SND - used for keyboard beeps.

They are very similar to for example key events, but they go in the other direction - from the system to the input device
driver. If your input device driver can handle these events, it has to set the respective bits in evbit, and also the callback
routine:

button_dev->event = button_event;

int button_event(struct input_dev *dev, unsigned int type, unsigned int code, int value);
{
if (type == EV_SND && code == SND_BELL) {
outb(value, BUTTON_BELL);
return 0;
}
return -1;
}

This callback routine can be called from an interrupt or a BH (although that isn't a rule), and thus must not sleep, and must
not take too long to finish.

29
Multi Touch Protocol

Contents
Introduction
Protocol Usage
Protocol Example A
Protocol Example B
Event Usage
Event Semantics
Event Computation
Finger Tracking
Gestures
Notes

Multi-touch (MT) Protocol


Copyright (C) 2009-2010 Henrik Rydberg [email protected]

Introduction
In order to utilize the full power of the new multi-touch and multi-user devices, a way to report detailed data from multiple
contacts, i.e., objects in direct contact with the device surface, is needed. This document describes the multi-touch (MT)
protocol which allows kernel drivers to report details for an arbitrary number of contacts.

The protocol is divided into two types, depending on the capabilities of the hardware. For devices handling anonymous
contacts (type A), the protocol describes how to send the raw data for all contacts to the receiver. For devices capable of
tracking identifiable contacts (type B), the protocol describes how to send updates for individual contacts via event slots.

Protocol Usage
Contact details are sent sequentially as separate packets of ABS_MT events. Only the ABS_MT events are recognized as
part of a contact packet. Since these events are ignored by current single-touch (ST) applications, the MT protocol can be
implemented on top of the ST protocol in an existing driver.

Drivers for type A devices separate contact packets by calling input_mt_sync() at the end of each packet. This generates a
SYN_MT_REPORT event, which instructs the receiver to accept the data for the current contact and prepare to receive another.

Drivers for type B devices separate contact packets by calling input_mt_slot() , with a slot as argument, at the beginning
of each packet. This generates an ABS_MT_SLOT event, which instructs the receiver to prepare for updates of the given slot.

All drivers mark the end of a multi-touch transfer by calling the usual input_sync() function. This instructs the receiver to
act upon events accumulated since last EV_SYN / SYN_REPORT and prepare to receive a new set of events/packets.

The main difference between the stateless type A protocol and the stateful type B slot protocol lies in the usage of
identifiable contacts to reduce the amount of data sent to userspace. The slot protocol requires the use of the
ABS_MT_TRACKING_ID , either provided by the hardware or computed from the raw data [5].

For type A devices, the kernel driver should generate an arbitrary enumeration of the full set of anonymous contacts
currently on the surface. The order in which the packets appear in the event stream is not important. Event filtering and
finger tracking is left to user space [3].

30
Multi Touch Protocol

For type B devices, the kernel driver should associate a slot with each identified contact, and use that slot to propagate
changes for the contact. Creation, replacement and destruction of contacts is achieved by modifying the
ABS_MT_TRACKING_ID of the associated slot. A non-negative tracking id is interpreted as a contact, and the value -1 denotes
an unused slot. A tracking id not previously present is considered new, and a tracking id no longer present is considered
removed. Since only changes are propagated, the full state of each initiated contact has to reside in the receiving end.
Upon receiving an MT event, one simply updates the appropriate attribute of the current slot.

Some devices identify and/or track more contacts than they can report to the driver. A driver for such a device should
associate one type B slot with each contact that is reported by the hardware. Whenever the identity of the contact
associated with a slot changes, the driver should invalidate that slot by changing its ABS_MT_TRACKING_ID . If the hardware
signals that it is tracking more contacts than it is currently reporting, the driver should use a BTN_TOOL_*TAP event to inform
userspace of the total number of contacts being tracked by the hardware at that moment. The driver should do this by
explicitly sending the corresponding BTN_TOOL_*TAP event and setting use_count to false when calling
input_mt_report_pointer_emulation() . The driver should only advertise as many slots as the hardware can report.
Userspace can detect that a driver can report more total contacts than slots by noting that the largest supported
BTN_TOOL_*TAP event is larger than the total number of type B slots reported in the absinfo for the ABS_MT_SLOT axis.

The minimum value of the ABS_MT_SLOT axis must be 0.

Protocol Example A
Here is what a minimal event sequence for a two-contact touch would look like for a type A device:

ABS_MT_POSITION_X x[0]
ABS_MT_POSITION_Y y[0]
SYN_MT_REPORT
ABS_MT_POSITION_X x[1]
ABS_MT_POSITION_Y y[1]
SYN_MT_REPORT
SYN_REPORT

The sequence after moving one of the contacts looks exactly the same; the raw data for all present contacts are sent
between every synchronization with SYN_REPORT .

Here is the sequence after lifting the first contact:

ABS_MT_POSITION_X x[1]
ABS_MT_POSITION_Y y[1]
SYN_MT_REPORT
SYN_REPORT

And here is the sequence after lifting the second contact:

SYN_MT_REPORT
SYN_REPORT

If the driver reports one of BTN_TOUCH or ABS_PRESSURE in addition to the ABS_MT events, the last SYN_MT_REPORT event
may be omitted. Otherwise, the last SYN_REPORT will be dropped by the input core, resulting in no zero-contact event
reaching userland.

Protocol Example B
Here is what a minimal event sequence for a two-contact touch would look like for a type B device:

31
Multi Touch Protocol

ABS_MT_SLOT 0
ABS_MT_TRACKING_ID 45
ABS_MT_POSITION_X x[0]
ABS_MT_POSITION_Y y[0]
ABS_MT_SLOT 1
ABS_MT_TRACKING_ID 46
ABS_MT_POSITION_X x[1]
ABS_MT_POSITION_Y y[1]
SYN_REPORT

Here is the sequence after moving contact 45 in the x direction:

ABS_MT_SLOT 0
ABS_MT_POSITION_X x[0]
SYN_REPORT

Here is the sequence after lifting the contact in slot 0:

ABS_MT_TRACKING_ID -1
SYN_REPORT

The slot being modified is already 0, so the ABS_MT_SLOT is omitted. The message removes the association of slot 0 with
contact 45, thereby destroying contact 45 and freeing slot 0 to be reused for another contact.

Finally, here is the sequence after lifting the second contact:

ABS_MT_SLOT 1
ABS_MT_TRACKING_ID -1
SYN_REPORT

Event Usage
A set of ABS_MT events with the desired properties is defined. The events are divided into categories, to allow for partial
implementation. The minimum set consists of ABS_MT_POSITION_X and ABS_MT_POSITION_Y , which allows for multiple
contacts to be tracked. If the device supports it, the ABS_MT_TOUCH_MAJOR and ABS_MT_WIDTH_MAJOR may be used to provide
the size of the contact area and approaching tool, respectively.

The TOUCH and WIDTH parameters have a geometrical interpretation; imagine looking through a window at someone
gently holding a finger against the glass. You will see two regions, one inner region consisting of the part of the finger
actually touching the glass, and one outer region formed by the perimeter of the finger. The center of the touching region
(a) is ABS_MT_POSITION_X/Y and the center of the approaching finger (b) is ABS_MT_TOOL_X/Y . The touch diameter is
ABS_MT_TOUCH_MAJOR and the finger diameter is ABS_MT_WIDTH_MAJOR . Now imagine the person pressing the finger harder
against the glass. The touch region will increase, and in general, the ratio ABS_MT_TOUCH_MAJOR / ABS_MT_WIDTH_MAJOR , which
is always smaller than unity, is related to the contact pressure. For pressure-based devices, ABS_MT_PRESSURE may be used
to provide the pressure on the contact area instead. Devices capable of contact hovering can use ABS_MT_DISTANCE to
indicate the distance between the contact and the surface.

32
Multi Touch Protocol

Linux MT Win8
__________ _______________________
/ \ | |
/ \ | |
/ ____ \ | |
/ / \ \ | |
\ \ a \ \ | a |
\ \____/ \ | |
\ \ | |
\ b \ | b |
\ \ | |
\ \ | |
\ \ | |
\ / | |
\ / | |
\ / | |
\__________/ |_______________________|

In addition to the MAJOR parameters, the oval shape of the touch and finger regions can be described by adding the MINOR

parameters, such that MAJOR and MINOR are the major and minor axis of an ellipse. The orientation of the touch ellipse
can be described with the ORIENTATION parameter, and the direction of the finger ellipse is given by the vector (a - b).

For type A devices, further specification of the touch shape is possible via ABS_MT_BLOB_ID .

The ABS_MT_TOOL_TYPE may be used to specify whether the touching tool is a finger or a pen or something else. Finally, the
ABS_MT_TRACKING_ID event may be used to track identified contacts over time [5].

In the type B protocol, ABS_MT_TOOL_TYPE and ABS_MT_TRACKING_ID are implicitly handled by input core; drivers should
instead call input_mt_report_slot_state() .

Event Semantics
ABS_MT_TOUCH_MAJOR

The length of the major axis of the contact. The length should be given in surface units. If the surface has an X times Y
resolution, the largest possible value of ABS_MT_TOUCH_MAJOR is sqrt(X^2 + Y^2) , the diagonal [4].

ABS_MT_TOUCH_MINOR

The length, in surface units, of the minor axis of the contact. If the contact is circular, this event can be omitted [4].

ABS_MT_WIDTH_MAJOR

The length, in surface units, of the major axis of the approaching tool. This should be understood as the size of the tool
itself. The orientation of the contact and the approaching tool are assumed to be the same [4].

ABS_MT_WIDTH_MINOR

The length, in surface units, of the minor axis of the approaching tool. Omit if circular [4].

The above four values can be used to derive additional information about the contact. The ratio ABS_MT_TOUCH_MAJOR /
ABS_MT_WIDTH_MAJOR approximates the notion of pressure. The fingers of the hand and the palm all have different
characteristic widths.

ABS_MT_PRESSURE

The pressure, in arbitrary units, on the contact area. May be used instead of TOUCH and WIDTH for pressure-based
devices or any device with a spatial signal intensity distribution.

ABS_MT_DISTANCE

The distance, in surface units, between the contact and the surface. Zero distance means the contact is touching the
surface. A positive number means the contact is hovering above the surface.

33
Multi Touch Protocol

ABS_MT_ORIENTATION

The orientation of the touching ellipse. The value should describe a signed quarter of a revolution clockwise around the
touch center. The signed value range is arbitrary, but zero should be returned for an ellipse aligned with the Y axis of the
surface, a negative value when the ellipse is turned to the left, and a positive value when the ellipse is turned to the right.
When completely aligned with the X axis, the range max should be returned.

Touch ellipsis are symmetrical by default. For devices capable of true 360 degree orientation, the reported orientation must
exceed the range max to indicate more than a quarter of a revolution. For an upside-down finger, range max * 2 should be
returned.

Orientation can be omitted if the touch area is circular, or if the information is not available in the kernel driver. Partial
orientation support is possible if the device can distinguish between the two axis, but not (uniquely) any values in between.
In such cases, the range of ABS_MT_ORIENTATION should be [0, 1] [4].

ABS_MT_POSITION_X

The surface X coordinate of the center of the touching ellipse.

ABS_MT_POSITION_Y

The surface Y coordinate of the center of the touching ellipse.

ABS_MT_TOOL_X

The surface X coordinate of the center of the approaching tool. Omit if the device cannot distinguish between the intended
touch point and the tool itself.

ABS_MT_TOOL_Y

The surface Y coordinate of the center of the approaching tool. Omit if the device cannot distinguish between the intended
touch point and the tool itself.

The four position values can be used to separate the position of the touch from the position of the tool. If both positions are
present, the major tool axis points towards the touch point [1]. Otherwise, the tool axes are aligned with the touch axes.

ABS_MT_TOOL_TYPE

The type of approaching tool. A lot of kernel drivers cannot distinguish between different tool types, such as a finger or a
pen. In such cases, the event should be omitted. The protocol currently supports MT_TOOL_FINGER and MT_TOOL_PEN [2]. For
type B devices, this event is handled by input core; drivers should instead use input_mt_report_slot_state() .

ABS_MT_BLOB_ID

The BLOB_ID groups several packets together into one arbitrarily shaped contact. The sequence of points forms a polygon
which defines the shape of the contact. This is a low-level anonymous grouping for type A devices, and should not be
confused with the high-level trackingID [5]. Most type A devices do not have blob capability, so drivers can safely omit this
event.

ABS_MT_TRACKING_ID

The TRACKING_ID identifies an initiated contact throughout its life cycle [5]. The value range of the TRACKING_ID should be
large enough to ensure unique identification of a contact maintained over an extended period of time. For type B devices,
this event is handled by input core; drivers should instead use input_mt_report_slot_state() .

Event Computation
The flora of different hardware unavoidably leads to some devices fitting better to the MT protocol than others. To simplify
and unify the mapping, this section gives recipes for how to compute certain events.

For devices reporting contacts as rectangular shapes, signed orientation cannot be obtained. Assuming X and Y are the
lengths of the sides of the touching rectangle, here is a simple formula that retains the most information possible:

34
Multi Touch Protocol

ABS_MT_TOUCH_MAJOR := max(X, Y)
ABS_MT_TOUCH_MINOR := min(X, Y)
ABS_MT_ORIENTATION := bool(X > Y)

The range of ABS_MT_ORIENTATION should be set to [0, 1], to indicate that the device can distinguish between a finger along
the Y axis (0) and a finger along the X axis (1).

For win8 devices with both T and C coordinates, the position mapping is

ABS_MT_POSITION_X := T_X
ABS_MT_POSITION_Y := T_Y
ABS_MT_TOOL_X := C_X
ABS_MT_TOOL_X := C_Y

Unfortunately, there is not enough information to specify both the touching ellipse and the tool ellipse, so one has to resort
to approximations. One simple scheme, which is compatible with earlier usage, is:

ABS_MT_TOUCH_MAJOR := min(X, Y)
ABS_MT_TOUCH_MINOR := <not used>
ABS_MT_ORIENTATION := <not used>
ABS_MT_WIDTH_MAJOR := min(X, Y) + distance(T, C)
ABS_MT_WIDTH_MINOR := min(X, Y)

Rationale: We have no information about the orientation of the touching ellipse, so approximate it with an inscribed circle
instead. The tool ellipse should align with the vector (T - C), so the diameter must increase with distance(T, C). Finally,
assume that the touch diameter is equal to the tool thickness, and we arrive at the formulas above.

Finger Tracking
The process of finger tracking, i.e., to assign a unique trackingID to each initiated contact on the surface, is a Euclidian
Bipartite Matching problem. At each event synchronization, the set of actual contacts is matched to the set of contacts from
the previous synchronization. A full implementation can be found in [3].

Gestures
In the specific application of creating gesture events, the TOUCH and WIDTH parameters can be used to, e.g.,
approximate finger pressure or distinguish between index finger and thumb. With the addition of the MINOR parameters,
one can also distinguish between a sweeping finger and a pointing finger, and with ORIENTATION, one can detect twisting
of fingers.

Notes
In order to stay compatible with existing applications, the data reported in a finger packet must not be recognized as single-
touch events.

For type A devices, all finger data bypasses input filtering, since subsequent events of the same type refer to different
fingers.

For example usage of the type A protocol, see the bcm5974 driver. For example usage of the type B protocol, see the hid-
egalax driver.

[1] Also, the difference ( TOOL_X - POSITION_X ) can be used to model tilt.
[2] The list can of course be extended.
[3] The mtdev project: http://bitmath.org/code/mtdev/.
[4] See the section on event computation.

35
Multi Touch Protocol

[5] See the section on finger tracking.

36
Namespaces

Various information about namespaces.

37
Resource Control

There are a lot of kinds of objects in the kernel that don't have individual limits or that have limits that are ineffective when a
set of processes is allowed to switch user ids. With user namespaces enabled in a kernel for people who don't trust their
users or their users programs to play nice this problems becomes more acute.

Therefore it is recommended that memory control groups be enabled in kernels that enable user namespaces, and it is
further recommended that userspace configure memory control groups to limit how much memory user's they don't trust to
play nice can use.

Memory control groups can be configured by installing the libcgroup package present on most distros editing
/etc/cgrules.conf , /etc/cgconfig.conf and setting up libpam-cgroup .

38
Virtual Memory

00-INDEX
- this file.
active_mm.txt
- An explanation from Linus about tsk->active_mm vs tsk->mm.
balance
- various information on memory balancing.
cleancache.txt
- Intro to cleancache and page-granularity victim cache.
frontswap.txt
- Outline frontswap, part of the transcendent memory frontend.
highmem.txt
- Outline of highmem and common issues.
hugetlbpage.txt
- a brief summary of hugetlbpage support in the Linux kernel.
hwpoison.txt
- explains what hwpoison is
ksm.txt
- how to use the Kernel Samepage Merging feature.
numa
- information about NUMA specific code in the Linux vm.
numa_memory_policy.txt
- documentation of concepts and APIs of the 2.6 memory policy support.
overcommit-accounting
- description of the Linux kernels overcommit handling modes.
page_migration
- description of page migration in NUMA systems.
pagemap.txt
- pagemap, from the userspace perspective
slub.txt
- a short users guide for SLUB.
soft-dirty.txt
- short explanation for soft-dirty PTEs
split_page_table_lock
- Separate per-table lock to improve scalability of the old page_table_lock.
transhuge.txt
- Transparent Hugepage Support, alternative way of using hugepages.
unevictable-lru.txt
- Unevictable LRU infrastructure
zswap.txt
- Intro to compressed cache for swap pages

39
HIGH MEMORY HANDLING

Contents
WHAT IS HIGH MEMORY?
TEMPORARY VIRTUAL MAPPINGS
USING KMAP_ATOMIC
COST OF TEMPORARY MAPPINGS
i386 PAE

HIGH MEMORY HANDLING


By: Peter Zijlstra [email protected]

WHAT IS HIGH MEMORY?


High memory (highmem) is used when the size of physical memory approaches or exceeds the maximum size of virtual
memory. At that point it becomes impossible for the kernel to keep all of the available physical memory mapped at all times.
This means the kernel needs to start using temporary mappings of the pieces of physical memory that it wants to access.

The part of (physical) memory not covered by a permanent mapping is what we refer to as 'highmem'. There are various
architecture dependent constraints on where exactly that border lies.

In the i386 arch, for example, we choose to map the kernel into every process's VM space so that we don't have to pay the
full TLB invalidation costs for kernel entry/exit. This means the available virtual memory space (4GiB on i386) has to be
divided between user and kernel space.

The traditional split for architectures using this approach is 3:1, 3GiB for userspace and the top 1GiB for kernel space:

+--------+ 0xffffffff
| Kernel |
+--------+ 0xc0000000
| |
| User |
| |
+--------+ 0x00000000

This means that the kernel can at most map 1GiB of physical memory at any one time, but because we need virtual
address space for other things - including temporary maps to access the rest of the physical memory - the actual direct
map will typically be less (usually around ~896MiB).

Other architectures that have mm context tagged TLBs can have separate kernel and user maps. Some hardware (like
some ARMs), however, have limited virtual space when they use mm context tags.

TEMPORARY VIRTUAL MAPPINGS


The kernel contains several ways of creating temporary mappings:

vmap(). This can be used to make a long duration mapping of multiple physical pages into a contiguous virtual space.
It needs global synchronization to unmap.

kmap(). This permits a short duration mapping of a single page. It needs global synchronization, but is amortized
somewhat. It is also prone to deadlocks when using in a nested fashion, and so it is not recommended for new code.

40
HIGH MEMORY HANDLING

kmap_atomic(). This permits a very short duration mapping of a single page. Since the mapping is restricted to the
CPU that issued it, it performs well, but the issuing task is therefore required to stay on that CPU until it has finished,
lest some other task displace its mappings.

kmap_atomic() may also be used by interrupt contexts, since it is does not sleep and the caller may not sleep until
after kunmap_atomic() is called.

It may be assumed that k[un]map_atomic() won't fail.

USING KMAP_ATOMIC
When and where to use kmap_atomic() is straightforward. It is used when code wants to access the contents of a page that
might be allocated from high memory (see __GFP_HIGHMEM), for example a page in the pagecache. The API has two
functions, and they can be used in a manner similar to the following:

/* Find the page of interest. */


struct page *page = find_get_page(mapping, offset);

/* Gain access to the contents of that page. */


void *vaddr = kmap_atomic(page);

/* Do something to the contents of that page. */


memset(vaddr, 0, PAGE_SIZE);

/* Unmap that page. */


kunmap_atomic(vaddr);

Note that the kunmap_atomic() call takes the result of the kmap_atomic() call not the argument.

If you need to map two pages because you want to copy from one page to another you need to keep the kmap_atomic calls
strictly nested, like:

vaddr1 = kmap_atomic(page1);
vaddr2 = kmap_atomic(page2);

memcpy(vaddr1, vaddr2, PAGE_SIZE);

kunmap_atomic(vaddr2);
kunmap_atomic(vaddr1);

COST OF TEMPORARY MAPPINGS


The cost of creating temporary mappings can be quite high. The arch has to manipulate the kernel's page tables, the data
TLB and/or the MMU's registers.

If CONFIG_HIGHMEM is not set, then the kernel will try and create a mapping simply with a bit of arithmetic that will
convert the page struct address into a pointer to the page contents rather than juggling mappings about. In such a case, the
unmap operation may be a null operation.

If CONFIG_MMU is not set, then there can be no temporary mappings and no highmem. In such a case, the arithmetic
approach will also be used.

i386 PAE
The i386 arch, under some circumstances, will permit you to stick up to 64GiB of RAM into your 32-bit machine. This has a
number of consequences:

41
HIGH MEMORY HANDLING

Linux needs a page-frame structure for each page in the system and the pageframes need to live in the permanent
mapping, which means:

you can have 896M/sizeof(struct page) page-frames at most; with struct page being 32-bytes that would end up being
something in the order of 112G worth of pages; the kernel, however, needs to store more than just page-frames in that
memory...

PAE makes your page tables larger - which slows the system down as more data has to be accessed to traverse in
TLB fills and the like. One advantage is that PAE has more PTE bits and can provide advanced features like NX and
PAT.

The general recommendation is that you don't use more than 8GiB on a 32-bit machine - although more might work for you
and your workload, you're pretty much on your own - don't expect kernel developers to really care much if things come
apart.

42

You might also like