Kernel Internals: Santosh Sam Koshy Santoshk@cdac - in Centre For Development of Advanced Computing, Hyderabad
Kernel Internals: Santosh Sam Koshy Santoshk@cdac - in Centre For Development of Advanced Computing, Hyderabad
Santosh Sam Koshy [email protected] Centre for Development of Advanced Computing, Hyderabad
Agenda
IOCTL Kernel Synchronization Techniques
Wait Queues
Time Delays Deferred Executions
Santosh Sam Koshy [email protected] Centre for Development of Advanced Computing, Hyderabad
IOCTL
Most drivers need -in addition to the ability to read and write the device -the ability to perform various types of hardware control via the device driver. These operations are normally supported via the ioctl method In the user space, the ioctl command has the following format
int ioctl(int fd, unsigned long cmd, ...); The ioctl driver method has the prototype int (*ioctl) (struct inode *inode, struct file *filp, unsigned int cmd, unsigned long arg);
14/11/2012
Santosh Sam Koshy [email protected] Centre for Development of Advanced Computing, Hyderabad
Magic Numbers
Magic numbers are mechanisms of identifying the commands for a particular device. They must be unique over the system. These are maintained by the kernel in 4 bit-fields
type: This is the magic number present in the file ioctlnumber.txt. It is 8 bits wide number: The ordinal (sequential) number. It is also 8 bits wide direction: The direction of data transfer. Two bits size: The size of the user data involved. The size is architecture dependent, and is generally limited to 13 or 14 bits
Santosh Sam Koshy [email protected] Centre for Development of Advanced Computing, Hyderabad
Kernel Synchronization
Santosh Sam Koshy [email protected] Centre for Development of Advanced Computing, Hyderabad
Agenda
Sources of Concurrency in the Kernel Mechanisms to manage concurrency
Semaphores RW Semaphores Spinlocks RW Spinlocks Completions Atomic Variables Sequential Locks RCU
Santosh Sam Koshy [email protected] Centre for Development of Advanced Computing, Hyderabad
What is Synchronization
In the kernel there can be many tasks that execute pseudo concurrently. This may lead to data inconsistencies in accessing a common resource A well defined coordination between tasks in accessing shared data is a must and this coordination leads to synchronization between tasks.
Santosh Sam Koshy [email protected] Centre for Development of Advanced Computing, Hyderabad
Sources of Concurrency
In a linux system, there is a possibility that numerous process are executing in the user space, making system calls to the kernel SMP systems can access your code concurrently Kernel code is preemptible Interrupts are asynchronous events that can cause concurrent execution Delayed code execution mechanisms provided by the kernel Hot pluggable devices can suddenly stop the functioning of the code
Santosh Sam Koshy [email protected] Centre for Development of Advanced Computing, Hyderabad
This calls for some resource access management and is brought about by mechanisms called locking or mutual exclusion making sure that only one resource can manipulate a shared resource at one time.
Santosh Sam Koshy [email protected] Centre for Development of Advanced Computing, Hyderabad
Semaphores
At its core, a semaphore is a single integer value combined with a pair of functions that are typically called up and down To use semaphores, the code must include asm/semaphore.h. The semaphore implementation in the kernel is just a structure semaphore.
struct semaphore {
atomic_t count; int sleepers; wait_queue_head_t wait;
};
Santosh Sam Koshy [email protected] Centre for Development of Advanced Computing, Hyderabad
Semaphores
There are two ways of creating a semaphore. The dynamic way uses the function
void sema_init(struct semaphore *sem, int val)
The count or val in both cases specifies the initialization value of the semaphore. Setting it to 1 created the semaphore as a binary semaphore or a mutex (mutual exclusion semaphore)
Santosh Sam Koshy [email protected] Centre for Development of Advanced Computing, Hyderabad
Semaphores
Semaphores may also be created in the mutex mode by the following functions
DECLARE_MUTEX(name); DECLARE_MUTEX_LOCKED(name);
Santosh Sam Koshy [email protected] Centre for Development of Advanced Computing, Hyderabad
Semaphores
Semaphores may be accessed by calling one of the following functions
void down(struct semaphore *sem); int down_interruptible(struct semaphore *sem); int down_trylock(struct semaphore *sem);
Once access to the critical section is completed, the semaphore may be released by the function
void up(struct semaphore *sem)
Santosh Sam Koshy [email protected] Centre for Development of Advanced Computing, Hyderabad
Reader/Writer Semaphores
Code using rwsems must include linux/rwsem.h. The relevant data type for rwsem is struct rw_semaphore. An rwsem must be explicitly initialized at run time using
void init_rwsem(struct rw_semaphore *sem)
Spinlocks
A spinlock is a mutual exclusion device that can have only two values locked and unlocked. It is implemented as a single bit in an integer value. Code wishing to take out a particular lock tests the relevant bit. Unlike semaphores, spinlocks may be used in code that cannot sleep. If the lock is taken by somebody else, the code goes into a tight loop where it repeatedly checks the lock until it becomes available
Santosh Sam Koshy [email protected] Centre for Development of Advanced Computing, Hyderabad
Spinlocks
Spinlocks are intended for use on multiprocessor systems although a uniprocessor workstation running a preemptive kernel behaves like SMP If a non preemptive uniprocessor ever went into a spinlock, it would spin forever; on other thread would ever be able to obtain the CPU to release the lock The Linux implementation nullifies the spinlock implementation if it is tried to be used on a uniprocessor system
Santosh Sam Koshy [email protected] Centre for Development of Advanced Computing, Hyderabad
Spinlocks
The required include file for spinlock primitives is linux/spinlock.h. A spinlock has the type spinlock_t and has to be initialized before it is used The static initialization for a spinlock is done by
spinlock_t my_lock = SPIN_LOCK_UNLOCKED
or at runtime as
void spin_lock_init(spinlock_t *lock);
Santosh Sam Koshy [email protected] Centre for Development of Advanced Computing, Hyderabad
Reader/Writer Spinlocks
Using read/write spinlocks is similar to rwsems It is initialized by
rwlock_t mylock =RW_LOCK_UNLOCKED //static way rwlock_t mylock; rw_lock_init (&my_rwlock); //Dynamic way
Semaphores vs Spinlocks
Requirement
Low overhead locking Short lock hold time Long lock hold time Need to lock from interrupt context Need to sleep while holding lock
Recommended Lock
Spinlock Spinlock Semaphore Spinlock Semaphore
Santosh Sam Koshy [email protected] Centre for Development of Advanced Computing, Hyderabad
Completions
A common phenomenon is kernel programming is the initiation of some activity outside the current execution flow and then wait for that activity to complete. Consider the following code snippet
struct semaphore sem; init_MUTEX_LOCKED(&sem); start_external_task(&sem); down(&sem);
Santosh Sam Koshy [email protected] Centre for Development of Advanced Computing, Hyderabad
Completions
Completions are a simple light weight mechanism with one task: allowing one thread to tell another that the job is done A completion can be created with
DECLARE_COMPLETION(my_completion);
Santosh Sam Koshy [email protected] Centre for Development of Advanced Computing, Hyderabad
Atomic Variables
Atomic variables are special data types that are provided by the kernel, to perform simple operations in an atomic manner. The kernel provides an atomic integer type called atomic_t and a set of functions that have to be used to perform operations on the atomic variables. The operations are very fast, because they compile to a simple machine instruction whenever possible
Santosh Sam Koshy [email protected] Centre for Development of Advanced Computing, Hyderabad
Santosh Sam Koshy [email protected] Centre for Development of Advanced Computing, Hyderabad
Santosh Sam Koshy [email protected] Centre for Development of Advanced Computing, Hyderabad
seqlocks
An added feature in the 2.6 kernel that is intended to provide fast, lockless access to a shared resource. Seq locks work in situations where write access is rare but must be fast They work by allowing readers free access to the resource but requiring those readers to check for collisions with writers and, when collisions occur, retry their access Cannot be used to protect data structures involving pointers because the reader may be following a pointer that is invalid while the writer may be changing the data structure
Santosh Sam Koshy [email protected] Centre for Development of Advanced Computing, Hyderabad
seqlocks
Seqlocks are defined in linux/seqlock.h. It may be initialized by
seqlock_t lock1= SEQLOCK_UNLOCKED;
reader 1 reader 2
reader 2
Pointer
Pointer
Pointer
Shared Resourc e
Santosh Sam Koshy [email protected] Centre for Development of Advanced Computing, Hyderabad
Santosh Sam Koshy [email protected] Centre for Development of Advanced Computing, Hyderabad
Agenda
Wait Queues HZ
Jiffies
Long Delays Kernel Timers Tasklets Work Queues
Santosh Sam Koshy [email protected] Centre for Development of Advanced Computing, Hyderabad
Wait Queues
Wait Queues are mechanisms of putting a user space process into a sleep whenever the kernel driver is not able to suffice the user processs requirements. When a process is put to sleep, it is marked as being in a special state and removed from the schedulers run queue. The process will not be scheduled unless an event causes the scheduling. The linux scheduler maintains two special states that represent a wait state. They are defined as TASK_INTERRUPTIBLE and TASK_UNINTERRUPTIBLE
Santosh Sam Koshy [email protected] Centre for Development of Advanced Computing, Hyderabad
This only creates a wait queue list for appending future tasks to it
Santosh Sam Koshy [email protected] Centre for Development of Advanced Computing, Hyderabad
The waking up process is either another user space process or may be an interrupt handler. It satisfies the condition for wake up and calls one of the appropriate functions
Void wake_up (wait_queue_head_t *queue); Void wake_up_interruptible(wait_queue_head_t *queue);
Santosh Sam Koshy [email protected] Centre for Development of Advanced Computing, Hyderabad
If a process calls write and there is no space in the buffer, the process must block, and it must be on a different wait queue from the one used for reading. When some data has been written to the hardware device, and space becomes free in the output buffer, the process is awakened and the write call succeeds, although the data may be only partially written if there isnt room in the buffer for the count bytes that were requested
Santosh Sam Koshy [email protected] Centre for Development of Advanced Computing, Hyderabad
Exclusive Waits
Thundering Herd
In using wait queues, we may occur a situation wherein there are many processes waiting for the occurrence of an event. During the wakeup process, all processes waiting for the event are made ready to execute. This causes a herd of processes thunder-in together to gain exclusive access to the shared resource. Only one of these events is satisfied with the CPU and the rest have to go back into their sleep state. This thundering of processes for CPU access may deteriorate the overall system performance, if it is quite frequent. This problem is known as the Thundering Herd Problem and is sorted out using Exclusive Wait Mechanisms
Santosh Sam Koshy [email protected] Centre for Development of Advanced Computing, Hyderabad
Exclusive Waits
In response to the thundering herd problems, kernel developers have added an exclusive wait option to the kernel. There are two important differences in using exclusive waits:
When a wait queue entry has the WQ_FLAG_EXCLUSIVE flag set, it is added to the end of the wait queue. Entries without that flag are added to the beginning When wake_up is called on a wait queue, it stops after waking the first process that has the WQ_FLAG_EXCLUSIVE flag set
Santosh Sam Koshy [email protected] Centre for Development of Advanced Computing, Hyderabad
HZ
The kernel keeps track of the flow of time by means of timer interrupts. These are generated by the systems timing hardware at regular intervals. This interval is programmed at system boot up by the kernel according to the value HZ, which is an architecture dependent variable. The default values range from 50 to 1200 and is typically set to 100 or 1000 on x86 machines Changing the value of HZ to a new effect will take its toll only on recompiling the kernel with the new value
Santosh Sam Koshy [email protected] Centre for Development of Advanced Computing, Hyderabad
Jiffies
Every time a timer interrupt occurs, the value of an internal kernel counter is incremented. The counter is initialized to 0 on system boot and therefore represents the number of timer ticks since last boot. The counter is a 64 bit variable and is called jiffies_64. However, driver writers access the jiffies variable, an unsigned long that is same as either jiffies_64 or its least significant bits.
Santosh Sam Koshy [email protected] Centre for Development of Advanced Computing, Hyderabad
Santosh Sam Koshy [email protected] Centre for Development of Advanced Computing, Hyderabad
Delaying Execution
Long Delays:
Occasionally a driver needs to delay execution for relatively long periods more than one clock tick. There are few ways of implementing the same Busy Waiting
J = jiffies; Delay = J + 5 * HZ //A delay of 5 HZ from now
Santosh Sam Koshy [email protected] Centre for Development of Advanced Computing, Hyderabad
Delaying Execution
This method causes a busy looping in the while statement, which hogs the CPU for no productive outcome Yielding the Processor
While (time_before (J, Delay)) { schedule(); // yield the CPU }
The advantage of this method is that another process may get access to the CPU. The delay requested guaranteed but the process may not be scheduled exactly after the requested delay.
Santosh Sam Koshy [email protected] Centre for Development of Advanced Computing, Hyderabad
Delaying Execution
Short Delays:
The kernel implements functions that provide delays that may not otherwise be possible with the jiffies counter. These delays are implemented as function loop, depending on the architecture.
Void ndelay(unsigned long nsecs); Void udelay(unsigned long usecs); Void mdelay(unsigned long msecs;
Santosh Sam Koshy [email protected] Centre for Development of Advanced Computing, Hyderabad
Kernel Timers
Kernel timers are used to schedule execution of a function at a later instance of time, based on the clock tick. A kernel timer is a data structure that instructs the kernel to execute a user defined function with a user defined argument at a user defined time. The declaration can be found in linux/timer.h and the source code may be found in kernel/timer.c
Santosh Sam Koshy [email protected] Centre for Development of Advanced Computing, Hyderabad
Kernel Timers
The functions scheduled to run, may not run while the process that initiated it is executing. They are run asynchronously as in an interrupt context. Kernel timers may be considered as a software interrupt handlers and have certain constraints associated with their implementations. Primarily, they have to be atomic and there are additional constraints because of execution in the interrupt context The timers run on the same CPU that registered it.
Santosh Sam Koshy [email protected] Centre for Development of Advanced Computing, Hyderabad
};
The member expires specifies the amount of time for delay. There is a function pointer to a user defined function and the third parameter data takes the arguments to the function pointer.
Santosh Sam Koshy [email protected] Centre for Development of Advanced Computing, Hyderabad
This timer may be added and deleted to and from the the kernel using the functions
void add_timer(struct timer_list *timer); void del_timer(struct timer_list *timer);
The timer is a one shot execution and is taken off the list before it is run.
Santosh Sam Koshy [email protected] Centre for Development of Advanced Computing, Hyderabad
Santosh Sam Koshy [email protected] Centre for Development of Advanced Computing, Hyderabad
Santosh Sam Koshy [email protected] Centre for Development of Advanced Computing, Hyderabad
Tasklets
Tasklets is another kernel facility that allows deferring the execution of a process to a later instance. It is similar to kernel timers in that they run at interrupt time, they always run on the same CPU that schedules them and they receive an unsigned long argument They differ from kernel timers in the fact that they are not scheduled at a particular time. They are scheduled by the system at a later instance of time.
Santosh Sam Koshy [email protected] Centre for Development of Advanced Computing, Hyderabad
Santosh Sam Koshy [email protected] Centre for Development of Advanced Computing, Hyderabad
Tasklets
A tasklet can be disabled and re-enabled later; it wont be executed until it is enabled as many times as it has been disabled A tasklet can re-register itself A tasklet can be scheduled to execute at normal priority or high priority Tasklets may be run immediately if the system is not under heavy load but never later than the next timer tick
Santosh Sam Koshy [email protected] Centre for Development of Advanced Computing, Hyderabad
Tasklet APIs
void tasklet_disable(struct tasklet_struct *t); void tasklet_enable(struct tasklet_struct *t); void tasklet_schedule(struct tasklet_struct *t); void tasklet_hi_schedule(struct tasklet_struct *t); void tasklet_kill(struct tasklet_struct *t);
Santosh Sam Koshy [email protected] Centre for Development of Advanced Computing, Hyderabad
Work Queues
Work queues allow the kernel code to request that a function be called at some future time. They differ as
Work Queue functions run in the context of a special kernel process These functions can sleep Kernel code can request that the execution of work queue functions be delayed for an explicit interval
The key difference between tasklets and work queues is that tasklets execute for a short period, immediately and are atomic. The same does not hold for work queues
Santosh Sam Koshy [email protected] Centre for Development of Advanced Computing, Hyderabad
Santosh Sam Koshy [email protected] Centre for Development of Advanced Computing, Hyderabad