Scheduling and Kernel Synchronization: in This Chapter
Scheduling and Kernel Synchronization: in This Chapter
qxd
8/19/05
2:38 PM
Page 373
7
Scheduling and Kernel
Synchronization
In this chapter
375
7.2 Preemption
405
409
411
Summary
418
Exercises
419
373
Salzberg_C07.qxd
8/19/05
2:38 PM
Page 374
374
Salzberg_C07.qxd
8/19/05
2:38 PM
Page 375
375
7.1
Linux Scheduler
The 2.6 Linux kernel introduces a completely new scheduler thats commonly
referred to as the O(1) scheduler. The scheduler can perform the scheduling of a
task in constant time.2 Chapter 3 addressed the basic structure of the scheduler and
how a newly created process is initialized for it. This section describes how a task is
executed on a single CPU system. There are some mentions of code for scheduling
across multiple CPU (SMP) systems but, in general, the same scheduling process
applies across CPUs. We then describe how the scheduler switches out the currently
running process, performing what is called a context switch, and then we touch on
the other significant change in the 2.6 kernel: preemption.
From a high level, the scheduler is simply a grouping of functions that operate
on given data structures. Nearly all the code implementing the scheduler can be
found in kernel/sched.c and include/linux/sched.h. One important point
to mention early on is how the scheduler code uses the terms task and process
interchangeably. Occasionally, code comments also use thread to refer to a task or
process. A task, or process, in the scheduler is a collection of data structures and
flow of control. The scheduler code also refers to a task_struct, which is a data
structure the Linux kernel uses to keep track of processes.3
Salzberg_C07.qxd
8/19/05
2:38 PM
Page 376
376
7.1.1
After a process has been initialized and placed on a run queue, at some time,
it should have access to the CPU to execute. The two functions that are responsible for passing CPU control to different processes are schedule() and
scheduler_tick(). scheduler_tick() is a system timer that the kernel periodically calls and marks processes as needing rescheduling. When a timer event occurs,
the current process is put on hold and the Linux kernel itself takes control of the
CPU. When the timer event finishes, the Linux kernel normally passes control back
to the process that was put on hold. However, when the held process has been
marked as needing rescheduling, the kernel calls schedule() to choose which
process to activate instead of the process that was executing before the kernel took
control. The process that was executing before the kernel took control is called the
current process. To make things slightly more complicated, in certain situations, the
kernel can take control from the kernel; this is called kernel preemption. In the following sections, we assume that the scheduler decides which of two user space
processes gains CPU control.
Figure 7.1 illustrates how the CPU is passed among different processes as time
progresses. We see that Process A has control of the CPU and is executing. The system timer scheduler_tick() goes off, takes control of the CPU from A, and
marks A as needing rescheduling. The Linux kernel calls schedule(), which
chooses Process B and the control of the CPU is given to B.
Time
Process A
Execution
Process B
Execution
Process B
Yields
scheduler_tick()
schedule()
schedule()
Process C
Execution
scheduler_tick()
Process C
Execution
Process A
Execution
Process C
Yields
schedule()
FIGURE 7.1
Scheduling Processes
Process B executes for a while and then voluntarily yields the CPU. This commonly occurs when a process waits on some resource. B calls schedule(), which
chooses Process C to execute next.
Salzberg_C07.qxd
8/19/05
2:38 PM
Page 377
377
Salzberg_C07.qxd
8/19/05
378
2:38 PM
Page 378
2220
* Tasks with interactive credits get charged less run_time
2221
* at high sleep_avg to delay them losing their interactive
2222
* status
2223
*/
2224
if (HIGH_CREDIT(prev))
2225
run_time /= (CURRENT_BONUS(prev) ? : 1);
-----------------------------------------------------------------------
Lines 22132218
We calculate the length of time for which the process on the scheduler has been
active. If the process has been active for longer than the average maximum sleep
time (NS_MAX_SLEEP_AVG), we set its runtime to the average maximum sleep time.
This is what the Linux kernel code calls a timeslice in other sections of the code.
A timeslice refers to both the amount of time between scheduler interrupts and the
length of time a process has spent using the CPU. If a process exhausts its timeslice,
the process expires and is no longer active. The timestamp is an absolute value that
determines for how long a process has used the CPU. The scheduler uses timestamps to decrement the timeslice of processes that have been using the CPU.
For example, suppose Process A has a timeslice of 50 clock cycles. It uses the
CPU for 5 clock cycles and then yields the CPU to another process. The kernel uses
the timestamp to determine that Process A has 45 cycles left on its timeslice.
Lines 22242225
Interactive processes are processes that spend much of their time waiting for
input. A good example of an interactive process is the keyboard controllermost
of the time the controller is waiting for input, but when it has a task to do, the user
expects it to occur at a high priority.
Interactive processes, those that have an interactive credit of more than
100 (default value), get their effective run_time divided by (sleep_avg/
max_sleep_avg * MAX_BONUS(10)):4
---------------------------------------------------------------------kernel/sched.c
2226
2227
spin_lock_irq(&rq->lock);
2228
2229
/*
2230
* if entering off of a kernel preemption go straight
2231
* to picking the next task.
2232
*/
4
Salzberg_C07.qxd
8/19/05
2:38 PM
Page 379
379
2233
switch_count = &prev->nivcsw;
2234
if (prev->state && !(preempt_count() & PREEMPT_ACTIVE)) {
2235
switch_count = &prev->nvcsw;
2236
if (unlikely((prev->state & TASK_INTERRUPTIBLE) &&
2237
unlikely(signal_pending(prev))))
2238
prev->state = TASK_RUNNING;
2239
else
2240
deactivate_task(prev, rq);
2241
}
-----------------------------------------------------------------------
Line 2227
The function obtains the run queue lock because were going to modify it.
Lines 22332241
If we have entered schedule() with the previous process being a kernel preemption, we leave the previous process running if a signal is pending. This means
that the kernel has preempted normal processing in quick succession; thus, the code
is contained in two unlikely() statements.5 If there is no further preemption, we
remove the preempted process from the run queue and continue to choose the next
process to run.
---------------------------------------------------------------------kernel/sched.c
2243
cpu = smp_processor_id();
2244
if (unlikely(!rq->nr_running)) {
2245
idle_balance(cpu, rq);
2246
if (!rq->nr_running) {
2247
next = rq->idle;
2248
rq->expired_timestamp = 0;
2249
wake_sleeping_dependent(cpu, rq);
2250
goto switch_tasks;
2251
}
2252
}
2253
2254
array = rq->active;
2255
if (unlikely(!array->nr_active)) {
2256
/*
2257
* Switch the active and expired arrays.
2258
*/
2259
rq->active = rq->expired;
2260
rq->expired = array;
2261
array = rq->active;
2262
rq->expired_timestamp = 0;
2263
rq->best_expired_prio = MAX_PRIO;
2264
}
----------------------------------------------------------------------5
For more information on the unlikely routine, see Chapter 2, Exploration Toolkit.
Salzberg_C07.qxd
8/19/05
2:38 PM
Page 380
380
Line 2243
If the run queue has no processes on it, we set the next process to the idle process
and reset the run queues expired timestamp to 0. On a multiprocessor system, we
first check if any processes are running on other CPUs that this CPU can take. In
effect, we load balance idle processes across all CPUs in the system. Only if no
processes can be moved from the other CPUs do we set the run queues next process
to idle and reset the expired timestamp.
Lines 22552264
If the run queues active array is empty, we switch the active and expired array
pointers before choosing a new process to run.
---------------------------------------------------------------------kernel/sched.c
2266
idx = sched_find_first_bit(array->bitmap);
2267
queue = array->queue + idx;
2268
next = list_entry(queue->next, task_t, run_list);
2269
2270
if (dependent_sleeper(cpu, rq, next)) {
2271
next = rq->idle;
2272
goto switch_tasks;
2273
}
2274
2275
if (!rt_task(next) && next->activated > 0) {
2276
unsigned long long delta = now - next->timestamp;
2277
2278
if (next->activated == 1)
2279
delta = delta * (ON_RUNQUEUE_WEIGHT * 128 / 100) / 128;
2280
2281
array = next->array;
2282
dequeue_task(next, array);
2283
recalc_task_prio(next, next->timestamp + delta);
2284
enqueue_task(next, array);
2285
}
next->activated = 0;
-----------------------------------------------------------------------
Lines 22662268
The
scheduler
finds
sched_find_first_bit()
Salzberg_C07.qxd
8/19/05
2:38 PM
Page 381
381
priority array at the specified location. next is initialized to the first process in
queue.
Lines 22702273
If the process activated attribute is greater than 0, and the next process is not a
real-time task, we remove it from queue, recalculate its priority, and enqueue it
again.
Line 2286
We set the process activated attribute to 0, and then run with it.
---------------------------------------------------------------------kernel/sched.c
2287 switch_tasks:
2288
prefetch(next);
2289
clear_tsk_need_resched(prev);
2290
RCU_qsctr(task_cpu(prev))++;
2291
2292
prev->sleep_avg -= run_time;
2293
if ((long)prev->sleep_avg <= 0) {
2294
prev->sleep_avg = 0;
2295
if (!(HIGH_CREDIT(prev) || LOW_CREDIT(prev)))
2296
prev->interactive_credit--;
2297
}
2298
prev->timestamp = now;
2299
2300
if (likely(prev != next)) {
2301
next->timestamp = now;
2302
rq->nr_switches++;
2303
rq->curr = next;
2304
++*switch_count;
2305
Salzberg_C07.qxd
8/19/05
382
2:38 PM
Page 382
2306
prepare_arch_switch(rq, next);
2307
prev = context_switch(rq, prev, next);
2308
barrier();
2309
2310
finish_task_switch(prev);
2311
} else
2312
spin_unlock_irq(&rq->lock);
2313
2314
reacquire_kernel_lock(current);
2315
preempt_enable_no_resched();
2316
if (test_thread_flag(TIF_NEED_RESCHED))
2317
goto need_resched;
2318 }
-----------------------------------------------------------------------
Line 2288
We attempt to get the memory of the new process task structure into the CPUs
L1 cache. (See include/linux/prefetch.h for more information.)
Line 2290
Because were going through a context switch, we need to inform the current
CPU that were doing so. This allows a multi-CPU device to ensure data that is
shared across CPUs is accessed exclusively. This process is called read-copy updating. For more information, see http://lse.sourceforge.net/locking/
rcupdate.html.
Lines 22922298
If we havent chosen the same process, we set the new process timestamp, increment the run queue counters, and set the current process to the new process.
Salzberg_C07.qxd
8/19/05
2:38 PM
Page 383
383
Lines 23062308
We reacquire the kernel lock, enable preemption, and see if we need to reschedule immediately; if so, we go back to the top of schedule().
Its possible that after we perform the context_switch(), we need to reschedule. Perhaps scheduler_tick() has marked the new process as needing
rescheduling or, when we enable preemption, it gets marked. We keep rescheduling
processes (and context switching them) until one is found that doesnt need rescheduling. The process that leaves schedule() becomes the new process executing on
this CPU.
7.1.2
Context Switch
Salzberg_C07.qxd
384
8/19/05
2:38 PM
Page 384
Here, we describe the two jobs of context_switch: one to switch the virtual
memory mapping and one to switch the task/thread structure. The first job, which
the function switch_mm() carries out, uses many of the hardware-dependent
memory management structures and registers:
---------------------------------------------------------------------/include/asm-i386/mmu_context.h
026 static inline void switch_mm(struct mm_struct *prev,
027
struct mm_struct *next,
028
struct task_struct *tsk)
029 {
030
int cpu = smp_processor_id();
031
032
if (likely(prev != next)) {
033
/* stop flush ipis for the previous mm */
034
cpu_clear(cpu, prev->cpu_vm_mask);
035 #ifdef CONFIG_SMP
036
cpu_tlbstate[cpu].state = TLBSTATE_OK;
037
cpu_tlbstate[cpu].active_mm = next;
038 #endif
039
cpu_set(cpu, next->cpu_vm_mask);
040
041
/* Re-load page tables */
042
load_cr3(next->pgd);
043
044
/*
045
* load the LDT, if the LDT is different:
046
*/
047
if (unlikely(prev->context.ldt != next->context.ldt))
048
load_LDT_nolock(&next->context, cpu);
049
}
050 #ifdef CONFIG_SMP
051
else {
-----------------------------------------------------------------------
Line 39
The code for switching the memory context utilizes the x86 hardware register
which holds the base address of all paging operations for a given process. The
new page global descriptor is loaded here from next->pgd.
cr3,
Line 47
Most processes share the same LDT. If another LDT is required by this process,
it is loaded here from the new next->context structure.
Salzberg_C07.qxd
8/19/05
2:38 PM
Page 385
385
CURRENT
First call:
TASK
A
TASK
B
CURRENT
Second call:
CURRENT
Third call:
TASK
B
TASK TASK
C
A
TASK
C
TASK TASK
A
B
CURRENT
AND SO ON. . .
FIGURE 7.2
switch_to Calls
Salzberg_C07.qxd
8/19/05
2:38 PM
Page 386
386
#define switch_to(prev,next,last) do {
unsigned long esi,edi;
\
asm volatile("pushfl\n\t"
\
"pushl %%ebp\n\t"
\
"movl %%esp,%0\n\t" /* save ESP */
\
"movl %5,%%esp\n\t" /* restore ESP */
"movl $1f,%1\n\t"
/* save EIP */
\
"pushl %6\n\t"
/* restore EIP */
\
"jmp __switch_to\n"
\
Salzberg_C07.qxd
8/19/05
2:38 PM
Page 387
387
023
"1:\t"
\
024
"popl %%ebp\n\t"
\
025
"popfl"
\
026
:"=m" (prev->thread.esp),"=m" (prev->thread.eip), \
027
"=a" (last),"=S" (esi),"=D" (edi)
\
028
:"m" (next->thread.esp),"m" (next->thread.eip),
\
029
"2" (prev), "d" (next));
\
030 } while (0)
-----------------------------------------------------------------------
Line 12
The do {} while (0) construct allows (among other things) the macro to have
local the variables esi and edi. Remember, these are just local variables with familiar names.
Salzberg_C07.qxd
8/19/05
2:38 PM
Page 388
388
Lines 17 and 30
The construct asm volatile ()6 encloses the inline assembly block and the
volatile keyword assures that the compiler will not change (optimize) the routine in
any way.
Lines 1718
Push the flags and ebp registers onto the stack. (Note: We are still using the
stack associated with the prev task.)
Line 19
This line saves the current stack pointer esp to the prev task structure.
Line 20
Move the stack pointer from the next task structure to the current processor esp.
NOTE By definition, we have just made a context switch.
We are now with a new kernel stack and thus, any reference to current is to the
new (next) task structure.
Line 21
Save the return address for prev into its task structure. This is where the prev
task resumes when it is restarted.
Line 22
Push the return address (from when we return from __switch_to()) onto the
stack. This is the eip from next. The eip was saved into its task structure (on
line 21) when it was stopped, or preempted the last time.
Line 23
Salzberg_C07.qxd
8/19/05
2:38 PM
Page 389
389
Lines 2425
Pop the base pointer and flags registers from the new (next task) kernel stack.
Lines 2629
These are the output and input parameters to the inline assembly routine. See the
Inline Assembly section in Chapter 2 for more information on the constraints put
on these parameters.
Line 29
By way of assembler magic, prev is returned in eax, which is the third positional
parameter. In other words, the input parameter prev is passed out of the
switch_to() macro as the output parameter last.
Because switch_to() is a macro, it was executed inline with the code that called
it in context_switch(). It does not return as functions normally do.
For the sake of clarity, remember that switch_to() passes back prev in the eax
register, execution then continues in context_switch(), where the next instruction is return prev (line 1074 of kernel/sched.c). This allows
context_switch() to pass back a pointer to the last task running.
7.1.2.2 Following the PPC context_switch()
The PPC code for context_switch() has slightly more work to do for the
same results. Unlike the cr3 register in x86 architecture, the PPC uses hash functions to point to context environments. The following code for switch_mm()
touches on these functions, but Chapter 4, Memory Management, offers a deeper
discussion.
Here is the routine for switch_mm() which, in turn, calls the routine
set_context().
---------------------------------------------------------------------/include/asm-ppc/mmu_context.h
155 static inline void switch_mm(struct mm_struct *prev, struct
mm_struct *next,struct task_struct *tsk)
156 {
157
tsk->thread.pgdir = next->pgd;
Salzberg_C07.qxd
8/19/05
2:38 PM
390
Page 390
158
get_mmu_context(next);
159
set_context(next->context, next->pgd);
160 }
-----------------------------------------------------------------------
Line 157
The page global directory (segment register) for the new thread is made to point
to the next->pgd pointer.
Line 158
This is the call to the assembly routine set_context. Below is the code and discussion of this routine. Upon execution of the blr instruction on line 1468, the
code returns to the switch_mm routine.
---------------------------------------------------------------------/arch/ppc/kernel/head.S
1437 _GLOBAL(set_context)
1438 mulli r3,r3,897 /* multiply context by skew factor */
1439 rlwinm r3,r3,4,8,27 /* VSID = (context & 0xfffff) << 4 */
1440 addis r3,r3,0x6000 /* Set Ks, Ku bits */
1441 li r0,NUM_USER_SEGMENTS
1442 mtctr r0
...
1457 3: isync
...
1461 mtsrin r3,r4
1462 addi r3,r3,0x111 /* next VSID */
1463 rlwinm r3,r3,0,8,3 /* clear out any overflow from VSID field */
1464 addis r4,r4,0x1000 /* address of next segment */
1465 bdnz 3b
1466 sync
1467 isync
1468 blr
------------------------------------------------------------------------
Lines 14371440
Salzberg_C07.qxd
8/19/05
2:38 PM
Page 391
391
Lines 14611465
Salzberg_C07.qxd
8/19/05
2:38 PM
Page 392
392
Line 205
Still running under the context of the old thread, pass the pointers to the thread
structure to the _switch() function.
Line 249
_switch() is the assembly routine called to do the work of switching the two
thread structures (see the following section).
Line 250
Salzberg_C07.qxd
8/19/05
2:38 PM
Page 393
393
464
CLR_TOP32(r0)
465
mtspr SPRG3,r0/* Update current THREAD phys addr */
466
lwz r1,KSP(r4) /* Load new stack pointer */
467
/* save the old current 'last' for return value */
468
mr r3,r2
469
addi r2,r4,-THREAD /* Update current */
...
478
lwz r0,_CCR(r1)
479
mtcrf 0xFF,r0
480
REST_NVGPRS(r1)
481
482
lwz r4,_NIP(r1) /* Return to _switch caller in new task */
483
mtlr r4
484
addi r1,r1,INT_FRAME_SIZE
485
blr
-----------------------------------------------------------------------
The environment is saved to the current stack with respect to the current stack
pointer, r1.
Line 461
The entire environment is then saved into the current thread_struct pointer
passed in by way of r3.
Lines 463465
SPRG3
Line 466
KSP is the offset into the task structure (r4) of the new tasks kernel stack pointer.
The stack pointer r1 is now updated with this value. (This is the point of the PPC
context switch.)
Line 468
The current pointer to the previous task is returned from _switch() in r3. This
represents the last task.
Line 469
The current pointer (r2) is updated with the pointer to the new task structure (r4).
Salzberg_C07.qxd
8/19/05
2:38 PM
Page 394
394
Lines 478486
Restore the rest of the environment from the new stack and return to the caller
with the previous task structure in r3.
This concludes the explanation of context_switch(). At this point, the
processor has swapped the two processes prev and next as called by
context_switch in schedule().
---------------------------------------------------------------------kernel/sched.c
1709
prev = context_switch(rq, prev, next);
-----------------------------------------------------------------------
prev now points to the process that we have just switched away from and next
points to the current process.
Now that weve discussed how tasks are scheduled in the Linux kernel, we can
examine how tasks are told to be scheduled. Namely, what causes schedule() to
be called and one process to yield the CPU to another process?
7.1.3
Processes can voluntarily yield the CPU by simply calling schedule(). This is
most commonly used in kernel code and device drivers that want to sleep or wait
for a signal to occur.7 Other tasks want to continually use the CPU and the system
timer must tell them to yield. The Linux kernel periodically seizes the CPU, in so
doing stopping the active process, and then does a number of timer-based tasks.
One of these tasks, scheduler_tick(), is how the kernel forces a process to
yield. If a process has been running for too long, the kernel does not return control to that process and instead chooses another one. We now examine how
scheduler_tick()determines if the current process must yield the CPU:
---------------------------------------------------------------------kernel/sched.c
1981 void scheduler_tick(int user_ticks, int sys_ticks)
1982 {
1983
int cpu = smp_processor_id();
1984
struct cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat;
1985
runqueue_t *rq = this_rq();
1986
task_t *p = current;
1987
Linux convention specifies that you should never call schedule while holding a spinlock because this
introduces the possibility of system deadlock. This is good advice!
Salzberg_C07.qxd
8/19/05
2:38 PM
Page 395
395
1988
rq->timestamp_last_tick = sched_clock();
1989
1990
if (rcu_pending(cpu))
1991
rcu_check_callbacks(cpu, user_ticks);
-----------------------------------------------------------------------
Lines 19811986
This code block initializes the data structures that the scheduler_tick() function needs. cpu, cpu_usage_stat, and rq are set to the processor ID, CPU stats
and run queue of the current processor. p is a pointer to the current process executing on cpu.
Line 1988
The run queues last tick is set to the current time in nanoseconds.
Lines 19901991
Salzberg_C07.qxd
8/19/05
396
2:38 PM
Page 396
Lines 19942000
keeps track of kernel statistics, and we update the hardware and software interrupt statistics by the number of system ticks that have occurred.
cpustat
Lines 20022011
More CPU statistics are gathered in this code block. If the current process was
we increment the CPU nice counter; otherwise, the user tick counter is
incremented. Finally, we increment the CPUs system tick counter.
niced,
---------------------------------------------------------------------kernel/sched.c
2019
if (p->array != rq->active) {
2020
set_tsk_need_resched(p);
2021
goto out;
2022
}
2023
spin_lock(&rq->lock);
-----------------------------------------------------------------------
Lines 20192022
Here, we see why we store a pointer to a priority array within the task_struct
of the process. The scheduler checks the current process to see if it is no longer
active. If the process has expired, the scheduler sets the process rescheduling flag
and jumps to the end of the scheduler_tick() function. At that point (lines
20922093), the scheduler attempts to load balance the CPU because there is no
active task yet. This case occurs when the scheduler grabbed CPU control before the
current process was able to schedule itself or clean up from a successful run.
Line 2023
At this point, we know that the current process was running and not expired or
nonexistent. The scheduler now wants to yield CPU control to another process; the
first thing it must do is take the run queue lock.
Salzberg_C07.qxd
8/19/05
2:38 PM
Page 397
397
---------------------------------------------------------------------kernel/sched.c
2024
/*
2025
* The task was running during this tick - update the
2026
* time slice counter. Note: we do not update a thread's
2027
* priority until it either goes to sleep or uses up its
2028
* timeslice. This makes it possible for interactive tasks
2029
* to use up their timeslices at their highest priority levels.
2030
*/
2031
if (unlikely(rt_task(p))) {
2032
/*
2033
* RR tasks need a special form of timeslice management.
2034
* FIFO tasks have no timeslices.
2035
*/
2036
if ((p->policy == SCHED_RR) && !--p->time_slice) {
2037
p->time_slice = task_timeslice(p);
2038
p->first_time_slice = 0;
2039
set_tsk_need_resched(p);
2040
2041
/* put it at the end of the queue: */
2042
dequeue_task(p, rq->active);
2043
enqueue_task(p, rq->active);
2044
}
2045
goto out_unlock;
2046 }
-----------------------------------------------------------------------
Lines 20312046
The easiest case for the scheduler occurs when the current process is a real-time
task. Real-time tasks always have a higher priority than any other tasks. If the task
is a FIFO task and was running, it should continue its operation so we jump to the
end of the function and release the run queue lock. If the current process is a roundrobin real-time task, we decrement its timeslice. If the task has no more timeslice,
its time to schedule another round-robin real-time task. The current task has its
new timeslice calculated by task_timeslice(). Then the task has its first timeslice reset. The task is then marked as needing rescheduling and, finally, the task is
put at the end of the round-robin real-time tasklist by removing it from the run
queues active array and adding it back in. The scheduler then jumps to the end of
the function and releases the run queue lock.
---------------------------------------------------------------------kernel/sched.c
2047
if (!--p->time_slice) {
2048
dequeue_task(p, rq->active);
2049
set_tsk_need_resched(p);
2050
p->prio = effective_prio(p);
Salzberg_C07.qxd
8/19/05
398
2:38 PM
Page 398
2051
p->time_slice = task_timeslice(p);
2052
p->first_time_slice = 0;
2053
2054
if (!rq->expired_timestamp)
2055
rq->expired_timestamp = jiffies;
2056
if (!TASK_INTERACTIVE(p) || EXPIRED_STARVING(rq)) {
2057
enqueue_task(p, rq->expired);
2058
if (p->static_prio < rq->best_expired_prio)
2059
rq->best_expired_prio = p->static_prio;
2060
} else
2061
enqueue_task(p, rq->active);
2062
} else {
-----------------------------------------------------------------------
Lines 20472061
At this point, the scheduler knows that the current process is not a real-time
process. It decrements the process timeslice and, in this section, the process timeslice has been exhausted and reached 0. The scheduler removes the task from the
active array and sets the process rescheduling flag. The priority of the task is recalculated and its timeslice is reset. Both of these operations take into account prior
process activity.8 If the run queues expired timestamp is 0, which usually occurs
when there are no more processes on the run queues active array, we set it to jiffies.
Jiffies
Jiffies is a 32-bit variable counting the number of ticks since the system has been
booted. This is approximately 497 days before the number wraps around to 0 on a
100HZ system. The macro on line 20 is the suggested method of accessing this value
as a u64. There are also macros to help detect wrapping in include/jiffies.h.
----------------------------------------------------------------------include/linux/jiffies.h
017 extern unsigned long volatile jiffies;
020 u64 get_jiffies_64(void);
-----------------------------------------------------------------------
We normally favor interactive tasks by replacing them on the active priority array
of the run queue; this is the else clause on line 2060. However, we dont want to
starve expired tasks. To determine if expired tasks have been waiting too long for
CPU time, we use EXPIRED_STARVING() (see EXPIRED_STARVING on line 1968).
8
Salzberg_C07.qxd
8/19/05
2:38 PM
Page 399
399
The function returns true if the first expired task has been waiting an unreasonable amount of time or if the expired array contains a task that has a greater priority than the current process. The unreasonableness of waiting is load-dependent
and the swapping of the active and expired arrays decrease with an increasing number of running tasks.
If the task is not interactive or expired tasks are starving, the scheduler takes the
current process and enqueues it onto the run queues expired priority array. If the
current process static priority is higher than the expired run queues highest priority task, we update the run queue to reflect the fact that the expired array now has
a higher priority than before. (Remember that high-priority tasks have low numbers
in Linux, thus, the (<) in the code.)
---------------------------------------------------------------------kernel/sched.c
2062
} else {
2063
/*
2064
* Prevent a too long timeslice allowing a task to monopolize
2065
* the CPU. We do this by splitting up the timeslice into
2066
* smaller pieces.
2067
*
2068
* Note: this does not mean the task's timeslices expire or
2069
* get lost in any way, they just might be preempted by
2070
* another task of equal priority. (one with higher
2071
* priority would have preempted this task already.) We
2072
* requeue this task to the end of the list on this priority
2073
* level, which is in essence a round-robin of tasks with
2074
* equal priority.
2075
*
2076
* This only applies to tasks in the interactive
2077
* delta range with at least TIMESLICE_GRANULARITY to requeue.
2078
*/
2079
if (TASK_INTERACTIVE(p) && !((task_timeslice(p) 2080
p->time_slice) % TIMESLICE_GRANULARITY(p)) &&
2081
(p->time_slice >= TIMESLICE_GRANULARITY(p)) &&
2082
(p->array == rq->active)) {
2083
2084
dequeue_task(p, rq->active);
2085
set_tsk_need_resched(p);
2086
p->prio = effective_prio(p);
2087
enqueue_task(p, rq->active);
2088
}
2089
}
2090 out_unlock:
2091
spin_unlock(&rq->lock);
2092 out:
2093
rebalance_tick(cpu, rq, NOT_IDLE);
2094 }
-----------------------------------------------------------------------
Salzberg_C07.qxd
8/19/05
400
2:38 PM
Page 400
Lines 20792089
The final case before the scheduler is that the current process was running and still
has timeslices left to run. The scheduler needs to ensure that a process with a large
timeslice doesnt hog the CPU. If the task is interactive, has more timeslices than
TIMESLICE_GRANULARITY, and was active, the scheduler removes it from the active
queue. The task then has its reschedule flag set, its priority recalculated, and is placed
back on the run queues active array. This ensures that a process at a certain priority
with a large timeslice doesnt starve another process of an equal priority.
Lines 20902094
The scheduler has finished rearranging the run queue and unlocks it; if executing on an SMP system, it attempts to load balance.
Combining how processes are marked to be rescheduled, via scheduler_tick()
and how processes are scheduled, via schedule() illustrates how the scheduler
operates in the 2.6 Linux kernel. We now delve into the details of what the scheduler means by priority.
7.1.3.1 Dynamic Priority Calculation
In previous sections, we glossed over the specifics of how a tasks dynamic priority is calculated. The priority of a task is based on its prior behavior, as well as its
user-specified nice value. The function that determines a tasks new dynamic priority is recalc_task_prio():
---------------------------------------------------------------------kernel/sched.c
381 static void recalc_task_prio(task_t *p, unsigned long long now)
382 {
383
unsigned long long __sleep_time = now - p->timestamp;
384
unsigned long sleep_time;
385
386
if (__sleep_time > NS_MAX_SLEEP_AVG)
387
sleep_time = NS_MAX_SLEEP_AVG;
388
else
389
sleep_time = (unsigned long)__sleep_time;
390
391
if (likely(sleep_time > 0)) {
392
/*
393
* User tasks that sleep a long time are categorised as
394
* idle and will get just interactive status to stay active &
395
* prevent them suddenly becoming cpu hogs and starving
396
* other processes.
397
*/
Salzberg_C07.qxd
8/19/05
2:38 PM
Page 401
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
401
Salzberg_C07.qxd
8/19/05
402
2:38 PM
Page 402
449
}
450
}
452
452
p->prio = effective_prio(p);
453 }
-----------------------------------------------------------------------
Lines 386389
Based on the time now, we calculate the length of time the process p has slept
for and assign it to sleep_time with a maximum value of NS_MAX_SLEEP_AVG.
(NS_MAX_SLEEP_AVG defaults to 10 milliseconds.)
Lines 391404
If process p has slept, we first check to see if it has slept enough to be classified
as an interactive task. If it has, when sleep_time > INTERACTIVE_SLEEP(p), we
adjust the process sleep average to a set value and, if p isnt classified as interactive
yet, we increment ps interactive_credit.
Lines 405410
If the task is CPU intensive, and thus classified as non-interactive, we restrict the
process to having, at most, one more timeslice worth of a sleep average bonus.
Lines 419432
Tasks that are not yet classified as interactive (not HIGH_CREDIT) that awake from
uninterruptible sleep are restricted to having a sleep average of INTERACTIVE().
Lines 434450
We add our newly calculated sleep_time to the process sleep average, ensuring
it doesnt go over NS_MAX_SLEEP_AVG. If the processes are not considered interactive but have slept for the maximum time or longer, we increment its interactive
credit.
Line 452
Finally, the priority is set using effective_prio(), which takes into account
the newly calculated sleep_avg field of p. It does this by scaling the sleep average
Salzberg_C07.qxd
8/19/05
2:38 PM
Page 403
403
of 0 .. MAX_SLEEP_AVG into the range of -5 to +5. Thus, a process that has a static
priority of 70 can have a dynamic priority between 65 and 85, depending on its
prior behavior.
One final thing: A process that is not a real-time process has a range between 101
and 140. Processes that are operating at a very high priority, 105 or less, cannot
cross the real-time boundary. Thus, a high priority, highly interactive process could
never have a dynamic priority of lower than 101. (Real-time processes cover 0..100
in the default configuration.)
7.1.3.2 Deactivation
We already discussed how a task gets inserted into the scheduler by forking and
how tasks move from the active to expired priority arrays within the CPUs run
queue. But, how does a task ever get removed from a run queue?
A task can be removed from the run queue in two major ways:
The task is preempted by the kernel and its state is not running, and there is
no signal pending for the task (see line 2240 in kernel/sched.c).
On SMP machines, the task can be removed from a run queue and placed
on another run queue (see line 3384 in kernel/sched.c).
The first case normally occurs when schedule() gets called after a process
puts itself to sleep on a wait queue. The task marks itself as non-running
(TASK_INTERRUPTIBLE, TASK_UNINTERRUPTIBLE, TASK_STOPPED, and so on) and
the kernel no longer considers it for CPU access by removing it from the run queue.
The case in which the process is moved to another run queue is dealt with in the
SMP section of the Linux kernel, which we do not explore here.
We now trace how a process is removed from the run queue via
deactivate_task():
---------------------------------------------------------------------kernel/sched.c
507 static void deactivate_task(struct task_struct *p, runqueue_t *rq)
508 {
509
rq->nr_running--;
510
if (p->state == TASK_UNINTERRUPTIBLE)
511
rq->nr_uninterruptible++;
512
dequeue_task(p, p->array);
513
p->array = NULL;
514 }
-----------------------------------------------------------------------
Salzberg_C07.qxd
8/19/05
404
2:38 PM
Page 404
Line 509
Our run queue statistics are now updated so we actually remove the process from
the run queue. The kernel uses the p->array field to test if a process is running
and on a run queue. Because it no longer is either, we set it to NULL.
There is still some run queue management to be done; lets examine the specifics
of dequeue_task():
---------------------------------------------------------------------kernel/sched.c
303 static void dequeue_task(struct task_struct *p, prio_array_t *array)
304 {
305
array->nr_active--;
306
list_del(&p->run_list);
307
if (list_empty(array->queue + p->prio))
308
__clear_bit(p->prio, array->bitmap);
309 }
-----------------------------------------------------------------------
Line 305
We adjust the number of active tasks on the priority array that process p is on
either the expired or the active array.
Lines 306308
We remove the process from the list of processes in the priority array at ps
priority. If the resulting list is empty, we need to clear the bit in the priority arrays
bitmap to show there are no longer any processes at priority p->prio().
Salzberg_C07.qxd
8/19/05
2:38 PM
Page 405
7.2 Preemption
list_del() does
list_head structure
405
the list.
We have reached the point where the process is removed from the run queue
and has thus been completely deactivated. If this process had a state of
TASK_INTERRUPTIBLE or TASK_UNINTERRUPTIBLE, it could be awoken and
placed back on a run queue. If the process had a state of TASK_STOPPED,
TASK_ZOMBIE, or TASK_DEAD, it has all of its structures removed and discarded.
7.2
Preemption
7.2.1
7.2.2
When the kernel has finished processing a kernel space task and is ready to pass
control to a user space task, it first checks to see which user space task it should pass
control to. This might not be the user space task that passed its control to the kernel. For example, if Task A invokes a system call, after the system call completes, the
kernel could pass control of the system to Task B.
Salzberg_C07.qxd
8/19/05
2:38 PM
Page 406
406
Each task on the system has a rescheduling necessary flag that is set whenever
a task should be rescheduled:
---------------------------------------------------------------------include/linux/sched.h
988 static inline void set_tsk_need_resched(struct task_struct *tsk)
989 {
990
set_tsk_thread_flag(tsk,TIF_NEED_RESCHED);
991 }
992
993 static inline void clear_tsk_need_resched(struct task_struct *tsk)
994 {
995
clear_tsk_thread_flag(tsk,TIF_NEED_RESCHED);
996 }
...
1003 static inline int need_resched(void)
1004 {
1005
return unlikely(test_thread_flag(TIF_NEED_RESCHED));
1006 }
-----------------------------------------------------------------------
Lines 988996
Lines 10031006
need_resched tests the current threads flag to see if TIF_NEED_RESCHED is set.
Salzberg_C07.qxd
8/19/05
2:38 PM
Page 407
7.2 Preemption
7.2.3
407
Lines 4650
preempt_enable() calls preempt_enable_no_resched(), which decrements the preempt_count on the current task by one and then calls
preempt_check_resched():
---------------------------------------------------------------------include/linux/preempt.h
40 #define preempt_check_resched() \
41 do { \
42
if (unlikely(test_thread_flag(TIF_NEED_RESCHED))) \
43
preempt_schedule(); \
44 } while (0)
-----------------------------------------------------------------------
Lines 4044
preempt_check_resched() sees if the current
rescheduling; if so, it calls preempt_schedule().
Salzberg_C07.qxd
8/19/05
2:38 PM
408
Page 408
---------------------------------------------------------------------kernel/sched.c
2328 asmlinkage void __sched preempt_schedule(void)
2329 {
2330
struct thread_info *ti = current_thread_info();
2331
2332
/*
2333
* If there is a non-zero preempt_count or interrupts are disabled,
2334
* we do not want to preempt the current task. Just return..
2335
*/
2336
if (unlikely(ti->preempt_count || irqs_disabled()))
2337
return;
2338
2339 need_resched:
2340
ti->preempt_count = PREEMPT_ACTIVE;
2341
schedule();
2342
ti->preempt_count = 0;
2343
2344 /* we could miss a preemption opportunity between schedule and now */
2345
barrier();
2346
if (unlikely(test_thread_flag(TIF_NEED_RESCHED)))
2347
goto need_resched;
2348 }
-----------------------------------------------------------------------
Line 23362337
If the current task still has a positive preempt_count, likely from nesting
commands, or the current task has interrupts disabled, we
return control of the processor to the current task.
preempt_disable()
Line 23402347
The current task has no locks because preempt_count is 0 and IRQs are
enabled. Thus, we set the current tasks preempt_count to note its undergoing preemption, and call schedule(), which chooses another task.
If the task emerging from the code block needs rescheduling, the kernel needs to
ensure its safe to yield the processor from the current task. The kernel checks the
tasks value of preempt_count. If preempt_count is 0, and thus the current task
holds no locks, schedule() is called and a new task is chosen for execution. If
preempt_count is non-zero, it is unsafe to pass control to another task, and control is returned to the current task until it releases all of its locks. When the current
task releases locks, a test is made to see if the current task needs rescheduling.
Salzberg_C07.qxd
8/19/05
2:38 PM
Page 409
409
When the current task releases its final lock and preempt_count goes to 0, scheduling immediately occurs.
7.3
When two or more processes require dedicated access to a shared resource, they
might need to enforce the condition that they are the sole process to operate in a
given section of code. The basic form of locking in the Linux kernel is the spinlock.
Spinlocks take their name from the fact that they continuously loop, or spin,
waiting to acquire a lock. Because spinlocks operate in this manner, it is imperative
not to have any section of code inside a spinlock attempt to acquire a lock twice.
This results in deadlock.
Before operating on a spinlock, the spin_lock_t structure must be initialized.
This is done by calling spin_lock_init():
---------------------------------------------------------------------include/linux/spinlock.h
63 #define spin_lock_init(x) \
64
do { \
65
(x)->magic = SPINLOCK_MAGIC; \
66
(x)->lock = 0; \
67
(x)->babble = 5; \
68
(x)->module = __FILE__; \
69
(x)->owner = NULL; \
70
(x)->oline = 0; \
71
} while (0)
-----------------------------------------------------------------------
This section of code sets the spin_lock to unlocked, or 0, on line 66 and initializes the other variables in the structure. The (x)->lock variable is the one were
concerned about here.
After a spin_lock is initialized, it can be acquired by calling spin_lock() or
spin_lock_irqsave(). The spin_lock_irqsave() function disables interrupts
before locking, whereas spin_lock() does not. If you use spin_lock(), the
process could be interrupted in the locked section of code.
To release a spin_lock after executing the critical section of code, you need to call
spin_unlock() or spin_unlock_irqrestore(). The spin_unlock_irqrestore()
restores the state of the interrupt registers to the state they were in when
spin_lock_irq() was called.
Salzberg_C07.qxd
410
8/19/05
2:38 PM
Page 410
Notice how preemption is disabled during the lock. This ensures that any operation in the critical section is not interrupted. The IRQ flags saved on line 260 are
restored on line 324.
The drawback of spinlocks is that they busily loop, waiting for the lock to be
freed. They are best used for critical sections of code that are fast to complete. For
code sections that take time, it is better to use another Linux kernel locking utility:
the semaphore.
Semaphores differ from spinlocks because the task sleeps, rather than busy waits,
when it attempts to obtain a contested resource. One of the main advantages is that
a process holding a semaphore is safe to block; they are SMP and interrupt safe:
---------------------------------------------------------------------include/asm-i386/semaphore.h
44 struct semaphore {
45
atomic_t count;
46
int sleepers;
47
wait_queue_head_t wait;
48 #ifdef WAITQUEUE_DEBUG
49
long __magic;
50 #endif
51 };
-------------------------------------------------------------------------------------------------------------------------------------------include/asm-ppc/semaphore.h
24 struct semaphore {
25
/*
Salzberg_C07.qxd
8/19/05
2:38 PM
Page 411
411
26
* Note that any negative value of count is equivalent to 0,
27
* but additionally indicates that some process(es) might be
28
* sleeping on 'wait'.
29
*/
30
atomic_t count;
31
wait_queue_head_t wait;
32 #ifdef WAITQUEUE_DEBUG
33
long __magic;
34 #endif
35 };
-----------------------------------------------------------------------
7.4
For scheduling, the kernel uses the system clock to know how long a task has
been running. We already covered the system clock in Chapter 5 by using it as an
example for the discussion on interrupts. Here, we explore the Real-Time Clock
and its uses and implementation; but first, lets recap clocks in general.
Salzberg_C07.qxd
8/19/05
2:38 PM
Page 412
412
7.4.1
The Linux interface to wall clock time is accomplished through the /dev/rtc
device driver ioctl() function. The device for this driver is called a Real-Time
Clock (RTC). The RTC9 provides timekeeping functions with a small 114-byte user
NVRAM. The input to this device is a 32.768KHz oscillator and a connection for
battery backup. Some discrete models of the RTC have the oscillator and battery
built in, while other RTCs are now built in to the peripheral bus controller (for
example, the Southbridge) of a processor chipset. The RTC not only reports the
time of day, but it is also a programmable timer that is capable of interrupting the
system. The frequency of interrupts varies from 2Hz to 8,192Hz. The RTC can also
interrupt daily, like an alarm clock. Here, we explore the RTC code:
---------------------------------------------------------------------/include/linux/rtc.h
/*
* ioctl calls that are permitted to the /dev/rtc interface, if
* any of the RTC drivers are enabled.
*/
70
71
72
73
74
#define
#define
#define
#define
#define
RTC_AIE_ON
RTC_AIE_OFF
RTC_UIE_ON
RTC_UIE_OFF
RTC_PIE_ON
Manufactured by several vendors, most notably Motorola, with the mc146818. (This RTC is no longer in
production. The Dallas DS12885 or equivalent is used instead.)
Salzberg_C07.qxd
8/19/05
2:38 PM
Page 413
75
76
77
#define RTC_PIE_OFF
#define RTC_WIE_ON
#define RTC_WIE_OFF
413
*/
78 #define RTC_ALM_SET
_IOW('p', 0x07, struct rtc_time) /* Set alarm time */
79 #define RTC_ALM_READ _IOR('p', 0x08, struct rtc_time) /* Read alarm time*/
80 #define RTC_RD_TIME
_IOR('p', 0x09, struct rtc_time) /* Read RTC time */
81 #define RTC_SET_TIME _IOW('p', 0x0a, struct rtc_time) /* Set RTC time */
82 #define RTC_IRQP_READ _IOR('p', 0x0b, unsigned long) /* Read IRQ rate*/
83 #define RTC_IRQP_SET _IOW('p', 0x0c, unsigned long) /* Set IRQ rate */
84 #define RTC_EPOCH_READ _IOR('p', 0x0d, unsigned long) /* Read epoch */
85 #define RTC_EPOCH_SET _IOW('p', 0x0e, unsigned long) /* Set epoch */
86
87 #define RTC_WKALM_SET _IOW('p', 0x0f, struct rtc_wkalrm)/*Set wakeupalarm*/
88 #define RTC_WKALM_RD _IOR('p', 0x10, struct rtc_wkalrm)/*Get wakeupalarm*/
89
90 #define RTC_PLL_GET
_IOR('p', 0x11, struct rtc_pll_info) /* Get PLL
correction */
91 #define RTC_PLL_SET
_IOW('p', 0x12, struct rtc_pll_info) /* Set PLL
correction */
-----------------------------------------------------------------------
int main(void) {
int fd, retval = 0;
//unsigned long tmp, data;
struct rtc_time rtc_tm;
fd = open ("/dev/rtc", O_RDONLY);
Salzberg_C07.qxd
8/19/05
2:38 PM
Page 414
414
7.4.2
At kernel compile time, the appropriate code tree (x86, PPC, MIPS, and so on)
is inserted. The source branch for PPC is discussed here in the source code file for
the generic RTC driver for non-x86 systems:
---------------------------------------------------------------------/drivers/char/genrtc.c
276 static int gen_rtc_ioctl(struct inode *inode, struct file *file,
277
unsigned int cmd, unsigned long arg)
278 {
279
struct rtc_time wtime;
280
struct rtc_pll_info pll;
281
282
switch (cmd) {
283
284
case RTC_PLL_GET:
...
290
case RTC_PLL_SET:
...
298
case RTC_UIE_OFF: /* disable ints from RTC updates. */
...
302
case RTC_UIE_ON: /* enable ints for RTC updates. */
...
305
case RTC_RD_TIME: /* Read the time/date from RTC */
306
307
memset(&wtime, 0, sizeof(wtime));
308
get_rtc_time(&wtime);
309
310
return copy_to_user((void *)arg,&wtime,sizeof(wtime)) ? -EFAULT:0;
311
Salzberg_C07.qxd
8/19/05
2:38 PM
Page 415
415
312
case RTC_SET_TIME: /* Set the RTC */
313
return -EINVAL;
314
}
...
353 static int gen_rtc_open(struct inode *inode, struct file *file)
354 {
355
if (gen_rtc_status & RTC_IS_OPEN)
356
return -EBUSY;
357
gen_rtc_status |= RTC_IS_OPEN;
------------------------------------------------------------------------
This code is the case statement for the ioctl command set. Because we made
the ioctl call from the user space test program with the RTC_RD_TIME flag, control is transferred to line 305. The next call is at line 308, get_rtc_time(&wtime)
in rtc.h (see the following code). Before leaving this code segment, note line 353.
This allows only one user to access, via open(), the driver at a time by setting the
status to RTC_IS_OPEN:
---------------------------------------------------------------------include/asm-ppc/rtc.h
045 static inline unsigned int get_rtc_time(struct rtc_time *time)
046 {
047
if (ppc_md.get_rtc_time) {
048
unsigned long nowtime;
049
050
nowtime = (ppc_md.get_rtc_time)();
051
052
to_tm(nowtime, time);
053
054
time->tm_year -= 1900;
055 time->tm_mon -= 1; /* Make sure userland has a 0-based month */
056
}
057
return RTC_24H;
058 }
------------------------------------------------------------------------
The inline function get_rtc_time() calls the function that the structure variable pointed at by ppc_md.get_rtc_time on line 50. Early in the kernel initialization, this variable is set in chrp_setup.c:
---------------------------------------------------------------------arch/ppc/platforms/chrp_setup.c
447 chrp_init(unsigned long r3, unsigned long r4, unsigned long r5,
448 unsigned long r6, unsigned long r7)
449 {
...
477
ppc_md.time_init = chrp_time_init;
478
ppc_md.set_rtc_time = chrp_set_rtc_time;
479
ppc_md.get_rtc_time = chrp_get_rtc_time;
480
ppc_md.calibrate_decr = chrp_calibrate_decr;
------------------------------------------------------------------------
Salzberg_C07.qxd
416
8/19/05
2:38 PM
Page 416
Salzberg_C07.qxd
8/19/05
2:38 PM
Page 417
7.4.3
417
The methodology for reading the RTC on the x86 system is similar to, but somewhat more compact and robust than, the PPC method. Once again, we follow the
open driver /dev/rtc, but this time, the build has compiled the file rtc.c for the
x86 architecture. The source branch for x86 is discussed here:
---------------------------------------------------------------------drivers/char/rtc.c
...
352 static int rtc_do_ioctl(unsigned int cmd, unsigned long arg, int kernel)
353 {
...
switch (cmd) {
...
482 case RTC_RD_TIME: /* Read the time/date from RTC */
483 {
484
rtc_get_rtc_time(&wtime);
485
break;
486 }
...
1208 void rtc_get_rtc_time(struct rtc_time *rtc_tm)
1209 {
...
1238
spin_lock_irq(&rtc_lock);
1239
rtc_tm->tm_sec = CMOS_READ(RTC_SECONDS);
1240
rtc_tm->tm_min = CMOS_READ(RTC_MINUTES);
1241
rtc_tm->tm_hour = CMOS_READ(RTC_HOURS);
1242
rtc_tm->tm_mday = CMOS_READ(RTC_DAY_OF_MONTH);
1243
rtc_tm->tm_mon = CMOS_READ(RTC_MONTH);
1244
rtc_tm->tm_year = CMOS_READ(RTC_YEAR);
1245
ctrl = CMOS_READ(RTC_CONTROL);
...
1249 spin_unlock_irq(&rtc_lock);
1250
1251 if (!(ctrl & RTC_DM_BINARY) || RTC_ALWAYS_BCD)
1252 {
1253
BCD_TO_BIN(rtc_tm->tm_sec);
1254
BCD_TO_BIN(rtc_tm->tm_min);
1255
BCD_TO_BIN(rtc_tm->tm_hour);
1256
BCD_TO_BIN(rtc_tm->tm_mday);
1257
BCD_TO_BIN(rtc_tm->tm_mon);
1258
BCD_TO_BIN(rtc_tm->tm_year);
1259 }
------------------------------------------------------------------------
The test program uses the ioctl() flag RTC_RD_TIME in its call to the driver
switch statement then fills the time structure from the CMOS
Salzberg_C07.qxd
8/19/05
418
2:38 PM
Page 418
memory of the RTC. Here is the x86 implementation of how the RTC hardware
is read:
---------------------------------------------------------------------include/asm-i386/mc146818rtc.h
...
018 #define CMOS_READ(addr) ({ \
019
outb_p((addr),RTC_PORT(0)); \
020
inb_p(RTC_PORT(1)); \
021 })
-----------------------------------------------------------------------
Summary
This chapter covered the Linux scheduler, preemption in Linux, and the Linux
system clock and timers.
More specifically, we covered the following topics:
We introduced the new Linux 2.6 scheduler and outlined its new features.
We described how the scheduler chooses the next task from among all tasks
it can choose and the algorithms the scheduler uses to do so.
We discussed the context switch that the scheduler uses to actually swap a
process and traced the function into the low-level architecture-specific code.
We covered how processes in Linux can yield the CPU to other processes by
calling schedule() and how the kernel then marks that process as to be
scheduled.
We delved into how the Linux kernel calculates dynamic priority based on
the previous behavior of an individual process and how a process eventually
gets removed from the scheduling queue.
We then moved on and covered implicit and explicit user- and kernel-level
preemption and how each is dealt with in the 2.6 Linux kernel.
Finally, we explored timers and the system clock and how the system clock is
implemented in both x86 and PPC architectures.
Salzberg_C07.qxd
8/19/05
2:38 PM
Page 419
Exercises
Exercises
1. How does Linux notify the scheduler to run periodically?
2. Describe the difference between interactive and non-interactive
processes.
3. With respect to the scheduler, whats special about real-time
processes?
4. What happens when a process runs out of scheduler ticks?
5. Whats the advantage of an O(1) scheduler?
6. What kind of data structure does the scheduler use to manage the
priority of the processes running on a system?
7. What happens if you were to call schedule() while holding a
spinlock?
8. How does the kernel decide whether a kernel task can be implicitly preempted?
419
Salzberg_C07.qxd
8/19/05
2:38 PM
Page 420