0% found this document useful (0 votes)
38 views88 pages

2 LinuxProcessStates

The document discusses Linux process states and the structure of the Process Control Block (PCB), specifically the task_struct. It outlines various task states such as TASK_RUNNING, TASK_DEAD, and TASK_NEW, along with their utilities and implications for process management. Additionally, it covers the evolution of scheduling policies in Linux, highlighting the transition from the O(N) scheduler to the O(1) scheduler and the introduction of priority-based scheduling classes.

Uploaded by

Ral Ralte
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views88 pages

2 LinuxProcessStates

The document discusses Linux process states and the structure of the Process Control Block (PCB), specifically the task_struct. It outlines various task states such as TASK_RUNNING, TASK_DEAD, and TASK_NEW, along with their utilities and implications for process management. Additionally, it covers the evolution of scheduling policies in Linux, highlighting the transition from the O(N) scheduler to the O(1) scheduler and the introduction of priority-based scheduling classes.

Uploaded by

Ral Ralte
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 88

Unit-I: Linux Process States

Subhrendu Chattopadhyay, IDRBT

Disclaimer: A few of the Images are taken from the Internet and Textbooks 1
Process Control Block (PCB)/ task_struct

struct mm_struct *mm;
struct mm_struct *active_mm;
1. task_struct aka PCB …
struct address_space *faults_disabled_mapping;

/* Filesystem information: */
/* PCB */ struct fs_struct *fs;
struct task_struct {
#ifdef CONFIG_THREAD_INFO_IN_TASK /* Open file information: */
struct thread_info thread_info; struct files_struct *files;
#endif pid_t pid;
unsigned int __state; pid_t tgid;
/* saved state for "spinlock sleepers" */ struct pid *thread_pid;
unsigned int saved_state; …
randomized_struct_fields_start /* Children/sibling form the list of natural children: */
void *stack; struct list_head children;
refcount_t usage; struct list_head sibling;
/* Per task flags (PF_*), defined further below: struct task_struct *group_leader;
*/ …
unsigned int flags; /* Monotonic time in nsecs: */
unsigned int ptrace; u64 start_time;
struct sched_entity se; /* Boot based time in nsecs: */
struct sched_rt_entity rt; u64 start_boottime;
struct sched_dl_entity dl; …
struct sched_dl_entity *dl_server; }
const struct sched_class *sched_class;
https://elixir.bootlin.com/linux/v6.11-rc4/source/include/linux/sched.h#L756 2
https://elixir.bootlin.com/linux/v6.11-rc4/source/include/linux/sched.h#L96
Process Control Block (PCB)/ task_struct
/* Used in tsk->__state: */
#define TASK_RUNNING 0x00000000

1. task_struct aka PCB #define TASK_INTERRUPTIBLE


#define TASK_UNINTERRUPTIBLE
0x00000001
0x00000002
#define __TASK_STOPPED 0x00000004
2. The states are stored in the task_struct structure, which #define __TASK_TRACED
/* Used in tsk->exit_state: */
0x00000008

#define EXIT_DEAD 0x00000010


represents each process in the kernel #define EXIT_ZOMBIE
#define EXIT_TRACE
0x00000020
(EXIT_ZOMBIE | EXIT_DEAD)
/* Used in tsk->__state again: */
#define TASK_PARKED 0x00000040
#define TASK_DEAD 0x00000080
#define TASK_WAKEKILL 0x00000100
#define TASK_WAKING 0x00000200
#define TASK_NOLOAD 0x00000400
struct task_struct { #define TASK_NEW 0x00000800
… #define TASK_RTLOCK_WAIT 0x00001000
unsigned int __state; #define TASK_FREEZABLE 0x00002000
int exit_state; #define __TASK_FREEZABLE_UNSAFE (0x00004000 *

IS_ENABLED(CONFIG_LOCKDEP))
}
#define TASK_FROZEN 0x00008000
#define TASK_STATE_MAX 0x00010000
https://elixir.bootlin.com/linux/v6.11-rc4/source/include/linux/sched.h#L756
https://elixir.bootlin.com/linux/v6.11-rc4/source/include/linux/sched.h#L96
3
PCB States
Task State Flags:
These flags are bitmask values, meaning they can be combined to
represent multiple states or conditions for a single process.

https://elixir.bootlin.com/linux/v6.11-rc4/source/include/linux/sched.h#L756 4
https://elixir.bootlin.com/linux/v6.11-rc4/source/include/linux/sched.h#L96
Process Control Block (PCB)/ task_struct
1. TASK_PARKED:
○ Utility: Represents a process or thread that is parked and inactive but not terminated. A parked process will not consume CPU resources but
can be "unparked" and made runnable when needed. This is typically used in scenarios like kernel thread management or when a thread is
waiting for a specific event to resume execution.
2. TASK_DEAD:
○ Utility: Marks a process as dead. A process enters this state after it has finished execution and is being cleaned up by the kernel. At this point,
the process is no longer scheduled for execution, and its resources are released. It's the final state before the process is completely removed
from the system.
3. TASK_WAKEKILL:
○ Utility: Indicates that a process can be woken up by a kill signal (SIGKILL). When a process is in this state, receiving a SIGKILL signal will
immediately wake it up, even if it is in an uninterruptible sleep (e.g., TASK_UNINTERRUPTIBLE). This is used to ensure that certain signals
can always wake up a process.
4. TASK_WAKING:
○ Utility: Represents a process that is in the process of waking up but hasn't yet fully transitioned to TASK_RUNNING. This state is a transitional
state used by the scheduler when a process is moving from a sleeping state to a runnable state.
5. TASK_NOLOAD 0x00000400
○ Utility: Marks a process as not contributing to system load averages. Processes in this state do not affect the system's load calculation, even if
they are running. This is useful for kernel threads or low-priority background processes that should not impact load metrics.
6. TASK_NEW 0x00000800
○ Utility: Represents a newly created process that has not yet been scheduled to run. This state is used for processes that are in the initial
stages of creation but have not yet been placed on the runqueue for execution.

https://elixir.bootlin.com/linux/v6.11-rc4/source/include/linux/sched.h#L756 5
https://elixir.bootlin.com/linux/v6.11-rc4/source/include/linux/sched.h#L96
Process Control Block (PCB)/ task_struct
7. TASK_NEW 0x00000800
○ Utility: Represents a newly created process that has not yet been scheduled to run. This state is used for processes that are in the initial
stages of creation but have not yet been placed on the runqueue for execution.
8. TASK_RTLOCK_WAIT
○ Utility: Indicates that a process is waiting on a real-time lock (RT lock). RT locks are used in real-time systems where certain processes require
strict timing guarantees. A process in this state is blocked until the lock is available.
9. TASK_FREEZABLE
○ Utility: Marks a process as "freezable," meaning it can be frozen by the kernel, typically during system hibernation or suspend operations.
When the system goes into a low-power state, processes marked as freezable will be paused until the system resumes normal operation.
10. __TASK_FREEZABLE_UNSAFE
○ Utility: This is a conditional flag that is enabled if lock dependency tracking (CONFIG_LOCKDEP) is enabled in the kernel configuration. It is
used to mark processes as freezable in certain conditions but potentially unsafe in terms of locking. This flag helps in debugging lock-related
issues during process freezing and resumption.
11. TASK_FROZEN:
○ Utility: Represents a process that has been frozen, typically during a system suspend or hibernate operation. A frozen process is completely
paused and cannot execute until the system resumes normal operation. This is different from being parked or asleep because a frozen
process is paused at the system level.
12. TASK_STATE_MAX:
○ Utility: This represents the maximum bitmask value for process states. It is often used as a boundary marker to indicate that no higher state
flags should exist beyond this value. It helps in bounds checking when managing process state transitions.

https://elixir.bootlin.com/linux/v6.11-rc4/source/include/linux/sched.h#L756 6
https://elixir.bootlin.com/linux/v6.11-rc4/source/include/linux/sched.h#L96
Process Scheduling Classes
Every process is attached to a scheduling class
1. Five scheduling classes (in order of lower to higher priority)
a. Idle (/kernel/sched/idle.c)
b. Fair (/kernel/sched/fair.c)
c. Real time (/kernel/sched/rt.c)
d. Deadline (/kernel/sched/deadline.c)
e. Stop (/kernel/sched/stop_task.c)

https://github.com/torvalds/linux/blob/master/kernel/sched/idle.c
https://github.com/torvalds/linux/blob/master/kernel/sched/fair.c
https://github.com/torvalds/linux/blob/master/kernel/sched/rt.c
https://github.com/torvalds/linux/blob/master/kernel/sched/deadline.c 7
https://github.com/torvalds/linux/blob/master/kernel/sched/stop_task.c
Scheduling Policies
Every class has one or more policies associated with it [*]
1. For idle class
a. SCHED_IDLE
b. For some very low priority background processes
2. For fair class
a. SCHED_OTHER/SCHED_NORMAL
https://elixir.bootlin.com/linux/v6.11-rc4/source/kernel/sched/fair.c#L8393
b. SCHED_BATCH
3. For real time class
a. SCHED_FIFO
b. SCHED_RR
4. For deadline class
a. SCHED_DEADINE

[*] https://elixir.bootlin.com/linux/v6.11-rc4/source/include/linux/sched.h#L850
https://elixir.bootlin.com/linux/v6.11-rc4/source/include/uapi/linux/sched.h#L112
https://elixir.bootlin.com/linux/v6.11-rc4/source/kernel/sched/sched.h#L186
8
Scheduling Policies (History)
Kernel version 0.01 (Genesis scheduler)
1. A single queue of runnable processes, default is 32 process
2. The scheduler iterates over the entire queue to select a task to run
a. Check if any alarm is raised for a task, if yes, mark for processing
i. Also move the tasks from waiting to running state if alarm raised
b. Find the task with the largest unused timeslice and schedule it
i. If no such process
1. Assign all processes new timeslice values based on priority
2. Higher priority gets larger timeslice
c. Schedule the one with the largest timeslice
3. Very simple, but O(n)
i. Did not scale as systems became more powerful and complex

9
Scheduling Policies (History)

1 void schedule(void) {
2 int i,next,c;
3 struct task_struct ** p;
4
5 /* check alarm, wake up any interruptible tasks that have got a signal
*/
7 for(p = &LAST_TASK ; p > &FIRST_TASK ; --p)
8 if (*p) {
9 if ((*p)->alarm && (*p)->alarm < jiffies) {
10 (*p)->signal |= (1<<(SIGALRM-1));
11 (*p)->alarm = 0;
12 }
13 if ((*p)->signal && (*p)->state==TASK_INTERRUPTIBLE)
14 (*p)->state=TASK_RUNNING;
15 }

10
Scheduling Policies (History)

17 /* this is the scheduler proper: */


18 while (1) {
19 c = -1;
20 next = 0;
21 i = NR_TASKS;
22 p = &task[NR_TASKS];
23 while (--i) {
24 if (!*--p)
25 continue;
26 if ((*p)->state == TASK_RUNNING && (*p)->counter > c)
27 c = (*p)->counter, next = i;
28 }
29 if (c) break;
30 for(p = &LAST_TASK ; p > &FIRST_TASK ; --p)
31 if (*p)
32 (*p)->counter = ((*p)->counter >> 1) + (*p)->priority;
33 }
34 switch_to(next);
35 }

11
Scheduling Policies (History)
O(N) Scheduler
1. From Kernel versions 2.4 onwards, till before 2.6
a. Similar to the Genesis scheduler
2. Main change is in the metric used for selecting the next process – Goodness of a process
a. Goodness of a process is calculated as the number of clock-ticks a task had left plus some
weight based on the task’s priority; returns integer values
b. -1000: Never select this task to run
c. positive number: The goodness value, larger the better
d. +1000: A real time process
3. No preemption of running process
a. So a real time task coming cannot preempt a simple user process
4. Has the same problem of scalability
a. Needs to loop through all processes
b. Goodness computations were costly
c. Runqueues can still incur significant locking overhead as no. of processes increases
5. Does not scale to multiprocessors
a. Single global queue suffers from ping-pong effect
12
Scheduling Policies (History)
O(1) Scheduler
1. Introduced in Kernel Version 2.6.0 (2003)
2. Introduced
a. The priority scale (0-139) we discussed and the separation between normal and real time
tasks
b. Early preemption: A new runnable task of higher priority can preempt the
c. currently running process of lower priority
d. Dynamic priority for considering interactivity
i. Decided based on recent interactivity (how often the process used the CPU in the past)
3. Separate runqueues for each CPU

13
Scheduling Policies (History)
O(1) Scheduler
4. Timeslice given for each process
a. For priority < 120, timeslice = (140 – priority)*20 milliseconds
b. otherwise, timeslice = (140 – priority)*5 milliseconds
5. Two sets of queues, Active and Expired
6. Each set has multiple queues, one for each priority
a. So total 140 queues in each set
7. At any point of time, schedule from the active set
8. A process moves to the expired set when if it uses up its timeslice
a. Except in some cases, will discuss
9. A new process gets added to the expired set

14
Scheduling Policies (History)
O(1) Scheduler

• 5 tasks with different priorities: Task Prio Slice


• Task A: Priority 100, Time Slice: 20ms
• Task B: Priority 102, Time Slice: 15ms
A 100 20ms
• Task C: Priority 101, Time Slice: 25ms
• Task D: Priority 0, Time Slice: 50ms (Real-time)
B 102 15ms
• Task E: Priority 105, Time Slice: 10ms

Priority Array Setup: C 101 25ms

● Task D (Priority 0) in the highest priority queue. D 0 50ms


● Tasks A, B, C, and E in regular task queues based on their priorities.
E 105 10ms

15
Scheduling Policies (History) Task Prio Slice

A 100 20ms
O(1) Scheduler
B 102 15ms
• Step 1: Task Selection
• Scheduler selects Task D (Priority 0) from the active array.
• Real-time task gets immediate execution because real-time tasks always preempt normal tasks. C 101 25ms
• After 50ms:
• Task D finishes its time slice and is moved to the expired array. D 0 50ms
• Step 2: Next Task Selection
• E
Scheduler selects the next highest-priority task in the active array, which is Task C (Priority 101). 105 10ms
• Task C runs for its time slice of 25ms.
• After 25ms:
• Task C is moved to the expired array.
• Step 3: Switching to Lower-Priority Tasks
• Scheduler selects Task A (Priority 100) next (as it is higher than B and E).
• Task A runs for its time slice of 20ms.
• After 20ms:
• Task A is moved to the expired array.

16
Scheduling Policies (History) Task Prio Slice

A 100 20ms
O(1) Scheduler
B 102 15ms
• Step 4: Active and Expired Arrays Swap
• After all tasks in the active array are exhausted, the active and expired arrays are swapped
C 101 25ms
• Task D (real-time) is once again moved to the active array

The process repeats with tasks rotating through their time slices based on their priorities D 0 50ms

E 105 10ms

17
Scheduling Policies (History)
O(1) Scheduler
• Why is this called O(1) scheduler?
• Problems with the O(1) scheduler
• Complex heuristics for interactivity check, did not work well in practice
• Managing 2 x 140 runqueues is complex
• Codebase was complex and difficult to debug
• Replaced by Completely Fair Scheduler (CFS) in 2007 (Kernel version 2.6.23)

18
Scheduling Policies (History)
So what was wrong with O(1)?
• Timeslice allocations across priorities were disproportionate, huge difference in allocated timeslices

Priority Static Niceness Quantum

Highest 100 -20 800ms

High 110 -10 600ms

Normal 120 0 100ms

Low 130 10 50ms

Lowest 139 19 5ms

19
Why is this a problem?
Scheduling Policies (History)
So what was wrong with O(1)?
• Low priority tasks cause frequent context switches, even if there are no other processes
• Suppose that there are two processes with priority 130, will cause context switches every 50
millisecond unnecessarily
• High priority batch tasks can cause interactive tasks to suffer
• Suppose that there are two batch processes with priority 110, interactive jobs will not get a
chance to run for long
• Dynamic priority increase will still take time to catch up
• Fixed timeslice based on priority is not good
• Ignored the current load on the CPU

20
Niceness Niceness vs Priority

21
Nice Value vs Priority
• How to determine weight of the process ?
• Nice Value:
• Range:
• The nice value ranges from -20 (highest priority) to +19 (lowest priority).
• Default:
• The default nice value for a task is 0, which gives it normal priority.
• Effect on Priority:
• A lower (or more negative) nice value gives a task higher priority. A higher
(positive) nice value gives a task lower priority (i.e. it will be preempted
sooner).

https://elixir.bootlin.com/linux/v6.11-rc4/source/include/linux/sched/prio.h#L5 22
struct task_struct {

User Space and PCB int on_rq;
int prio;
int static_prio;
int normal_prio;
• nice sys_call unsigned int rt_priority;
• In task_struct → static_prio
struct sched_entity se;
struct sched_rt_entity rt;
struct sched_dl_entity dl;
struct sched_dl_entity
void set_user_nice(struct task_struct *p, long *dl_server;
nice){ const struct sched_class
… *sched_class;
if (task_nice(p) == nice || nice < …
MIN_NICE || nice > MAX_NICE) }
return;
… static inline int task_nice(const struct task_struct
} *p){
return PRIO_TO_NICE((p)->static_prio);
}

https://elixir.bootlin.com/linux/v6.11-rc4/source/kernel/sched/syscalls.c#L65
https://elixir.bootlin.com/linux/v6.11-rc4/source/include/linux/sched.h#L805
https://elixir.bootlin.com/linux/v6.11-rc4/source/include/linux/sched.h#L1827 23
Nice Value vs Priority
• Priority Calculation:
• Weight of Tasks:
• Each task has a weight based on its nice value. The weight is used to determine
how much CPU time the task gets relative to others.
/*
* Nice levels are multiplicative, with a gentle 10% change for every
* nice level changed. I.e. when a CPU-bound task goes from nice 0 to
* nice 1, it will get ~10% less CPU time than another CPU-bound task
* that remained on nice 0.
*
* The "10% effect" is relative and cumulative: from _any_ nice level,
* if you go up 1 level, it's -10% CPU usage, if you go down 1 level
* it's +10% CPU usage. (to achieve that we use a multiplier of 1.25.
* If a task goes up by ~10% and another task goes down by ~10% then
* the relative distance between them is ~25%.)
*/
https://elixir.bootlin.com/linux/v6.11-rc4/source/include/linux/sched/prio.h#L5 24
Nice Value vs Priority
• Priority Calculation:
• Weight of Tasks:
• Each task has a weight based on its nice value. The weight is used to determine
how much CPU time the task gets relative to others.
• Priority to Weight Mapping:
• Formula for Weight:
• The weight is calculated using the formula:
• weight=1024/(1.25)nice value
• This means that tasks with lower nice values (higher priority) have higher weights
and thus get more CPU time.

https://elixir.bootlin.com/linux/v6.11-rc4/source/include/linux/sched/prio.h#L5 25
Nice Value vs Priority
• Relationship Overview:
• Higher Priority (More CPU Time):
• Tasks with negative nice values (closer to -20) have a higher weight, meaning they
accumulate virtual runtime more slowly and are selected to run more frequently.
• Lower Priority (Less CPU Time):
• Tasks with positive nice values (closer to +19) have a lower weight, meaning they
accumulate virtual runtime more quickly and get preempted sooner.
• Example of Nice Values and Weights:
• Nice Value Priority (Relative) Weight Explanation
-20 Highest 88761 Very high priority, very slow vruntime
-10 High 10240 High priority, slower vruntime growth
0 Normal 1024 Default priority, normal vruntime
10 Low 100 Low priority, faster vruntime growth
19 Lowest 15 Very low priority, very fast vruntime

https://elixir.bootlin.com/linux/v6.11-rc4/source/kernel/sched/core.c#L9784
https://elixir.bootlin.com/linux/v6.11-rc4/source/kernel/sched/core.c#L9802
26
Nice Value vs Priority
• How It Affects CPU Scheduling:
• Lower Nice Values (e.g., -20):
• These tasks are more likely to be chosen by the scheduler because their virtual
runtime increases slowly. Thus, they tend to run more often, ensuring they get
more CPU time.
• Higher Nice Values (e.g., +19):
• These tasks are scheduled less frequently because their virtual runtime increases
quickly. Therefore, they are preempted by higher-priority tasks.

27
CFS Completely Fair Scheduler

28
Completely Fair Scheduling (CFS) → SCHED_NORMAL
• Introduced in Kernel version 2.6.23 (2007)
• Default scheduler for a new task
• Major Idea
• To select the task to run
• Choose a task that has used the CPU less so far
• To decide the timeslice
• Calculate how long a task should run as a function of the total number of currently
runnable processes and their priorities
• So no fixed timeslice, depends on other tasks in the runqueue
• Trying to be fair to everyone

29
Completely Fair Scheduling (CFS) → SCHED_NORMAL
• Consider two processes, a text editor and a simulation job
• Ideal proportion of CPU: 50%
• Text editor will not use its 50% always
• But will need the CPU immediately when it wants
• Will use it for a short time and then wait again
• Simulation job can use more than 50% when the text editor is not using it
• But must relinquish immediately whenever text editor wants it
• CFS Idea
• Allocate the CPU to a process which has used it less so far
• So the text editor will get scheduled as soon as it wants the CPU

30
Completely Fair Scheduling (CFS) → SCHED_NORMAL
• But a simple implementation does not take care of priorities
• So weight the runtime with the priority
• Keep track of virtual runtime (not exact physical runtime) of each process
• At every scheduling tick, if a process has run for p milliseconds,
set vruntime += p*(weight of the process)
• Weight increases with nice value of a process
• At any point of time, choose the process with the smallest vruntime
• Processes with higher nice values have faster increase in vruntime, therefore are chosen later
(lower priority as it should be) and vice-versa
• When a process sleeps, its vruntime remains unchanged
• The runqueue are arranged in a Red-Black tree where node weights = vruntime

https://elixir.bootlin.com/linux/v6.11-rc4/source/kernel/sched/fair.c#L8673
https://elixir.bootlin.com/linux/v6.11-rc4/source/kernel/sched/sched.h#L591

31
Completely Fair Scheduling (CFS) → SCHED_NORMAL

function CFS_Scheduler(){ function PickLeftmostTask(tree){


while true{ // Return the task with the smallest
// Pick the task with the smallest vruntime from
the red-black tree vruntime (i.e. leftmost node)
task = PickLeftmostTask(cfs_rq.tasks_timeline) return tree.leftmost_node
}
if task != NULL{ // Run the selected task
RunTask(task)
} function RunTask(task){
UpdateVruntime(task) ContextSwitchTo(task) }
// If the task is still runnable
if task.state == RUNNABLE{ function InsertIntoTree(tree, task){
// Re-insert it back into the red-black tree
InsertIntoTree(cfs_rq.tasks_timeline, task) InsertOrUpdateRBTree(tree, task) }
}else{
// No runnable tasks, enter idle state function EnterIdleState(){ Sleep()}
EnterIdleState()
}
} 32
https://docs.kernel.org/scheduler/sched-design-CFS.html
Completely Fair Scheduling (CFS) → SCHED_NORMAL

function CFS_Scheduler(){ function UpdateVruntime(task){


while true{ // Calculate elapsed time since task started
// Pick the task with the smallest vruntime from running
the red-black tree delta_exec = CurrentTime() - task.start_time
task = PickLeftmostTask(cfs_rq.tasks_timeline) // Update the task's vruntime based on
delta_exec and task priority
task.vruntime += delta_exec * task.weight
if task != NULL{ // Run the selected task // Update the minimum vruntime in the
RunTask(task) system
} cfs_rq.min_vruntime =
UpdateVruntime(task) min(cfs_rq.min_vruntime, task.vruntime)
// If the task is still runnable }
if task.state == RUNNABLE{
// Re-insert it back into the red-black tree
InsertIntoTree(cfs_rq.tasks_timeline, task)
}else{
// No runnable tasks, enter idle state
EnterIdleState()
}
} 33
Completely Fair Scheduling (CFS) → SCHED_NORMAL
Initial Task Setup: Task Priority Weights:

● Task A: Priority 100, vruntime = 0 ● Lower priority value → Higher weight


● Task B: Priority 200, vruntime = 0 (more CPU time).
● Task C: Priority 150, vruntime = 0 ● Task A weight: 1.0
● Task B weight: 2 (runs slower)
Red-Black Tree Structure: ● Task C weight: 1.5

● Initially, all tasks are placed into the red-black tree with vruntime = 0.

34
Completely Fair Scheduling (CFS) → SCHED_NORMAL
Red-Black Tree at Time = 0 Task Priority Weights:

● Left Child (Task A): vruntime = 0 ● Lower priority value → Higher weight
● Root Child (Task B): vruntime = 0 (more CPU time).
● Right (Task C): vruntime = 0 ● Task A weight: 1.0
● Task B weight: 2 (runs slower)
Current Minimum vruntime: 0 (all tasks are equal). ● Task C weight: 1.5

35
Completely Fair Scheduling (CFS) → SCHED_NORMAL
• Task A Selected: Task Priority Weights:
• Since Task A is chosen arbitrarily.
● Lower priority value → Higher weight
• Task A runs for 10ms:
(more CPU time).
• Updated vruntime of Task A: vruntime = (10*1)=10
● Task A weight: 1.0
• Red-Black Tree After Execution:
● Task B weight: 2 (runs slower)
• Task A: vruntime = 10 (reinserted into the tree)
● Task C weight: 1.5
• Task B: vruntime = 0 (remains the same)
• Task C: vruntime = 0 (remains the same)

36
Completely Fair Scheduling (CFS) → SCHED_NORMAL
• Red-Black Tree at Time = 10ms Task Priority Weights:
• Left Child (Task B): vruntime = 0
● Lower priority value → Higher weight
• Root (Task A): vruntime = 0
(more CPU time).
• Right Child (Task C): vruntime = 10
● Task A weight: 1.0
• Current Minimum vruntime: 0 (Task B).
● Task B weight: 2 (runs slower)
• Task B Selected:
● Task C weight: 1.5
• Task B now has the smallest vruntime and is selected to run.

----------------------------------------------------------------------------------------------------------------------------------------------------------------------
• Task B runs for 10ms:
• Since Task B has a lower weight, its vruntime increases more slowly.
• Updated vruntime of Task B: vruntime = 10 * 2 = 20
• Red-Black Tree After Execution:
• Task B: vruntime = 20 (reinserted into the tree)
• Task A: vruntime = 10 (remains the same)
• Task C: vruntime = 0 (remains the same)

37
Completely Fair Scheduling (CFS) → SCHED_NORMAL
• Red-Black Tree at Time = 20ms Task Priority Weights:
• Left Child (Task C): vruntime = 0
● Lower priority value → Higher weight
• Root (Task A): vruntime = 10
(more CPU time).
• Right Child (Task B): vruntime = 20
● Task A weight: 1.0
• Current Minimum vruntime: 0 (Task C).
● Task B weight: 2 (runs slower)
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
● Task C weight: 1.5
• Task C Selected:
• Task C has the smallest vruntime and is selected to run.
• Task C runs for 10ms:
• With a weight of 1.5, its vruntime increases faster than Task A but slower than Task B.
• Updated vruntime of Task C: vruntime = 10 / 1.5 = 15
• Red-Black Tree After Execution:
• Task C: vruntime = 15 (reinserted into the tree)
• Task A: vruntime = 10 (remains the same)
• Task B: vruntime = 20 (remains the same)

38
Completely Fair Scheduling (CFS) → SCHED_NORMAL
• Red-Black Tree at Time = 30ms
• Left Child (Task A): vruntime = 10
• Root (Task C): vruntime = 15
• Right Child (Task B): vruntime = 20
• Current Minimum vruntime: 10 (Task A).

39
Completely Fair Scheduling (CFS) → SCHED_NORMAL
• Implementation Challenges
• The runqueue is maintained as a single Red-Black tree organized with the virtual runtimes
• Leftmost node gives the next process to run (O(log n) )
• So processes move from left to right of the tree as they execute
• Higher priority processes move slower than lower priority process, increasing their chance to be
rescheduled sooner
• When are new processes inserted into the tree?
• When a new process is created
• When a process becomes runnable
• With what initial vruntime?
• The maximum of the minimum vruntimes seen so far (will see later what this means)

40
Completely Fair Scheduling (CFS) → SCHED_NORMAL
• Types of tasks
• Interactive Tasks
• Uses less CPU time, so vruntime stays low, so stays more on left side of the tree
• Scheduled again earlier
• Batch Tasks
• Uses more CPU time, so vruntime is high, so moves more to the right side of the tree
• Scheduled later
• So CFS favors interactive tasks

41
Completely Fair Scheduling (CFS) → SCHED_NORMAL
Implementation
• Data Structures
• task_struct
• static_prio : the static priority of the process from the nice value
• prio : the actual priority of the process used by the scheduler
• normal_prio : the priority based on the static priority and the scheduling policy
• rt_priority : real time priority (a number between 0 and 99)
• se, rt, dl : different scheduling entity structures corresponding to fair, rt, and deadline class. The
applicable structure is used depending on the scheduling class of the process
struct task_struct {

int prio;
int static_prio;
int normal_prio;
unsigned int rt_priority;
const struct sched_class *sched_class;
struct sched_entity se;
struct sched_rt_entity rt;
https://elixir.bootlin.com/linux/v6.11-rc4/source/include/linux/sched.h#L756
struct sched_dl_entity dl;

}
42
Completely Fair Scheduling (CFS) → SCHED_NORMAL
struct task_struct {

Implementation int prio;
• Data Structures int static_prio;
int normal_prio;
• Suppose the task_struct is pointed to by p unsigned int rt_priority;
• Compute p->normal_prio from p->static_prio const struct sched_class *sched_class;
• Compute p->prio from p->normal_prio (via effective_prio() ) struct sched_entity se;
struct sched_rt_entity rt;
struct sched_dl_entity dl;

}

#define MAX_USER_RT_PRIO 100


#define MAX_RT_PRIO MAX_USER_RT_PRIO
#define MAX_PRIO (MAX_RT_PRIO + NICE_WIDTH)
#define DEFAULT_PRIO (MAX_RT_PRIO + NICE_WIDTH / 2)

https://elixir.bootlin.com/linux/v6.11-rc4/source/include/linux/sched.h#L538
https://elixir.bootlin.com/linux/v6.11-rc4/source/include/linux/sched/prio.h#L9

43
Completely Fair Scheduling (CFS) → SCHED_NORMAL
struct sched_entity {

struct load_weight load;
Implementation struct rb_node run_node;
u64 deadline;
• Data Structures u64 min_vruntime;
struct list_head group_node;
• sched_entity unsigned int on_rq;
u64 exec_start;
• Each node of the RB tree is a sched_entity structure u64 sum_exec_runtime;
u64 prev_sum_exec_runtime;
• This is a fair class specific structure, there are separate structures u64 vruntime;
• (sched_entity_rt etc.) for other classes s64 vlag;
u64 slice;
u64 nr_migrations;
#ifdef CONFIG_FAIR_GROUP_SCHED
int depth;
struct sched_entity *parent;
/* rq on which this entity is (to be) queued: */
struct cfs_rq *cfs_rq;
/* rq "owned" by this entity/group: */
struct cfs_rq *my_q;
/* cached value of my_q->h_nr_running */
unsigned long runnable_weight;
#endif

};

https://elixir.bootlin.com/linux/v6.11-rc4/source/include/linux/sched.h#L538
44
Completely Fair Scheduling (CFS) → SCHED_NORMAL
struct sched_entity {

Implementation struct load_weight load;
struct rb_node
• Data Structures
run_node;
• sched_entity u64 deadline;
• load : the load of this process (the weight we used in CFS) u64 min_vruntime;
struct list_head group_node;
• run_node : the RB tree node for this process unsigned int on_rq;
• on_rq : task is on runqueue u64 exec_start;
u64
• exec_start : starting time of the process in the last scheduling tick period sum_exec_runtime;
• sum_exec_runtime : total runtime of the process till now u64
• vruntime : virtual runtime prev_sum_exec_runtime;
u64 vruntime;
• prev_sum_exec_runtime : total runtime of the process till the beginning of the last scheduling
s64 period vlag;
• nr_migration : number of times this process is migrated between CPUs u64 slice;
u64 nr_migrations;
• statistics : a structure containing different scheduling stats field …
};

https://elixir.bootlin.com/linux/v6.11-rc4/source/include/linux/sched.h#L538
45
Completely Fair Scheduling (CFS) → SCHED_NORMAL
Implementation
• Data Structures
• sched_class
• Defines generic functions (function pointers) for operations on the runqueue
struct sched_class {
void (*enqueue_task) (struct rq *rq, struct task_struct *p, int flags);
void (*dequeue_task) (struct rq *rq, struct task_struct *p, int flags);
void (*yield_task) (struct rq *rq);

void (*check_preempt_curr)(struct rq *rq, struct task_struct *p, int flags);
struct task_struct *(*pick_next_task)(struct rq *rq);
void (*task_fork)(struct task_struct *p);
void (*task_dead)(struct task_struct *p);
void (*task_tick)(struct rq *rq, struct task_struct *p, int queued);

void (*prio_changed) (struct rq *this_rq, struct task_struct *task, int oldprio);
void (*update_curr)(struct rq *rq);

https://elixir.bootlin.com/linux/v6.11-rc4/source/kernel/sched/sched.h#L2281
46
Completely Fair Scheduling (CFS) → SCHED_NORMAL
struct sched_class {
void (*enqueue_task) (struct rq *rq, struct
Implementation task_struct *p, int flags);
• Data Structures void (*dequeue_task) (struct rq *rq, struct
task_struct *p, int flags);
• sched_class void (*yield_task) (struct rq *rq);
• enqueue_task : called when a task becomes runnable …
• dequeue_task : called when a task is no longer runnable void (*check_preempt_curr)(struct rq *rq,
struct task_struct *p, int flags);
• yield_task : called when a task wants to give up the CPU voluntarily (but is still struct task_struct
runnable) *(*pick_next_task)(struct rq *rq);
void (*task_fork)(struct task_struct *p);
• check_preempt_curr : checks if a runnable task should preempt the currently void (*task_dead)(struct task_struct *p);
running task or not void (*task_tick)(struct rq *rq, struct
• pick_next_task : choose the next task to run task_struct *p, int queued);

• task_fork, task_dead : called to inform the scheduler that a new task is void (*prio_changed) (struct rq *this_rq,
spawned or dead struct task_struct *task, int oldprio);
• task_tick : called on a timer interrupt void (*update_curr)(struct rq *rq);
• prio_changed : called when the priority of a process is changed
• update_curr : updates the runtime statistics

https://elixir.bootlin.com/linux/v6.11-rc4/source/kernel/sched/sched.h#L2281
47
Completely Fair Scheduling (CFS) → SCHED_NORMAL
DEFINE_SCHED_CLASS(fair) = {

Implementation .enqueue_task = enqueue_task_fair,


.dequeue_task = dequeue_task_fair,
• Macro
.yield_task = yield_task_fair,
• DEFINE_SCHED_CLASS .yield_to_task = yield_to_task_fair,
• Pointers to the function which is used later to define operations on runqueue
.wakeup_preempt = check_preempt_wakeup_fair,

.pick_next_task = __pick_next_task_fair,
.put_prev_task = put_prev_task_fair,
.set_next_task = set_next_task_fair,

}

#define DEFINE_SCHED_CLASS(name)
const struct sched_class name##_sched_class \
__aligned(__alignof__(struct sched_class)) \
__section("__" #name "_sched_class")

https://elixir.bootlin.com/linux/v6.11-rc4/source/kernel/sched/sched.h#L2367
https://elixir.bootlin.com/linux/v6.11-rc4/source/kernel/sched/fair.c#L13188 48
Completely Fair Scheduling (CFS) → SCHED_NORMAL
struct rq {
raw_spinlock_t lock;
Implementation unsigned int nr_running;
• Data Structures …
• rq aka runqueue struct cfs_rq cfs;
• Each CPU has its own runqueue struct rt_rq rt;
• The rq is a generic structure, has pointers to class-specific runqueues
struct dl_rq dl;

struct task_struct __rcu *curr;
struct task_struct *idle;
● Reduces runqueue contention int cpu;

● As CPU core Specific Cache
● Task-CPU affinity

https://elixir.bootlin.com/linux/v6.11-rc4/source/kernel/sched/sched.h#L1011
49
Completely Fair Scheduling (CFS) → SCHED_NORMAL
struct rq {
raw_spinlock_t lock;
Implementation unsigned int nr_running;
• Data Structures …
• rq aka runqueue struct cfs_rq cfs;
• lock : spinlock for locking the runqueue struct rt_rq rt;
• nr_running : number of processes on this queue, over all
struct dl_rq dl;

scheduling classes struct task_struct __rcu *curr;
• cfs, rt, dl : class specific queues for fair class, rt class, struct task_struct *idle;
and deadline class int cpu;
• All can exist at the same time as at any time, there can be …
processes belonging to
• different classes in the system
• curr : pointer to currently running process
• idle : pointer to the idle process
• cpu : cpu of this runqueue

https://elixir.bootlin.com/linux/v6.11-rc4/source/kernel/sched/sched.h#L1011
50
Completely Fair Scheduling (CFS) → SCHED_NORMAL
/* sys_clone3 - create a new process with specific properties @uargs: argument
structure @size: size of @uargs clone3() is the extensible successor to
Initialization clone()/clone2(). It takes a struct as argument that is versioned by its size. Return:
On success, a positive PID for the child process. On error, a negative errno
• clone3 → kernel_clone() → number. */
SYSCALL_DEFINE2(clone3, struct clone_args __user *, uargs, size_t, size){
copy_process() → sched_fork() int err;
struct kernel_clone_args kargs;
pid_t set_tid[MAX_PID_NS_LEVEL];
#ifdef __ARCH_BROKEN_SYS_CLONE3
#warning clone3() entry point is missing, please fix
return -ENOSYS;
#endif
kargs.set_tid = set_tid;
err = copy_clone_args_from_user(&kargs, uargs, size);
if (err)
return err;
if (!clone3_args_valid(&kargs))
return -EINVAL;
return kernel_clone(&kargs);
}

https://elixir.bootlin.com/linux/v6.11-rc4/source/kernel/fork.c#L3083
https://elixir.bootlin.com/linux/v6.11-rc4/source/kernel/fork.c#L2759
https://elixir.bootlin.com/linux/v6.11-rc4/source/kernel/fork.c#L2800
https://elixir.bootlin.com/linux/v6.11-rc4/source/kernel/sched/core.c#L4545 51
Completely Fair Scheduling (CFS) → SCHED_NORMAL
/* Ok, this is the main fork-routine. It copies the process,
Initialization and if successful kick-starts it and waits for it to finish
• clone3 → kernel_clone() →
using the VM if required. args->exit_signal is expected to
be checked for sanity by the caller.*/
copy_process() → sched_fork()
pid_t kernel_clone(struct kernel_clone_args *args){
u64 clone_flags = args->flags;
struct completion vfork;
struct pid *pid;
struct task_struct *p;
int trace = 0;
pid_t nr;

p = copy_process(NULL, trace, NUMA_NO_NODE,
args);

}

https://elixir.bootlin.com/linux/v6.11-rc4/source/kernel/fork.c#L3083
https://elixir.bootlin.com/linux/v6.11-rc4/source/kernel/fork.c#L2759
https://elixir.bootlin.com/linux/v6.11-rc4/source/kernel/fork.c#L2800
https://elixir.bootlin.com/linux/v6.11-rc4/source/kernel/sched/core.c#L4545 52
Completely Fair Scheduling (CFS) → SCHED_NORMAL
/* This creates a new process as a copy of the old one,
Initialization but does not actually start it yet. It copies the registers,
• clone3 → kernel_clone() →
and all the appropriate parts of the process environment
(as per the clone flags). The actual kick-off is left to the
copy_process() → sched_fork()
caller. */
__latent_entropy struct task_struct *copy_process( struct
pid *pid, int trace, int node, struct kernel_clone_args
*args){

/* Perform scheduler related setup. Assign this task
to a CPU. */
retval = sched_fork(clone_flags, p);

}

https://elixir.bootlin.com/linux/v6.11-rc4/source/kernel/fork.c#L3083
https://elixir.bootlin.com/linux/v6.11-rc4/source/kernel/fork.c#L2759
https://elixir.bootlin.com/linux/v6.11-rc4/source/kernel/fork.c#L2800
https://elixir.bootlin.com/linux/v6.11-rc4/source/kernel/sched/core.c#L4545 53
Completely Fair Scheduling (CFS) → SCHED_NORMAL
Initialization int sched_fork(unsigned long clone_flags, struct task_struct *p){
• clone3 → kernel_clone() → _sched_fork(clone_flags, p);
copy_process() → sched_fork() …
p->prio=current->normal_prio;

if (unlikely(p->sched_reset_on_fork)) {
if (task_has_dl_policy(p) || task_has_rt_policy(p)) {
p->policy=SCHED_NORMAL;
p->static_prio=NICE_TO_PRIO(0); p>rt_priority=0;
} else if (PRIO_TO_NICE(p->static_prio) < 0)
p->static_prio=NICE_TO_PRIO(0);
p->prio=p->normal_prio=p->static_prio;
set_load_weight(p);
p->sched_reset_on_fork=0;
}
if (dl_prio(p->prio)) return -EAGAIN;
else if (rt_prio(p->prio)) p->sched_class= &rt_sched_class;
else p->sched_class=&fair_sched_class;

https://elixir.bootlin.com/linux/v6.11-rc4/source/kernel/sched/core.c#L4545
54
Completely Fair Scheduling (CFS) → SCHED_NORMAL
Initialization
• clone3 → kernel_clone() →
copy_process() → sched_fork()
→ __sched_fork()

static void __sched_fork(unsigned long clone_flags, struct task_struct *p){


p->on_rq = 0;
p->se.on_rq = 0;
p->se.exec_start = 0;
p->se.sum_exec_runtime = 0;
p->se.prev_sum_exec_runtime = 0;
p->se.nr_migrations = 0;
p->se.vruntime = 0;
p->se.vlag = 0;
p->se.slice = sysctl_sched_base_slice;
INIT_LIST_HEAD(&p->se.group_node);

https://elixir.bootlin.com/linux/v6.11-rc4/source/kernel/fork.c#L2759
https://elixir.bootlin.com/linux/v6.11-rc4/source/kernel/fork.c#L2800 55
https://elixir.bootlin.com/linux/v6.11-rc4/source/kernel/sched/core.c#L4545
https://elixir.bootlin.com/linux/v6.11-rc4/source/kernel/sched/core.c#L4314
Completely Fair Scheduling (CFS) → SCHED_NORMAL
Update vruntime static void update_curr(struct cfs_rq *cfs_rq){
• update_curr() struct sched_entity *curr = cfs_rq->curr;
• Called periodically on scheduler tick or on s64 delta_exec;
sleep/wakeup if (unlikely(!curr))
return;
delta_exec = update_curr_se(rq_of(cfs_rq), curr);
if (unlikely(delta_exec <= 0))
return;
static inline u64 calc_delta_fair(u64 delta, struct curr->vruntime += calc_delta_fair(delta_exec, curr);
sched_entity *se){ update_deadline(cfs_rq, curr);
if (unlikely(se->load.weight != NICE_0_LOAD)) update_min_vruntime(cfs_rq);
delta = __calc_delta(delta, NICE_0_LOAD,
&se->load); if (entity_is_task(curr))
return delta; update_curr_task(task_of(curr), delta_exec);
} account_cfs_rq_runtime(cfs_rq, delta_exec);
}

https://elixir.bootlin.com/linux/v6.11-rc4/source/kernel/sched/fair.c#L1153
https://elixir.bootlin.com/linux/v6.11-rc4/source/kernel/sched/fair.c#L289 56
Completely Fair Scheduling (CFS) → SCHED_NORMAL
Overall Flow for Scheduler
• Disable interrupts (local_irq_disable() )
• Lock the runqueue (rq_lock() )
• If current task is not in TASK_RUNNING state
• If it has a signal pending (signal_pending_state() ), change state to TASK_RUNNING
• Else dequeue it
• Choose the next task to run (pick_next_task() ) and context switch if needed (if different from current task)
• Unlock the run queue and enable interrupts

Called periodically on scheduler tick or on sleep/wakeup

https://elixir.bootlin.com/linux/v6.11-rc4/source/kernel/sched/sched.h#L2294
57
Extra Browsing Source codes

58
Bootlin
• What is Bootlin?
• Bootlin is an open-source consulting company that maintains free kernel cross-reference tools, helping developers
navigate the Linux kernel.
• Why use Bootlin?
• Provides an easy way to search, navigate, and cross-reference kernel sources.

https://elixir.bootlin.com/linux/v6.11-rc4/source/kernel/sched/sched.h#L2294
59
Overview of Bootlin's Linux Cross-Reference (LXR)
• Bootlin LXR is a web-based tool that allows browsing through kernel source code versions.
• Offers advanced search and cross-referencing functionalities.
• You can explore function definitions, macros, and other key elements in the kernel code.
• Accessing Bootlin Kernel Browser
• URL to Access
• Bootlin’s kernel cross-reference browser is available at: https://elixir.bootlin.com
• Supported Kernel Versions
• View different versions of the kernel, from legacy to the latest stable releases.
• Basic Navigation
• Kernel Source Tree
• The main screen shows the root directory of the kernel source. You can drill down through directories like arch/,
kernel/, drivers/, etc.
• File Structure
• Kernel source is structured in folders based on functionalities like architecture-specific code (arch/), core kernel
(kernel/), etc.

https://elixir.bootlin.com/linux/v6.11-rc4/source/kernel/sched/sched.h#L2294
60
Searching for Symbols or Functions
• Search Bar
• Located at the top of the page, the search bar allows you to find functions, variables, and macros.
• Search Example:
• Search for sched to see files and lines related to kernel scheduling, or a function like schedule() to see its definition and
references.
• Viewing and Understanding Kernel Code
• Syntax Highlighting
• Code is displayed with syntax highlighting, making it easy to distinguish keywords, functions, and comments.
• Function Call Links
• Functions or symbols within the code are hyperlinked. Clicking on a symbol navigates to its definition.

https://elixir.bootlin.com/linux/v6.11-rc4/source/kernel/sched/sched.h#L2294
61
Understanding Call Graphs
• Cross-Referencing
• For each function, you can find a list of files where the function is referenced or called. Helps in tracing function calls.
• Call Graph Example:
• Clicking on sched_fork() shows the files where it is called, allowing deeper insight into its usage.
• Navigating Definitions and Macros
• Definitions
• Find definitions for kernel constants, macros, and inline functions. Cross-referencing helps trace their use in the kernel.
• Macro Example:
• Search for CONFIG_SMP to see how it influences code compilation in the kernel.

• Questions???

https://elixir.bootlin.com/linux/v6.11-rc4/source/kernel/sched/sched.h#L2294
62
Extra Other nitty gritties

63
Scheduling: Extra
Timer Management
static void task_tick_fair(struct rq *rq, struct task_struct
• The timer periodically interrupts
*curr, int queued){
struct cfs_rq *cfs_rq;
• Called scheduler tick
struct sched_entity *se = &curr->se;
• The timer interrupt handler calls
• update_process_times() → for_each_sched_entity(se) {
sched_tick() → the current process’s cfs_rq = cfs_rq_of(se);
task_struct->sched_class->task_tick entity_tick(cfs_rq, se, queued);
() function, which for fair class, is }
task_tick_fair()
• task_tick_fair() calls entity_tick() → if (static_branch_unlikely(&sched_numa_balancing))
entity_tick() calls task_tick_numa(rq, curr);
check_preempt_tick() …
task_tick_core(rq, curr);
}

https://elixir.bootlin.com/linux/v6.11-rc4/source/kernel/time/timekeeping.h#L25
https://elixir.bootlin.com/linux/v6.11-rc4/source/kernel/time/timer.c#L2490
https://elixir.bootlin.com/linux/v6.11-rc4/source/kernel/sched/core.c#L5466
https://elixir.bootlin.com/linux/v6.11-rc4/source/kernel/sched/fair.c#L12679 64
https://elixir.bootlin.com/linux/v6.11-rc4/source/kernel/sched/fair.c#L5509
Scheduling: Extra
Timer Management
• The timer periodically interrupts
• Called scheduler tick
• The timer interrupt handler calls update_process_times()
• update_process_times() calls sched_tick()
• sched_tick() calls the current process’s task_struct->sched_class->task_tick() function, which for fair
class, is task_tick_fair() static void entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int
• task_tick_fair() calls entity_tick() queued){
// Update run-time statistics of the 'current'.
• entity_tick() calls update_curr() update_curr(cfs_rq);
// Ensure that runnable average is periodically updated.
update_load_avg(cfs_rq, curr, UPDATE_TG);
update_cfs_group(curr);

}

https://elixir.bootlin.com/linux/v6.11-rc4/source/kernel/time/timekeeping.h#L25
https://elixir.bootlin.com/linux/v6.11-rc4/source/kernel/time/timer.c#L2490
https://elixir.bootlin.com/linux/v6.11-rc4/source/kernel/sched/core.c#L5466
https://elixir.bootlin.com/linux/v6.11-rc4/source/kernel/sched/fair.c#L12679 65
https://elixir.bootlin.com/linux/v6.11-rc4/source/kernel/sched/fair.c#L5509
Scheduling: Extra
Timer Management static void update_curr(struct cfs_rq *cfs_rq){
struct sched_entity *curr = cfs_rq->curr;
• The timer periodically interrupts s64 delta_exec;
• Called scheduler tick if (unlikely(!curr))
return;
• The timer interrupt handler calls update_process_times() delta_exec = update_curr_se(rq_of(cfs_rq), curr);
if (unlikely(delta_exec <= 0))
• update_process_times() calls sched_tick() return;
• sched_tick() calls the current process’s task_struct->sched_class->task_tick()curr->vruntimefunction,
+= calc_delta_fair(delta_exec,
which for fair curr);
update_deadline(cfs_rq, curr);
class, is task_tick_fair() update_min_vruntime(cfs_rq);
if (entity_is_task(curr))
• task_tick_fair() calls entity_tick() update_curr_task(task_of(curr), delta_exec);
• entity_tick() calls update_curr() account_cfs_rq_runtime(cfs_rq, delta_exec);
}

https://elixir.bootlin.com/linux/v6.11-rc4/source/kernel/time/timekeeping.h#L25
https://elixir.bootlin.com/linux/v6.11-rc4/source/kernel/time/timer.c#L2490
https://elixir.bootlin.com/linux/v6.11-rc4/source/kernel/sched/core.c#L5466
https://elixir.bootlin.com/linux/v6.11-rc4/source/kernel/sched/fair.c#L12679 66
https://elixir.bootlin.com/linux/v6.11-rc4/source/kernel/sched/fair.c#L5509
Sleep and Wakeup

67
Scheduling: Extra /* wait_queue_entry::flags */
#define WQ_FLAG_EXCLUSIVE 0x01
#define WQ_FLAG_WOKEN 0x02
#define WQ_FLAG_CUSTOM 0x04
#define WQ_FLAG_DONE 0x08
Sleep and Wakeup #define WQ_FLAG_PRIORITY 0x10
• Processes wait on different wait queues
• Wait queues
• Linked list of wait queue entries struct wait_queue_entry {
• Data structures unsigned int flags;
• wait_queue_entry() void *private;
• flags: different values, the two of interest to us are wait_queue_func_t func;
struct list_head entry;
• WQ_FLAG_EXCLUSIVE };
• WQ_FLAG_PRIORITY
• private: points to the task that is waiting
• func: the function to be called on wake up
• There is a default_wake_function() also, which through other functions, calls
activate_task() ), which enqueues the task back into the run queue
typedef int (*wait_queue_func_t)(struct wait_queue_entry *wq_entry, unsigned mode, int flags, void *key);
int default_wake_function(struct wait_queue_entry *wq_entry, unsigned mode, int flags, void *key);

https://elixir.bootlin.com/linux/v6.11-rc4/source/include/linux/wait.h#L28
https://elixir.bootlin.com/linux/v6.11-rc4/source/include/linux/wait.h#L15 68
Scheduling: Extra
static inline void init_waitqueue_entry(struct
wait_queue_entry *wq_entry, struct task_struct *p){
Sleep and Wakeup wq_entry->flags = 0;
• Processes wait on different wait queues wq_entry->private = p;
wq_entry->func = default_wake_function;
• Wait queues
}
• Linked list of wait queue entries
• Data structures
void add_wait_queue(struct wait_queue_head
• wait_queue_entry()
*wq_head, struct wait_queue_entry *wq_entry){
• Initialize unsigned long flags;
• init_waitqueue_entry() wq_entry->flags &= ~WQ_FLAG_EXCLUSIVE;
• Add tasks to Wait Queue spin_lock_irqsave(&wq_head->lock, flags);
• add_wait_queue() → __add_wait_queue() __add_wait_queue(wq_head, wq_entry);
spin_unlock_irqrestore(&wq_head->lock,
flags);
}

https://elixir.bootlin.com/linux/v6.11-rc4/source/include/linux/wait.h#L80
https://elixir.bootlin.com/linux/v6.11-rc4/source/kernel/sched/wait.c#L17 69
Scheduling: Extra
static inline void __add_wait_queue(struct wait_queue_head
*wq_head, struct wait_queue_entry *wq_entry){
Sleep and Wakeup struct list_head *head = &wq_head->head;
struct wait_queue_entry *wq;
• Processes wait on different wait queues
• Wait queues list_for_each_entry(wq, &wq_head->head, entry) {
if (!(wq->flags & WQ_FLAG_PRIORITY))
• Linked list of wait queue entries break;
• Data structures head = &wq->entry;
• wait_queue_entry() }
list_add(&wq_entry->entry, head);
• Initialize }
• init_waitqueue_entry()
• Add tasks to Wait Queue
• add_wait_queue() → __add_wait_queue()
void add_wait_queue_exclusive(struct wait_queue_head
Waiting for • If WQ_FLAG_PRIORITY *wq_head, struct wait_queue_entry *wq_entry){
a resource • Then add at head …
__add_wait_queue_entry_tail(wq_head, wq_entry);
or condition • else …
• Add at end }
• If WQ_FLAG_EXCLUSIVE
• Add at end
https://elixir.bootlin.com/linux/v6.11-rc4/source/kernel/sched/wait.c#L17
https://elixir.bootlin.com/linux/v6.11-rc4/source/include/linux/wait.h#L169 70
Scheduling: Extra
#define wake_up(x) __wake_up(x, TASK_NORMAL, 1, NULL)

Sleep and Wakeup


int __wake_up(struct wait_queue_head *wq_head, unsigned int mode,
• During wakeup int nr_exclusive, void *key){
return __wake_up_common_lock(wq_head, mode, nr_exclusive,
• wake_up() → __wake_up() → __wake_up_common_lock()
0, key); → __wake_up_common()
}

static int __wake_up_common_lock(struct wait_queue_head


*wq_head, unsigned int mode,
int nr_exclusive, int wake_flags, void
*key){

remaining = __wake_up_common(wq_head, mode,
nr_exclusive, wake_flags,
key);

}

https://elixir.bootlin.com/linux/v6.11-rc4/source/include/linux/wait.h#L219
https://elixir.bootlin.com/linux/v6.11-rc4/source/kernel/sched/wait.c#L124 71
https://elixir.bootlin.com/linux/v6.11-rc4/source/kernel/sched/wait.c#L99
https://elixir.bootlin.com/linux/v6.11-rc4/source/kernel/sched/wait.c#L73
Scheduling: Extra
Sleep and Wakeup
• During wakeup
• wake_up() → __wake_up() → __wake_up_common_lock() → __wake_up_common()
static int __wake_up_common(struct wait_queue_head *wq_head, unsigned int mode, int nr_exclusive, int wake_flags, void *key){


• When a condition or event that the wait curr = list_first_entry(&wq_head->head, wait_queue_entry_t, entry);
if (&curr->entry == &wq_head->head)
queue is waiting for occurs, exclusive return nr_exclusive;
list_for_each_entry_safe_from(curr, next, &wq_head->head, entry) {
waiters are woken up before non-exclusive unsigned flags = curr->flags;
int ret;
waiters. This ensures that the exclusive ret = curr->func(curr, mode, wake_flags, key);
if (ret < 0)
condition is met. break;
• nr_exclusive signifies no. of waiters if (ret && (flags & WQ_FLAG_EXCLUSIVE) && !--nr_exclusive)
break;
waiting for event. ( Explain --nr_exclusive) }
return nr_exclusive;
}

• i.e.
• (ret= +ve) AND (WQ_FLAG_EXCLUSIVE) AND (nr_exclusive > 0)
https://elixir.bootlin.com/linux/v6.11-rc4/source/kernel/sched/wait.c#L73
72
Scheduling: Extra
static long sock_wait_for_wmem(struct sock *sk, long timeo){
DEFINE_WAIT(wait);
sk_clear_bit(SOCKWQ_ASYNC_NOSPACE, sk);
Sleep and Wakeup for (;;) {
if (!timeo)
break;
• Example with sock.c if (signal_pending(current))
break;
set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
prepare_to_wait(sk_sleep(sk), &wait, TASK_INTERRUPTIBLE);
if (refcount_read(&sk->sk_wmem_alloc) < READ_ONCE(sk->sk_sndbuf))
break;
if (READ_ONCE(sk->sk_shutdown) & SEND_SHUTDOWN)
break;
if (READ_ONCE(sk->sk_err))
break;
timeo = schedule_timeout(timeo);
}
finish_wait(sk_sleep(sk), &wait);
return timeo;
}

// Generic send/receive buffer handlers


struct sk_buff *sock_alloc_send_pskb(struct sock *sk, unsigned long header_len,
unsigned long data_len, int noblock,int *errcode, int max_page_order){

for (;;) {

timeo = sock_wait_for_wmem(sk, timeo);

}

https://elixir.bootlin.com/linux/v6.11-rc4/source/net/core/sock.c#L2813
https://elixir.bootlin.com/linux/v6.11-rc4/source/net/core/sock.c#L2756 73
Scheduling: Extra
Sleep and Wakeup
• Example with sock.c
static long sock_wait_for_wmem(struct sock *sk, long timeo){
DEFINE_WAIT(wait);
sk_clear_bit(SOCKWQ_ASYNC_NOSPACE, sk);
for (;;) {
if (!timeo)
break;
if (signal_pending(current))
break;
set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
prepare_to_wait(sk_sleep(sk), &wait, TASK_INTERRUPTIBLE);
if (refcount_read(&sk->sk_wmem_alloc) < READ_ONCE(sk->sk_sndbuf))
break;
if (READ_ONCE(sk->sk_shutdown) & SEND_SHUTDOWN)
break;
if (READ_ONCE(sk->sk_err))
break;
timeo = schedule_timeout(timeo);
}
finish_wait(sk_sleep(sk), &wait);
return timeo;
}

https://elixir.bootlin.com/linux/v6.11-rc4/source/net/core/sock.c#L2813
https://elixir.bootlin.com/linux/v6.11-rc4/source/net/core/sock.c#L2756 74
static long sock_wait_for_wmem(struct sock *sk, long timeo){

Scheduling: Extra
DEFINE_WAIT(wait);
sk_clear_bit(SOCKWQ_ASYNC_NOSPACE, sk);
for (;;) {

prepare_to_wait(sk_sleep(sk), &wait, TASK_INTERRUPTIBLE);

}
finish_wait(sk_sleep(sk), &wait);
return timeo;
}
Sleep and Wakeup
• Example with sock.c
• int sock_wait_for_wmem(struct sock *sk, unsigned int size, int timeout);
• sk: Pointer to the sock structure representing the socket.
• size: The amount of writable memory required.
• timeout: The maximum time to wait for space to become available.
• Check Buffer Availability:
• The function first checks if the required amount of space (size) is already available in the socket's send buffer. If enough space is
available, it can proceed without blocking.
• Blocking and Waiting:
• If the buffer does not have enough space, the function blocks the process and waits until space becomes available. This involves
putting the process into a wait queue associated with the socket.
• Timeout Handling:
• The function also handles timeouts. If the space does not become available within the specified timeout period, it will return an
error code indicating that the operation timed out.
• Waking Up:
• Once there is enough space available in the buffer (either because data has been read from the buffer by the receiving end or due
to other reasons), the function wakes up the blocked process and allows it to continue.

https://elixir.bootlin.com/linux/v6.11-rc4/source/kernel/sched/wait.c#L228
https://elixir.bootlin.com/linux/v6.11-rc4/source/net/core/sock.c#L2756 75
Scheduling: Extra
static long sock_wait_for_wmem(struct sock *sk, long timeo){
DEFINE_WAIT(wait);
sk_clear_bit(SOCKWQ_ASYNC_NOSPACE, sk);
for (;;) {
Sleep and Wakeup …
prepare_to_wait(sk_sleep(sk), &wait, TASK_INTERRUPTIBLE);
• Example with sock.c …
}
finish_wait(sk_sleep(sk), &wait);
return timeo;
}

void prepare_to_wait(struct wait_queue_head *wq_head,


struct wait_queue_entry *wq_entry, int state){
unsigned long flags;
wq_entry->flags &= ~WQ_FLAG_EXCLUSIVE;
spin_lock_irqsave(&wq_head->lock, flags);
if (list_empty(&wq_entry->entry))
__add_wait_queue(wq_head, wq_entry);
set_current_state(state);
spin_unlock_irqrestore(&wq_head->lock, flags);
}

https://elixir.bootlin.com/linux/v6.11-rc4/source/kernel/sched/wait.c#L228
https://elixir.bootlin.com/linux/v6.11-rc4/source/net/core/sock.c#L2756 76
Scheduling: Extra
static long sock_wait_for_wmem(struct sock *sk, long timeo){
DEFINE_WAIT(wait);
sk_clear_bit(SOCKWQ_ASYNC_NOSPACE, sk);
void prepare_to_wait(struct wait_queue_head *wq_head, for (;;) {
struct wait_queue_entry *wq_entry, int state){
Sleep and Wakeup unsigned long flags;
wq_entry->flags &= ~WQ_FLAG_EXCLUSIVE;

prepare_to_wait(sk_sleep(sk), &wait, TASK_INTERRUPTIBLE);
spin_lock_irqsave(&wq_head->lock, flags);
• Example with sock.c if (list_empty(&wq_entry->entry))
__add_wait_queue(wq_head, wq_entry);

}
set_current_state(state); finish_wait(sk_sleep(sk), &wait);
spin_unlock_irqrestore(&wq_head->lock, flags); return timeo;
} }

void finish_wait(struct wait_queue_head *wq_head, struct


wait_queue_entry *wq_entry){
unsigned long flags;
__set_current_state(TASK_RUNNING);
if (!list_empty_careful(&wq_entry->entry)) {
spin_lock_irqsave(&wq_head->lock, flags);
list_del_init(&wq_entry->entry);
spin_unlock_irqrestore(&wq_head->lock, flags);
}
}
https://elixir.bootlin.com/linux/v6.11-rc4/source/kernel/sched/wait.c#L228
https://elixir.bootlin.com/linux/v6.11-rc4/source/kernel/sched/wait.c#L356 77
Scheduling: Extra
When is the scheduler called?
• On process exit
• Called by the function do_task_dead() (called by do_exit() ) when a process exits
• For scheduling the idle process
• Called by schedule_idle() (from do_idle() ) for scheduling the idle task
• Called from wait and wake up functions
• Too many to list, from too many drivers, file systems, other places
• On process preemption
• From preempt_schedule () and related functions
• Checks the need-to-reschedule flag set earlier after a kernel task finishes

https://elixir.bootlin.com/linux/v6.11-rc4/source/kernel/sched/core.c#L6537
https://elixir.bootlin.com/linux/v6.11-rc4/source/kernel/sched/core.c#L6636 78
https://elixir.bootlin.com/linux/v6.11-rc4/source/kernel/sched/core.c#L6724
Semaphore

79
struct semaphore {
raw_spinlock_t lock;

Synchronization
unsigned int count;
struct list_head wait_list;
};
extern void down(struct semaphore *sem);
extern void up(struct semaphore *sem);
How process synchronization is achieved?
• down() → __down() → __down_common() → ___down_common()

void __sched down(struct semaphore *sem){ static inline int __sched __down_common(struct
unsigned long flags; semaphore *sem, long state, long timeout){
int ret;
might_sleep(); trace_contention_begin(sem, 0);
ret = ___down_common(sem, state, timeout);
raw_spin_lock_irqsave(&sem->lock, flags); trace_contention_end(sem, ret);
if (likely(sem->count > 0)) return ret;
sem->count--; }
else
__down(sem);
raw_spin_unlock_irqrestore(&sem->lock, flags);
}

https://elixir.bootlin.com/linux/v6.11-rc4/source/include/linux/semaphore.h#L15
https://elixir.bootlin.com/linux/v6.11-rc4/source/kernel/locking/semaphore.c#L54 80
https://elixir.bootlin.com/linux/v6.11-rc4/source/kernel/locking/semaphore.c#L252
https://elixir.bootlin.com/linux/v6.11-rc4/source/kernel/locking/semaphore.c#L209
static inline int __sched ___down_common(struct semaphore *sem,
long state,long timeout){

Synchronization struct semaphore_waiter waiter;


list_add_tail(&waiter.list, &sem->wait_list);
waiter.task = current;
waiter.up = false;
How process synchronization is achieved? for (;;) {
• down() → __down() → __down_common() → ___down_common() → schedule_timeout()
if (signal_pending_state(state, current))
goto interrupted;
if (unlikely(timeout <= 0))
goto timed_out;
__set_current_state(state);
raw_spin_unlock_irq(&sem->lock);
timeout = schedule_timeout(timeout);
raw_spin_lock_irq(&sem->lock);
if (waiter.up)
return 0;
}
timed_out:
list_del(&waiter.list);
return -ETIME;
interrupted:
list_del(&waiter.list);
return -EINTR;
}

https://elixir.bootlin.com/linux/v6.11-rc4/source/include/linux/semaphore.h#L15
ttps://elixir.bootlin.com/linux/v6.11-rc4/source/kernel/locking/semaphore.c#L209 81
Synchronization
signed long __sched schedule_timeout(signed long timeout){
switch (timeout){
How process synchronization is achieved? case MAX_SCHEDULE_TIMEOUT:
• schedule();
down() → __down() → __down_common() → ___down_common() → schedule_timeout()
goto out;
default:
if (timeout < 0) {
printk(KERN_ERR "schedule_timeout:
wrong timeout value %lx\n", timeout);
dump_stack();
__set_current_state(TASK_RUNNING);
goto out;
}
}

schedule();

out:
return timeout < 0 ? 0 : timeout;

https://elixir.bootlin.com/linux/v6.11-rc4/source/kernel/time/timer.c#L2542
82
struct semaphore {
raw_spinlock_t lock;

Synchronization
unsigned int count;
struct list_head wait_list;
};
extern void down(struct semaphore *sem);
extern void up(struct semaphore *sem);
How process synchronization is achieved?
• up() → __up() → wake_up_process()→ try_to_wake_up()
• Read the Inline DOC in
• https://elixir.bootlin.com/linux/v6.11-rc4/source/kernel/sched/core.c#L3983

void __sched up(struct semaphore *sem){


unsigned long flags;
raw_spin_lock_irqsave(&sem->lock, flags);
if (likely(list_empty(&sem->wait_list)))
sem->count++;
else
__up(sem);
raw_spin_unlock_irqrestore(&sem->lock,
flags);
static noinline void __sched __up(struct semaphore *sem){ }
struct semaphore_waiter *waiter =
list_first_entry(&sem->wait_list,struct semaphore_waiter, list);
list_del(&waiter->list);
waiter->up = true;
wake_up_process(waiter->task);
}

https://elixir.bootlin.com/linux/v6.11-rc4/source/include/linux/semaphore.h#L15
https://elixir.bootlin.com/linux/v6.11-rc4/source/kernel/locking/semaphore.c#L183 83
https://elixir.bootlin.com/linux/v6.11-rc4/source/kernel/locking/semaphore.c#L272
https://elixir.bootlin.com/linux/v6.11-rc4/source/kernel/sched/core.c#L3983
Symbol Table EXPORT_SYMBOL

84
Symbol Table
• Symbol Table Data structure

/* Symbol table format returned by kallsyms. */


typedef struct __ksymtab {
unsigned long value; /* Address of symbol */
const char *mod_name; /* Module containing symbol or
* "kernel" */
unsigned long mod_start;
unsigned long mod_end;
const char *sec_name; /* Section containing symbol */
unsigned long sec_start;
unsigned long sec_end;
const char *sym_name; /* Full symbol name, including any version */
unsigned long sym_start;
unsigned long sym_end;
} kdb_symtab_t;

https://elixir.bootlin.com/linux/v6.11-rc4/source/kernel/debug/kdb/kdb_private.h#L72
85
Symbol Table
EXPORT_SYMBOL(__init_waitqueue_head);

#define EXPORT_SYMBOL(sym) _EXPORT_SYMBOL(sym, "")

#define _EXPORT_SYMBOL(sym, license) __EXPORT_SYMBOL(sym, license, "")

#define __EXPORT_SYMBOL(sym, license, ns) \


extern typeof(sym) sym; \
__ADDRESSABLE(sym) \
asm(__stringify(___EXPORT_SYMBOL(sym, license, ns)))

https://elixir.bootlin.com/linux/v6.11-rc4/source/include/linux/export.h#L68
https://elixir.bootlin.com/linux/v6.11-rc4/source/include/linux/export.h#L63
86
https://elixir.bootlin.com/linux/v6.11-rc4/source/include/linux/export.h#L55
https://elixir.bootlin.com/linux/v6.11-rc4/source/include/linux/export.h#L27
Symbol Table
#define __EXPORT_SYMBOL(sym, license, ns) \
extern typeof(sym) sym; \
__ADDRESSABLE(sym) \
Newline
asm(__stringify(___EXPORT_SYMBOL(sym, license, ns)))
• section ".export_symbol",
• "a"
__export_symbol_##sym:
• This assembly directive tells
• This
the line defines
compiler a label
to switch for the #define
to the ___EXPORT_SYMBOL(sym, license, ns)
exported symbol. \
• The".export_symbol" .section
section of __export_symbol_
## token concatenates with the name of the symbol being exported (sym).ASM_NL
".export_symbol","a" For \
the object file. __export_symbol_##sym: ASM_NL \
• example, if the
The "a" flag symbol
indicates thatisthis
my_function, this will create the label __export_symbol_my_function:.
.asciz license ASM_NL \
• This labelcontains
section allowsallocatable
the assembler to uniquely identify
.asciz nsthis block of dataASM_NL
for the exported
\ symbol in the
data, meaning it will be part of
".export_symbol" section.
the final binary and will be __EXPORT_SYMBOL_REF(sym) ASM_NL \
loaded into memory. .previous

The section .export_symbol is used to store information related to the exported symbols, including metadata like the license
and namespace.

https://elixir.bootlin.com/linux/v6.11-rc4/source/include/linux/export.h#L68
https://elixir.bootlin.com/linux/v6.11-rc4/source/include/linux/export.h#L63 87
https://elixir.bootlin.com/linux/v6.11-rc4/source/include/linux/export.h#L55
https://elixir.bootlin.com/linux/v6.11-rc4/source/include/linux/export.h#L27
Symbol Table #define __EXPORT_SYMBOL_REF(sym)
.balign 8
.quad sym
ASM_NL \
\

• .asciz license / .asciz ns


• • This places a null-terminated
__EXPORT_SYMBOL_REF(sym)
string (using the .asciz
• This is another
directive) #define
macro that is typically
into the ___EXPORT_SYMBOL(sym,
responsible license,
for placing the actual reference to the symbolns)
into the \
".export_symbol"
section. section is
This reference that .section
usually the address ".export_symbol","a"
of the symbol. ASM_NL \
represents the license /
• __EXPORT_SYMBOL_REF(sym)
namespace under which the typically expands into something that places the symbol’s address into \
__export_symbol_##sym: ASM_NL the
.asciz license ASM_NL \
section,
symbol so the kernel
is being knows where to .asciz
exported. find the actual
ns symbol. ASM_NL \
e.g.
• .previousEXPORT_SYMBOL_GPL() __EXPORT_SYMBOL_REF(sym) ASM_NL \
• .previous
The .previous directive tells the assembler to switch back to the previous section that was active before the
".export_symbol" section was entered.
• This is important because it ensures that any subsequent assembly code returns to the correct section, usually
the one that was being processed before the export symbol information was added.

https://elixir.bootlin.com/linux/v6.11-rc4/source/include/linux/export.h#L27
https://elixir.bootlin.com/linux/v6.11-rc4/source/include/linux/export.h#L18 88

You might also like