LINUX Kernel: Introduction To The Kernel
LINUX Kernel: Introduction To The Kernel
LINUX Kernel: Introduction To The Kernel
Chapter 3
Introduction to the Kernel
黃仁竑
Processes and Tasks
Processes
seen from outside: individual processes exist
independently
Tasks
seen from inside: only one operating system is
running
Return from
system call Interrupt routine System call
System
mode
Scheduler
Ready Waiting
© 黃仁竑 / 中正資工
Process States
Running
Task is active and running in the non-privileged
user mode.
If an interrupt or system call occurs, it is
switched to the privileged system mode.
Interrupt routine
hardware signals an exception condition
clock generates signal every 10 ms
System call
software interrupt
© 黃仁竑 / 中正資工
Process States
Waiting
wait for an external event (e.g., I/O complete)
Return from system call
when system call or interrupt is complete
scheduler switches the process to ready state
Ready
competing for the processor
© 黃仁竑 / 中正資工
Important Data Structures
Task structure
task_struct in include/linux/sched.h
Also accessed by assembly code, cannot alter the
sequence or add declarations in the front
states
TASK_RUNNING (0): ready or running
TASK_INTERRUPTIBLE(1), TASK_UNINTERRUPTIBLE(2)
: waiting for certain events. TASK_UNINTERRUPTIBLE mean
s a task cannot accept any other signals.
TASK_ZOMBIE(3): process terminated but still has its task stru
cture
TASK_STOPPED(4): process has been halted
TASK_SWAPPING(5): not used.
© 黃仁竑 / 中正資工
Task Structure
struct task_struct {
/* these are hardcoded - don't touch */
volatile long state;
volatile indicates that this value can be altered by i
nterrupt routines
long counter;
long priority;
counter variable holds the time in ticks for the pro
cess can still run before a mandatory scheduling ac
tion is carried out. Counter is used as dynamic prio
rity for scheduler
priority holds the static priority of a process
© 黃仁竑 / 中正資工
Task Structure
unsigned long signal;
unsigned long blocked;
signal contains a bit mask for signals received for
the process. It is evaluated in the routing
ret_from_sys_call() which is called after every
system call and after slow interrupts.
blocked contains a bit mask for signals to be
blocked
unsigned long flags;
flags contains the combination of the system status
flags
© 黃仁竑 / 中正資工
Task Structure
Process flags:
#define PF_ALIGNWARN 0x00000001 /* Print alignment warning msgs */
/* Not implemented yet, only for 486*/
#define PF_PTRACED 0x00000010 /* set if ptrace (0) has been called. */
#define PF_TRACESYS 0x00000020 /* tracing system calls */
#define PF_FORKNOEXEC 0x00000040 /* forked but didn't exec */
#define PF_SUPERPRIV 0x00000100 /* used super-user privileges */
#define PF_DUMPCORE 0x00000200 /* dumped core */
#define PF_SIGNALED 0x00000400 /* killed by a signal */
#define PF_STARTING 0x00000002 /* being created */
#define PF_EXITING 0x00000004 /* getting shut down */
#define PF_USEDFPU 0x00100000 /* Process used the FPU this
quantum (SMP only) */
#define PF_DTRACE 0x00200000 /* delayed trace (used on m68k) */
© 黃仁竑 / 中正資工
Task Structure
int errno;
int debugreg[8];
errno holds the error code for the last faulty syste
m call.
debugreg contains the 80x86’s debugging registers.
struct exec_domain *exec_domain;
which UNIX is emulated for each process
struct task_struct *next_task, *prev_task;
all processes are linked through these two pointers
init_task points to the start and end of this list
struct task_struct *next_run, *prev_run;
list of processes that apply for the processor
© 黃仁竑 / 中正資工
Task Structure
struct task_struct *p_opptr, *p_pptr, *p_cptr, *p_ysptr, *p_os
ptr;
pointers to (original) parent process, youngest child, youn
ger sibling, older sibling, respectively
parent
p_cptr
p_pptr p_pptr
p_pptr
p_osptr p_osptr
youngest child oldest
child p_ysptr child
p_ysptr
© 黃仁竑 / 中正資工
Task Structure
struct mm_struct *mm;
memory management information
struct mm_struct {
int count; pgd_t * pgd;
unsigned long context;
unsigned long start_code, end_code, start_data, end_data;
unsigned long start_brk, brk, start_stack, start_mmap;
unsigned long arg_start, arg_end, env_start, env_end;
unsigned long rss, total_vm, locked_vm;
unsigned long def_flags;
struct vm_area_struct * mmap;
struct vm_area_struct * mmap_avl;
struct semaphore mmap_sem;
};
© 黃仁竑 / 中正資工
Virtual Memory
© 黃仁竑 / 中正資工
Task Structure
unsigned long kernel_stack_page;
stack when a process is running in system mode
unsigned long saved_kernel_stack;
save the old stack pointer when running MS-DOS
emulator (vm86)
int pid, pgrp, session, leader;
process id, group id, session belongs to, and session
leader
unsigned short uid,euid,suid,fsuid;
unsigned short gid,egid,sgid,fsgid;
user id, effective user id, file system user id
group id, effective group id, file system group id
© 黃仁竑 / 中正資工
Task Structure
uid, euid, suid, gid, egid, sgid
Each process has a real user ID and group ID and
an effective user ID and group ID.
The real ID identifies the person using the system
The effective ID determines their access privileges.
execve() changes the effective user or group ID to the
owner or group of the executed file if the file has the
set-user-ID (suid) or set-group-ID (sgid) modes. The
real UID and GID are not affected. The effective user
ID and effective group ID of the new process imag
e are saved as the saved set-user-ID and saved set-gro
up-ID respectively, for use by setuid(3V).
Turn on suid: chmod a+s filename
© 黃仁竑 / 中正資工
Task Structure
Uid, gid are inherited from parent
euid, egid, fsuid, fsgid can be set at run time (owner of
the executable file)
int groups[NGROUPS];
A process may be assigned to many groups
struct fs_struct *fs;
file system information
struct fs_struct {
int count; /* for future expansions */
unsigned short umask; /* access mode */
struct inode * root, * pwd; /* root dir and current dir
*/
};
© 黃仁竑 / 中正資工
Task Structure
struct files_struct *files;
open file information (file descriptors)
© 黃仁竑 / 中正資工
Task Structure
long utime, stime, cutime, cstime, start_time;
time spend in user mode, system mode, total time o
f children process spend in user mode, system mod
e, and the time when the process generated, respect
ively.
unsigned long it_real_value, it_prof_value, it_virt_value
;
unsigned long it_real_incr, it_prof_incr, it_virt_incr;
struct timer_list real_timer;
timer for alarm system call (SIGALRM)
time in ticks until the timer will be trigger, for re-i
nitialization, real-time interval timer, respectively.
© 黃仁竑 / 中正資工
Task Structure
struct sem_undo *semundo;
semaphores need to be released when a process ter
minated
struct sem_queue *semsleeping;
semaphore waiting queue
© 黃仁竑 / 中正資工
Task Structure
struct signal_struct *sig;
struct signal_struct {
int count;
struct sigaction action[32];
};
Signal handlers
© 黃仁竑 / 中正資工
Task Structure
unsigned long personality;
description of the characteristics of this version of
UNIX (see also exec_domain)
int dumpable:1;
whether a memory dump is to be executed
int did_exec:1;
is the process still running the old program (no exe
cve, …)
struct desc_struct *ldt;
used by WINE, windows emulator
© 黃仁竑 / 中正資工
Task Structure
struct linux_binfmt *binfmt;
functions responsible for loading the program
struct thread_struct tss;
holds all the data on the current processor status at
the time of the last transition from user mode to sy
stem mode, all registers are saved here.
struct thread_struct can be found in asm-i386/pro
cessor.h which, among other definitions, include 80
86 related information:
struct vm86_struct * vm86_info;
unsigned long screen_bitmap;
unsigned long v86flags, v86mask, v86mode;
© 黃仁竑 / 中正資工
Task Structure
unsigned long policy, rt_priority;
Scheduling policies: classic (SCHED_OTHER), rea
l-time (SCHED_RR, SCHED_FIFO)
rt_priority :real-time priority
#ifdef __SMP__
int processor;
int last_processor;
int lock_depth;
#endif
When running on a multi-processor machine, need
to know on which processor the task is running, ..,
etc.
© 黃仁竑 / 中正資工
Process Table
struct task_struct init_task;
points to the start of the doubly linked task list
#define for_each_task(p) \
for (p = &init_task ; (p = p->next_task) != &init_task ; )
macro for find all processes
the first task is skipped (init_task)
© 黃仁竑 / 中正資工
Files and inodes
Two important structures:file, inode (linux
/fs.h)
The file structure (process’s view)
struct file {
mode_t f_mode;
acess mode when opened(RO, RW, WO)
loff_t f_pos;
position of the read/write pointer (64-bit)
© 黃仁竑 / 中正資工
Files and inodes
© 黃仁竑 / 中正資工
Files and inodes
unsigned short f_count;
reference count (dup, dup2, fork)
struct file *f_next, *f_prev;
doubly linked list
global variable: struct file *first_file;
struct inode * f_inode;
actual description of the file
struct file_operations * f_op;
refers to a structure of function pointers of file ope
rations, i.e., functions are not directly called.
Since LINUX supports many file system, Virtual Fi
le System (VFS) is implemented.
© 黃仁竑 / 中正資工
Files and inodes
struct inode {
kdev_t i_dev; /* which device the file is on */
unsigned long i_ino; /* position on the device */
umode_t i_mode;
nlink_t i_nlink;
uid_t i_uid; /* owner user id */
gid_t i_gid; /* owner group id */
off_t i_size; /* size in bytes */
time_t i_atime; /* time of last access */
time_t i_mtime; /* time of last modification */
time_t i_ctime; /* time of last modification to
inode*/
© 黃仁竑 / 中正資工
Memory Management
Macros
#define __get_free_page(priority) __get_free_pages((priori
ty),0,0)
#define __get_dma_pages(priority, order) __get_free_page
s((priority),(order),1)
extern unsigned long __get_free_pages(int priority, unsign
ed long gfporder, int dma);
defined in linux/mm.h, page size is 4KB
priority: GFP_BUFFER, GFP_ATOMIC, GFP_KER
NEL, GFP_NOBUFFER, GFP_NFS (what to do if no
t enough pages are free)
order:number of pages to be reserved (in power of 2)
dma: address can be addressed by DMA component
© 黃仁竑 / 中正資工
Memory Management
Functions
extern inline unsigned long get_free_page(int priority)
{
unsigned long page;
page = __get_free_page(priority);
if (page)
memset((void *) page, 0, PAGE_SIZE);
return page;
}
Will clear the page
© 黃仁竑 / 中正資工
Memory Management
Functions
void *kmalloc(size_t size, int priority)
void kfree(void *__ptr)
malloc() and free() in the kernel
© 黃仁竑 / 中正資工
Waiting Queues
Structures for waiting queues
struct wait_queue {
struct task_struct * task;
struct wait_queue * next;
};
include/linux/wait.h
wait until condition met
Functions (sched.h)
extern inline void add_wait_queue(struct wait_queue
** p, struct wait_queue * wait)
extern inline void remove_wait_queue(struct wait_qu
eue ** p, struct wait_queue * wait)
© 黃仁竑 / 中正資工
Waiting Queues
Functions
void sleep_on(struct wait_queue ** p);
void interruptible_sleep_on(struct wait_queue ** p);
void wake_up(struct wait_queue ** p);
void wake_up_interruptible(struct wait_queue ** p);
kernel/sched.c
sleep_on sets process state to TASK_UNINTERRU
PTIBLE or TASK_INTERRUPTIBLE
wait_up sets process state to TASK_RUNNING
© 黃仁竑 / 中正資工
Semaphores
Structure for semaphores
struct semaphore {
int count;
int waiting;
struct wait_queue * wait;
};
asm-i386/semaphore.h
Functions
extern inline void down(struct semaphore * sem)
extern inline void up(struct semaphore * sem)
© 黃仁竑 / 中正資工
System Time and Timers
In unit of ticks (10 ms)
Global variable, jiffies, denotes the time in ticks s
ince the system booted
Structure for timer (old)
struct timer_struct {
unsigned long expires;
void (*fn)(void);
};
extern struct timer_struct timer_table[32];
extern unsigned long timer_active; /* which entry is vali
d? */
© 黃仁竑 / 中正資工
System Time and Timers
Structure for timer (new)
struct timer_list {
struct timer_list *next;
struct timer_list *prev;
unsigned long expires;
unsigned long data; /* arguments */
void (*function)(unsigned long);
};
extern void add_timer(struct timer_list * timer);
extern int del_timer(struct timer_list * timer);
© 黃仁竑 / 中正資工
Process Management
Signal
Interrupt
Booting
Timer
Scheduler
© 黃仁竑 / 中正資工
Signal
Signals ()
SIGHUP 1 hangup
SIGINT 2 interrupt
SIGQUIT 3 quit
SIGILL 4 illegal instruction
SIGTRAP 5 trace trap
SIGABRT 6 abort (generated by abort(3) routine)
SIGIOT 6 Input/Output Trap (obsolete)
SIGBUS 7 bus error
SIGFPE 8 arithmetic exception
SIGKILL 9 kill (cannot be caught, blocked, or ignored)
SIGUSR1 10 user-defined signal 1
© 黃仁竑 / 中正資工
Signal
SIGSEGV 11 segmentation violation
SIGUSR2 12 user-defined signal 2
SIGPIPE 13 write on a pipe or other socket with no one to read it
SIGALRM 14 alarm clock
SIGTERM 15 software termination signal
SIGTKFLT 16
SIGCHLD 17 child status has changed
SIGCONT 18 continue after stop
SIGSTOP 19 stop (cannot be caught, blocked, or ignored)
SIGTSTP 20 stop signal generated from keyboard
SIGTTIN 21 background read attempted from control terminal
© 黃仁竑 / 中正資工
Signal
SIGTTOU 22 background write attempted to control terminal
SIGURG 23 urgent condition present on socket
SIGXCPU 24 cpu time limit exceeded (see getrlimit(2))
SIGXFSZ 25 file size limit exceeded (see getrlimit(2))
SIGVTALRM 26 virtual time alarm (see getitimer(2))
SIGPROF 27 profiling timer alarm (see getitimer(2))
SIGWINCH 28 window changed (see termio(4) and win(4S))
SIGIO 29 I/O is possible on a descriptor (see fcntl(2V))
SIGPOLL 29 SIGIO
SIGPWR 30 Power Failure (for UPS)
SIGUNUSED 31
© 黃仁竑 / 中正資工
Signal System Calls
Important system calls
kill(int pid, int sig)
sends the signal sig to a process or a group of processe
s
If pid is greater than zero, the signal is sent to the proc
ess with the PID pid.
If pid is zero, the signal is sent to the process group of
the current process.
If pid is -1, the signal is sent to all processes, except th
e system processes and current process
If pid is less than -1, the signal is sent to all process of
the process group -pid
© 黃仁竑 / 中正資工
Signal System Calls
Important system calls
kill(int pid, int sig)
The real or effective user ID of the sending processing
must match the real or saved set-user ID of the receivi
ng process, unless the effective user ID of the sending
process is super-user.
A single exception is the signal SIGCONT, which requ
ires the sending and receiving processes belong to the
same session.
Errors:
– EINVAL: invalid sig
– ESRCH: process or process group does not exist
– EPERM: no privileges
© 黃仁竑 / 中正資工
Signal System Calls
Important system calls
kill(int pid, int sig)
Implementation
– linux/kernel/exit.c
– sys_kill() -> send_sig(), kill_pg(), kill_proc() -> g
enerate()
– see also force_sig(), kill_sl()
– also called from ret_from_sys_call() -> do_signal
()->send_sig()
->handle_signal() (signal.c, 223)
->setup_frame() (160)
->regs->eip = sa->sa_handler (213)
© 黃仁竑 / 中正資工
sys_kill
Linux/kernel/exit.c, line 318-339
322-323: If pid is zero, the signal is sent to the process gr
oup of the current process.
324-334: If pid is -1, the signal is sent to all processes, ex
cept the system processes (PID=0 or 1) and current proce
ss. “for_each_task” macro is defined in include/linux/sch
ed.h, line 491. If count is zero, return error code ESRCH.
335-336:If pid is less than -1, the signal is sent to all proc
ess of the process group -pid.
338: If pid is greater than zero, the signal is sent to the pr
ocess with the PID pid.
© 黃仁竑 / 中正資工
kill_pg
Linux/kernel/exit.c, line 258-275.
264-265: sig must be in [1..32], pgrp (process group id)
must be greater than zero
266-273: for each process, if its process group id is pgrp,
then sends signal sig to it (send_sig). If success, send_sig
will return zero.
274: if found=0, then no process has been found, return e
rror ESRCH, else return zero.
© 黃仁竑 / 中正資工
kill_proc
Linux/kernel/exit.c, line 301-312
305-306: sig must be in [1..32].
307-310: if a process with pid is found, sends signal sig t
o it (send_sig)
311: if no process has been found, return error ESRCH
© 黃仁竑 / 中正資工
send_sig
Linux/kernel/exit.c, line 73-101
75-76: p cannot be null and sig must less than or equal to
32
77: priv is privilege (0 for normal process, 1 for super use
r), SIGCONT can only send to process belongs to the sa
me sessin
78-79: The real or effective user ID of the sending proces
sing must match the real or saved set-user ID of the recei
ving process, unless the effective user ID of the sending p
rocess is super-user.
80: super user?
81: If none of above conditions is true, return error
© 黃仁竑 / 中正資工
send_sig
82-83: if sig=0, do nothing
84-88: if sig in the task struct is null (in zombie state), do
nothing
89-95: if sig is SIGKILL or SIGCONT, and the process i
s in state TASK_STOPPED, wake up the process and res
et SIGSTOP, SIGTSTP, SIGTTIN, SIGTTOU signals.
96-97: if sig is SIGSTOP, SIGTSTP, SIGTTIN, or SIGT
TOU, reset SIGCONT.
99: actually generate the signal
© 黃仁竑 / 中正資工
generate
Linux/kernel/exit.c, line 29-51
31: set up signal mask
32: action of the signal, sa=p->sig->action[sig-1]
39: if the signal is not blocked and the process is not traced
41: and if the handler of the signal is SIG_IGN (to be ignore
d) and the signal is not from state change of child process
42: then return immediately.
44-46: if the handler if SIG_DFL (default action) and the sign
al is SIGCONT, SIGCHLD, SIGWINCH, SIGURG, then ret
urn immediately. (wake up has been done for SIGCONT)
© 黃仁竑 / 中正資工
generate
48: finally, set the signal
49-50: if the signal receiving process is interruptable and
the signal is not to be blocked, then wake up the process.
© 黃仁竑 / 中正資工
force_sig
Linux/kernel/exit.c, line 57-70
force to send a signal to a process (cannot be ignored)
60: if the process is not in zombie state
61-62: set the signal and get the signal action struct
63: really set the signal
64: the signal cannot be blocked, so clear the bit in
p->blocked
65-66: if the handler is SIG_IGN, reset it to SIG_DFL
67-68: wake up the process if it is interruptible
© 黃仁竑 / 中正資工
kill_sl
Linux/kernel/exit.c, line 282-299
sends a signal to the session leader
288-289: sig must be in [1..32]. Session must be greater t
han zero
290-297: for each process, checks to see if session id is e
qual to sess and the process is the session leader, then sen
ds signal to the session leader (send_sig)
298: return error if no process is found
© 黃仁竑 / 中正資工
Signal System Calls
Important system calls
sigaction(int sig, struct sigaction *act, *oact)
examine and change signal action ( 取代 signal())
act: new action, oact: old action (return)
struct sigaction {
__sighandler_t sa_handler; /* SIG_DFL, SIG_IGN, or … */
sigset_t sa_mask; /* signals to be blocked during execution of
handler*/
unsigned long sa_flags; /* SA_ONSTACK: on sig stack
SA_INTERRUPT: do not restart system on signal return
SA_RESETHAND: reset handler to SIG_DFL when signal taken
SA_NOCLDSTOP: don’t send SIGCHLD on child stop */
void (*sa_restorer)(void);
}
© 黃仁竑 / 中正資工
Signal System Calls
Important system calls
sigprocmask(int how, sigset_t *set, *oset)
examine and change the calling process’s blocked signa
ls
how
– SIG_BLOCK: add blocked signals to oset
– SIG_UNBLOCK: unblock blocked signals from ose
t
– SIG_SETMASK: reset blocked signals with set
SIGKILL, SIGSTOP cannot be blocked
undefined for SIGFPE, SIGILL, SIGSEGV if they are b
locked when they are generated
© 黃仁竑 / 中正資工
Signal System Calls
Important system calls
sigpending(sigset_t *set)
stores
the set of signals that are blocked from delivery
and pending for the calling process in set.
ssetmask(int mask), sgetmask() set/get blocked si
ngals of current process, obsolete by sigprocmas
k().
Sigsuspend(int restart, unsigned long oldmask, u
nsigned long newmask)
replacesthe process’s signal mask with newmask and
then suspends the process until delivery of a signal.
© 黃仁竑 / 中正資工
sys_sigaction
Linux/kernel/signal.c, line 150-182
155-156: check signal number [1..32]
157: get the old sigaction (p)
158-170: if action (new setting) is not null, check if it can
be read. If yes, copy the content of action to new_sa
171-176: if oldaction is not null, stores the old sigaction
(p) to oldaction
177-180: replace sigaction with new_sa
© 黃仁竑 / 中正資工
sys_sigprocmask
Linux/kernel/signal.c, line 29-60
34-52: if set (new mask) is not null, process set depends o
n how; SIG_BLOCK, add blocked signals to oset, SIG_U
NBLOCK: unblock blocked signals from oset, SIG_SET
MASK: reset blocked signals with set.
53-58: if oset is no null, copy old_set (current->blocked)
to seet.
© 黃仁竑 / 中正資工
Sys_sigpending
Linux/kernel/signal.c, line 80-88
stores signals pending but blocked into set.
84: check if set can be write
85-86: if yes, copy current blocked signals to set.
© 黃仁竑 / 中正資工
Interrupt
To allow the hardware to communicate wit
h the operating system
Source files
arch/i386/kernel/irq.c
include/asm-i386/irq.h
Interrupt handlers
slow, fast, bad (irq.c, lines 142-172)
build the interrupt handler first
line 114-136, irq.c
line 200-243, irq.h (macros)
© 黃仁竑 / 中正資工
Interrupt
Interrupt number
First set : 0-7
Second set: 8-15
0 for timer
On SMP board (486 and above)
irq13 for Interprocessor interrupts
irq16 for SMP reschedule
On a 386
irq13 for SIGFPE (unreliable)
no irq16
© 黃仁竑 / 中正資工
Interrupt
Slow interrupts (include/asm/irq.h, line 205-222)
206: build symbol table IRQ#_interrupt (see irq.c, 142)
208: SAVE_ALL, save all registers
209: ENTER_KERNEL, synchronization processors’ acc
ess to the kernel on a SMP board
210: ACK_FIRST (or SECOND), ack to the interrupt con
troller
211: increase intr_count (number of nested interrupts)
213-217: call do_IRQ(int irq, struct pt_regs *regs) (see ar
ch/i386/kernel/irq.c, line 343-364)
219: UNBLK_FIRST (or SECOND), inform interrupt co
ntroller that interrupts of this type can again be accepted
© 黃仁竑 / 中正資工
Do_IRQ()
© 黃仁竑 / 中正資工
Data Structure
struct irqaction {
void (*handler)(int, void *, struct pt_regs *);
unsigned long flags;
unsigned long mask;
const char *name;
void *dev_id;struct irqaction *next;
};
© 黃仁竑 / 中正資工
Interrupt
Slow interrupts (irq.h, line 205-222)
220: decrease intr_count
221: increase syscall_count
222: jump to routine ret_from_sys_call (never returned)
Fast Interrupts (irq.h, line 224-236)
use SAVE_MOST and RESTORE_MOST instead of SA
VE_ALL and do not call ret_from_sys_call
229-230: call do_fast_IRQ(int irq) (see irq.c, line 371-39
3)
Bad interrupts (irq.h, line 237-243)
Simply acks the interrupt (not installed)
© 黃仁竑 / 中正資工
Interrupt 流程
IDT[] -> interrupt[] (or fast_interrupt[], bad
_interrupt[])
IRQi_interrupt (or fast or bad) -> do_IRQ
()->irqaction[]
irqaction[i]->handler -> jump to ret_from_s
ys_call
jump to handle_bottom_half (if bh_mask &
bh_active)
do_bottom_half -> bh_base[] -> bh_base[i]
© 黃仁竑 / 中正資工
Interrupt 相關程式工作
request_irq()->setup_x86_irq() ( 或由 init fn)
setup_x86_irq 做兩件事 :
一是將 IDT[] 的 entry 設到 interrupt[]
二是將相關的 action 放到 irqaction[] 中。若 irq 是 shar
ed ,則 irqaction[] 放的是一個 list
interrupt[],
fast_interrupt[], bad_interrupt[] 是
由 BUILDIRQ macro 建好的。都是 assembly c
ode 。在 assembly code 中, interrupt 及 fast_in
terrupt 會去 call do_IRQ , bad_interrupt 則不
會。另 interrupt 在 call 完 do_IRQ 後,會 jump
到 ret_from_sys_call (fast_interrupt 不會 ) 。
© 黃仁竑 / 中正資工
Interrupt 相關程式工作
do_IRQ 會一一執行記錄在相對應的 irqac
tion[] 中的 action 的 handler
當 jump 到 ret_from_sys_call 時,會檢查
是否需要 jump 到 handle_bottom_half (bh
_mask & bh_active) ,在 handle_bottom_
half 的 assembly code 中會 call do_bottom
_half 。在 do_bottom_half 中 會將 bh_ba
se[] 中的 function 叫出來執行。
© 黃仁竑 / 中正資工
Interrupt 相關程式工作
bh_base[]
是以 init_bh() 來設定的。
就像 irq 是用 request_irq() 來設定一
樣
© 黃仁竑 / 中正資工
bottom half 相關的 data structure
bh_mask: 設為 1 表示已安裝了 botto
m half routine
bh_active: 設為 1 表示 interrupt 發生
,已處理完快速部份,等著執行 bott
om half 部份。
bh_mask_count: 計算此 bottom half
被 disable 幾次, 0 時表示沒有任何 n
ested 的 disable ( 即為 enable)
bh_base: bottom half routine
© 黃仁竑 / 中正資工
範例
interrupt 部份 :
start_kernel() -> time_init() -> setup_x86_
irq(0, &irq0) -> set_intr_gate()
irq0->action=timer_interrupt
故 IDT[0] -> interrupt[0] -> do_IRQ -> ti
mer_interrupt()->do_timer()
© 黃仁竑 / 中正資工
範例
bottom half:
start_kernel() -> sched_init() -> init_bh(TI
MER_BH, timer_bh)
所以在 call 完 do_timer 後, jump 到 ret_
from_sys_call -> handle_bottom_half -> d
o_bottom_half->bh_base[0]->timer_bh
© 黃仁竑 / 中正資工
Device Driver Example
以 3Com 3C509 網路卡為例
Source 在 drivers/net/3c509.c
在 open 網路卡時 (el3_open(), line 347)
request_irq(dev->irq, &el3_interrupt, …) 356
發生 interrupt 時
el3_interrupt(…)
515
mark_bh(NET_BH) 548
© 黃仁竑 / 中正資工
init_IRQ()
arch/i386/kernel/irq.c
536 :系統啟動後, void init_IRQ(void) 這個 function 將 IR
Q 初始化。
545~547 : outb_p 和 outb 都是 output 一個 byte 到某一 port
。
548~549 :這個 for loop ,利用 set_intr_gate 來設定 bad_inte
rrupt array , set_intr_gate 請參考 system.h 中 235-247 ;
初始將指到 bad_interrupt[] ,表示我們尚未安裝 interrupt
handler 。在 request_irq() 中,會依所要求的 flag ,再將
此位置改指到 interrupt[] 或 fast_ interrupt[] 。
555~556 : request_region() 在 apricot.c 中是個 function ,在
resource.c 中是個 macro ,這裡不討論。
557~558 : setup_x86_irq() 用來建立 Interrupt Descriptor Ta
ble(IDT)
© 黃仁竑 / 中正資工
setup_x86_irq( )
395 : setup_x86_irq() 開始。
401 : p = irq_action + irq; irq_action 定義在 219 行為一個有 1
6 個 NULL 的 struct 指標陣列的頭,而加上 irq 就是找到它
是 0~15 中的哪一個 irq 。
402~417 :這段程式碼來決定此 IRQ 是否可以 share , fast 和
bad interrupt 一定不能 share ,只有 slow interrupt 有可能
發生 interrupt share ,在後面第七章會詳細討論,這裡並不
討論。
426~432 :如果此 IRQ 不能 share ,如果屬於 fast interrupt 則
設定到 fast_interrupt[] 中,否則就設定到 interrupt[] 中。
int request_irq()
437~467 :當有一個 device 要求系統給一個 IRQ ,則呼叫 req
uest_irq() ,這個 function 會根據 device 所要求的 IRQ 號碼
,將 IRQ 的 handler 設給 device 。
© 黃仁竑 / 中正資工
Request and Free IRQ
int request_irq()
437~467 :當有一個 device 要求系統給一個
IRQ ,則呼叫 request_irq() ,這個 functi
on 會根據 device 所要求的 IRQ 號碼,將
IRQ 的 handler 設給 device 。
void free_irq()
469~495 : free_irq() 和上一個 request_irq
() 這好相反,當要拿掉一個 device ,則呼
叫 free_irq() 空出 IRQ 。
© 黃仁竑 / 中正資工
Boot
Boot process
BIOS
reads the first sector of the boot disk (floppy, hard disk,
…, according to the BIOS parameter setting)
Load the boot sector (512 bytes), which will contain progr
am code for loading the operating system kernel (e.g., Lin
ux Loader, LILO), to 0x7C00 (arch/i386/boot/bootsect.s,
35) in real mode
boot sector ends with 0xAA55
Boot disk
Floppy: the first sector
Hard disk: the first sector is the master boot record (MB
R)
© 黃仁竑 / 中正資工
Boot Sector and MBR
0x000 JMP 0x03E
0x003 Disk parameters Boot
Program code loading Sector
0x03E
the OS kernel (Floppy)
0x1FE 0xAA55
Extended partition
Ifmore than 4 partitions are needed
The first sector of extended partition is same as MBR
The first partition entry is for the first logical drive
The second partition entry points to the next logical
drive (MBR)
The first sector of each primary or extended
partition contains a boot sector
© 黃仁竑 / 中正資工
Extended Partition MBR
MBR for extended partition
© 黃仁竑 / 中正資工
Structure of a Partition Entry
1 Boot Boot flag: 0=not active, 0x80 active
1 HD Begin: head number
2 SEC CYL Begin: sector and cylinder number of boot sector
1 SYS System code: 0x83 Linux, 0x82: swap, 0x05: extend
1 HD End: head number
2 SEC CYL End: sector and cylinder number of boot sector
4 low byte high byte Relative sector number
4 low byte high byte of start sector
© 黃仁竑 / 中正資工
Active Partition
Booting is carried out from the active
partition which is determined by the boot flag
Operations of MBR
determine active partition
load the boot sector of the active partition
jump into the boot sector at offset 0
© 黃仁竑 / 中正資工
Boot Process
Compressed Kernel size
Include/linux/config.h, DEF_SYSSIZE = 0x7F00 cli
cks = 508 KB. (1 click=16 bytes)
zImage is less than this size
zImage’s source is arch/i386/boot/bootsect.s, it is loa
ded to 0x7C00 first, it is then moved to 0x90000 and
jump to there to start execution.
Setup.s is then loaded to 0x90200 and kernel image i
s loaded to 0x10000 (64KB)
Setup.s moves the kernel from 0x10000 to 0x1000(4
KB) to save memory and then enters the protected m
ode, jumps to 0x1000 (line 520-536)
© 黃仁竑 / 中正資工
Bootsect.c
Line 59-69
Moves code from 0x7C00 (BOOTSEG) to
0x90000(INITSEG)
64-65: set si, di to zero
rep: repeat 68
68: move word by word until cx=0 (initializ
e to 256)
66: cld clears DF flag in EFLAG to 0 whic
h makes the move statement goes up (incre
ases the address for data movement)
© 黃仁竑 / 中正資工
Boot Process
Uncompress Kernel
The start point is at arch/i386/kernel/head.s
It initializes the system and then calls
start_kernel
So the system then runs from start_kernel()
© 黃仁竑 / 中正資工
Booting the System
LILO loads the Linux kernel into memory
starts from “start:” in arch/i386/boot/setup.s
setup.s is responsible for initializing the hardware,
asking the bios for memory/disk/other parameters,
and putting them in memory 0x90000-0x901FF
520-521: switch to protected mode
534-536: jmp 0x1000, KERNEL_CS
jmpi 0x100000, KERNEL_CS for big kernels
Continues from startup_32 in arch/i386/kernel/hea
d.s
© 黃仁竑 / 中正資工
Booting the System
More sections of the hardware are initialized (pa
ging table, co-processor, interrupt descriptor tabl
e (idt), stack, environment, …)
219: calls the start_kernel() in init/main.c
start_kernel(): all areas of the kernel are initialize
d and process 1 is created
794-852: more initializations
858: creates process 1 (kernel_thread(init, NULL,0))
– process 0 is an idle process, do nothing and runs w
hen no other process needs CPU
– process 1 calls the init() and starts some daemons
868: process 0 enters an infinite idle loop
© 黃仁竑 / 中正資工
Booting the System
Init() in init/main.c, lines 919-1020
927: bdflush is responsible for synchronization of the
buffer cache contents with the file system
929: kswapd is the background pageout daemon (swa
ping)
937: setup initializes the file systems and mounts the r
oot file system
986-991: connects to the console and open file descrip
tors 0, 1, 2 (console)
993-997: tries to execute one of the programs /etc/init,
/bin/init, /sbin/init.
999-1003: if none of the three programs exists, execut
es /etc/rc
© 黃仁竑 / 中正資工
Booting the System
Init() in init/main.c, lines 919-1020
1005-1018: enters an infinite loop in which a shell is
started for users to login on the console.
© 黃仁竑 / 中正資工
Setitimer System Call
int setitimer(int which, struct itimerval *value, *ovalue)
which:
ITIMER_REAL: decrements n real time. A SIGALRM signal is de
livered when this timer expires.
ITIMER_VIRTUAL: Decrements in process virtual time. It runs o
nly when the process is executing (not including system time). A S
IGVTALRM is delivered when this timer expires.
ITIMER_PROF: Decrements both in process virtual time and whe
n the system is running on behalf of the process. A SIGPROF sign
al is delivered when this timer expires. It is designed for profiling t
he execution of interpreted programs.
The itimerval struct has two fields: it_interval and it_value. If it_value
is non-zero, it indicates the time to the next timer expiration. If it_inter
val is non-zero, it specifies a value to be used in reloading it_value wh
en timer expires. Setting it_value to zero disables a timer. Setting it_int
erval to zero causes a timer to be disabled after its next expiration.
© 黃仁竑 / 中正資工
Related Codes
ITIMER_REAL
Data structure: timer_head
run_timer_list()
it_real_fn() (itimer.c, 98, sched.h, 297)
ITIMER_VIRTUAL
do_it_virt() (sched.c, 943)
ITIMER_PROF
do_it_prof() (sched.c, 956)
Sys_setitimer -> _setitimer()-> add_timer()
Itimer.c/115, sched.c/606
© 黃仁竑 / 中正資工
Timer Interrupt
Important global variables
jiffies
kernel/sched.c (96): unsigned long volatile jiffies=0;
ticks (10ms) since the system was started up
xtime
kernel/sched.c (47): volatile struct timeval xtime;
actual time
Timer interrupt
updates jiffies and make the bottom half active
the bottom half is called later, after handling othe
r interrupts
© 黃仁竑 / 中正資工
Timer Interrupt
© 黃仁竑 / 中正資工
Timer Interrupt
do_timer (kernel/sched.c, 1077-1095)
1079: increase jiffies
1080: increase lost_ticks (ticks since last called of the bot
tom half routine)
1081: mark the bottom half active (include/linux/ interrup
t.h)
1082-1083: increase lost_ticks_system if in kernel mode
(ticks spent in kernel mode since last called of the bottom
half routing)
1084-1092: profile
1093-1094: mark timer queue handler active
© 黃仁竑 / 中正資工
Timer Interrupt
Bottom half routines of the timer interrupt
timer_bh (kernel/sched.c, lines 1070-1075)
1072: updating the times, kernel/sched.c, lines 1054-1
068
– 1058: xchg gets the value of lost_ticks and reset it
to zero in an atomic way.
– 1063: get lost_ticks_system and reset
– 1064: calculate system load (lines725-738)
– 1065: update the real time xtime (740-922, hw)
– 1066: update times of current process (977-1049)
1073, 1074: updating system wide timers (649-683)
© 黃仁竑 / 中正資工
Timer Interrupt
update_process_times (977-1049)
981: user time = ticks - system time
983: decrease the time quota used by current process
984-987: if the time quota is used up, need to reschedule
988-992: kernel statistics
994: update current process’s times (924-975)
929-930: update process’s user and system times
932-940: check if the process has used up its CPU lim
itation (setrlimit for setting limit of resource usage). If
exceeds soft limit, sends SIGXCPU. If exceeds hard q
uota, sends SIGKILL to kill the process.
© 黃仁竑 / 中正資工
Timer Interrupt
update_process_times (977-1049)
994: update current process’s times (924-975)
947-953: update interval timers. When timers have
expired, sends SIGVTALRM.
960-966: update profile
run_timer_list (649-665)
654: check timer list to see which timer has expired
655-662: prepare to call timer handler
run_old_timers (667-683)
check timer table (obsolete)
© 黃仁竑 / 中正資工
Scheduler
Classes
Real-time (soft)
Preemptive: rt_priority
SCHED_FIFO
– a process runs until it relinquishes control or a pro
cess with higher rt_priority wishes to run
SCHED_RR
– can be interrupted if its time slice has expired and
there are other processes with the same priority wi
shes to run (round robin with the same class)
Classic
SCHED_OTHER
© 黃仁竑 / 中正資工
Scheduler
Schedule() (kernel/sched.c, lines 283-407)
Called when
system call (indirectly, sleep_on -> schedule)
after slow_interrupt, ret_from_sys_call is called to ch
eck the need_resched flag
timer interrupt will also set the need_resched flag
Major tasks
routinesneed to be called regularly
determine the process with highest priority
make the process to be the current process
© 黃仁竑 / 中正資工
Scheduler
Schedule() (kernel/sched.c, lines 283-407)
303-304: cannot be called within a nested interrupt
306-310: the bottom halves of the interrupt routines (time-u
ncritical). E.g., the timer interrupt.
312: routines registered to be run in scheduler (chap. 7)
318-321: if current process belongs to the SCHED_RR clas
s and its time slice has expired, move it to the end of run qu
eue.
323-325: if current process is in TASK_INTERRUPTIBLE
state and the signal it is waiting has arrived, make it runnabl
e again
326-333: if current process is waiting for timeout and the ti
meout has expired, make it runnable again
© 黃仁竑 / 中正資工
Scheduler
Schedule() (kernel/sched.c, lines 283-407)
334-335: the current process must wait for an event, remo
ve it from the run queue
357-364: looks for the process with highest priority..
goodness(lines 235-281) return values
– -1000: don’t select this task
– 0: out of time (no results)
– +ve: the larger, the better
1000: real-time process
255-256: real-time process
265: simply use p->counter as its weight
277-278: a slight favor to the current process
© 黃仁竑 / 中正資工
Scheduler
Schedule() (kernel/sched.c, lines 283-407)
367-370: all process’s counter is 0, re-calculate
386-401: have a new process become the current process,
do the context switch (switch_to())
switch_to() in include/asm-i386/system.h, lines 53-12
2
104-105: if next is the current task, do nothing
106-109: clears the TS-flag if the task we switched to
has used the math co-processor latest
111-112: switch to the next task
114-120: reloads the debug regs if necessary.
© 黃仁竑 / 中正資工
System Call 流程
設定 IDT table
在 kernel_start() 中, call 了 trap_init() (arc
h/i386/kernel/traps.c, 322)
trap_init() 中將系統中的 trap 設好後,會
call set_system_gate(0x80, &system_call) 。
此時, IDT[0x80] 就會設為 system_call 。
發生 trap 0x80 時,就會 call system_call 。
© 黃仁竑 / 中正資工
設定各種 system call
以 fork() 為例,在 include/asm-i386/ unistd.h 的 272 行定
義了 static inline _syscall0(int,fork)
而 _syscall0 定義在 174 行,它會將此指令 extend 成
int fork(void)
{
long __res;
__asm__ volatile ("int $0x80"
: "=a" (__res)
: "0" (__NR_fork)); /* 就是 2 */
if (__res >= 0)
return (type) __res;
errno = -__res;
return -1;
}
© 黃仁竑 / 中正資工
Fork() System Call
所以它就是靠 int $0x80 造成 trap ,
並傳入 input 參數 __NR_fork
output 參數 __res 。當 trap 發生時,
就會到 system_call 的地方執行。
© 黃仁竑 / 中正資工
執行 system_call
這在 arch/i386/kernel/entry.s的第 281 行。
在 290 行,利用所傳入的參數 (system call
number) 查 sys_call_table[] 的 function 名
字 ( 如 sys_fork) ,如果不是 null ,在檢查
完 trace flag 後,就會在 304 行 call 這個 f
unction( 如 sys_fork) 。
system call 完成後,就會到 322 行,這就
是 ret_from_sys_call ,是 slow interrupt 執
行完也會到的地方。
© 黃仁竑 / 中正資工