LINUX Kernel: Introduction To The Kernel

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 105

LINUX Kernel

Chapter 3
Introduction to the Kernel

黃仁竑
Processes and Tasks
 Processes
 seen from outside: individual processes exist
independently
 Tasks
 seen from inside: only one operating system is
running

Process Process Process


1 2 3

Task 1 Task 2 Task 3


System Kernel with co-routines
© 黃仁竑 / 中正資工
Process States
Running
User mode
Interrupt

Return from
system call Interrupt routine System call

System
mode
Scheduler
Ready Waiting

© 黃仁竑 / 中正資工
Process States
 Running
 Task is active and running in the non-privileged
user mode.
 If an interrupt or system call occurs, it is
switched to the privileged system mode.
 Interrupt routine
 hardware signals an exception condition
 clock generates signal every 10 ms
 System call
 software interrupt

© 黃仁竑 / 中正資工
Process States
 Waiting
 wait for an external event (e.g., I/O complete)
 Return from system call
 when system call or interrupt is complete
 scheduler switches the process to ready state
 Ready
 competing for the processor

© 黃仁竑 / 中正資工
Important Data Structures
 Task structure
 task_struct in include/linux/sched.h
 Also accessed by assembly code, cannot alter the
sequence or add declarations in the front
 states
 TASK_RUNNING (0): ready or running
 TASK_INTERRUPTIBLE(1), TASK_UNINTERRUPTIBLE(2)
: waiting for certain events. TASK_UNINTERRUPTIBLE mean
s a task cannot accept any other signals.
 TASK_ZOMBIE(3): process terminated but still has its task stru
cture
 TASK_STOPPED(4): process has been halted
 TASK_SWAPPING(5): not used.

© 黃仁竑 / 中正資工
Task Structure
struct task_struct {
/* these are hardcoded - don't touch */
volatile long state;
 volatile indicates that this value can be altered by i
nterrupt routines
long counter;
long priority;
 counter variable holds the time in ticks for the pro
cess can still run before a mandatory scheduling ac
tion is carried out. Counter is used as dynamic prio
rity for scheduler
 priority holds the static priority of a process

© 黃仁竑 / 中正資工
Task Structure
unsigned long signal;
unsigned long blocked;
 signal contains a bit mask for signals received for
the process. It is evaluated in the routing
ret_from_sys_call() which is called after every
system call and after slow interrupts.
 blocked contains a bit mask for signals to be
blocked
unsigned long flags;
 flags contains the combination of the system status
flags

© 黃仁竑 / 中正資工
Task Structure
 Process flags:
#define PF_ALIGNWARN 0x00000001 /* Print alignment warning msgs */
/* Not implemented yet, only for 486*/
#define PF_PTRACED 0x00000010 /* set if ptrace (0) has been called. */
#define PF_TRACESYS 0x00000020 /* tracing system calls */
#define PF_FORKNOEXEC 0x00000040 /* forked but didn't exec */
#define PF_SUPERPRIV 0x00000100 /* used super-user privileges */
#define PF_DUMPCORE 0x00000200 /* dumped core */
#define PF_SIGNALED 0x00000400 /* killed by a signal */
#define PF_STARTING 0x00000002 /* being created */
#define PF_EXITING 0x00000004 /* getting shut down */
#define PF_USEDFPU 0x00100000 /* Process used the FPU this
quantum (SMP only) */
#define PF_DTRACE 0x00200000 /* delayed trace (used on m68k) */

© 黃仁竑 / 中正資工
Task Structure
int errno;
int debugreg[8];
 errno holds the error code for the last faulty syste
m call.
 debugreg contains the 80x86’s debugging registers.
struct exec_domain *exec_domain;
 which UNIX is emulated for each process
struct task_struct *next_task, *prev_task;
 all processes are linked through these two pointers
 init_task points to the start and end of this list
struct task_struct *next_run, *prev_run;
 list of processes that apply for the processor

© 黃仁竑 / 中正資工
Task Structure
struct task_struct *p_opptr, *p_pptr, *p_cptr, *p_ysptr, *p_os
ptr;
 pointers to (original) parent process, youngest child, youn
ger sibling, older sibling, respectively

parent
p_cptr
p_pptr p_pptr
p_pptr
p_osptr p_osptr
youngest child oldest
child p_ysptr child
p_ysptr

© 黃仁竑 / 中正資工
Task Structure
struct mm_struct *mm;
 memory management information
struct mm_struct {
int count; pgd_t * pgd;
unsigned long context;
unsigned long start_code, end_code, start_data, end_data;
unsigned long start_brk, brk, start_stack, start_mmap;
unsigned long arg_start, arg_end, env_start, env_end;
unsigned long rss, total_vm, locked_vm;
unsigned long def_flags;
struct vm_area_struct * mmap;
struct vm_area_struct * mmap_avl;
struct semaphore mmap_sem;
};

© 黃仁竑 / 中正資工
Virtual Memory

© 黃仁竑 / 中正資工
Task Structure
unsigned long kernel_stack_page;
 stack when a process is running in system mode
unsigned long saved_kernel_stack;
 save the old stack pointer when running MS-DOS
emulator (vm86)
int pid, pgrp, session, leader;
 process id, group id, session belongs to, and session
leader
unsigned short uid,euid,suid,fsuid;
unsigned short gid,egid,sgid,fsgid;
 user id, effective user id, file system user id
 group id, effective group id, file system group id

© 黃仁竑 / 中正資工
Task Structure
 uid, euid, suid, gid, egid, sgid
 Each process has a real user ID and group ID and
an effective user ID and group ID.
 The real ID identifies the person using the system
 The effective ID determines their access privileges.
 execve() changes the effective user or group ID to the
owner or group of the executed file if the file has the
set-user-ID (suid) or set-group-ID (sgid) modes. The
real UID and GID are not affected. The effective user
ID and effective group ID of the new process imag
e are saved as the saved set-user-ID and saved set-gro
up-ID respectively, for use by setuid(3V).
 Turn on suid: chmod a+s filename

© 黃仁竑 / 中正資工
Task Structure
 Uid, gid are inherited from parent
 euid, egid, fsuid, fsgid can be set at run time (owner of
the executable file)
int groups[NGROUPS];
 A process may be assigned to many groups
struct fs_struct *fs;
 file system information
struct fs_struct {
int count; /* for future expansions */
unsigned short umask; /* access mode */
struct inode * root, * pwd; /* root dir and current dir
*/
};
© 黃仁竑 / 中正資工
Task Structure
struct files_struct *files;
 open file information (file descriptors)

struct files_struct { /* open file table structure */


int count;
fd_set close_on_exec; /* files to be closed when exe
c
is issued */
fd_set open_fds; /* open files (bitmask) */
struct file * fd[NR_OPEN];
};

© 黃仁竑 / 中正資工
Task Structure
long utime, stime, cutime, cstime, start_time;
 time spend in user mode, system mode, total time o
f children process spend in user mode, system mod
e, and the time when the process generated, respect
ively.
unsigned long it_real_value, it_prof_value, it_virt_value
;
unsigned long it_real_incr, it_prof_incr, it_virt_incr;
struct timer_list real_timer;
 timer for alarm system call (SIGALRM)
 time in ticks until the timer will be trigger, for re-i
nitialization, real-time interval timer, respectively.
© 黃仁竑 / 中正資工
Task Structure
struct sem_undo *semundo;
 semaphores need to be released when a process ter
minated
struct sem_queue *semsleeping;
 semaphore waiting queue

struct wait_queue *wait_chldexit;


 When a process calls wait4(), it will halt until a chil
d process terminates at this queue.
struct rlimit rlim[RLIM_NLIMITS];
 limits of the use of resources (setrlimit(), getrlimit
())

© 黃仁竑 / 中正資工
Task Structure
struct signal_struct *sig;
struct signal_struct {
int count;
struct sigaction action[32];
};
 Signal handlers

int exit_code, exit_signal;


 return code and the signal that causes the program
aborted
char comm[16];
 name of the program that executed by the process

© 黃仁竑 / 中正資工
Task Structure
unsigned long personality;
 description of the characteristics of this version of
UNIX (see also exec_domain)
int dumpable:1;
 whether a memory dump is to be executed

int did_exec:1;
 is the process still running the old program (no exe
cve, …)
struct desc_struct *ldt;
 used by WINE, windows emulator

© 黃仁竑 / 中正資工
Task Structure
struct linux_binfmt *binfmt;
 functions responsible for loading the program
struct thread_struct tss;
 holds all the data on the current processor status at
the time of the last transition from user mode to sy
stem mode, all registers are saved here.
 struct thread_struct can be found in asm-i386/pro
cessor.h which, among other definitions, include 80
86 related information:
struct vm86_struct * vm86_info;
unsigned long screen_bitmap;
unsigned long v86flags, v86mask, v86mode;

© 黃仁竑 / 中正資工
Task Structure
unsigned long policy, rt_priority;
 Scheduling policies: classic (SCHED_OTHER), rea
l-time (SCHED_RR, SCHED_FIFO)
 rt_priority :real-time priority
#ifdef __SMP__
int processor;
int last_processor;
int lock_depth;
#endif
 When running on a multi-processor machine, need
to know on which processor the task is running, ..,
etc.

© 黃仁竑 / 中正資工
Process Table
struct task_struct init_task;
 points to the start of the doubly linked task list

struct task_struct *task[NR_TASKS];


 task table

#define current (0+current_set[smp_processor_id()])


struct task_struct *current_set[NR_CPUS];
 current process (for multi-processor architecture)

#define for_each_task(p) \
for (p = &init_task ; (p = p->next_task) != &init_task ; )
 macro for find all processes
 the first task is skipped (init_task)

© 黃仁竑 / 中正資工
Files and inodes
 Two important structures:file, inode (linux
/fs.h)
 The file structure (process’s view)
struct file {
mode_t f_mode;
 acess mode when opened(RO, RW, WO)

loff_t f_pos;
 position of the read/write pointer (64-bit)

unsigned short f_flags;


 additional flag for controlling access rights (fcntl)

© 黃仁竑 / 中正資工
Files and inodes

© 黃仁竑 / 中正資工
Files and inodes
unsigned short f_count;
 reference count (dup, dup2, fork)
struct file *f_next, *f_prev;
 doubly linked list
 global variable: struct file *first_file;
struct inode * f_inode;
 actual description of the file
struct file_operations * f_op;
 refers to a structure of function pointers of file ope
rations, i.e., functions are not directly called.
 Since LINUX supports many file system, Virtual Fi
le System (VFS) is implemented.

© 黃仁竑 / 中正資工
Files and inodes
struct inode {
kdev_t i_dev; /* which device the file is on */
unsigned long i_ino; /* position on the device */
umode_t i_mode;
nlink_t i_nlink;
uid_t i_uid; /* owner user id */
gid_t i_gid; /* owner group id */
off_t i_size; /* size in bytes */
time_t i_atime; /* time of last access */
time_t i_mtime; /* time of last modification */
time_t i_ctime; /* time of last modification to
inode*/

© 黃仁竑 / 中正資工
Memory Management
 Macros
#define __get_free_page(priority) __get_free_pages((priori
ty),0,0)
#define __get_dma_pages(priority, order) __get_free_page
s((priority),(order),1)
extern unsigned long __get_free_pages(int priority, unsign
ed long gfporder, int dma);
 defined in linux/mm.h, page size is 4KB
 priority: GFP_BUFFER, GFP_ATOMIC, GFP_KER
NEL, GFP_NOBUFFER, GFP_NFS (what to do if no
t enough pages are free)
 order:number of pages to be reserved (in power of 2)
 dma: address can be addressed by DMA component

© 黃仁竑 / 中正資工
Memory Management
 Functions
extern inline unsigned long get_free_page(int priority)
{
unsigned long page;
page = __get_free_page(priority);
if (page)
memset((void *) page, 0, PAGE_SIZE);
return page;
}
 Will clear the page

© 黃仁竑 / 中正資工
Memory Management
 Functions
void *kmalloc(size_t size, int priority)
void kfree(void *__ptr)
 malloc() and free() in the kernel

© 黃仁竑 / 中正資工
Waiting Queues
 Structures for waiting queues
struct wait_queue {
struct task_struct * task;
struct wait_queue * next;
};
 include/linux/wait.h
 wait until condition met
 Functions (sched.h)
 extern inline void add_wait_queue(struct wait_queue
** p, struct wait_queue * wait)
 extern inline void remove_wait_queue(struct wait_qu
eue ** p, struct wait_queue * wait)
© 黃仁竑 / 中正資工
Waiting Queues
 Functions
void sleep_on(struct wait_queue ** p);
void interruptible_sleep_on(struct wait_queue ** p);
void wake_up(struct wait_queue ** p);
void wake_up_interruptible(struct wait_queue ** p);
 kernel/sched.c
 sleep_on sets process state to TASK_UNINTERRU
PTIBLE or TASK_INTERRUPTIBLE
 wait_up sets process state to TASK_RUNNING

© 黃仁竑 / 中正資工
Semaphores
 Structure for semaphores
struct semaphore {
int count;
int waiting;
struct wait_queue * wait;
};
 asm-i386/semaphore.h

 Functions
extern inline void down(struct semaphore * sem)
extern inline void up(struct semaphore * sem)

© 黃仁竑 / 中正資工
System Time and Timers
 In unit of ticks (10 ms)
 Global variable, jiffies, denotes the time in ticks s
ince the system booted
 Structure for timer (old)
struct timer_struct {
unsigned long expires;
void (*fn)(void);
};
extern struct timer_struct timer_table[32];
extern unsigned long timer_active; /* which entry is vali
d? */
© 黃仁竑 / 中正資工
System Time and Timers
 Structure for timer (new)
struct timer_list {
struct timer_list *next;
struct timer_list *prev;
unsigned long expires;
unsigned long data; /* arguments */
void (*function)(unsigned long);
};
extern void add_timer(struct timer_list * timer);
extern int del_timer(struct timer_list * timer);

© 黃仁竑 / 中正資工
Process Management
 Signal
 Interrupt
 Booting
 Timer
 Scheduler

© 黃仁竑 / 中正資工
Signal
 Signals ()
SIGHUP 1 hangup
SIGINT 2 interrupt
SIGQUIT 3 quit
SIGILL 4 illegal instruction
SIGTRAP 5 trace trap
SIGABRT 6 abort (generated by abort(3) routine)
SIGIOT 6 Input/Output Trap (obsolete)
SIGBUS 7 bus error
SIGFPE 8 arithmetic exception
SIGKILL 9 kill (cannot be caught, blocked, or ignored)
SIGUSR1 10 user-defined signal 1

© 黃仁竑 / 中正資工
Signal
SIGSEGV 11 segmentation violation
SIGUSR2 12 user-defined signal 2
SIGPIPE 13 write on a pipe or other socket with no one to read it
SIGALRM 14 alarm clock
SIGTERM 15 software termination signal
SIGTKFLT 16
SIGCHLD 17 child status has changed
SIGCONT 18 continue after stop
SIGSTOP 19 stop (cannot be caught, blocked, or ignored)
SIGTSTP 20 stop signal generated from keyboard
SIGTTIN 21 background read attempted from control terminal

© 黃仁竑 / 中正資工
Signal
SIGTTOU 22 background write attempted to control terminal
SIGURG 23 urgent condition present on socket
SIGXCPU 24 cpu time limit exceeded (see getrlimit(2))
SIGXFSZ 25 file size limit exceeded (see getrlimit(2))
SIGVTALRM 26 virtual time alarm (see getitimer(2))
SIGPROF 27 profiling timer alarm (see getitimer(2))
SIGWINCH 28 window changed (see termio(4) and win(4S))
SIGIO 29 I/O is possible on a descriptor (see fcntl(2V))
SIGPOLL 29 SIGIO
SIGPWR 30 Power Failure (for UPS)
SIGUNUSED 31

© 黃仁竑 / 中正資工
Signal System Calls
 Important system calls
 kill(int pid, int sig)
 sends the signal sig to a process or a group of processe
s
 If pid is greater than zero, the signal is sent to the proc
ess with the PID pid.
 If pid is zero, the signal is sent to the process group of
the current process.
 If pid is -1, the signal is sent to all processes, except th
e system processes and current process
 If pid is less than -1, the signal is sent to all process of
the process group -pid

© 黃仁竑 / 中正資工
Signal System Calls
 Important system calls
 kill(int pid, int sig)
 The real or effective user ID of the sending processing
must match the real or saved set-user ID of the receivi
ng process, unless the effective user ID of the sending
process is super-user.
 A single exception is the signal SIGCONT, which requ
ires the sending and receiving processes belong to the
same session.
 Errors:
– EINVAL: invalid sig
– ESRCH: process or process group does not exist
– EPERM: no privileges

© 黃仁竑 / 中正資工
Signal System Calls
 Important system calls
 kill(int pid, int sig)
 Implementation
– linux/kernel/exit.c
– sys_kill() -> send_sig(), kill_pg(), kill_proc() -> g
enerate()
– see also force_sig(), kill_sl()
– also called from ret_from_sys_call() -> do_signal
()->send_sig()
->handle_signal() (signal.c, 223)
->setup_frame() (160)
->regs->eip = sa->sa_handler (213)

© 黃仁竑 / 中正資工
sys_kill
 Linux/kernel/exit.c, line 318-339
 322-323: If pid is zero, the signal is sent to the process gr
oup of the current process.
 324-334: If pid is -1, the signal is sent to all processes, ex
cept the system processes (PID=0 or 1) and current proce
ss. “for_each_task” macro is defined in include/linux/sch
ed.h, line 491. If count is zero, return error code ESRCH.
 335-336:If pid is less than -1, the signal is sent to all proc
ess of the process group -pid.
 338: If pid is greater than zero, the signal is sent to the pr
ocess with the PID pid.

© 黃仁竑 / 中正資工
kill_pg
 Linux/kernel/exit.c, line 258-275.
 264-265: sig must be in [1..32], pgrp (process group id)
must be greater than zero
 266-273: for each process, if its process group id is pgrp,
then sends signal sig to it (send_sig). If success, send_sig
will return zero.
 274: if found=0, then no process has been found, return e
rror ESRCH, else return zero.

© 黃仁竑 / 中正資工
kill_proc
 Linux/kernel/exit.c, line 301-312
 305-306: sig must be in [1..32].
 307-310: if a process with pid is found, sends signal sig t
o it (send_sig)
 311: if no process has been found, return error ESRCH

© 黃仁竑 / 中正資工
send_sig
 Linux/kernel/exit.c, line 73-101
 75-76: p cannot be null and sig must less than or equal to
32
 77: priv is privilege (0 for normal process, 1 for super use
r), SIGCONT can only send to process belongs to the sa
me sessin
 78-79: The real or effective user ID of the sending proces
sing must match the real or saved set-user ID of the recei
ving process, unless the effective user ID of the sending p
rocess is super-user.
 80: super user?
 81: If none of above conditions is true, return error
© 黃仁竑 / 中正資工
send_sig
 82-83: if sig=0, do nothing
 84-88: if sig in the task struct is null (in zombie state), do
nothing
 89-95: if sig is SIGKILL or SIGCONT, and the process i
s in state TASK_STOPPED, wake up the process and res
et SIGSTOP, SIGTSTP, SIGTTIN, SIGTTOU signals.
 96-97: if sig is SIGSTOP, SIGTSTP, SIGTTIN, or SIGT
TOU, reset SIGCONT.
 99: actually generate the signal

© 黃仁竑 / 中正資工
generate
 Linux/kernel/exit.c, line 29-51
 31: set up signal mask
 32: action of the signal, sa=p->sig->action[sig-1]
 39: if the signal is not blocked and the process is not traced
 41: and if the handler of the signal is SIG_IGN (to be ignore
d) and the signal is not from state change of child process
 42: then return immediately.
 44-46: if the handler if SIG_DFL (default action) and the sign
al is SIGCONT, SIGCHLD, SIGWINCH, SIGURG, then ret
urn immediately. (wake up has been done for SIGCONT)

© 黃仁竑 / 中正資工
generate
 48: finally, set the signal
 49-50: if the signal receiving process is interruptable and
the signal is not to be blocked, then wake up the process.

© 黃仁竑 / 中正資工
force_sig
 Linux/kernel/exit.c, line 57-70
 force to send a signal to a process (cannot be ignored)
 60: if the process is not in zombie state
 61-62: set the signal and get the signal action struct
 63: really set the signal
 64: the signal cannot be blocked, so clear the bit in
p->blocked
 65-66: if the handler is SIG_IGN, reset it to SIG_DFL
 67-68: wake up the process if it is interruptible

© 黃仁竑 / 中正資工
kill_sl
 Linux/kernel/exit.c, line 282-299
 sends a signal to the session leader
 288-289: sig must be in [1..32]. Session must be greater t
han zero
 290-297: for each process, checks to see if session id is e
qual to sess and the process is the session leader, then sen
ds signal to the session leader (send_sig)
 298: return error if no process is found

© 黃仁竑 / 中正資工
Signal System Calls
 Important system calls
 sigaction(int sig, struct sigaction *act, *oact)
 examine and change signal action ( 取代 signal())
 act: new action, oact: old action (return)
struct sigaction {
__sighandler_t sa_handler; /* SIG_DFL, SIG_IGN, or … */
sigset_t sa_mask; /* signals to be blocked during execution of
handler*/
unsigned long sa_flags; /* SA_ONSTACK: on sig stack
SA_INTERRUPT: do not restart system on signal return
SA_RESETHAND: reset handler to SIG_DFL when signal taken
SA_NOCLDSTOP: don’t send SIGCHLD on child stop */
void (*sa_restorer)(void);
}

© 黃仁竑 / 中正資工
Signal System Calls
 Important system calls
 sigprocmask(int how, sigset_t *set, *oset)
 examine and change the calling process’s blocked signa
ls
 how
– SIG_BLOCK: add blocked signals to oset
– SIG_UNBLOCK: unblock blocked signals from ose
t
– SIG_SETMASK: reset blocked signals with set
 SIGKILL, SIGSTOP cannot be blocked
 undefined for SIGFPE, SIGILL, SIGSEGV if they are b
locked when they are generated

© 黃仁竑 / 中正資工
Signal System Calls
 Important system calls
 sigpending(sigset_t *set)
 stores
the set of signals that are blocked from delivery
and pending for the calling process in set.
 ssetmask(int mask), sgetmask() set/get blocked si
ngals of current process, obsolete by sigprocmas
k().
 Sigsuspend(int restart, unsigned long oldmask, u
nsigned long newmask)
 replacesthe process’s signal mask with newmask and
then suspends the process until delivery of a signal.
© 黃仁竑 / 中正資工
sys_sigaction
 Linux/kernel/signal.c, line 150-182
 155-156: check signal number [1..32]
 157: get the old sigaction (p)
 158-170: if action (new setting) is not null, check if it can
be read. If yes, copy the content of action to new_sa
 171-176: if oldaction is not null, stores the old sigaction
(p) to oldaction
 177-180: replace sigaction with new_sa

© 黃仁竑 / 中正資工
sys_sigprocmask
 Linux/kernel/signal.c, line 29-60
 34-52: if set (new mask) is not null, process set depends o
n how; SIG_BLOCK, add blocked signals to oset, SIG_U
NBLOCK: unblock blocked signals from oset, SIG_SET
MASK: reset blocked signals with set.
 53-58: if oset is no null, copy old_set (current->blocked)
to seet.

© 黃仁竑 / 中正資工
Sys_sigpending
 Linux/kernel/signal.c, line 80-88
 stores signals pending but blocked into set.
 84: check if set can be write
 85-86: if yes, copy current blocked signals to set.

© 黃仁竑 / 中正資工
Interrupt
 To allow the hardware to communicate wit
h the operating system
 Source files
 arch/i386/kernel/irq.c
 include/asm-i386/irq.h
 Interrupt handlers
 slow, fast, bad (irq.c, lines 142-172)
 build the interrupt handler first
 line 114-136, irq.c
 line 200-243, irq.h (macros)

© 黃仁竑 / 中正資工
Interrupt
 Interrupt number
 First set : 0-7
 Second set: 8-15
 0 for timer
 On SMP board (486 and above)
 irq13 for Interprocessor interrupts
 irq16 for SMP reschedule

 On a 386
 irq13 for SIGFPE (unreliable)
 no irq16

© 黃仁竑 / 中正資工
Interrupt
 Slow interrupts (include/asm/irq.h, line 205-222)
 206: build symbol table IRQ#_interrupt (see irq.c, 142)
 208: SAVE_ALL, save all registers
 209: ENTER_KERNEL, synchronization processors’ acc
ess to the kernel on a SMP board
 210: ACK_FIRST (or SECOND), ack to the interrupt con
troller
 211: increase intr_count (number of nested interrupts)
 213-217: call do_IRQ(int irq, struct pt_regs *regs) (see ar
ch/i386/kernel/irq.c, line 343-364)
 219: UNBLK_FIRST (or SECOND), inform interrupt co
ntroller that interrupts of this type can again be accepted

© 黃仁竑 / 中正資工
Do_IRQ()

struct irqaction * action = *(irq + irq_action);



while (action) {
do_random |= action->flags;
action->handler(irq, action->dev_id, regs);
action = action->next;
}

© 黃仁竑 / 中正資工
Data Structure
struct irqaction {
void (*handler)(int, void *, struct pt_regs *);
unsigned long flags;
unsigned long mask;
const char *name;
void *dev_id;struct irqaction *next;
};

© 黃仁竑 / 中正資工
Interrupt
 Slow interrupts (irq.h, line 205-222)
 220: decrease intr_count
 221: increase syscall_count
 222: jump to routine ret_from_sys_call (never returned)
 Fast Interrupts (irq.h, line 224-236)
 use SAVE_MOST and RESTORE_MOST instead of SA
VE_ALL and do not call ret_from_sys_call
 229-230: call do_fast_IRQ(int irq) (see irq.c, line 371-39
3)
 Bad interrupts (irq.h, line 237-243)
 Simply acks the interrupt (not installed)
© 黃仁竑 / 中正資工
Interrupt 流程
 IDT[] -> interrupt[] (or fast_interrupt[], bad
_interrupt[])
 IRQi_interrupt (or fast or bad) -> do_IRQ
()->irqaction[]
 irqaction[i]->handler -> jump to ret_from_s
ys_call
 jump to handle_bottom_half (if bh_mask &
bh_active)
 do_bottom_half -> bh_base[] -> bh_base[i]

© 黃仁竑 / 中正資工
Interrupt 相關程式工作
 request_irq()->setup_x86_irq() ( 或由 init fn)
 setup_x86_irq 做兩件事 :
 一是將 IDT[] 的 entry 設到 interrupt[]
 二是將相關的 action 放到 irqaction[] 中。若 irq 是 shar
ed ,則 irqaction[] 放的是一個 list
 interrupt[],
fast_interrupt[], bad_interrupt[] 是
由 BUILDIRQ macro 建好的。都是 assembly c
ode 。在 assembly code 中, interrupt 及 fast_in
terrupt 會去 call do_IRQ , bad_interrupt 則不
會。另 interrupt 在 call 完 do_IRQ 後,會 jump
到 ret_from_sys_call (fast_interrupt 不會 ) 。

© 黃仁竑 / 中正資工
Interrupt 相關程式工作
 do_IRQ 會一一執行記錄在相對應的 irqac
tion[] 中的 action 的 handler
 當 jump 到 ret_from_sys_call 時,會檢查
是否需要 jump 到 handle_bottom_half (bh
_mask & bh_active) ,在 handle_bottom_
half 的 assembly code 中會 call do_bottom
_half 。在 do_bottom_half 中 會將 bh_ba
se[] 中的 function 叫出來執行。

© 黃仁竑 / 中正資工
Interrupt 相關程式工作
 bh_base[]
是以 init_bh() 來設定的。
就像 irq 是用 request_irq() 來設定一

© 黃仁竑 / 中正資工
bottom half 相關的 data structure
 bh_mask: 設為 1 表示已安裝了 botto
m half routine
 bh_active: 設為 1 表示 interrupt 發生
,已處理完快速部份,等著執行 bott
om half 部份。
 bh_mask_count: 計算此 bottom half
被 disable 幾次, 0 時表示沒有任何 n
ested 的 disable ( 即為 enable)
 bh_base: bottom half routine
© 黃仁竑 / 中正資工
範例
 interrupt 部份 :
 start_kernel() -> time_init() -> setup_x86_
irq(0, &irq0) -> set_intr_gate()
irq0->action=timer_interrupt
 故 IDT[0] -> interrupt[0] -> do_IRQ -> ti
mer_interrupt()->do_timer()

© 黃仁竑 / 中正資工
範例
 bottom half:
 start_kernel() -> sched_init() -> init_bh(TI
MER_BH, timer_bh)
 所以在 call 完 do_timer 後, jump 到 ret_
from_sys_call -> handle_bottom_half -> d
o_bottom_half->bh_base[0]->timer_bh

© 黃仁竑 / 中正資工
Device Driver Example
以 3Com 3C509 網路卡為例
 Source 在 drivers/net/3c509.c
 在 open 網路卡時 (el3_open(), line 347)
 request_irq(dev->irq, &el3_interrupt, …) 356
 發生 interrupt 時
 el3_interrupt(…)
515
 mark_bh(NET_BH) 548

 那 NET_BH 在那裏 init 呢 ?


 net_dev_init()
(net/core/dev.c)
 init_bh(NET_BH, net_bh); 1471

© 黃仁竑 / 中正資工
init_IRQ()
arch/i386/kernel/irq.c
536 :系統啟動後, void init_IRQ(void) 這個 function 將 IR
Q 初始化。
545~547 : outb_p 和 outb 都是 output 一個 byte 到某一 port

548~549 :這個 for loop ,利用 set_intr_gate 來設定 bad_inte
rrupt array , set_intr_gate 請參考 system.h 中 235-247 ;
初始將指到 bad_interrupt[] ,表示我們尚未安裝 interrupt
handler 。在 request_irq() 中,會依所要求的 flag ,再將
此位置改指到 interrupt[] 或 fast_ interrupt[] 。
555~556 : request_region() 在 apricot.c 中是個 function ,在
resource.c 中是個 macro ,這裡不討論。
557~558 : setup_x86_irq() 用來建立 Interrupt Descriptor Ta
ble(IDT)
© 黃仁竑 / 中正資工
setup_x86_irq( )
395 : setup_x86_irq() 開始。
401 : p = irq_action + irq; irq_action 定義在 219 行為一個有 1
6 個 NULL 的 struct 指標陣列的頭,而加上 irq 就是找到它
是 0~15 中的哪一個 irq 。
402~417 :這段程式碼來決定此 IRQ 是否可以 share , fast 和
bad interrupt 一定不能 share ,只有 slow interrupt 有可能
發生 interrupt share ,在後面第七章會詳細討論,這裡並不
討論。
426~432 :如果此 IRQ 不能 share ,如果屬於 fast interrupt 則
設定到 fast_interrupt[] 中,否則就設定到 interrupt[] 中。
int request_irq()
437~467 :當有一個 device 要求系統給一個 IRQ ,則呼叫 req
uest_irq() ,這個 function 會根據 device 所要求的 IRQ 號碼
,將 IRQ 的 handler 設給 device 。

© 黃仁竑 / 中正資工
Request and Free IRQ
int request_irq()
437~467 :當有一個 device 要求系統給一個
IRQ ,則呼叫 request_irq() ,這個 functi
on 會根據 device 所要求的 IRQ 號碼,將
IRQ 的 handler 設給 device 。
 
void free_irq()
469~495 : free_irq() 和上一個 request_irq
() 這好相反,當要拿掉一個 device ,則呼
叫 free_irq() 空出 IRQ 。
© 黃仁竑 / 中正資工
Boot
 Boot process
 BIOS
 reads the first sector of the boot disk (floppy, hard disk,
…, according to the BIOS parameter setting)
 Load the boot sector (512 bytes), which will contain progr
am code for loading the operating system kernel (e.g., Lin
ux Loader, LILO), to 0x7C00 (arch/i386/boot/bootsect.s,
35) in real mode
 boot sector ends with 0xAA55

 Boot disk
 Floppy: the first sector
 Hard disk: the first sector is the master boot record (MB
R)

© 黃仁竑 / 中正資工
Boot Sector and MBR
0x000 JMP 0x03E
0x003 Disk parameters Boot
Program code loading Sector
0x03E
the OS kernel (Floppy)
0x1FE 0xAA55

0x000 0x1BE Code for loading the boot


sector of the active partition
0x1BE 0x010 Partition 1
0x1CE 0x010 Partition 2 MBR and extended
partition table
0x1DE 0x010 Partition 3
0x1EE 0x010 Partition 4
0x1FE 0x002 0xAA55
© 黃仁竑 / 中正資工
MBR
 MBR
 Four primary partitions
 only4 partition entries
 Each entry is 16 bytes

 Extended partition
 Ifmore than 4 partitions are needed
 The first sector of extended partition is same as MBR
 The first partition entry is for the first logical drive
 The second partition entry points to the next logical
drive (MBR)
 The first sector of each primary or extended
partition contains a boot sector
© 黃仁竑 / 中正資工
Extended Partition MBR
 MBR for extended partition

Code for loading the boot


sector of the active partition
Logic Partition
Next Ext Partition
Not Used
Not Used
0xAA55

© 黃仁竑 / 中正資工
Structure of a Partition Entry
1 Boot Boot flag: 0=not active, 0x80 active
1 HD Begin: head number
2 SEC CYL Begin: sector and cylinder number of boot sector
1 SYS System code: 0x83 Linux, 0x82: swap, 0x05: extend
1 HD End: head number
2 SEC CYL End: sector and cylinder number of boot sector
4 low byte high byte Relative sector number
4 low byte high byte of start sector

Number of sectors in the partition

© 黃仁竑 / 中正資工
Active Partition
 Booting is carried out from the active
partition which is determined by the boot flag
 Operations of MBR
 determine active partition
 load the boot sector of the active partition
 jump into the boot sector at offset 0

© 黃仁竑 / 中正資工
Boot Process
 Compressed Kernel size
 Include/linux/config.h, DEF_SYSSIZE = 0x7F00 cli
cks = 508 KB. (1 click=16 bytes)
 zImage is less than this size
 zImage’s source is arch/i386/boot/bootsect.s, it is loa
ded to 0x7C00 first, it is then moved to 0x90000 and
jump to there to start execution.
 Setup.s is then loaded to 0x90200 and kernel image i
s loaded to 0x10000 (64KB)
 Setup.s moves the kernel from 0x10000 to 0x1000(4
KB) to save memory and then enters the protected m
ode, jumps to 0x1000 (line 520-536)
© 黃仁竑 / 中正資工
Bootsect.c
 Line 59-69
 Moves code from 0x7C00 (BOOTSEG) to
0x90000(INITSEG)
 64-65: set si, di to zero
 rep: repeat 68
 68: move word by word until cx=0 (initializ
e to 256)
 66: cld clears DF flag in EFLAG to 0 whic
h makes the move statement goes up (incre
ases the address for data movement)
© 黃仁竑 / 中正資工
Boot Process
 Uncompress Kernel
 The start point is at arch/i386/kernel/head.s
 It initializes the system and then calls
start_kernel
 So the system then runs from start_kernel()

© 黃仁竑 / 中正資工
Booting the System
 LILO loads the Linux kernel into memory
 starts from “start:” in arch/i386/boot/setup.s
 setup.s is responsible for initializing the hardware,
asking the bios for memory/disk/other parameters,
and putting them in memory 0x90000-0x901FF
 520-521: switch to protected mode
 534-536: jmp 0x1000, KERNEL_CS
 jmpi 0x100000, KERNEL_CS for big kernels
 Continues from startup_32 in arch/i386/kernel/hea
d.s

© 黃仁竑 / 中正資工
Booting the System
 More sections of the hardware are initialized (pa
ging table, co-processor, interrupt descriptor tabl
e (idt), stack, environment, …)
 219: calls the start_kernel() in init/main.c
 start_kernel(): all areas of the kernel are initialize
d and process 1 is created
 794-852: more initializations
 858: creates process 1 (kernel_thread(init, NULL,0))
– process 0 is an idle process, do nothing and runs w
hen no other process needs CPU
– process 1 calls the init() and starts some daemons
 868: process 0 enters an infinite idle loop

© 黃仁竑 / 中正資工
Booting the System
 Init() in init/main.c, lines 919-1020
 927: bdflush is responsible for synchronization of the
buffer cache contents with the file system
 929: kswapd is the background pageout daemon (swa
ping)
 937: setup initializes the file systems and mounts the r
oot file system
 986-991: connects to the console and open file descrip
tors 0, 1, 2 (console)
 993-997: tries to execute one of the programs /etc/init,
/bin/init, /sbin/init.
 999-1003: if none of the three programs exists, execut
es /etc/rc

© 黃仁竑 / 中正資工
Booting the System
 Init() in init/main.c, lines 919-1020
 1005-1018: enters an infinite loop in which a shell is
started for users to login on the console.

© 黃仁竑 / 中正資工
Setitimer System Call
 int setitimer(int which, struct itimerval *value, *ovalue)
 which:
 ITIMER_REAL: decrements n real time. A SIGALRM signal is de
livered when this timer expires.
 ITIMER_VIRTUAL: Decrements in process virtual time. It runs o
nly when the process is executing (not including system time). A S
IGVTALRM is delivered when this timer expires.
 ITIMER_PROF: Decrements both in process virtual time and whe
n the system is running on behalf of the process. A SIGPROF sign
al is delivered when this timer expires. It is designed for profiling t
he execution of interpreted programs.
 The itimerval struct has two fields: it_interval and it_value. If it_value
is non-zero, it indicates the time to the next timer expiration. If it_inter
val is non-zero, it specifies a value to be used in reloading it_value wh
en timer expires. Setting it_value to zero disables a timer. Setting it_int
erval to zero causes a timer to be disabled after its next expiration.

© 黃仁竑 / 中正資工
Related Codes
 ITIMER_REAL
 Data structure: timer_head
 run_timer_list()
 it_real_fn() (itimer.c, 98, sched.h, 297)
 ITIMER_VIRTUAL
 do_it_virt() (sched.c, 943)
 ITIMER_PROF
 do_it_prof() (sched.c, 956)
 Sys_setitimer -> _setitimer()-> add_timer()
 Itimer.c/115, sched.c/606
© 黃仁竑 / 中正資工
Timer Interrupt
 Important global variables
 jiffies
 kernel/sched.c (96): unsigned long volatile jiffies=0;
 ticks (10ms) since the system was started up

 xtime
 kernel/sched.c (47): volatile struct timeval xtime;
 actual time
 Timer interrupt
 updates jiffies and make the bottom half active
 the bottom half is called later, after handling othe
r interrupts
© 黃仁竑 / 中正資工
Timer Interrupt

© 黃仁竑 / 中正資工
Timer Interrupt
 do_timer (kernel/sched.c, 1077-1095)
 1079: increase jiffies
 1080: increase lost_ticks (ticks since last called of the bot
tom half routine)
 1081: mark the bottom half active (include/linux/ interrup
t.h)
 1082-1083: increase lost_ticks_system if in kernel mode
(ticks spent in kernel mode since last called of the bottom
half routing)
 1084-1092: profile
 1093-1094: mark timer queue handler active

© 黃仁竑 / 中正資工
Timer Interrupt
 Bottom half routines of the timer interrupt
 timer_bh (kernel/sched.c, lines 1070-1075)
 1072: updating the times, kernel/sched.c, lines 1054-1
068
– 1058: xchg gets the value of lost_ticks and reset it
to zero in an atomic way.
– 1063: get lost_ticks_system and reset
– 1064: calculate system load (lines725-738)
– 1065: update the real time xtime (740-922, hw)
– 1066: update times of current process (977-1049)
 1073, 1074: updating system wide timers (649-683)

© 黃仁竑 / 中正資工
Timer Interrupt
 update_process_times (977-1049)
 981: user time = ticks - system time
 983: decrease the time quota used by current process
 984-987: if the time quota is used up, need to reschedule
 988-992: kernel statistics
 994: update current process’s times (924-975)
 929-930: update process’s user and system times
 932-940: check if the process has used up its CPU lim
itation (setrlimit for setting limit of resource usage). If
exceeds soft limit, sends SIGXCPU. If exceeds hard q
uota, sends SIGKILL to kill the process.

© 黃仁竑 / 中正資工
Timer Interrupt
 update_process_times (977-1049)
 994: update current process’s times (924-975)
 947-953: update interval timers. When timers have
expired, sends SIGVTALRM.
 960-966: update profile

 run_timer_list (649-665)
 654: check timer list to see which timer has expired
 655-662: prepare to call timer handler
 run_old_timers (667-683)
 check timer table (obsolete)

© 黃仁竑 / 中正資工
Scheduler
 Classes
 Real-time (soft)
 Preemptive: rt_priority
 SCHED_FIFO
– a process runs until it relinquishes control or a pro
cess with higher rt_priority wishes to run
 SCHED_RR
– can be interrupted if its time slice has expired and
there are other processes with the same priority wi
shes to run (round robin with the same class)
 Classic
 SCHED_OTHER

© 黃仁竑 / 中正資工
Scheduler
 Schedule() (kernel/sched.c, lines 283-407)
 Called when
 system call (indirectly, sleep_on -> schedule)
 after slow_interrupt, ret_from_sys_call is called to ch
eck the need_resched flag
 timer interrupt will also set the need_resched flag

 Major tasks
 routinesneed to be called regularly
 determine the process with highest priority
 make the process to be the current process

© 黃仁竑 / 中正資工
Scheduler
 Schedule() (kernel/sched.c, lines 283-407)
 303-304: cannot be called within a nested interrupt
 306-310: the bottom halves of the interrupt routines (time-u
ncritical). E.g., the timer interrupt.
 312: routines registered to be run in scheduler (chap. 7)
 318-321: if current process belongs to the SCHED_RR clas
s and its time slice has expired, move it to the end of run qu
eue.
 323-325: if current process is in TASK_INTERRUPTIBLE
state and the signal it is waiting has arrived, make it runnabl
e again
 326-333: if current process is waiting for timeout and the ti
meout has expired, make it runnable again

© 黃仁竑 / 中正資工
Scheduler
 Schedule() (kernel/sched.c, lines 283-407)
 334-335: the current process must wait for an event, remo
ve it from the run queue
 357-364: looks for the process with highest priority..
 goodness(lines 235-281) return values
– -1000: don’t select this task
– 0: out of time (no results)
– +ve: the larger, the better
 1000: real-time process
 255-256: real-time process
 265: simply use p->counter as its weight
 277-278: a slight favor to the current process

© 黃仁竑 / 中正資工
Scheduler
 Schedule() (kernel/sched.c, lines 283-407)
 367-370: all process’s counter is 0, re-calculate
 386-401: have a new process become the current process,
do the context switch (switch_to())
 switch_to() in include/asm-i386/system.h, lines 53-12
2
 104-105: if next is the current task, do nothing
 106-109: clears the TS-flag if the task we switched to
has used the math co-processor latest
 111-112: switch to the next task
 114-120: reloads the debug regs if necessary.

© 黃仁竑 / 中正資工
System Call 流程
 設定 IDT table
 在 kernel_start() 中, call 了 trap_init() (arc
h/i386/kernel/traps.c, 322)
 trap_init() 中將系統中的 trap 設好後,會
call set_system_gate(0x80, &system_call) 。
此時, IDT[0x80] 就會設為 system_call 。
發生 trap 0x80 時,就會 call system_call 。

 所以 system call 是以 int 0x80 指令引發。

© 黃仁竑 / 中正資工
設定各種 system call
 以 fork() 為例,在 include/asm-i386/ unistd.h 的 272 行定
義了 static inline _syscall0(int,fork)
 而 _syscall0 定義在 174 行,它會將此指令 extend 成
int fork(void)
{
long __res;
__asm__ volatile ("int $0x80"
: "=a" (__res)
: "0" (__NR_fork)); /* 就是 2 */
if (__res >= 0)
return (type) __res;
errno = -__res;
return -1;
}

© 黃仁竑 / 中正資工
Fork() System Call
 所以它就是靠 int $0x80 造成 trap ,
並傳入 input 參數 __NR_fork
 output 參數 __res 。當 trap 發生時,
就會到 system_call 的地方執行。

© 黃仁竑 / 中正資工
執行 system_call
 這在 arch/i386/kernel/entry.s的第 281 行。
 在 290 行,利用所傳入的參數 (system call
number) 查 sys_call_table[] 的 function 名
字 ( 如 sys_fork) ,如果不是 null ,在檢查
完 trace flag 後,就會在 304 行 call 這個 f
unction( 如 sys_fork) 。
  system call 完成後,就會到 322 行,這就
是 ret_from_sys_call ,是 slow interrupt 執
行完也會到的地方。

© 黃仁竑 / 中正資工

You might also like