Linux Kernel Internals
Outline
Linux Introduction Linux Kernel Architecture Linux Kernel Components
Linux Introduction
Linux Introduction
History Features Resources
Features
Free Open system Open source GNU GPL (General Public License) POSIX standard High portability High performance Robust Large development toolset Large number of device drivers Large number of application programs
Features (Cont.)
Multi-tasking Multi-user Multi-processing Virtual memory Monolithic kernel Loadable kernel modules Networking Shared libraries Support different file systems Support different executable file formats Support different networking protocols Support different architectures
Resources
Distributions Books Magazines Web sites ftp cites bbs
Linux Kernel Architecture
Linux Kernel Architecture
User View of Linux Operating System Linux Kernel Architecture Kernel Source Code Organization
User View of Linux Operating System
Applications Shell Kernel
Hardware
System Structure
Processes
System calls interface
File systems
ext2fs minix iso9660 xiafs nfs proc msdos
Central kernel
Task management Scheduler Signals Loadable modules Memory management
Buffer Cache Peripheral managers
block character sound card cdrom isdn netw o scsi pci rk
Network Manager
ipv4 ethernet ...
Machine interface
Machine
Linux Kernel Architecture
Analysis of Linux Kernel Architecture
Stability Safety Speed Brevity Compatability Portability Reusability and modifiability Monolithic kernel vs. microkernel Linux takes the advantages of monolithic kernel and microkernel
Kernel Source Code Organization
Source code web site: http://www.kernel.org Source code version:
X.Y.Z 2.2.17 2.4.0
Kernel Source Code Organization (Cont.)
Resources for Tracing Linux
Source code browser
cscope Global LXR (Source code navigator)
Books
Understanding the Linux Kernel, D. P. Bovet and M. Cesati, O'Reilly & Associates, 2000. Linux Core Kernel Commentary, In-Depth Code Annotation, S. Maxwell, Coriolis Open Press, 1999. The Linux Kernel, Version 0.8-3, D. A Rusling, 1998. Linux Kernel Internals, 2nd edition, M. Beck et al., Addison-Wesley, 1998. Linux Kernel, R. Card et al., John Wiley & Sons, 1998.
How to compile Linux Kernel
1. make config (make manuconfig) 2. make depend 3. make boot generate a compressed bootable linux kernel arch/i386/boot/zIamge make zdisk generate kernel and write to disk dd if=zImage of=/dev/fd0 make zlilo generate kernel and copy to /vmlinuz lilo: Linux Loader
Linux Kernel Components
Linux Kernel Components
Bootstrap and system initializaiton Memory management Process management Interprocess communication File system Networking Device control and device drivers
Bootstrap and System Initialization Events From Power-On To Linux Kernel Running
Bootstrap and System Initialization Booting the PC (Events From Power On)
Perform POST procedure Select boot device Load bootstrap program (bootsect.S) from floppy or HD
Bootstrap program
Hardware Initialization (setup.S) loads Linux kernel into memory (head.S) Initializes the Linux kernel Turn bootstrap sequence to start the first init process
Bootstrap and System Initialization (Cont.) Init process
Create various system daemons Initialize kernel data structures Free initial memory unused afterwards
Runs shell
Shell accepts and executes user commands
Low-level Hardware Resource Handling Interrupt handling Trap/Exception handling System call handling
Memory Management
Memory Management Subsystem Provides virtual memory mechanism
Overcome memory limitation Makes the system appear to have more memory than it actually has by sharing it between competing processes as they need it.
It provides:
Large address spaces Protection Memory mapping Fair physical memory allocation Shared virtual memory
Memory Management
x86 Memory Management
Segmentation Paging
Linux Memory Management
Memory Initialization Memory Allocation & Deallocation Memory Map Page Fault Handling Demand Paging and Page Replacement
Segment Translation
15
0
Selector
31 Offset
logical address
Segment Descriptor
base address
linear address
Segment Descriptor Table Dir Page Offset
Linear Address Translation
linear address
31 22 21 12 11 0
Directory
Table
Offset
12
10
10
Physical Address Page-Table Entry Directory Entry
Page table
32
Page directory
CR3(PDBR)
Physical memory
Segmentation and Paging
Logical Address
Segment Selector
Offset
Linear Address Space
Dir
Linear Address
Table Offset Page Table Page Directory
Physical Address Space
Page
Segment
Segment Descriptor
Page
Segment Base Address
Abstract model of Virtual to Physical address mapping
Process X
VPFN7 VPFN6 VPFN5
Process Y Process X Page Table
PFN4 PFN3 PFN2 PFN1 PFN0
Process Y Page Table
VPFN7 VPFN6 VPFN5
VPFN4
VPFN3 VPFN2
VPFN4
VPFN3 VPFN2
VPFN1
VPFN0
VPFN1
VPFN0
Physical Memory
Virtual Memory
Virtual Memory
An Abstract Model of VM (Cont.) Each page table entry contains:
Valid flag Physical page frame number Access control information
X86 page table entry and page directory entry:
31 12 6 5 2 1 0 UR / / P SW
Page Address
DA
Demand Paging
Loading virtual pages into memory as they are accessed Page fault handling
faulting virtual address is invalid faulting virtual address was valid but the page is not currently in memory
Swapping
If a process needs to bring a virtual page into physical memory and there are no free physical pages available: Linux uses a Least Recently Used page aging technique to choose pages which might be removed from the system. Kernel Swap Daemon (kswapd)
Caches
To improve performance, Linux uses a number of memory management related caches:
Buffer Cache Page Caches Swap Cache Hardware Caches (Translation Look-aside Buffers)
Page Allocation and Deallocation Linux uses the Buddy algorithm to effectively allocate and deallocate blocks of pages. Pages are allocated in blocks which are powers of 2 in size.
If the block of pages found is larger than requested must be broken down until there is a block of the right size.
The page deallocation codes recombine pages into large blocks of free pages whenever it can.
Whenever a block of pages is freed, the adjacent or buddy block of the same size is checked to see if it is free.
Splitting of Memory in a Buddy Heap
Vmlist for virtual memory allocation
vmalloc() & vfree() first-fit algorithm
vmlist
addr addr+size
VMALLOC_START
VMALLOC_END
Allocated space
Unallocated space
Process Management
What is a Process ?
A program in execution. A process includes program's instructions and data, program counter and all CPU's registers, process stacks containing temporary data. Each individual process runs in its own virtual address space and is not capable of interacting with another process except through secure, kernel managed mechanisms.
Linux Processes
Each process is represented by a task_struct data structure, containing:
Process State Scheduling Information Identifiers Inter-Process Communication Times and Timers File system Virtual memory Processor Specific Context
Process State
creation signal
stopped
signal termination
ready
scheduling
executing
zombie
end of input / output
input / output suspended
Process Relationship
parent
p_pptr p_opptr p_cptr p_osptr p_pptr p_opptr
p_pptr p_opptr
p_osptr
youngest child
p_ysptr
child
p_ysptr
oldest child
Managing Tasks
struct task_struct
pidhash
next_task prev_task
task
tarray_freelist
Scheduling
As well as the normal type of process, Linux supports real time processes. The scheduler treats real time processes differently from normal user processes Pre-emptive scheduling. Priority based scheduling algorithm Time-slice: 200ms Schedule: select the most deserving process to run
Priority: weight Normal : counter Real Time : counter + 1000
A Process's Files
current task_struct Table of open files Table of i-nodes
... files
...
...
...
...
...
...
Virtual Memory
A process's virtual memory contains executable code and data from many sources. Processes can allocate (virtual) memory to use during their processing Demand paging is used where the virtual memory of a process is brought into physical memory only when a process attempts to use it.
Process Address Space
kernel memory environment arguments stack 0xC0000000
data (bss) data code 0
A Processs Virtual Memory
task_struct mm mm_struct count pgd vm_area_struct
vm_end vm_start vm_flags vm_inode vm_ops vm_next
Processs Virtual Memory
mmap mmap_avl mmap_sem
data
vm_area_struct
vm_end vm_start vm_flags vm_inode vm_ops vm_next
code
Process Creation and Execution
UNX process management separates the creation of processes and the running of a new program into two distinct operations.
The fork system call creates a new process. A new program is run after a call to execve.
Executing Programs
Programs and commands are normally executed by a command interpreter. A command interpreter is a user process like any other process and is called a shell ex.sh, bash and tcsh Executable object files:
Contain executable code and data together with information to be loaded and executed by OS
Linux Binary Format
ELF, a.out, script
How to execute a program?
Command enter
Search file in processs search path(PATH)
Shell clone itself and binary image is replaced with executable image
ELF
ELF (Executable and Linkable Format) object file format
designed by Unix System Laboratories Format header the most commonly used format in Linux Physical header
(Code) Physical header (Data) Code Data
Interprocess Communication Mechanisms (IPC)
Signals Pipes Message Queues Semaphores Shared Memory
Signals
Signals inform processes of the occurrence of asynchronous events. Processes may send each other signals by kill system call, or kernel may send signals to a process. A set of defined signals in the system:
1)SIGHUP 5) SIGTRAP 9) SIGKILL 13) SIGPIPE 17) SIGCHLD 21) SIGTTIN 25) SIGXFSZ 29) SIGIO 2) SIGINT 6) SIGIOT 10) SIGUSR1 14) SIGALR 18) SIGCONT 22) SIGTTOU 26) SIGVTALRM 30) SIGPWR 3) SIGQUIT 4) SIGILL 7) SIGBUS 8) SIGFPE 11) SIGSEGV 12) SIGUSR2 15)SIGTERM 19) SIGSTOP 20) SIGTSTP 23) SIGURG 24) SIGXCPU 27) SIGPROF 28) SIGWINCH
Signals (Cont.)
A process can choose to block or handle signals itself or allow kernel to handle it Kernel handles signals using default actions.
E.g., SIGFPE(floating point exception) : core dump and exit
Signal related fields in task_struct data structure
signal (32 bits): pending signals blocked: a mask of blocked signal sigaction array: address of handling routine or a flag to let kernel handle the signal
Pipes
one-way flow of data The writer and the reader communicate using standard read/write library function
Communication pipe Task A Task B
Restriction of Pipes and Signals
Pipe:
Impossible for any arbitrary process to read or write in a pipe unless it is the child of the process which created it. Named Pipes (also known as FIFO)
also one-way flow of data allowing unrelated processes to access a single FIFO.
Signal
The only information transported is a simple number, which renders signals unsuitable for transferring data.
System V IPC Mechanism
Linux supports 3 types of IPC mechanisms:
Message queues, semaphores and shared memory First appeared in UNIX System V in 1983
They allow unrelated processes to communicate with each other.
Key Management
Processes may access these IPC resources only by passing a unique reference identifier to the kernel via system calls. Senders and receivers must agree on a common key to find the reference identifier for the System V IPC object. Access to these System V IPC objects is checked using access permissions.
Shared Memory and Semaphores
Shared memory
Allow processes to communicate via memory that appears in all of their virtual address space As with all System V IPC objects, access to shared memory areas is controlled via keys and access rights checking. Must rely on other mechanisms (e.g. semaphores) to synchronize access to the memory
Semaphores
A semaphore is a location in memory whose value can be tested and set (atomic) by more than one processes Can be used to implement critical regions
Sys_shmget()
Create Segment Give a valid IPC identifier
Sys_shmat()
Process to attach segment For read and write
Remove or detach segment
Execute commands about Shared memory
Sys_shmdt()
Sys_shmctl()
Semaphores
struct msqid_ds struct sems
struct sem_queues IPC_NOID
IPC_UNUSED
Message Queues
Allow one or more processes to write messages, which will be read by one or more reading processes struct msqid_ds
struct msgs IPC_NOID
IPC_UNUSED
File System
Linux File System
Linux supports different file system structures at the same time
Ext2, ISO 9660, ufs, FAT-16,VFAT,
Hierarchical File System Structure
Linux adds each new file system into this single file system tree as it is mounted.
The real file systems are separated from the OS by an interface layer: Virtual File System: VFS VFS allows Linux to support many different file systems, each presenting a common software interface to the VFS.
Hierarchical File System Structure
bin
dev
etc
lib
sbin
usr
ls
cp bin include lib man sbin
cc
Mounting of Filesystems
/
mounting operation
bin
dev
etc
lib
sbin
usr
bin
include
lib
man
sbin
root filesystem
/usr filesystem
bin
dev
etc
lib
sbin
usr
bin
include
lib
man
sbin
complete hierarchy after mounting /usr
The Layers in the File System
Process 1 Process 2 Process n
User mode System mode Virtual File System
ext2
msdos
minix
proc
Buffer cache
File system
Device drivers
Ext2 File System
Devised (by Rmy Card) as an extensible and powerful file system for Linux. Allocation space to files
Data in files is kept in fixed-size data blocks Indexed allocation (inode)
directory : special file which contains pointers to the inodes of its directory entries Divides the logical partition that it occupies into Block Groups.
Physical Layout of File Systems
Schematic Structure of a UNIX File System
Boot block 0 Superblock 1 Inode blocks 2... Data blocks
Physical Layout of EXT2 File System
Block Group 0 Block Group 1
...
Block Group n
Super block
Group descriptors
Block bitmap
Inode bitmap
Inode table
Data blocks
The EXT2 Inode
Mode Owner Info Size Timestamps Direct Blocks Indirect blocks Double Indirect Triple Indirect
data data data data data data data
Directory Format
i-node table 0 1 2 3 4 5 3 2 3 0 directory name 1 name 2 name 3 name 4
The Virtual File System (VFS)
Tasks System call interface Inode cache Directory cache
Virtual file system
minix
ext2fs
proc
Buffer cache Device drivers Machine
Allocating Blocks to a File
To avoid fragmentation that file blocks may spread all over the file system, EXT2 file system:
Allocating the new blocks for a file physically close to its current data blocks or at least in the same Block Group as its current data blocks as possible. Block preallocation
Speedup Access
VFS Inode Cache Directory Cache
stores the mapping between the full directory names and their inode numbers.
Buffer Cache
All of the Linux file systems use a common buffer cache to cache data buffers from the underlying devices
Replacement policy: LRU
bdflush & update Kernel Daemons
The bdflush kernel daemon
provides a dynamic response to the system having too many dirty buffers (default:60%). tries to write a reasonable number of dirty buffers out to their owning disks (default:500).
The update daemon
periodically flush all older dirty buffers out to disk
The /proc File System
It does not really exist. Presents a user readable windows into the kernels inner workings.
The /proc file system serves information about the running system. It not only allows access to process data but also allows you to request the kernel status by reading files in the hierarchy. System information
Process-Specific Subdirectories Kernel data IDE devices in /proc/ide Networking info in /proc/net, SCSI info Parallel port info in /proc/parport TTY info in /proc/tty
Networking
Linux Networking Layers
Network Applications BSD Sockets Socket Interface INET Sockets TCP Protocol Layers IP Network Devices PPP SLIP Ethernet ARP UDP User Kernel
Server Client Model
Server socket( ) bind( ) listen( ) accept( ) connection establishment data(request) data(replay) connection break Client
socket( )
connect( )
read( ) write( ) close( )
write( ) read( ) close( )
Linux BSD Socket Data Structure
file files_struct
count close_on_exec open_fs fd[0] fd[1] f_mode f_pos f_flags f_count f_owner f_op f_inode f_version
BSD Socket File Operations
inode
socket
type protocol data
lseek read write select ioctl close fasync
fd[255]
SOCK_STREAM
SOCK_STREAM Address Family socket operations
sock
type protocol socket
Loadable Kernel Module
A Kernel Module is not an independent executable, but an object file which will be linked into the kernel in runtime. Modules can be dynamically integrated into the kernel. When no longer used, the modules may then be unloaded. Enable the system to have an extended kernel.
Loading Modules
Loading Minix NFS PPP Printer
Kernel
Kernel
Compiled Kernel
Kernel after loading modules