0% found this document useful (0 votes)

19 views30 pages

Lecture w07

This document provides an in-depth exploration of the Linux kernel, detailing its structure, boot process, and mechanisms for handling interrupts and exceptions. It explains how the kernel operates differently from standard programs, emphasizing its unique entry points and the absence of standard library functions. Additionally, it outlines the organization of kernel source code and the initialization sequence during system boot, including the creation of user-level processes.

Uploaded by

meira.pics2

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views30 pages

Lecture w07

Uploaded by

meira.pics2

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 30

ECE357:Computer Operating Systems Unit 7/pg 1 ©2024 Jeff Hakner

What is the Kernel?

In this unit, we will begin an exploration of the Linux kernel code itself. The first and
most obvious question is: just what exactly is the "kernel?!"

The Linux kernel is an unusual type of program. Most programs to which we are
accustomed have one point of entry, run for some time, and then terminate. The kernel
has one initial point of entry which is used at boot-time, and then multiple points of
controlled re-entry. The termination of the kernel amounts to handing the CPU back to
the firmware which either reboots or turns off the machine.

The kernel is compiled and linked just like an ordinary C program, in that it consists of
many individual object (.o) files which are compiled from source code (the Linux kernel
is written exclusively in C and assembly). Within the kernel is the equivalent of a text
section, containing the executable code, as well as data section with variable initializers.
Like an ordinary C program, when the kernel starts it sets aside space for uninitialized
bss variables, and can dynamically allocate memory for other data structures. Also, like
a user-level C program, the kernel can dynamically load and unload executable modules.

Unlike the C environment that is documented in the C language standards documents, the
kernel does not have the standard library. There is no printf, no malloc, no fopen, etc.
because these standard library functions would require an operating system to provide the
underlying system calls. There is a subset of "pure" library functions (such as strcpy)
which is linked in. The kernel is also coded to avoid all use of the floating point registers,
so functions such as pow() and sqrt() would never be seen in kernel code, and indeed the
keywords float and double would generally be absent too.

At this point, it might be instructive to look at how the Linux kernel source code is
arranged. The kernel can be compiled for a variety of target architectures. From the top-
level directory, we see the following directories, each of which contains architecture-
independent code. Thus the code below is only in C, not assembly. There is one top-
level directory called arch, under which is one entry for each supported architecture,
e.g. arch/x86, and below each of those is another set of subdirectories with similar
structure to the architecture-independent set. Here is a description of these directories:
• block: A fairly small set of utility routines pertaining to block-level access to hard
disks.
• crypto: Encryption functions.
• drivers: A vast set of routines for interfacing with I/O devices, including arch-neutral
stuff such as the SCSI messaging layer, and support for specific peripherals,
motherboards, etc.
• fs: Routines to support the file system, such as open, read, write system calls. Also
ECE357:Computer Operating Systems Unit 7/pg 2 ©2024 Jeff Hakner

subdirectories for each supported file system type module.

• include: Header (.h) files
• init: Code which is executed when the system first boots.
• ipc: System V and POSIX IPC mechanisms
• kernel: A considerable amount of code including all of the synchronization, task
control, signals, halting and restarting the system, etc.
• lib: A number of utility routines needed by different subsystems, including
synchronization primitives such as spin locks and semaphores.
• mm: The virtual memory subsystem.
• net: Networking protocols
• security: Some extra stuff for "security-hardened" versions of the kernel.
• sound: Sound drivers and support

(there are a few other subdirectories which have to do with the kernel build and install
process or other features but are beyond the scope of this course.)

Boot-up

"Boot" is short for "bootstrap" and derives from the old expression "trying to lift yourself
up by your own bootstraps." In order to load the operating system kernel from a bootable
device or via the network, we need kernel-like services such as disk and network drivers.
But how do we obtain these services before the kernel is running? This is the bootstrap
paradox. The following describes how the linux kernel under X86-32 bit architecture is
booted:
• The linux kernel is built from source and often included as a binary distribution. A
valid kernel must be present to support the particular architecture of the machine that is
being booted, and the particular device drivers that are needed at boot time. The kernel
source code as described above is first compiled (using ordinary tools such as the GCC
compiler toolchain) into an a.out file. However, a.out files mean nothing to the firmware,
so additional preparation is necessary. The image of the text and data sections is pulled
out of the compiled kernel. Typically, to save space on smaller bootable media, the
kernel image is compressed. A bootable kernel consists of a small, uncompressed portion
of bootable code followed by the compressed text/data image. This bootable kernel
(typically called vmlinuz where the "z" stands for compressed) is installed on the
bootable medium (hard disk, CD/DVD ROM, USB stick).
• To boot the system, the firmware ("BIOS") causes the bootable kernel image to be read
in from the boot device (e.g. the hard disk) into a specific place in physical memory. The
firmware then transfers control to the starting address. In most cases, the firmware does
not directly load the kernel image. Instead, the firmware loads a small preliminary
program called the "boot loader" and this program, which can be much more
sophisticated than the firmware, is able to locate and load the actual kernel image. For
example, the GRUB bootloader is commonly used with Linux installations. Also, modern
ECE357:Computer Operating Systems Unit 7/pg 3 ©2024 Jeff Hakner

hardware generally uses a boot method called UEFI (Universal Extensible Firmware
Interface) instead of the very old BIOS method. This doesn’t materially affect the rest of
our discussion.
• BIOS runs in supervisor mode, with address translation turned off (so all addresses used
in code are physical) and with interrupts disabled. Hand-off to the kernel image is made
in this condition. Thus the first part of the kernel’s initialization routines have been
compiled and linked with a specific address in mind which corresponds to the physical
load address. The next task is to de-compress the image which is in one place in physical
memory and write it to another location. For the (32-bit) x86 architecture, this is physical
address 0x00100000. A jump is then made to that physical address.
• At around this time, the kernel turns on virtual address translation. On the x86 32-bit
architecture, the Linux kernel uses virtual addresses below 0xC000000 exclusively for
user-level processes, and reserves the 0xC000000-0xFFFFFFFF range exclusively for
kernel memory. We’ll see how this is exploited to accomplish fast access to kernel data
structures. Note: All parts of the kernel see the same virtual address mappings. To
get the kernel running on virtual, rather than physical addresses, page tables are created
which map 0xC00000000 to physical address 0x00100000, and so on for higher
addresses, until the entire static part of the kernel (text, data, bss) has been mapped. The
static bss pages are explicitly zero-filled at this time. A temporary stack is set up to allow
the kernel initialization code to run.
• The MMU is turned on and thereafter the kernel runs on virtual addresses.
• The physical page frames that are not part of the static part of the kernel become the
page frame pool. The kernel is able to dynamically allocate memory for itself out of this
pool, in addition to satisfying user-level page fault demands.
• The interrupt vector tables are appropriately initialized (see "Interrupt Handling in
Hardware" below).
• The kernel then calls initialization routines for its various subsystems, e.g. virtual
memory, file systems, scheduling. They allocate dynamic kernel memory as needed. An
inventory of hardware devices is taken and the initialization routines for each device are
called.
• Interrupts are now enabled.
• Now the kernel is able to provide its services, but has no user-mode processes to which
they can be provided yet. The device which contains the root filesystem is determined
from information passed during boot loading, and the root filesystem is mounted (using
an internal version of the mount system call)
• The kernel creates process #1 and causes that process to exec/map a specific binary
program (/sbin/init), running as the super-user (uid and gid = 0). This "init"
program must reside on the root filesystem, and is the bridge between kernel initialization
and user-mode initialization. It remains running for the life of the system.
• pid 1 is marked as ready to run and the kernel relinquishes control to the process
scheduler which immediately selects pid 1 to run (since there is nothing else to do!)
Control now exits kernel mode and the init process begins to execute at user level. Of
ECE357:Computer Operating Systems Unit 7/pg 4 ©2024 Jeff Hakner

course control immediately re-enters the kernel as init page-faults in its first page of
executable text.
• First, init runs a series of programs and shell scripts which complete the initialization
of the system. This includes locating and mounting other filesystem volumes as needed.
Then, init spawns other user-mode programs which ultimately provide access to the
machine for the end-user.

Note: on more recent Linux systems, the straightforward, traditional /sbin/init is replaced
by a more unwieldy /bin/systemd. Either way, it is pid #1 which is responsible for user-
level system startup

Aside: In some configurations, user-mode tools are needed to locate and mount the root
filesystem. For example, it may be on a software RAID volume, or it may be a networked
filesystem. To support this, the boot sequence can include an "init ramdisk" image which
is also loaded into memory, and provides a temporary in-memory root filesystem that
contains the needed tools. The "real" root filesystem is then mounted and the init ramdisk
is thrown away

When booting up a multiprocessor machine, the BIOS starts the machine in single-
processor mode. Fairly late in the kernel’s initialization, the other processors are enabled.
They are initially given the idle task to run. Once additional processes other than pid 1
are runnable, the other CPUs will start to pick up their loads.

Re-entering the kernel

We have seen that there is one initial entrypoint to the kernel, through the boot process.
Thereafter, the kernel relinquishes control to user-mode processes. The kernel is entered
again only through the interrupt / exception mechanism.
We can broadly divide re-entry into two categories:
• Synchronous: When the interupt is raised because of the execution of a specific
instruction, that is said to be synchronous. The Linux kernel calls these synchronous
events Exceptions. Other similar terms are "Fault" and "Trap". Exceptions are said to
occur in the context of a particular running process. We’ll see that the kernel associates a
unique kernel-mode stack for each process (or to be precise, for each user-level thread),
and so we can think of the kernel as containing one kernel-mode thread of control for
each user-level thread, which gets activated when an exception is raised. Thus the
handling of the exception in the kernel is in effect a controlled extension of the user-level
thread into the kernel.
• Asynchronous: These events are not correlated with a specific instruction, but come
from hardware devices. The Linux kernel calls these "Interrupts." Asynchronous
interrupts are handled on whatever kernel-mode stack (see below) happens to be active at
ECE357:Computer Operating Systems Unit 7/pg 5 ©2024 Jeff Hakner

the time.
In most operating systems literature, synchronous entries to the kernel are said to be "top
half," and asynchronous are said to be in the "bottom half". Unfortunately the Linux
kernel uses the term "bottom half" in an inconsistent manner, so we will avoid top/bottom
half terminology to reduce confusion.

Exception Types

• Many exceptions are the result of program error. These are also called faults. A fault is
defined as a condition which prevents the current opcode from being executed. The
default behavior of the kernel for most faults is to post a signal to the offending process.
Examples of such faults include: integer divide by 0 and illegal instruction.
• A Page Fault exception occurs when the hardware was unable to perform a memory
access. On the x86-32 architecture, this same fault type is used both for when a valid
PTE can not be found and for a protection violation, the distinction between the two
being passed to the fault handler by hardware in a specific register. Page Faults are quite
normal and not necessarily a sign of program error, as discussed in Unit 5.
• The kernel uses some exceptions to be clever and efficient about things. For example,
the kernel usually does not bother with floating point registers. It turns the floating point
unit off, and then if the process tries to use a floating-point operation, it raises an
exception and the kernel knows that it needs to worry about it for that process. So these
exceptions are like a Page Fault, in that the kernel’s handler must determine whether the
exception is benign or the result of an errant program.
• When faults are resolved successfully by the kernel, control returns to user level and the
previously faulted instruction is re-tried, and this time should succeed.
• A System Call is a particularly important type of exception, and we’ll spend some time
looking at system calls in detail.

Interrupt Types

• Hardware devices will raise interrupts to indicate that they have entered a ready or non-
ready state, that data are available, or that a data transfer operation has completed. Each
hardware device is assigned an Interrupt Request (IRQ) number (some devices may share
IRQs).
• The kernel programs a chip on the motherboard to deliver a periodic "heartbeat"
interrupt, e.g. every millisecond. This Periodic Interval Timer (PIT) is very important
to the scheduler. Other time-based events in the kernel (e.g. network protocol
retransmission timers) are derived from this clock source. (Note: on modern Linux
kernels supporting multi-CPU systems, this is complicated by the fact that different CPUs
might be running at different clock rates. A more sophisticated mechanism is used, but
the end result is the same: each CPU gets a periodic interrupt to drive the scheduling
subsystem)
ECE357:Computer Operating Systems Unit 7/pg 6 ©2024 Jeff Hakner

• On a multi-processor system, one processor may execute a series of instructions which

cause a hardware interrupt to be posted to the other processors. This is known as an
Inter-processor Interrupt (IPI). The Linux kernel uses this in several specific cases: (a)
To shut down the system (b) to request that the other processors re-examine their task
scheduling (covered in next unit) (c) to purge stale TLB entries.
• Hardware will deliver interrupts to notify the kernel of important system events, such as
a hardware failure, battery power failure, etc. If continued system operation is possible,
the kernel will ultimately translate these into user-mode events. Otherwise the kernel will
print out an informative message and halt the system.

Important Shared Processor/Kernel Data Structures

The behavior of the CPU is hard-wired (or programmed in its microcode) and thus
immutable to any software code including the kernel. The kernel is compiled for a
specific architecture and knows about data structures that the processor expects to see in
memory, which control how the processor reacts to certain events. Because the processor
itself must respond to virtual memory lookup failures (page faults) these data structures
are typically defined at physical, rather than virtual memory addresses. We’ve already
seen the Page Table data structure. On the X86-32 bit architecture, the additional data
structures relevant to interrupt/exception handling are:
• The Interrupt Descriptor Table (IDT) is an array of 256 interrupt descriptors. The
physical address of this array is contained in the special idtr register, which can only be
accessed through special privileged instructions LIDT and SIDT. Although the structure
of the IDT is baroque because of legacy X86 segmented memory model issues, it is
basically a table of program counter (%eip) addresses, which are the handler addresses (
these are specified as virtual addresses) for each of the 256 possible interrupt or
exception/fault vectors. Note that the X86 hardware doesn’t distinguish between
interrupts and exceptions (faults) with respect to the IDT. The kernel has pre-initialized
all IDT entries to point to valid kernel functions before allowing user-level code to begin
execution after system boot. The IDT is generally not modified after this.
• The Task State Segment (TSS) is a small data structure which resides inside a larger
data structure known as the Global Descriptor Table (GDT). The Linux kernel does not
utilize all of the functionality that is available with the TSS, but uses it simply to control
the stack pointer address that will be used during interrupt/exception handling.
Furthermore, the TSS only comes into play when the interrupt/exception originates in
user mode. The special tr register controls the location of the TSS and is generally set
once. On multi-processor systems, there is a TSS for each processor.
ECE357:Computer Operating Systems Unit 7/pg 7 ©2024 Jeff Hakner

(Phys addr) IDT (incomplete example) Kernel Text

%idtr C0000000
(Virt
#0
Addr)

system_call:

CPU
Exceptions #6
divide_error:
#14

page_fault:
#31

#32
invalid_op:

timer:
HW IRQ
& user−
defined
#128

#255

Interrupt and Exception handling in hardware

On the 32-bit X86 architecture, the following steps are taken by the processor in response
to an interrupt or exception:
• Each interrupt or exception has an associated vector between 0 and 255. This is used to
index the IDT, from which the handler address is fetched. The handler address is the
virtual address of the first opcode of the associated handler function within the kernel.
This function is generally written in assembly language. The kernel has made sure that
all 256 entries have valid handler addresses. In the X86 architecture, vectors 0-31 are
reserved for processor-generated exceptions, while 32-255 are user-defined and typically
used for I/O devices, with a straightforward mapping between IRQ numbers and IDT
vectors. Note that the system call vector 128 (decimal) falls within this range.
• When the processor is currently running in User mode, it fetches the TSS. Within the
TSS is the new value (virtual address) of the stack pointer. This will be the kernel-mode
stack, as described later. The kernel maintains a separate kernel stack for each task (this
ECE357:Computer Operating Systems Unit 7/pg 8 ©2024 Jeff Hakner

includes each thread in a multi-threaded program), and makes sure the correct stack
address was placed within the TSS before performing a context switch. If however, the
processor was already in Supervisor mode at the time of the exception/interrupt, then the
stack pointer is already within a valid kernel stack, and the TSS is not used by the
processor.
• The processor now transitions to Supervisor mode (if not already in it), and the %esp
register is pointing within the kernel-mode stack for the task.
• Note that the kernel-mode stack is empty if we are transitioning from User Mode, but if
we were already in kernel mode when the interrupt/ exception arrived, the kernel stack
will contain the stack frames of the currently executing kernel code for this processor.
• The processor pushes onto the kernel-mode stack the value of the old stack pointer
register %esp, the flags/status register %eflags, and the program counter register
%eip. The %cs and %ss registers (which have to do with how code and stack memory
are accessed) are also saved. This set of 5 registers is critical because it is what the
hardware relies on to determine where the stack is, and where the next instruction is.
• For certain types of exceptions, the processor pushes an error code onto the stack. This
is used, e.g., to distinguish between a page translation fault and a page protection fault.
• The program counter location of the handler, which was fetched from the IDT, is now
loaded into the eip register, effecting a jump to that entrypoint.
• Recall that the kernel controls the page tables, and establishes them for each process.
The kernel virtual address space, 0xC0000000-0xFFFFFFFF, is always present in these
page tables, but the User/Supervisor flags in the Page Table Entries are set so user
processes can not directly access the kernel memory (this would be really bad....however
user processes running as root can use a pseudo-device /dev/kmem to access kernel
memory as if it were a file). This page table arrangement means that as soon as control
enters the kernel, the shared kernel address space is immediately available. Aside:
Recently, security vulnerabilities were discovered in the way that many X86-based
processors handle what is known as "speculative execution". These vulnerabilities can
allow a user-mode process to potentially learn the contents of protected kernel-mode
address space. A series of patches have been applied to both Linux and BSD-based
kernels as well as Windows to address this. These patches generally do away with the
practice of having the entire kernel’s page tables sitting there (relying on the U/S bit of
the PTE to protect them) and instead keep a very minimal set of kernel page tables
resident, and switch to a new set of page tables upon kernel entry. However, this also
comes with a significant performance impact from the spate of TLB misses. As a result,
many system administrators have actually rolled back these patches. A developing
situation, stay tuned!

Any interrupt or exception handler ultimately terminates by executing the special iret
instruction, which:
• Pops the 5 registers from the stack which had been saved by hardware.
• Resets the privilege level to the previous value (because the %eflags register has been
ECE357:Computer Operating Systems Unit 7/pg 9 ©2024 Jeff Hakner

restored from the stack)

• Resumes execution with the stack pointer now set back to the original stack. Interrupts
occur between instructions, and so execution resumes at the next instruction. For
exception returns, if the instruction caused a fault (e.g. Page Fault) then that instruction is
re-started, otherwise (e.g. System Call) the next instruction is executed.

user−level processes
0x0000 0000

0xBFFF FFFF
0xC000 0000

kernel−only thread

kernel−mode per−thread stacks

shared kernel data structures

0xFFFF FFFF

Kernel control flow paths & pre-emption

Now that we’ve seen the types of kernel entrypoints, lets continue further to talk about
how control flows within the kernel and back out again.
At any given moment, a CPU is either:
• Executing a user-level thread (process)
• Executing a kernel thread. This is identical to a user-level thread except that the virtual
address space is entirely within the kernel.
• Handling an exception
• Handling an interrupt
• Temporarily halted because there is nothing else to do at the moment. We call this the
ECE357:Computer Operating Systems Unit 7/pg 10 ©2024 Jeff Hakner

"Idle" state.
While handling an exception or interrupt, it is possible that another exception or interrupt
will arise. Thus the flow of control in the kernel is multiply re-entrant, and nested. E.g.
while handling a system call (exception), a keyboard interrupt is received. The exception
handler is suspended (by hardware) and the keyboard interrupt handler begins to run.
While servicing that, a disk interrupt is received and handled. When the disk handler
finishes, the keyboard handler resumes and finishes, then the system call is allowed to
continue.
To simplify kernel programming and synchronization issues, the Linux kernel is carefully
coded so that kernel code never produces exceptions, with one caveat: During the
handling of a system call (exception), a Page Fault (exception) may be raised. Thus we
can say that exceptions will never arrive during interrupt handling, and will never arrive
during exception handling other than a Page Fault during a System Call. Furthermore,
the kernel code generally avoids incurring a Page Fault during a system call. An interrupt
may occur at any time, because it is asynchronous.
Pre-emption means a potentially non-voluntary task switch between one thread (task) to
another. We’ll have more information on pre-emption in the next unit. Pre-emption of a
user-mode task by another user-mode task is straightforward and generally happens as
control is returning from the PIT (Timer/Clock) interrupt handler back to userland. A
voluntary context switch may also happen because a system call encounters a blocking
condition and therefore explicitly relinquishes the processor.
With regard to pre-emption of a task which is already executing in kernel code, this
introduces complexity in the kernel’s coding, especially with regard to locking. The
traditional default configuration of Linux kernel is such that pre-emption does not take
place in kernel code -- the pre-emption only occurs as the task is about to return to user
mode from a system call, fault or interrupt.
When kernel pre-emption is enabled that changes the kernel code and therefore this is not
a mode that can be toggled on the fly or even selected at system startup. The kernel must
be compiled with pre-emption on or off. Here is an example illustrating the difference:
• Control has entered the kernel synchronously, e.g a system call or page fault, from task
"A"
• While handling the system call or fault, an interrupt is received
• The handling of the interrupt causes another task "B" with better scheduling priority to
become READY (e.g. disk I/O has completed).
• With pre-emption OFF, the interrupt handler completes and control resumes in the
kernel code, handling the system call in the context of task "A". Once the system call
finishes, upon return to user mode, pre-emption takes place (see below under "Deferred
Return from System Call").
• With pre-emption ON, upon return from the interrupt handler to the kernel system call
handler, the NEED_RESCHED flag (see below under "Deferred Return from System
ECE357:Computer Operating Systems Unit 7/pg 11 ©2024 Jeff Hakner

call") is noticed and a context switch takes place. Execution of task "A" is suspended in
the kernel at the point in the system call handler where the interrupt happened to arrive (it
could be any arbitrary point). Task "B" gets the CPU. At some later time, task "A" gets
the CPU again, and the system call resumes.
The advantage of full kernel pre-emption is latency: a high priority task which wakes up
need not wait for the system call to complete before it gets the CPU. The disadvantage is
the added code complexity and locking, which can also introduce a slight performance
penalty because it burns up CPU cycles. A compromise between these two extremes is
"voluntary pre-emption" where the kernel is coded with specific pre-emption points in
system calls or fault handlers where it might take a "long time" (but not indefinite) to do
something. At these points the NEED_RESCHED flag is explicitly tested and a context
switch is voluntarily made if needed.

Making a System Call

We are now ready to discuss one particular kernel control path: the system call. We’ll
illustrate a fairly simple one: getuid(). A user-level program calls getuid() as an
ordinary C function. This function is provided by the standard C library which is linked
with all C programs. The getuid() function is written partially in C and partially in
assembly language. This provides the "glue" between the user-level domain of the C
program, and the kernel’s system call API.

The 32-bit X86 API

Argument passing between user-level functions in C (X86-32) is via the stack. However,
during a system call, the processor will be switching to a different, kernel stack. The
user-level program obviously can not write to the kernel stack. Conversely, although the
kernel can access the user-level stack memory, it would rather not, since that might create
a Page Fault. The solution is to pass arguments to system calls in registers. The
convention used on the x86-32 architecture is that the 32-bit arguments are passed such
that the first argument is in the ebx register. The second is placed in the ecx register.
The third through sixth are placed in registers edx, esi, edi, ebp respectively. If
the number of arguments to the system call exceeds 6 (rare), then all of the arguments are
placed on the user-level stack and the kernel receives just the address of that argument
block.
Each available system call is assigned a specific number by the kernel. A given system
call number also has a specification for what arguments are expected. The standard C
library must therefore be compiled, at least in part, with an eye towards the exact
architecture on which the program will be run. As the kernel evolves towards higher
version numbers and new system calls are added, backwards-compatability with older
code must be maintained. System call numbers are not re-used. If it is necessary to
ECE357:Computer Operating Systems Unit 7/pg 12 ©2024 Jeff Hakner

change the semantics of a system call, a new call is defined with a new number, and the
kernel provides both versions. In the Linux 2.6.15 X86-32 version kernel, there were 294
defined system call numbers, by version 2.6.23 that number had grown to 325, and by
2.6.32 it had reached 338. There is no end in sight with 377 system calls in 4.9.34!
The system call number is passed to the kernel in the eax register. Now the user-level
getuid function is ready to actually make the system call. There are no arguments to
this particular system call, so the system call # corresponding to getuid (199 decimal) is
put in eax and then a special instruction is used. There are two ways to make a system
call: using the INT $0x80 software interrupt exception instruction, or using the
SYSENTER instruction. We’ll follow the former example. The reader is referred to the
book Understanding the Linux Kernel for more information on the SYSENTER method.

The INT $0x80 instruction causes an exception to be raised with vector code 128. The
hardware then vectors to the kernel entrypoint in the IDT for vector 128, which the kernel
had previously initialized to point to the (symbolic) kernel virtual address
system_call. The code below is a simplified version of
/usr/src/linux/arch/x86/kernel/entry.S:
ECE357:Computer Operating Systems Unit 7/pg 13 ©2024 Jeff Hakner

system_call:
pushl %eax #contains system call #
SAVE_ALL #macro to push important
#registers on the stack
movl $0xFFFFE000,%ebp #mask SP to get to
andl %esp,%ebp #thread_info
testw $_TIF_SYSCALL_TRACE,TI_FLAGS(%ebp) #test thread_info.flags
jnz syscall_trace_entry #for syscall tracing on
cmpl $nr_syscalls, %eax #bounds check, unsigned compare
jae syscall_badsys
call *sys_call_table(%eax,4) #indirect addressing
movl %eax,PT_EAX(%esp) #poke return code into EAX slot
syscall_exit:
cli #temporarily mask interrupts
movl TI_flags(%ebp),%ecx #get flags field of thread_info
#ALLWORK_MASK includes all TIF_XXX thread info flags that indicate more
#work might be needed before returning to user space
testw $_TIF_ALLWORK_MASK,%cx #see if any flags are set
jne syscall_exit_work #if so more work before exit
restore_all:
RESTORE_REGS #pop registers from stack
addl $4,%esp #discard original eax
iret #return from interrupt (restores mask)
syscall_badsys:
movl $-ENOSYS,PT_EAX(%esp) #poke error return code
jmp syscall_exit #simplified
syscall_exit_work: #simplified
testb $_TIF_NEED_RESCHED,%cl #flags already in ecx
jz work_notifysig #if clear, must be signal pending
work_resched:
call schedule #otherwise, task switch
#We have regained the CPU after a possible task switch
cli #avoid missing an interrupt
movl TI_flags(%ebp),%ecx #check flags again
andl $_TIF_WORK_MASK,%ecx
jz restore_all #OK to return to userland
testb $_TIF_NEED_RESCHED,%cl
jnz work_resched #still need rescheduling
work_notifysig:
#This is where any pending signals are noticed, and we
#jump to code which potentially terminates the process or
#creates a signal handler stack frame in user-mode and causes
#control to jump to it upon return to user mode.
#We won’t be looking at this code in detail
ECE357:Computer Operating Systems Unit 7/pg 14 ©2024 Jeff Hakner

The SAVE_ALL macros pushes all of the registers that the kernel is likely to clobber. In
conjunction with the hardware pushes and the instruction pushl %eax just above, the
kernel stack now looks like this:
0x00(%esp) - ebx general-purpose registers,
0x04(%esp) - ecx saved by
0x08(%esp) - edx SAVE_ALL
0x0C(%esp) - esi "
0x10(%esp) - edi "
0x14(%esp) - ebp "
0x18(%esp) - eax " (will be syscall return value)
0x1C(%esp) - ds "
0x20(%esp) - es "
0x24(%esp) - fs "
0x28(%esp) - orig_eax orig system_call number (used for syscall restart)
0x2C(%esp) - eip program counter in user land, saved by hw
0x30(%esp) - cs code segment register, saved by hw
0x34(%esp) - eflags flags register, saved by hw
0x38(%esp) - oldesp user-land stack pointer, saved by hw
0x3C(%esp) - oldss user-land stack segment reg, saved by hw

The next instruction places a mask into register ebp and applies that mask to the stack
pointer esp. Normally, in C programs, the ebp register has a very important function as
the local frame pointer. However, we are still in an assembly language entrypoint, and
thus ebp is available as a scratch register. The purpose of this masking operation deserves
a considerable detour:

Kernel-mode stack and the thread_info structure

Recall that the kernel allocates an individual stack area for each user-level thread of
control. On X86-32, the kernel stack is only two pages (8K). On X86-64 (as of Linux
4.X kernels) the kernel stack is either 4 or 8 pages (16K or 32K). For simplicity, we will
assume an 8K stack. While this seems rather small, bear in mind that every bit of code in
the kernel is carefully controlled. Large amounts of local variable space are discouraged,
recursive programming is never used, and the depth of function call nesting rarely gets
obnoxious. Therefore, the kernel programmers can rely on the fact that this 8K stack will
not overflow.
Now, to compound this trickery, the kernel sticks a small data structure called struct
thread_info at the limit of the stack, i.e. at the lowest memory address. The kernel
stack pointer value stored in the TSS memory area by the kernel for each process/thread
is the highest address of the allocated stack area. On entry to the kernel from user-mode,
the kernel stack is empty, and the kernel stack pointer is thus furthest away from this
thread_info data structure, and as kernel functions are called, the stack pointer gets
closer to it, but should never be in any danger of over-writing it.
This arrangement of memory addresses means that on entry to the kernel, a simple
masking operation of the stack pointer %esp yields the beginning of the
ECE357:Computer Operating Systems Unit 7/pg 15 ©2024 Jeff Hakner

thread__info structure. That address is kept in the %ebp register for a while. Also,
at any point in the kernel, the inline function current_thread_info() performs
that same masking operation. Let’s look into the thread_info struct
(/usr/src/linux/arch/x86/include/asm/thread_info.h):
/* On recent Linux kernels */
struct thread_info {
unsigned long flags; /* low level flags */
};

There isn’t much room on the kernel stack, so thread_info is pretty small. What is
kept in there is only what is needed by the assembly language entry/exit routines. The
kernel needs to keep a lot more information about a process, so it allocates another data
structure called struct task_struct which we have seen in other units.
On older Linux kernels, a pointer to the task_struct was kept in thread_info
along with a lot of other things. All of this has been evicted to other places (e.g. the
current pseudo-variable in the per-CPU private memory area is the place to get the
task_struct pointer) leaving just the flags word in thread_info.

Per-CPU variable area

Many kernel variables that pertain only to a given CPU are contained in a special area of
memory called the per-CPU variable storage area (alternatively sometimes called the
this_cpu area). The most extensively used per-cpu variable is current which is
actually a macro which accesses a particular variable slot in the per-CPU area. The slot
stores a pointer to the task_struct for the task currently running on this CPU.
Although it is really a macro, it appears like current is declared as struct
task_struct *current; and is used that way throughout kernel source code.

Now, back to our system call, already in progress

Having computed the address of the thread_info structure, the next line of assembly
examines a bitwise flags word. There are quite a few TIF_XXX flags defined. Of
interest here is a tracing hook: if the TIF_SYSCALL_TRACE flag is set, then the thread
making the system call is being traced (e.g. through the strace command) and the
kernel diverts to an alternate entry which will record the parameters and the return value
of the system call and pass these back as events to the tracer. We won’t go down that
road, however.
The next two lines are very important, as they illustrate data validation. Recall that the
user-level process is completely untrusted as far as the kernel is concerned. If the system
call number passed by the user is greater than the highest system call number (or
ECE357:Computer Operating Systems Unit 7/pg 16 ©2024 Jeff Hakner

negative...think about two’s complement) then we jump to the syscall_badsys code.

This places an error code into the slot in the stack where the user’s eax register was
saved. Upon exit from the system call handler, this value will be popped into %eax.

System Call Return Value

On the X86-32 architecture, when a function returns an int (or other 32-bit value such as
a pointer), that value is returned in the eax register. All valid return values from kernel
system calls are positive or zero. A negative value is used to indicate an error, and that
value is -error_number.
Therefore, an invalid system call will return the value -ENOSYS via the %eax register.
Now, an unfortunate history lesson crops up. The UNIX API, for reasons that may be
lost to time, specifies that when system calls fail, they should set the global variable
errno, and return -1. So the user-level glue function does this (pseudocode):
int generic_system_call(arguments...)
{
put arguments into registers
put system call # into %eax
INT $0x80
if (%eax<0)
{
errno= -%eax;
return -1;
}
return %eax;
}

Kernel system call hand-off to C

However, we know in this case that a valid system call number was used. We have been
tracing out assembly language code, but most of the kernel is written in C. The
instruction
call *sys_call_table(0,%eax,4)
uses the X86 indexed register offset indirect addressing mode as follows: The %eax
register (containing the system call #) is multipled by 4 (the sizeof a pointer), and that
offset is added to the base address of a table of function pointers sys_call_table.
The result of that addition is used to fetch 4 bytes from memory, and finally that result is
the address which is called as a subroutine. It is as if:
/* Declare and initialize array of function pointers */
/* The system call names and positions are purely an example */
void (*sys_call_table[])()={sys_open,sys_read,sys_close,....};
/* Hand-off to syscall handler, pseudocode */
(*sys_call_table[%eax])(args);

When the kernel is compiled, the sys_call_table is filled in with the name (i.e. the virtual
ECE357:Computer Operating Systems Unit 7/pg 17 ©2024 Jeff Hakner

address) of each C function which implements each system call. It is the convention that
a system call which is known as XXX to the user is implemented by a kernel function
called sys_XXX.

asmlinkage long sys_getuid(void)

{
return current->cred->uid; /*cred is the credentials info */
}

This system call is coded purely in C, and is in fact found in the architecture-neutral
portion of the kernel source code (at /usr/src/linux/kernel/timer.c). The only unusual
thing is the compiler directive asmlinkage, which indicates that this function is being
called directly from assembly language, and thus the the preceding stack frame isn’t a
normal C stack frame. Arguments in C are pushed on the stack. Note that the top of the
stack, after the SAVE_ALL macro, contains the user’s ebx, ecx, edx, edi, esi, ebp
registers, i.e. the 6 allowable system call arguments, in sequence. If you look in the
source code you won’t find the exact code above, in which some of the macros have been
expanded out for better readability.
Note that the uid is a property of the currently running process, and thus is fetched from
the task_struct structure via the current pointer, which contains a pointer to the
struct cred that contains information such as uid, gid, etc.
Let’s take a look at another system call (time) which passes an argument:
asmlinkage long sys_time(time_t __user * tloc)
{
time_t i;
struct timespec tv;

getnstimeofday(&tv); //Kernel keeps time in ns resolution

i = tv.tv_sec; //tv_nsec ignored, therefore rounds DOWN

if (tloc) {
if (put_user(i,tloc))
i = -EFAULT;
}
return i;
}

Recall that time returns the time as an int, but also accepts a pointer to an int. That
argument comes in to the system call as tloc in the code above, and if not NULL, the
kernel takes the value and writes it into the user’s memory using the kernel function
put_user. Note that this might potentially involve paging-in the required user-space
memory. If the user supplied an invalid memory address, put_user will catch that and
return a non-zero value, which will cause the system call to fail with EFAULT.

Returning from a system call

Once the system call specific handler routine sys_XXX returns, the system call return
value is in the %eax register (because that’s where C function return values are placed by
the compiler). Looking back at the system_call assembly language routine, we see
this return value is written to the kernel stack in the location where the eax register will
be popped when returning back to user mode.
Now at label syscall_exit interrupts are temporarily masked (on multi-processor
systems, this applies to the local processor only). The reason for this is to protect the
next few testing and branching instructions as a critical region. Recall that the
thread_info structure address is in %ebp. The next line of assembly code fetches the
bitwise flags into register %ecx and tests to see if any flags are set which would
indicate that, instead of returning directly to user mode, some other action might be
required. Let’s say for a moment that those flags are clear. Then the code at
restore_all uses a macro RESTORE_REGS to pop all of the registers which had
been saved by SAVE_REGS. The system call return value is now in %eax (regardless of
how control reached restore_all) and the extra stacked copy of eax which
contained the system call number is simply discarded.
Now the iret instruction is executed. This causes the hardware to restore the eip,
cs, eflags, esp and ss registers. The result is that execution resumes in the user
process, with the stack pointer back on the user’s stack. All of the registers are exactly as
they were when the user process executed the INT $0x80 instruction, with the
exception of the eax register which now holds the system call return value. The
restoration of the eflags register means the privilege level is returned to user mode, and
the interrupt mask is restored to the normal value (which when running in user mode is to
allow interrupts).

Deferred return from system call, interrupt or fault handler

There are two major reasons why the CPU would not return directly back to the user-
mode program upon completion of a system call, an interrupt handler, or a fault handler:
(1) another task "may" be "better" to run, as determined by the scheduler. (2) a
deliverable signal is pending for our process.

Pre-emptive context switch

The scheduler system may from time to time determine that the current task is not
necessarily the "best" task to be running on our CPU. This could happen from within the
tick interrupt handler (because the current task has used up its timeslice) or when, during
our system call, another task has woken up and it has better priority. The Linux kernel is
not fully pre-emptive, unless the kernel code has been built with support for full kernel-
mode pre-emption. Pre-emptive context switches can always happen at the moment that
control is about to return to user mode. (of course, if kernel code in a system call
ECE357:Computer Operating Systems Unit 7/pg 19 ©2024 Jeff Hakner

encounters a blocking condition, such as reading from an empty pipe, that causes an
immediate context switch). The key is the bitwise flag TIF_NEED_RESCHED which is
part of the thread_info structure on the kernel stack.
Referring to the entry.S code, if the TIF_NEED_RESCHED flag is set, then at
work_resched the kernel scheduler function schedule is called. We’ll see in unit 8
that this function causes a context switch, and the original process appears to be frozen at
the instant of having called schedule from work_resched. At some later time, the
original process is selected to run again, and execution resumes at that frozen point.
Control returns from schedule, and then the next few lines are identical to those we
have already seen: they check to see if anything else has come up, or whether it is OK to
return to user mode. If we were to look at other exception/interrupt code paths, such as a
page fault, or clock tick interrupt, we would see similar code to examine the
TIF_NEED_RESCHED flag and trigger a task switch.

Signal Delivery

It could also be that while the task was in kernel mode, a signal became pending for that
task. Perhaps the system call changed the signal mask and a previously received signal
became un-blocked, or perhaps the system call itself caused a signal to be raised, or
perhaps a signal from another process just happened to come along while we were in the
system call. Signal delivery, as discussed in Unit #4, only happens at the moment that
control was about to return from kernel mode back to user mode.
work_notifysig is called when the threadinfo flags indicate that a non-blocked
signal is pending for the process. The kernel then has some work to do to deliver the
signal to the process. As we’ve seen in Unit #4, that might mean terminating the process
(the kernel calls do_exit on behalf of the process) or invoking the signal handler. In
the latter case, the kernel has to modify the user-mode stack to make it appear that the
handler was called from the point in the user-mode program (%eip value) where control
had entered the kernel, and then change the user-mode registers saved on the kernel stack
to cause control to return to user mode at the %eip location of the signal handler, rather
than the point where it left off!
The orig_eax stack slot is always the original system call number that got us into the
kernel. The eax slot will contain the syscall return value after the system call has run.
orig_eax is used in signal handling for "restarted system calls" but we won’t be looking at
that code.
We’re only tracing out the system call code path in the kernel. However, very similar
code is seen on the path dealing with interrupts. In particular, that means that signals are
noticed upon return to user mode from an interrupt handler. When signals are posted to a
process that is currently running on a different processor, the interprocessor interrupt
(IPI) hardware mechanism is used to get that processor’s attention and force immediate
ECE357:Computer Operating Systems Unit 7/pg 20 ©2024 Jeff Hakner

signal action.

The X86-64 API

Although the examples herein are for the older 32-bit X86 API, also known as i386, we
should examine how things change when using the 64-bit API, known as X86-64.
At user level, the first 6 arguments to a function are passed in registers, not the stack. The
argument slots are registers %rdi,%rsi,%rdx,%rcx,%r8,%r9. The X86-64 API does not
use the INT 0x80 instruction, but a new instruction called SYSCALL, which is somewhat
faster because it avoids stack-write memory accesses. It performs the following steps in
hardware:
• Save the return address (%rip register) in register %rcx (overwriting its value)
• Save the current value of the flags register (%rflags) in %r11 (overwriting it too)
• Make some adjustments to the %cs and %ss registers to allow kernel code to execute
properly.
• Set the processor to privileged mode
• Load the value of a special register (MSR_CSTAR) into %rip. This privileged register
has been pre-loaded by the kernel to point to the system call entrypoint.

Note that in this 64-bit API, the hardware does not switch stacks nor write anything to the
stack when performing a system call.
Because the %rcx register is clobbered by the SYSCALL instruction, the kernel’s system
call convention specifies that the arguments to the system call are passed in registers:
%rdi,%rsi,%rdx,%r10,%r8,%r9. The system call number is passed in %rax. Therefore,
the user-level "glue" code takes the 4th argument in %rcx and moves it into %r10, and
adds the system call number in %rax. The kernel’s system call assembly-language entry
code will put the 4th argument back into %rcx from %r10 prior to dispatching to kernel C
functions via the system call table.
We are however still on the user-mode stack. Upon entry via any kernel entrypoint, the
kernel uses the privileged SWAPGS instruction to interchange the user-mode value of the
%gs register with a "hidden" %gs register that the kernel has pre-configured to point to
the per-CPU area for this CPU. By using an obscure addressing mode of the X86 known
as "segment override" the kernel can then easily access this memory. One of the per-CPU
variables is the kernel-mode stack pointer. This value is now loaded by the kernel into
%rsp and the kernel saves the user-mode registers on the kernel stack. Prior to return to
user mode, the kernel makes sure that the correct kernel stack pointer is stored in the per-
cpu scratchpad area, executes the SWAPGS instruction again to save the hidden gs
register, and restores the user-mode %rip and %rflags values (which were saved on the
kernel stack) into %rcx and %r11 respectively. The SYSRET instruction is then used to
reverse the effects of SYSCALL and return control to user mode. Note that the TSS is not
used at all in this 64-bit API.
ECE357:Computer Operating Systems Unit 7/pg 21 ©2024 Jeff Hakner

If the above description of X86-64 caused pain and/or confusion, do not panic. We will
conduct the rest of our examples in 32 bits.

Summary of 32 and 64 bit X86 Syscall APIs

32 64
Opcode INT $0x80 SYSCALL
syscall# %eax %rax
arg1 %ebx %rdi
arg2 %ecx %rsi
arg3 %edx %rdx
arg4 %esi %r10
arg5 %edi %r8
arg6 %ebp %r9
retval %eax %rax

Appendix - X86 Architecture and Assembly Language

The following material comes from ECE466 -- Compilers and is intended to assist with
understanding the assembly language portion of the kernel.
X86 refers broadly to a family of Intel (and compatible) microprocessors manufactured in
the last 25 years or so. It is also called the X86 architecture by Intel. The first 32-bit X86
processor was the 80386. X86-64 is a 64-bit extension to X86. Intel’s is a CISC
architecture which is a direct linear descendent of the very first microprocessor, the 4004
(a 4-bit product).
There are many who find the X86 architecture to be a dinosaur, and a badly designed one
at that, which should have long ago become extinct. However, IBM’s choice of it for its
first personal computer sealed its fate as the most popular processor architecture.
The X86-64 architecture extends the 32-bit X86 to use 64-bit registers, while retaining
backwards compatibility with 32-bit X86 code.
Below is a summary of the X86/X86-64 architecture The reader is detoured to the official
reference manuals for full details.

Intel vs UNIX assembly syntax

The Intel documentation uses the Intel standard assembly language syntax, but the UNIX
assembler as follows a different convention (which is consistent across different
ECE357:Computer Operating Systems Unit 7/pg 22 ©2024 Jeff Hakner

processors). In the UNIX syntax, an identifier is unambiguously an assembler symbol.

To reference a register, its name is prefixed with a percent sign, e.g. %eax. To use a
symbol as an immediate value, the dollar sign is used as a prefix. Otherwise the symbol
means the contents of that address. Register indirect addressing modes are indicated by
brackets or parentheses. UNIX assembly instructions are opcode src1,src2,dst for
3-address instructions or opcode src,dst for 2-address. (Note that Intel syntax is dst,src).
We will use the UNIX syntax in these notes. Another name for this is the "AT&T"
syntax, after the original authors of UNIX
In the UNIX/AT&T syntax, where a particular opcode can be performed at different
precisions, that opcode receives a letter suffix: b,w,l,q for 8,16,32 and 64-bit operations
respectively.
Assembly language files are denoted with a .s or .S suffix. The latter is used in the Linux
kernel for files which are hand-crafted in assembly, as opposed to .s files which were
generated by the compiler. Assembly language is line-break sensitive. Each line is an
instruction (not including whitespace, assembler directives and comments) and each
instruction must be contained on that line. The line has 2 parts: opcode operands. These
fields are separated by whitespace. The optional label is a symbolic name given to that
instruction or memory location. It is specified as label: and can either appear at the
beginning of the line, or on a line by itself, in which case the label is attached to the next
line. By stylistic convention, the opcode is never at the beginning of the line. There is
least one whitespace character (typically a tab) so that the opcode is not mistaken as a
label. The operands field contains 0, 1 or more operands delimited by commas, and
following the addressing mode syntax explained below.

Assembler Directives / Pseudo Opcodes

Assembler directives are pseudo-opcodes that do not correspond to actual opcodes that
the processor executes, but cause the assembler to modify its operation, or to emit special
code or data. These include .text and .data to switch between the text and data
sections of the a.out file, .byte to emit a single byte, .long to emit a 4-byte value, and
.string to emit a nul-terminated string. This is not an exhaustive list and the reader is
referred to the documentation for as.

X86 Register Model

When referring to X86 registers, their size is implied by a prefix. For example, there is a
32-bit register called EAX. The least significant 16 bits of that register are called AX. It
is possible to refer to the least significant byte as AL and the next most significant byte as
AH. In the 64-bit X86-64 instruction set, the 64-bit version of EAX would be called
RAX. We will consider the 32-bit model first.
The register model of X86 is convoluted and archaic, making efficient register allocation
ECE357:Computer Operating Systems Unit 7/pg 23 ©2024 Jeff Hakner

and instruction selection a challenge. The following general-purpose registers are

typically used for holding temporary values, general integer computation, etc.
• %eax: The "accumulator". Many instructions use %eax as an implied operand.
• %ebx: The "base register" (not to be confused with %ebp).
• %ecx: The "counter register".
• %edx: The "data register".
• %esi: Source register for string operations
• %edi: Destination register for string operations
The following special registers are used for control flow:
• %eip: The "instruction pointer", aka the Program Counter. At the time of instruction
execution, %eip contains the address of the next instruction to be fetched. A branch
instruction modifies %eip and causes the next instruction to be fetched from that new
address.
• %esp: The stack pointer.
• %ebp: Typically used in the C / assembly language convention for the stack frame "base
pointer". Aka the "frame pointer".
• %eflags: The flags register. It contains the condition code flags (carry, parity, BCD
adjust, zero, signed, overflow) as well as a number of flags and control bits which can
only be modified when running in Kernel (Supervisor) mode, such as the interrupt enable
flag and the user/supervisor privilege level.

Segmentation, Special-Purpose and Additional Registers

The X86 addressing scheme is based on an obsolete concept known as "segment/offset"

addressing. In all modern operating systems, program addresses are linear, and the
segmentation is basically ignored. The register model contains the registers %cs, %ds,
%ss which are initialized by the kernel and should not be touched. They are what enable
code, data and stack accesses to work. Additional segment registers %es, %fs and %gs
are general-purpose and, because a linear addressing model is being used, could be
employed as general-purpose scratch registers, subject to some restrictions as to which
registers may appear in which instructions. However, both the %fs and %gs registers are
used by the kernel and the standard library, and should be avoided.
We have seen additional registers such as %cr3 and %tr. These are generally only
accessible when running in kernel mode, and require special opcodes to read or write.
Additional registers are present for various floating point operations. The kernel does not
use these at all, but is required to save and restore them when context switching between
user-level tasks, if they are being used by those tasks.
On X86-64, there are additional general-purpose registers %r8 - %r15.

There are a number of addressing modes which are used to specify where to find or put
the operands of an instruction:
• Register Direct: Specify the register name with a % prefix, e.g. %eax.
• Immediate: The immediate value must be prefixed with the dollar sign, e.g. $1 Symbols
can also be used, e.g. movl $y,%eax moves the address of the variable y (not the
contents) into the EAX register.
• Memory Absolute: The absolute address of the operand is specified without a prefix
qualifier. E.g. movl $1,y moves the immediate value 1 into the memory address
which is associated with the linker symbol y.
• Base-index (Register Indirect with offset): The X86 has a handy mode for accessing
elements of an array. The syntax is disp(%base,%index,scale) . The address of
the operand is computed as addr=base+index*scale+disp. The base and index
may be any of the general-purpose registers (eax, ebx, ecx, edx, ebp, dsi, edi, esp (not
allowed as the index)). The displacement is a 32-bit absolute address. The scale factor
may be 1, 2, 4 or 8. Some of these parameters may be omitted, forming simpler
addressing modes. E.g. in movl $1, (%eax) the eax register contains a pointer to a
memory location, into which the immediate value 1 is moved.

X86 is generally a 2-address architecture, meaning that one of the operands is both a
source and a destination. For example, the instruction subl $4,%esp says subtract the
immediate value 4 from register esp and put the result back in esp. There are many
combinations of src/dst addressing modes including some odd restrictions. Generally
speaking, most opcodes allow register/register, register/immediate, register/memory or
immediate/memory combinations. Memory/memory is generally not allowed.
Function Calling Convention

We will discuss what the Intel documentation calls the CDECL convention for procedure
calling, as that is what is used in the C/UNIX world. Other calling conventions do exist.
In the X86-32 architecture, all arguments to a function are pushed on the stack, and the
return value is returned in the %eax register. If the return value is 64 bits (long long), it is
returned in the register pair %edx:%eax, with the %edx being the most significant 32 bits.
Recall that %esp is the stack pointer, and the stack grows towards low memory. The
PUSH instruction predecrements the stack pointer, then writes the value to (%esp).
Likewise, POP reads from (%esp) and then postincrements %esp. Arguments in C are
pushed to the stack in right-to-left order. Therefore, just before issuing the CALL
instruction, the leftmost argument is on the top of the stack. This convention allows
variadic functions to work properly. The callee does not need to know in advance (at
compile time) the exact number of arguments which will be pushed. It is able to retrieve
the arguments left-to-right by positive offsets from %esp.
The CALL instruction pushes the value of %eip, thus on entry to a function (%esp)
contains the address of the instruction to which control should return (i.e. the instruction
ECE357:Computer Operating Systems Unit 7/pg 25 ©2024 Jeff Hakner

after the CALL). The first thing any function does is set up its local stack frame. Let’s
look at an example:
f1()
{
f2(2);
}

f2(int b)
{
int a;
a++;
b--;
return 1;
}

f1:
pushl %ebp
movl %esp, %ebp
subl $8, %esp !one arg slot + one padding slot
movl $2, (%esp) !put arg onto stack
call f2
leave
ret

f2:
pushl %ebp
movl %esp, %ebp
subl $16, %esp !extra space for alignment
incl -4(%ebp) !access local var a
decl 8(%ebp) !access param b
movl $1,%eax !return value
leave
ret

The %ebp register is the frame pointer, and will be used to access both local variables and
parameters. Its value must be preserved so the first action is to save it on the stack. Then
the stack pointer is decremented to create room for local variables. In our example,
function g has one local variable which takes up 4 bytes. The %ebp contains the value of
the stack pointer after saving the old %ebp. Therefore 4(%ebp) is the return address,
(%ebp) is the saved %ebp, and the first parameter is 8(%ebp). Parameters will be at
positive offsets from %ebp and local variables will be at negative offsets. Generally
speaking, the local variables mentioned first in a function will have the lowest memory
address (i.e. highest negative offset from %ebp), but that behavior is not guaranteed.
When a function call is made, arguments can be pushed on the stack in right-to-left order,
using the pushl instruction. After the CALL instruction, an addl $X,%esp would be needed
to adjust the stack pointer and reverse the effects of the previous pushes. Alternatively,
one could determine during code generation which function call (within the function
being generated) has the highest number of arguments. The number of bytes thus
ECE357:Computer Operating Systems Unit 7/pg 26 ©2024 Jeff Hakner

required for passing arguments can be added to the total local stack frame size, as if these
"argument slots" were hidden local variables. Then the arguments can be passed via movl
OFFSET(%esp), in any order desired, and there is no need to adjust the stack pointer after
the call. This is the approach that gcc takes.
Upon leaving a function, the LEAVE instruction is used, which performs two operations:
%ebp is moved into %esp, thus restoring the stack pointer to its value just after the base
pointer save on entry, then %ebp is popped from the stack. Now everything is restored,
and the RET instruction pops the return address from the stack and resumes execution in
the caller.
If the compiler chose to use any registers which are callee-saves, we would see pushes of
those registers on entry and corresponding pops on exit.
ECE357:Computer Operating Systems Unit 7/pg 27 ©2024 Jeff Hakner

LOW MEMORY

g(x,y,z)
{
int a,b,c;
/*...*/
%esp z(1,2);
arg s (during exec .*...*.
lot 1
of fn. g) }
arg s
lot 2 If additional space were needed
stack padd for more local variables or temp
ing values, they would be allocated in
frame here, with padding necessary to bring
a
for the total stack frame size to a multiple
b of 16.
g()

saved %ebp
%eb (during exec. of fn. g)
p
retur f()
addr n
ess %esp just {
/*...*/
arg1 before calling g(arg1,arg2,arg3);
fn. g /*...*/
arg2 }

arg3

HIGH MEMORY

X86-64 Function Calling

Under the 64 bit architecture, the first 6 integer arguments are passed in registers, rather
than on the stack. Arguments are placed in left-to-right order in registers %rdi, %rsi,
%rdx, %rcx, %r8, %r9. If there are additional arguments, they are put on the stack right-
to-left, i.e. with the right-most argument at the highest memory address, just like X86-32.
ECE357:Computer Operating Systems Unit 7/pg 28 ©2024 Jeff Hakner

If structs are passed as arguments, they are always placed on the stack. The integer return
value is in the %rax register.
This hybrid register/memory argument passing model introduces some complexity with
variadic functions, aka <stdarg.h>. GCC implements stdarg as a compiler built-in.

X86-64 Global Variables

There is an odd limitation in the X86-64 instruction set: the absolute addressing mode is
not supported for 64-bit addresses. To access a memory operand, a register indirect
addressing mode must be used.
extern int i;

f()
{
i=2;
}

f:
pushq %rbp #Prologue, save base pointer
movq %rsp, %rbp #Set new base pointer
subq $32, %rsp #Create stack frame
movl $2, i(%rip) #Program Counter Relative mode
leave
ret

There will be a 32-bit "hole" in the movl opcode which will be a program counter relative
relocation type (similar to the example of the CALL opcode earlier in this unit). At link
time, when the address of symbol i has been resolved, this hole will be filled with the i’s
address, minus the address of the hole itself.
ECE357:Computer Operating Systems Unit 7/pg 29 ©2024 Jeff Hakner

This introduces a limitation that code and data must fall within the same contiguous 2GB
memory region at run time, which the X64-64 spec calls a "medium" memory model. To
use a "large" memory model where code and data may be anyplace within the 64-bit
address space, different opcodes are used:
movabsq $i, %rax #Move 64 bit immediate value to rax
movl $2, (%rax) #Register indirect

Caller/Callee saves

It is the case for any architecture and operating system that there is a function calling
"convention" which specifies how arguments are passed and returned, and how registers
may be used. This convention dictates which of the registers are expected to survive a
function call, and which ones may be used as "scratch" registers, and are therefore
expected to be volatile across function calls. Another way of saying this is there are
caller-saved registers (the scratch registers.. if the caller wants to keep a value in there
through a function call it must explicitly save it) and callee-saved registers (if a function
wants to use one of these registers it must explicitly save it on entry and restore it before
returning).
In the X86-32 architecture under UNIX, the %eax,%ecx,and %edx registers are scratch
registers (caller-saves). You will find that the compiler tends to put short-lived values in
these registers. Of course the %eflags register is also expected to be modified by a
function call. The %ebx,%edi,%esi and %es [CAUTION: this is a 16-bit register]
registers are callee-saved. The compiler may use these for longer-lived values (such as
local variables which are assigned to a register for all or part of the function to improve
speed). However, if one of these registers is used by the compiler, it must emit code to
push it on the stack on entry, and pop it on return.
On X86-64, the caller-save (scratch) registers are %rax,%rcx,%rdx,%rsi,%rdi, and
%r8-%r11, while the callee-save (long-term) registers are %rbx, %r12-%r15. Note that
%rsi and %rdi are caller-save on 64 bit, whereas they were callee-save on 32-bit. This is
because they are used for argument passing on 64 bit.

X86 General-Purpose Register Summary

-32 reg Saved by Notes -64 reg Saved by Notes

%eax Caller fn retval, syscall# %rax Caller fn retval, syscall #
%ebx CallEE syscallarg 1 %rbx CallEE
%ecx Caller syscallarg2 %rcx Caller fn arg #4
%edx Caller longlong ret, syscall arg 3 %rdx Caller arg #3
%edi CallEE syscall arg5 %rdi Caller arg #1
%esi CallEE syscall arg4 %rsi Caller arg #2
%es CallEE 16bit only N/A
%ebp CallEE frame ptr,syscall arg6 %rbp CallEE frame ptr
%r8 Caller arg #5
%r9 Caller arg #6
%r10 Caller syscall arg4
%r11 Caller
%r12 CallEE
%r13 CallEE
%r14 CallEE
%r15 CallEE

Linux Kernel Development
No ratings yet
Linux Kernel Development
420 pages
Linux Kernel
No ratings yet
Linux Kernel
49 pages
CC EBook
No ratings yet
CC EBook
91 pages
Unit 6. Linux Operating System
No ratings yet
Unit 6. Linux Operating System
92 pages
Linux Insides
No ratings yet
Linux Insides
339 pages
Embedded Software 2
No ratings yet
Embedded Software 2
75 pages
4computer System Operation
No ratings yet
4computer System Operation
66 pages
Linux Architecture
No ratings yet
Linux Architecture
39 pages
Linux Insides
No ratings yet
Linux Insides
831 pages
OS - UNIT - 6 Notes
No ratings yet
OS - UNIT - 6 Notes
19 pages
Module II
No ratings yet
Module II
74 pages
Unix Self Notes
No ratings yet
Unix Self Notes
22 pages
Bootlin Kernel Init
No ratings yet
Bootlin Kernel Init
16 pages
Kernel
No ratings yet
Kernel
9 pages
Comp 3000 - Exam Notes Before Midterm 1
No ratings yet
Comp 3000 - Exam Notes Before Midterm 1
42 pages
Linux Insides
No ratings yet
Linux Insides
583 pages
2021-08-07 - LKM C - Vx-Ug (En-Us)
No ratings yet
2021-08-07 - LKM C - Vx-Ug (En-Us)
23 pages
Linux Boot Loader
No ratings yet
Linux Boot Loader
35 pages
Embedded Linux To Start With
100% (1)
Embedded Linux To Start With
70 pages
Linux 2
No ratings yet
Linux 2
22 pages
Os PPT
No ratings yet
Os PPT
30 pages
02 Linux Overview
No ratings yet
02 Linux Overview
81 pages
Linux Insides
No ratings yet
Linux Insides
115 pages
Booting and Shutting Down
No ratings yet
Booting and Shutting Down
97 pages
Kernel Basics
No ratings yet
Kernel Basics
10 pages
Inside The Linux Kernel: Unixforum Chicago - March 8, 2001
No ratings yet
Inside The Linux Kernel: Unixforum Chicago - March 8, 2001
40 pages
Understanding The Linux Kernel (3rd Ed)
No ratings yet
Understanding The Linux Kernel (3rd Ed)
974 pages
Canadian Pharmacist Evaluating Examination (PEBC) Study Guide
70% (10)
Canadian Pharmacist Evaluating Examination (PEBC) Study Guide
20 pages
Chapter 5 Kernel Code (Kernel) : List 5-1 Files in Linux/kernel/ Directory
No ratings yet
Chapter 5 Kernel Code (Kernel) : List 5-1 Files in Linux/kernel/ Directory
91 pages
LINUX Device Drivers: Partha
No ratings yet
LINUX Device Drivers: Partha
20 pages
Describe Initial Process Sequence While The System Boots Up
No ratings yet
Describe Initial Process Sequence While The System Boots Up
30 pages
Linux Booting Steps
100% (2)
Linux Booting Steps
21 pages
422s Exam1
No ratings yet
422s Exam1
2 pages
Osdev Report
No ratings yet
Osdev Report
29 pages
Bkerndev
No ratings yet
Bkerndev
38 pages
K L Deemed To Be University Department of Computer Science and Engineering
No ratings yet
K L Deemed To Be University Department of Computer Science and Engineering
23 pages
Linux Case Study
No ratings yet
Linux Case Study
13 pages
Linux Kernel Internals
100% (1)
Linux Kernel Internals
84 pages
Driver
No ratings yet
Driver
47 pages
Linux Insides
100% (1)
Linux Insides
292 pages
L Kernelhack2 A4
No ratings yet
L Kernelhack2 A4
23 pages
Building Services Compressed Compressed
No ratings yet
Building Services Compressed Compressed
79 pages
Linux Booting Procedure: Syahrizal Shafie
No ratings yet
Linux Booting Procedure: Syahrizal Shafie
39 pages
Linux Kernels Internals
100% (2)
Linux Kernels Internals
379 pages
Linux Kernel Series
No ratings yet
Linux Kernel Series
140 pages
Unreliable Guide To Hacking The Linux Kernel
No ratings yet
Unreliable Guide To Hacking The Linux Kernel
24 pages
Converted File 891129ca PDF
No ratings yet
Converted File 891129ca PDF
6 pages
Module 20: More On LINUX: Linux Kernel Architecture The Big Picture
No ratings yet
Module 20: More On LINUX: Linux Kernel Architecture The Big Picture
39 pages
Linux Kernal 1
No ratings yet
Linux Kernal 1
41 pages
5 Principles of Presentation Design-18
No ratings yet
5 Principles of Presentation Design-18
27 pages
Linux Device Driver Sum Up 2
No ratings yet
Linux Device Driver Sum Up 2
33 pages
6 Stages of Linux Boot Process
100% (1)
6 Stages of Linux Boot Process
9 pages
ECE 412: Microcomputer Laboratory: Lecture 10: Kernel Modules and Device Drivers
No ratings yet
ECE 412: Microcomputer Laboratory: Lecture 10: Kernel Modules and Device Drivers
22 pages
Linux Kernel
No ratings yet
Linux Kernel
9 pages
Introduction To The Linux Kernel
No ratings yet
Introduction To The Linux Kernel
7 pages
Booting Process in Linux and Windows
No ratings yet
Booting Process in Linux and Windows
5 pages
Human Resource
100% (1)
Human Resource
92 pages
Boot Process On Linux
No ratings yet
Boot Process On Linux
17 pages
Manual de Mantenimiento S331D
No ratings yet
Manual de Mantenimiento S331D
32 pages
VetcoGray S-Series SVXT
No ratings yet
VetcoGray S-Series SVXT
2 pages
Dr. Sameh Ahmad Muhamad Abdelghany Lecturer of Clinical Pharmacology Mansura Faculty of Medicine
No ratings yet
Dr. Sameh Ahmad Muhamad Abdelghany Lecturer of Clinical Pharmacology Mansura Faculty of Medicine
52 pages
Mies Van Der Rohe and The Philosophy of Work
No ratings yet
Mies Van Der Rohe and The Philosophy of Work
5 pages
Micro Economics II Monopolistic Competition
No ratings yet
Micro Economics II Monopolistic Competition
11 pages
Power Electronics and DC Lectures
No ratings yet
Power Electronics and DC Lectures
159 pages
Basics of Essay Writing
No ratings yet
Basics of Essay Writing
20 pages
Input Output Devices
No ratings yet
Input Output Devices
44 pages
Eaton DS 265760 NZMN4 AE1000 en - GB 20241113
No ratings yet
Eaton DS 265760 NZMN4 AE1000 en - GB 20241113
4 pages
Shift Left (Left) and KCS: Working Towards Better Services
No ratings yet
Shift Left (Left) and KCS: Working Towards Better Services
3 pages
VB7
No ratings yet
VB7
44 pages
Unsupervised Machine Learning
No ratings yet
Unsupervised Machine Learning
10 pages
Technical Specifications / Tender Text Wöhr Autoparksysteme GMBH Parklift 462-2,0 D / 462-2,6 D
No ratings yet
Technical Specifications / Tender Text Wöhr Autoparksysteme GMBH Parklift 462-2,0 D / 462-2,6 D
1 page
How To Write A Research Proposal
No ratings yet
How To Write A Research Proposal
3 pages
Sambhav Daksh Syed Abhimanyu
No ratings yet
Sambhav Daksh Syed Abhimanyu
10 pages
Patliputra University, Patna: Under-Graduate (UG) Admission Application Form 24G0066273
No ratings yet
Patliputra University, Patna: Under-Graduate (UG) Admission Application Form 24G0066273
2 pages
Only One Mind PDF
No ratings yet
Only One Mind PDF
34 pages
Chapter One: Organizations and Organization Theory
No ratings yet
Chapter One: Organizations and Organization Theory
12 pages
NOTES LIFE PROCESSES (Respiration, Excretion
No ratings yet
NOTES LIFE PROCESSES (Respiration, Excretion
3 pages
Application Guide
No ratings yet
Application Guide
4 pages
What Is Twitter and Why Should You Use It
No ratings yet
What Is Twitter and Why Should You Use It
4 pages
Mapreduce Join Document
No ratings yet
Mapreduce Join Document
4 pages
After You Graduate You Get A Job in A Small
No ratings yet
After You Graduate You Get A Job in A Small
2 pages
Rubric For Preparation of Design/Computational Plate
No ratings yet
Rubric For Preparation of Design/Computational Plate
1 page
Plate Fin Heat Ex Changers
No ratings yet
Plate Fin Heat Ex Changers
16 pages

Lecture w07

Uploaded by

Lecture w07

Uploaded by

ECE357:Computer Operating Systems Unit 7/pg 1 ©2024 Jeff Hakner

What is the Kernel?

subdirectories for each supported file system type module.

Re-entering the kernel

• On a multi-processor system, one processor may execute a series of instructions which

Important Shared Processor/Kernel Data Structures

(Phys addr) IDT (incomplete example) Kernel Text

Interrupt and Exception handling in hardware

restored from the stack)

kernel−mode per−thread stacks

shared kernel data structures

Kernel control flow paths & pre-emption

Making a System Call

The 32-bit X86 API

Kernel-mode stack and the thread_info structure

Per-CPU variable area

Now, back to our system call, already in progress

negative...think about two’s complement) then we jump to the syscall_badsys code.

System Call Return Value

Kernel system call hand-off to C

asmlinkage long sys_getuid(void)

getnstimeofday(&tv); //Kernel keeps time in ns resolution

Returning from a system call

Deferred return from system call, interrupt or fault handler

Pre-emptive context switch

The X86-64 API

Summary of 32 and 64 bit X86 Syscall APIs

Appendix - X86 Architecture and Assembly Language

Intel vs UNIX assembly syntax

processors). In the UNIX syntax, an identifier is unambiguously an assembler symbol.

Assembler Directives / Pseudo Opcodes

X86 Register Model

and instruction selection a challenge. The following general-purpose registers are

Segmentation, Special-Purpose and Additional Registers

The X86 addressing scheme is based on an obsolete concept known as "segment/offset"

X86-64 Function Calling

X86-64 Global Variables

X86 General-Purpose Register Summary

-32 reg Saved by Notes -64 reg Saved by Notes

You might also like