A Journey in Creating An Operating System Kernel The 539kernel Book
A Journey in Creating An Operating System Kernel The 539kernel Book
Mohammed Q. Hussain
November 2022
CONTENTS
2
contents 3
5
contents 6
acknowledgment
I would like to thank Dr. Hussain Almohri 1 for his kind acceptance to
read this book before its release and for his encouragement, feedback
and discussions that really helped me. Also, I would like to thank
my friends Anas Nayfah, Ahmad Yassen, DJ., Naser Alajmi and my
dearest niece Eylaf for their kind support.
1 https://almohri.io/
1
C H A P T E R 1 : L E T ’ S S TA R T W I T H T H E B O O T L O A D E R
1.1 introduction
The first piece to start with when writing an operating system’s kernel
is the boot loader which is the code that is responsible for loading the
main kernel from the disk to the main memory so the kernel can be
executed. Before getting started in the details of the boot loader and
all other parts of the kernel, we need to learn a little bit about the
tools (e.g. compilers and programming languages) that we will use
in our journey of creating a kernel. In this chapter, we start with an
overview on the tools and their basics and then we start in writing a
boot loader.
1 While the program that transforms the source code which is written in high-level
language such as C to machine code is known as compiler.
2 Another popular open-source assembler is GNU Assembler (GAS). One of main
differences between NASM and GAS that the first uses Intel’s syntax while the
second uses AT&T syntax.
7
1.2 x86 assembly language overview 8
1.2.1 Registers
Figure 1: How the Registers EAX, EBX, ECX and EDX are Divided in x86
can be referred to in the assembly code. The first 8 bits of the register
are called the low bits, while the second 8 bits are called the high bits.
Let’s take one of these register as an example:AX register is a 16-bit
register which is a part of the bigger 32-bit EAX register in 32-bit
architecture. AX 5 is divided into two more parts, AL for the low 8 bits
as the second letter of the name indicates and AH for the high 8 bits as
the second letter of the name indicates. The same division holds true
for the registers BX, CX and DX, figure 1 illustrates that division.
As you can see, each line starts with an instruction which is pro-
vided to us by x86 architecture, in the first two lines we use an
instruction named mov and as you can see, this instruction receives
5 Or in other words for 32-bit architecture: The first 16 bits of EAX.
6 Or a procedure for people who work with Algol-like programming languages.
1.2 x86 assembly language overview 10
Now you can tell that the first line copies the value 0Eh to the
register ah, and the second line copies the character s to the register
al. The single quotation is used in NASM to represent strings or
characters and that’s why we have used it in the second line, based
on that, you may noticed that the value 0Eh is not surrounded by a
single quotation though it contains characters, in fact, this value isn’t
a string, it is a number that is represented by hexadecimal numbering
system and due to that the character h was put in the end of that
value, that is, putting h in the end of 0E tells NASM that this value is
a hexadecimal number, the equivalent number of 0E in the decimal
numbering system, which we humans are using, is 14, that is 0E and
14 are the exactly the same, but they are represented in two different
numbering system8 .
1.2.3 NASM
7 https://software.intel.com/en-us/articles/intel-sdm
8 Numbering systems will be discussed in more details later.
1.2 x86 assembly language overview 11
Binary Format
In any binary format, one major part of the binary file that uses
this format is the machine code that has been produced by compiling
or assembling some source code, the machine code is specific to a
processor architecture, for example, the machine code that has been
generated for x64 12 cannot run on x86. Because of that the binary
files are distributed according to the processor architecture which can
run on, for example, GNU/Linux users see the names of software
packages in the following format nasm_2.14-1_i386.deb, the part
i386 tells the users that the binary machine code of this package is
generated for i386 architecture, which is another name for x86 by the
way, that means this package cannot be used in a machine that uses
ARM processor such as Raspberry Pi for example.
Due to that, to distribute a binary file of the same software for
multiple processor’s architectures, a separate binary file should be
generated for each architecture, to solve this problem, a binary format
named FatELF was presented. In this binary format, the software
machine code of multiple processor architectures are gathered in one
binary file and the suitable machine code will be loaded and run
based on the type of the system’s processor. Naturally, the size of
the files that use such format will be bigger than the files that uses a
binary format that is oriented for one processor architecture. Due to
the bigger size, this type of binary formats is known as fat binary.
Getting back to the format argument of NASM, if our goal of using
assembly language is to produce an executable file for Linux for
example, we will use elf as a value for format argument. But we are
working with low-level kernel development, so our binary files should
be flat and the value of format should be bin to generate a flat binary
file which doesn’t use any specification, instead, in flat binary files, the
output is stored as is with no additional information or organization,
only the output machine language of our code. Using flat binary for
bootloader does make sense and that’s because the code which is
going to load 13 our binary file doesn’t understand any binary format
to interpret it and fetch the machine code out of it, instead, the content
of the binary file will be loaded to the memory as is.
GNU Make is a build automation tool. Well, don’t let this fancy
term make you panic! the concept behind it is too simple. When we
create a kernel of an operating system 14 we are going to write some
assembly code and C code and both of them need to be assembled
and compiled (for the C code) to generate the machine code as binary
files out of them. With each time a modification is made in the source
code, you need to recompile (or reassemble) the code over and over
again through writing the same commands in the terminal in order
to generate the last binary output of your code. Beside the compiling
and recompiling steps, an important step needs to take place in order
to generate the last output, this operation is known as linking, usually
a programming project contains multiple source files that call each
other, compiling each one of these files is going to generate a separate
object file 15 for each one, in linking process these different object files
are linked with each other to generate one binary file out of these
multiple object files, this last binary file represents the program that
we are writing.
These operations which are needed to generate the last binary
file out of the source code is known as building process, which, as
mentioned earlier, involves executing multiple commands such as
compiling, assembling and linking. The building process a tedious job
and error-prone and to save our time (and ourselves from boredom
of course) we don’t want to write all these commands over and over
again in order to generate the last output, we need an alternative and
here where GNU Make 16 comes to the rescue, it automates the building
process by gathering all required commands in a text file known as
Makefile and once the user runs this file through the command make,
GNU Make is going to run these commands sequentially, furthermore,
it checks whether a code file is modified since the last building process
or not, if the case is that the file is not modified then it will not be
compiled again and the generated object file from the last building
process is used instead, which of course minimize the needed time to
finish the building process.
1.3.1 Makefile
A makefile is a text file that tells GNU Make what are the needed
steps to complete the building process of a specific source code. There
is a specific syntax that we should obey when writing makefile. A
number of rules may be defined, we can say that a makefile has a list
of rules that define how to create the executable file. Each rule has the
following format:
1 target: prerequisites
2 recipe
1 #include "file2.h"
2 int main()
3 {
4 func();
5 }
1 void func();
1 #include <stdio.h>
2 void func()
3 {
4 printf( "Hello World!" );
5 }
The target name of this rule is build, and since it is the first and
only rule in the makefile which its name doesn’t start with a dot, then
1.4 the emulators 15
In this example, the first rule build depends on two object files
file1.o and file2.o. Before running the building process for the first
time, these two files will not be available in the source code directory
17 ,
therefore, we have defined a rule for each one of them. The rule
file1.o is going to generate the object file file1.o and it depends on
file1.c, the object file will be simple generated by compiling file1.c.
The same happens with file2.o but this rule depends on two files
instead of only one.
GNU Make also supports variables which can simply be defined
as the following: foo = bar and they can be used in the rules as the
following: $(foo). Let’s now redefine the second makefile by using
the variables.
1 c_compiler = gcc
2 buid_dependencies = file1.o file2.o
3 file1_dependencies = file1.c
4 file2_dependencies = file2.c file2.h
5 bin_filename = ex_file
6 build: $(buid_dependencies)
7 $(c_compiler) -o $(bin_filename) $(buid_dependencies)
8 file1.o: $(file1_dependencies)
9 gcc -c $(file1_dependencies)
10 file2.o: $(file2_dependencies)
11 gcc -c $(file2_dependencies)
While developing an operating system kernel, for sure, you will need
to run that kernel to test your code frequently. That’s of course can
be done by writing the image of the kernel on a bootable device and
17 Since they are a result of one step of the building process which is the compiling step
that has not been performed yet.
1.4 the emulators 16
reboot you machine over and over again in order to run your kernel.
Obviously, this way isn’t practical and needs a lot of chore work.
Moreover, when a bug shows up, it will be really hard to debug your
code by using this way. An alternative better way is to use an emulator
to run your kernel every time you need to test it.
An emulator is a software that acts like a full computer and by using
it you can run any code that require to run on a bare metal hardware.
Also, by using an emulator, everything will be virtual, for example,
you can create a virtual hard disk (that is, not real) that can be used
by your kernel, this virtual hard disk will be a normal file in you host
system, so, if anything goes wrong in your code you will not lose your
data in your main system. Furthermore, an emulator can provide you
with a debugger which will make your life a lot easier when you need
to debug your code.
There are two options for the emulator, QEMU 18 and Bochs 19 .
Both of them are open source and both of them provides us with a
way to run a debugger. Personally, I liked Bochs’ debugger better
since it provides an easy GUI that saves a lot of time. QEMU on the
other hand, gives that user the ability to use GNU Debugger through
command line. Running a kernel image is simple in QEMU, the
following command performs that.
1 qemu-system-x86_64 kernel.img
Where kernel.img is the binary file of the kernel and the bootloader.
You will see later in 539kernel’s Makefile that the option -s is used
with QEMU, it can be safely removed but it is used to make GNU
debugger able to connect to QEMU in order to start a debugging
session. Of course you can find a lot more about QEMU in its official
documentation 20 .
To run your kernel by using Bochs, you need to create a configura-
tion text file named bochsrc. Each time you run Bochs it will use this
configuration file which tells Bochs the specifications of the virtual
machine that will be created, these specifications are something about
the virtual processors, their number, their available feature, the num-
ber of available virtual disks, their options, the path of their files and
so on. Also, whether the debugger of Bochs and its GUI is enabled or
not are decided through this configuration file. This configuration can
be easily created or edited by using a command line interface through
running the command bochs with no arguments. After creating the
file you can use the option -f bochsrc where bochsrc is the filename
of the configuration file to run your kernel directly with no question
from Bochs about what to do.
18 https://www.qemu.org/
19 https://bochs.sourceforge.io/
20 https://www.qemu.org/docs/master/
1.5 writing the boot loader 17
Figure 2: (A) Shows a platter when we see it from the side. (B) Shows a
platter when we see it from top/down.
Figure 3: Shows how the parts of a hard disk are assembled together.
24 This fancy term mechanical moves means that some physical parts of hard disk moves
physically.
25 Not exactly random, can you tell why?
26 I didn’t mention that previously, but yes, the bootloader resides in track 0.
1.5 writing the boot loader 20
27 C programming language, for instance, uses this way for hexadecimal numbers.
1.5 writing the boot loader 21
That’s it, all the BIOS services can be used in this exact way. First we
need to know what is the interrupt number that the service belongs
to, then, we need to know the number of the service itself, we put
the service number in the register ah then we call the interrupt by its
number by using int instruction.
The previous code calls the service of printing a character on a
screen, but is it complete yet? Actually no, we didn’t specify what
is the character that we would like to print. We need something like
parameters in high-level languages to pass additional information for
BIOS to be able to do its job. Well, lucky us! the registers are here to
the rescue.
When a BIOS service needs additional information, that is, param-
eters. It expects to find these information in a specific register. For
example, the service 0Eh in interrupt 10h expects to find the character
that the user wants to print in the register al, so, the register al is one
of service 0Eh parameters. The following code requests from BIOS to
print the character S on the screen:
1 mov ah, 0Eh
2 mov al, ’S’
3 int 10h
whatever we like after it as a comment and the rest of the source line
will be considered as a part of the comment.
A label is a way to give an instruction or a group of instructions a
meaningful name, then we can use this name in other places in the
source code to refer to this instruction/group of instructions, we can
use labels for example to call this group of instructions or to get the
starting memory address of these instructions. Sometimes, we may
use labels to make the code more readable.
We can say that a label is something like the name of a function or
variable in C, as we know a variable name in C is a meaningful name
that represents the memory address of a location in the main memory
that contains the value of a variable, the same holds true for a function
name. Labels in NASM works in the same way, under the hood it
represents a memory address. The colon in label is also optional.
1 print_character_S_with_BIOS:
2 mov ah, 0Eh
3 mov al, ’S’
4 int 10h
You can see in the code above, we gave a meaningful name for the
bunch of instructions that prints the character S on the screen. After
defining this label in our source code, we can use it anywhere in the
same source code to refer to this bunch of instructions.
1 call_video_service int 10h
Let’s start this section with a simple question. What happens when
we call a function in C? Consider the following C code.
1 main()
2 {
3 int result = sum( 5, 3 );
4
5 printf( "%d\n", result );
6 }
Here, the function main called a function named sum, this function
reside in a different region in memory and by calling it we are telling
the processor to go to this different region of memory and execute
what’s inside it, the function sum is going to do its job, and after
that, in some magical way, the processor is going to return to the
original memory region where we called sum from and proceed the
execution of the code that follows the calling of sum, in this case, the
printf function. How does the processor know where to return after
completing the execution of sum?
The function which call another is named caller while the function
which is called by the caller named callee, in the above C code, the
caller is the function main while the callee is the function sum.
29 For the simplicity of explanation, the details of decoding have been eliminated.
30 The stack as a region of memory (x86 stack) is not same as the data structure stack,
the former implements the latter.
31 Push means store something in a stack, this term is applicable for both x86 stack and
the data structure stack, as we have said previously, x86 stack is an implementation
of the stack data structure.
1.5 writing the boot loader 25
which is 120 in the stack, get it and put it in the register IP/EIP for
the next instruction cycle 32 . So, this is the answer of our original
question in the previous section “How does the processor know where
to return after completing the execution of sum?”.
You can see here that we have used the code sample print_character_S_with_BIOS
to define something like C functions by using the instructions call
and ret. It should be obvious that the code of print_two_times prints
the character S two times, as we have said previously, a label repre-
sents a memory address and print_character_S_with_BIOS is a label,
the operand of call is the memory address of the code that we wish to
call, the instructions of print_character_S_with_BIOS will be executed
sequentially until the processor reaches the instruction ret, at this
point, the return address is obtained from the stack and the execution
of the caller is resumed.
call performs an unconditional jump, that means the processor
reaches to a call instruction, it will always call the callee, without any
condition, later in this chapter we will see an instruction that performs
a conditional jump, which only calls the callee when some condition is
satisfied, otherwise, the execution of the caller continues sequentially
with no flow change.
Like call, the instruction jmp jumps to the specified memory address,
but unlike call, it doesn’t store the return address in the stack which
32 By the way, this is, partially, the cause of buffer overflow bugs.
33 Actually it pops the value since we are talking about stack here.
1.5 writing the boot loader 26
means ret cannot be used in the callee to resume the caller’s execution.
We use jmp when we want to jump to a code that we don’t need to
return from, jmp has the same functionality of goto statement in C.
Consider the following example.
1 print_character_S_with_BIOS:
2 mov ah, 0Eh
3 mov al, ’S’
4 jmp call_video_service
5
6 print_character_A_with_BIOS:
7 mov ah, 0Eh
8 mov al, ’A’
9
10 call_video_service:
11 int 10h
Can you guess what is the output of this code? it is S and the code
of the label print_character_A_with_BIOS will never be executed
because of the line jmp call_video_service. If we remove the line of
jmp from this code sample, A will be printed on the screen instead of
S. Another example which causes infinite loop.
1 infinite_loop:
2 jmp infinite_loop
34 In 32-bit x86 processors its name is EFLAGS and in 64-bit its name is RFLAGS.
1.5 writing the boot loader 27
cases. We can see that the conditional jump instructions have the same
functionality of if statement in C. Consider the following example.
1 main:
2 cmp al, 5
3 je the_value_equals_5
4 ; The rest of the code of ‘main‘ label
Like jmp, but unlike call, conditional jump instructions don’t push
the return address into the stack, which means the callee can’t use
ret to return and resume caller’s code, that is, the jump will be one
way jump. We can also imitate while loop by using conditional jump
instructions and cmp, the following example prints S five times by
looping over the same bunch of code.
1 mov bx, 5
2
3 loop_start:
4 cmp bx, 0
5 je loop_end
6
7 call print_character_S_with_BIOS
8
9 dec bx
10
11 jmp loop_start
12
13 loop_end:
14 ; The code after loop
You should be familiar with the most of the code of this sample,
first we assign the value 5 to the register bx 35 , then we start the label
loop_start which the first thing it does is comparing the value of bx
with 0, when bx equals 0 the code jumps to the label loop_end which
contains the code after the loop, that is, it means that the loop ended.
When bx doesn’t equal 0 the label print_character_S_with_BIOS will
35 Can you tell why we used bx instead of ax? [Hint: review the code of
print_character_S_with_BIOS.]
1.5 writing the boot loader 28
Load String
NASM’s Pseudoinstructions
When you encounter the prefix 39 pseudo before a word, you should
know that it describes something fake, false or not real 40 . NASM
provides us with a number of pseudoinstructions, that is, they are
not real x86 instructions, the processor doesn’t understand them and
they can’t be used in other assemblers 41 , on the other hand, NASM
understands those instructions and can translate them to something
understandable by the processor. They are useful, and we are going
to use them to make the writing of the bootloader easier.
The above example reserves a byte in the memory, this is the decla-
ration step, then the character a will be stored on this reserved byte of
the memory, which is the initialization step.
1 db ’a’, ’b’, ’c’
these values will be stored contiguously, that is, one after another, the
memory location (hence, the memory address) of the value b will be
right after the memory location of value a and the same rule applies
for c. Since a, b and c are of the same type, a character, we can write
the previous code as the following and it gives as the same result.
1 db ’abc’
Also, we can declare different types of data in the same source line,
given the above code, let’s say that we would like to store the number
0 after the character c, this can be achieved by simply using a comma.
1 db ’abc’, 0
Now, to make this data accessible from other parts of the code, we
can use a label to represent the starting memory address of this data.
Consider the following example, it defines the label our_variable,
after that, we can use this label to refer to the initialized data.
1 our_variable db ’abc’, 0
Not only normal x86 instructions can be used with times as second
operand, also NASM’s pseudoinstructions can be used with times.
The following example reserves 100 bytes of the memory and fills
them with 0.
1 times 100 db 0
Till now, you have learned enough to understand the most of the
bootloader that we are going to implement, however, some details
have not been explained in this chapter and have been delayed to be
explained later. The first couple lines of the bootloader is an example
of concepts that have not been explained, our bootloader source code
starts with the following.
1 start:
2 mov ax, 07C0h
3 mov ds, ax
ds, this is a story for another chapter, just take these two lines on faith,
and you will learn later the purpose of them. Let’s continue.
1 mov si, title_string
2 call print_string
3
4 mov si, message_string
5 call print_string
These two lines represent the most important part of any bootloader,
first a function named load_kernel_from_disk is called, we are going
to define this function in a moment, as you can see from its name, it is
going to load the code of the kernel from disk into the main memory
and this is the first step that makes the kernel able to take the control
over the system. When this function finishes its job and returns, a
jump is performed to the memory address 0900h:000, but before
discussing the purpose of the second line let’s define the function
load_kernel_from_disk.
1 load_kernel_from_disk:
2 mov ax, 0900h
3 mov es, ax
This couple of lines, also, should be taken on faith. You can see, we
are setting the value 0900h on the register es. Let’s move to the most
important part of this function.
1 mov ah, 02h
2 mov al, 01h
3 mov ch, 0h
4 mov cl, 02h
5 mov dh, 0h
6 mov dl, 80h
7 mov bx, 0h
8 int 13h
9
10 jc kernel_load_error
11
12 ret
1.5 writing the boot loader 33
This block of code loads the kernel from the disk into the memory
and to do that it uses the BIOS Service 13h which provides services
that are related to hard disks. The service number which is 02h is
specified on the register ah, this service reads sectors from the hard
disk and loads them into the memory. The value of the register al is
the number of sectors that we would like to read, in our case, because
the size of our temporary kernel simple_kernel.asm doesn’t exceed
512 bytes we read only 1 sector. Before discussing the rest of passed
values to the BIOS service, we need to mentioned that our kernel will
be stored right after the bootloader on the hard disk, and based on this
fact we can set the correct values for the rest registers which represent
the disk location of the content that we would like to load.
The value of register ch is the number of the track that we would
like to read from, in our case, it is the track 0. The value of the register
cl is the sector number that we would like to read its content, in our
case, it is the second sector. The value of the register dh is the head
number. The value of dl specifies which the type of disk that we
would like to read from, the value 0h in this register means that we
would like to read the sector from a floppy disk, while the value 80h
means we would like to read from the hard disk #0 and 81h for hard
disk #1, in our case, the kernel is stored in the hard disk #0, so, the
value of dl should be 80h. Finally, the value of the register bx is the
memory address that the content will be loaded into, in our case, we
are reading one sector, and its content will be stored on the memory
address 0h 44 .
When the content is loaded successfully, the BIOS Service 13h:02h
is going to set the carry flag 45 to 0, otherwise, it sets the carry flag
to 1 and stores the error code in register ax, the instruction jc is a
conditional jump instruction that jumps when CF = 1, that is, when
the value of the carry flag is 1. That means our bootloader is going
to jump to the label kernel_load_error when the kernel isn’t loaded
correctly.
If the kernel is loaded correctly, the function load_kernel_from_disk
returns by using the instruction ret which makes the processor to
resume the main code of our bootloader and executes that instruction
which is after call load_kernel_from_disk, this next instruction is
jmp 0900h:0000 which gives the control to the kernel by jumping to
its starting point, that is, the memory location where we loaded our
kernel in. In this time, the operand of jmp is an explicit memory
address 0900h:0000, it has two parts, the first part is the one before
the colon, you can see that it is the same value that we have loaded in
the register es in the beginning of load_kernel_from_disk function.
The second part of the memory address is the one after the colon,
44 Not exactly the memory address 0h, in fact, it will be loaded in offset 0 inside a
segment that starts at 0900h. Don’t worry, these details will be examined later in the
next chapter 2.
45 Which is part of FLAGS register as we mentioned earlier
1.5 writing the boot loader 34
6
7 cmp al, 0
8 je printing_finished
9
10 int 10h
11
12 jmp print_char
13
14 printing_finished:
15 mov al, 10d ; Print new line
16 int 10h
17
18 ; Reading current cursor position
19 mov ah, 03h
20 mov bh, 0
21 int 10h
22
23 ; Move the cursor to the beginning
24 mov ah, 02h
25 mov dl, 0
26 int 10h
27
28 ret
the previous string finished, finally the function returns to the caller
by using the instruction ret.
1 title_string db ’The Bootloader of 539kernel.’, 0
2 message_string db ’The kernel is loading...’, 0
3 load_error_string db ’The kernel cannot be loaded’, 0
The code above defines the strings that have been used previously
in the source code, note the last part of each string, which is the null
character that indicates the end of a string 48 .
Now, we have written our bootloader and the last thing to do is
to put the magic code in the end of it, the magic code which is a 2
bytes value should reside in the last two bytes in the first sector, that
is, in the locations 510 and 511 (the location number starts from 0),
otherwise, the firmware will not recognize the content of the sector as
a bootloader. To ensure that the magic code is written on the correct
location, we are going to fill with zeros the empty space between the
last part of bootloader code and the magic code, this can be achieved
by the following line.
1 times 510-($-$$) db 0
Implementing simple_kernel.asm
48 Exercise: What will be the behavior of the bootloader if we remove the null character
from title_string and message_string and keep it in load_error_string?
1.5 writing the boot loader 37
10 jmp $
11
12 print_string:
13 mov ah, 0Eh
14
15 print_char:
16 lodsb
17
18 cmp al, 0
19 je done
20
21 int 10h
22
23 jmp print_char
24
25 done:
26 ret
27
28 hello_string db ’Hello World!, From Simple Assembly 539kernel!’, 0
The only lines that you are not familiar with until now are the
first two lines in the label start which will be explained in details in
chapter 2. Finally the Makefile is the following.
1 ASM = nasm
2 BOOTSTRAP_FILE = bootstrap.asm
3 KERNEL_FILE = simple_kernel.asm
4
5 build: $(BOOTSTRAP_FILE) $(KERNEL_FILE)
6 $(ASM) -f bin $(BOOTSTRAP_FILE) -o bootstrap.o
7 $(ASM) -f bin $(KERNEL_FILE) -o kernel.o
8 dd if=bootstrap.o of=kernel.img
9 dd seek=1 conv=sync if=kernel.o of=kernel.img bs=512
10 qemu-system-x86_64 -s kernel.img
11
12 clean:
13 rm -f *.o
2
C H A P T E R 2 : A N O V E RV I E W O F X 8 6 A R C H I T E C T U R E
2.1 introduction
38
2.2 x86 operating modes 39
currently running code belongs to the kernel then the current privilege
level will be 0 and according to it, the processor is going to decide
allowed operations. In other words, we can say that the processor
keeps tracking the current state of the currently running system and
one of the information in this state is in which privilege level (or mode)
the system is currently running.
6 And from here came the well-known joke: “There are 10 types of people in this world,
those who understand binary and those who don’t”.
2.3 numbering systems 42
It is too long and it will be tedious to work with, and for that the
hexadecimal numbering system can be useful. Each digit in hexadec-
imal represents four bits 8 , that is, the number 0h in hexadecimal is
equivalent to 0000b in binary. As the 8 bits known as a byte, the
4 bits is known as a nibble, that is, a nibble is a half byte and, as we
have said, but in other words, one digit of hexadecimal represents a
nibble. So, we can use hexadecimal to represent the same memory
address value in more elegant way.
1 00 00 00 01h
7 I think It’s too brave to state this claim, however, it holds true at least for the well-
known numbering system.
8 Can you tell why? [Hint: How the maximum hexadecimal number F is represented
in binary?]
2.4 the basic view of memory 43
The basic view of the main memory is that it is an array of cells, each
cell has the ability to store one byte and it is reachable by a unique
number called memory address 9 , the range of memory addresses starts
from 0 to some limit x, for example, if the system has 1MB of physical
main memory, then the last memory address in the range will be
1023, as we know, 1MB = 1024 bytes and since the range starts from
0 and not 1, then the last memory address in this case is 1023 and
not 1024. This range of memory addresses is known as address space
and it can be a physical address space which is limited by the physical
main memory or a logical address space. A well-known example of
using logical address space that we will discuss in a latter chapter is
virtual memory which provides a logical address space of size 4GB in
32-bit architecture even if the actual size of physical main memory
is less than 4GB. However, The address space starts from the memory
address 0, which is the index of the first cell (byte) of the memory, and
it increases by 1, so the memory address 1 is the index of the second
cell of the memory, 2 is the index of third cell of memory and so on.
Viewing the memory as an array of contiguous cells is also known as
flat memory model.
When we say physical we mean the actual hardware, that is, when
the maximum capacity of the hardware of the main memory (RAM) is
1MB then the physical address space of the machine is up to 1MB. On
the other hand, when we say logical that means it doesn’t necessarily
9 The architecture which each memory address points to 1 byte is known as byte-
addressable architecture or byte machines. It is the most common architecture. Of course,
other architectures are possible, such as word-addressable architecture or word machines.
2.4 the basic view of memory 44
represents or obeys the way the actual hardware works on, instead it
is a hypothetical way of something that doesn’t exist in the real world
(the hardware). To make the logical view of anything works, it should
be mapped into the real physical view, that is, it should be somehow
translated for the physical hardware to be understood, this mapping
is handled by the software or sometimes special parts of the hardware.
Now, for the following discussion, let me remind you that the
memory address is just a numerical value, it is just a number. When I
discuss the memory address as a mere number I call it memory address
value or the value of memory address, while the term memory address
keeps its meaning, which is a unique identifier that refers to a specific
location (cell) in the main memory.
The values of memory addresses are used by the processor all the
time to be able to perform its job, and when it is executing some
instructions that involve the main memory (e.g. reading a content
from some memory location or dealing with program counter), the
related values of memory addresses are stored temporarily on the
registers of the processor, due to that, the length of a memory address
value is bounded to the size of the processor’s registers, so, in 32-bit
environments, where the size of the registers is usually 32-bit, the
length of the memory address value is always 32 bits, why am I
stressing “always” here? Because even if less than 32 bits is enough to
represent the memory address value, it will be represented in 32 bits
though, for example, assume the memory address value 1, in binary,
the value 1 can be represented by only 1 bit and no more, but in
reality, when it is stored (and handled) by the 32-bit processor, it will
be stored as the following sequence of bits.
1 00000000 00000000 00000000 00000001
As you can see, the value 1 has been represented in exactly 32 bits,
appending zeros to the left doesn’t change the value itself, it is similar
to writing a number as 0000539 which is exactly 539.
It has been mentioned earlier that the register size that stores the
values of memory address in order to deal with memory contents
affects the available size of main memory for the system. Take for
example the instruction pointer register, if its size, say, 16 bits then
the maximum available memory for code will be 64KB (64 KB = 65536
Bytes / 1024) since it is the last reachable memory address by the
processor for fetching an instruction. What if the size of the instruction
pointer register is 32 bits, then the maximum available memory for
code will be 4GB. Why is that?
To answer this question let’s work with decimal numbers first. If I
tell you that you have five blanks, what is the largest decimal number
you can represent in these five blanks? the answer is 99999d. In the
same manner, if you have 5 blanks, what is the largest binary number
you can represent in these 5 blanks? it is 11111b which is equivalent
to 31d, the same holds true for the registers that store the value of
2.5 x86 segmentation 45
memory addresses, given the size of such register is 16 bits, then there
is 16 blanks, and the largest binary number that can be represented
in those 16 blanks is 11111111 11111111b or in hexadecimal FF FFh,
which is equivalent to 65535d, that means the last byte a register of
size 16 bits can refer to is the byte number 65535d because it is the
largest value this register can store and no more, which leads to the
maximum size of main memory this register can handle, it is 65535
bytes which is equivalent to 64KB and the same applies on any other
size than 16 bits.
For the sake of clarity, let’s discuss the details of segmentation under
real mode first. We have said that logical views (of anything) should
be mapped to the physical view either by software or hardware, in this
case, the segmentation view is realized and mapped to the architecture
of the physical main memory by the x86 processor itself, that is, by
the hardware. So, we have a logical view, which is the concept of
segmentation which divides a program into separated segments, and
the actual physical main memory view which is supported by the real
2.5 x86 segmentation 46
RAM hardware and sees the data as a big array of bytes. Therefore, we
need some tools to implement (map) the logical view of segmentation
on top the actual hardware.
For this purpose, special registers named segment registers are pre-
sented in x86, the size of each segment register is 16 bits and they
are: CS which is used to define the code segment. SS which is used
to define the stack segment. DS, ES, FS and GS which can be used
to define data segments, that means each program can have up to
four data segments. Each segment register stores the starting memory
address of a segment and here you can start to observe the mapping
between the logical and physical view. In real mode, the size of each
segment is 64KB and as we have said we can reach any byte inside
a segment by using the offset of the required byte, you can see the
resemblance between a memory address of the basic view of memory
and an offset of the segmentation view of memory 10 .
Let’s take an example to make the matter clear, assume that we
have a code of some program loaded into the memory and its starting
physical memory address is 100d, that is, the first instruction of this
program is stored in this address and the next instructions are stored
right after this memory address one after another. To reach the first
byte of this code we use the offset 0, so, the whole physical address of
the first byte will be 100:0d, as you can see, the part before the colon
is the starting memory address of the code and the part after the colon
is the offset that we would like to reach and read the byte inside it. In
the same way, let’s assume we would like to reach the offset 33, which
means the byte 34 inside the loaded code, then the physical address
that we are trying to reach is actually 100:33d. To make the processor
handle this piece of code as the current code segment then its starting
memory address should be loaded into the register CS, that is, setting
the value 100d to CS, so, we can say in other words that CS contains
the starting memory address of currently executing code segment (for
short: current code segment).
10 The concept and term of offset is not exclusive to segmentation, it is used on other
topics related to the memory.
2.5 x86 segmentation 47
As we have said, the x86 processor always run with the mind that
the segmentation is in use. So, let’s say it is executing the following
assembly instruction jmp 150d which jumps to the address 150d. What
really happens here is that the processor consider the value 150d as
an offset instead of a full memory address, so, what the instruction
requests from the processor here is to jump to the offset 150 which
is inside the current code segment, therefore, the processor is going
to retrieve the value of the register CS to know what is the starting
memory address of the currently active code segment and append the
value 150 to it. Say, the value of CS is 100, then the memory address
that the processor is going to jump to is 100:150d.
This is also applicable on the internal work of the processor, do you
remember the register IP which is the instruction pointer? It actually
stores the offset of the next instruction instead of the whole memory
address of the instruction. Any call (or jump) to a code inside the same
code segment of the caller is known as near call (or jump), otherwise
is it a far call (or jump). Again, let’s assume the current value of CS is
100d and you want to call a label which is on the memory location
900:1d, in this situation you are calling a code that reside in a different
code segment, therefore, the processor is going to take the first part of
the address which is900d, loads it to CS then loads the offset 1d in IP.
Because this call caused the change of CS value to another value, it is
a far call.
The same is exactly applicable to the other two types of segments
and of course, the instructions deal with different segment types based
on their functionality, for example, you have seen that jmp and call
deal the code segment in CS, that’s because of their functionality which
is related to the code. Another example is the instruction lodsb which
deals with the data segment DS, the instruction push deals with the
stack segment SS and so on.
Here, we told the processor that the data segment of our program
(the bootloader) starts in the memory address 07C0h 11 , so, if we refer
11 Yes, all segments can be on the same memory location, that is, there is a 64KB segment
of memory which is considered as the currently active code segment, data segment
and stack segment. We have already mentioned that when we have discussed how to
implement flat-memory model on x86.
2.5 x86 segmentation 48
to the memory to read or write data, the processor starts with the
memory address 07C0h which is stored in the data segment register ds
and then it appends the offset that we are referring to, in other words,
any reference to data by the code being executed will make the proces-
sor to use the value in data segment register as the beginning of the
data segment and the offset of referred data as the rest of the address,
after that, this physical memory address of the referred data will be
used to perform the instruction. An example of instructions that deal
with data in our bootloader is the line mov si, title_string.
Now assume that BIOS has set the value of ds to 0 (it can be
any other value) and jumped to our bootloader, that means the data
segment in the system now starts from the physical memory address
0 and ends at the physical memory address 65535 since the maximum
size of a segment in real-mode is 64KB. Now let’s take the label
title_string as an example and let’s assume that its offset in the
binary file of our bootloader is 490, when the processor starts to
execute the line mov si, title_string 12 it will, somehow, figures
that the offset of title_string is 490 and based on the way that x86
handles memory accesses the processor is going to think that we are
referring to the physical memory address 490 since the value of ds is 0,
but in reality, the correct physical memory address of title_string is
the offset 490 inside the memory address 07C0h since our bootloader
is loaded into this address and not the physical memory address 0, so,
to be able to reach to the correct addresses of the data that we have
defined in our bootloader and that are loaded with the bootloader
starting from the memory address 07C0h we need to tell the processor
that our data segment starts from 07C0h and with any reference to data,
it should calculate the offset of that data starting from this physical
address, and that exactly what these two lines do, in other words,
change the current data segment to another one which starts from the
first place of our bootloader.
The second use of the segments in the bootloader is when we tried
to load the kernel from the disk by using the BIOS service 13h:02h in
the following code.
1 mov ax, 0900h
2 mov es, ax
3
4 mov ah, 02h
5 mov al, 01h
6 mov ch, 0h
7 mov cl, 02h
8 mov dh, 0h
9 mov dl, 80h
10 mov bx, 0h
11 int 13h
12 Which loads the physical memory address of title_string to the register si.
2.5 x86 segmentation 49
You can see here, we have used the other data segment ES to define
a new data segment that starts from the memory address 0900h, we
did that because the BIOS service 13h:02h loads the required content
(in our case the kernel) to the memory address ES:BX, for that, we
have defined the new data segment and set the value of bx to 0h.
That means the code of the kernel will be loaded on 0900:0000h
and because of that, after loading the kernel successfully we have
performed a far jump.
1 jmp 0900h:0000
here and in Intel’s official x86 manual the term field is used when the
size of the value that should be stored in the descriptor is more than
1 bit, for example the segment’s starting memory address is stored in
4 bytes, then the place where this address is stored in the descriptor
is called a field, otherwise when the term flag is used that means the
size of the value is 1 bit.
15 Reminder: In protected mode, the corresponding segment register stores the selector
of the currently active segment.
16 And it should, since segmentation is enabled by default in x86 and cannot be disabled.
17 Remember our discussion of the difference between our logical view of the memory
(e.g. segmentation) and the actual physical hardware
18 We can see here how obvious the mapping between the logical view of the memory
and the real-world memory.
19 Don’t worry about paging right now. It will be discussed later in this book. All you
need to know now is that paging is another logical view of the memory. Paging is
disabled by default in x86 which makes it an optional feature unlike segmentation.
2.5 x86 segmentation 52
For example, assume hypothetically that the running code has the
privilege to read data from data segment A and in the physical memory
another data segment B is defined right after the limit of A, which
means if we can exceed the limit of A we will able to access the data
inside B which is a critical data segment that stores kernel’s internal
data structures and we don’t want any code to read from it or write
to it in case this code is not privileged to do so. This can be achieved
by specifying the limit of A correctly, and when the unprivileged
code tries maliciously to read from B by generating a logical memory
address that has an offset which exceeds the limit of A the processor
prevents the operation and protects the content of segment B.
The limit, or in other words, the size of a given segment is stored
in the 20 bits segment limit field of that segment descriptor and how
the processor interprets the value of segment limit field depends
on the granularity flag (G flag) which is also stored in the segment’s
descriptor, when the value of this flag is 0 then the value of the limit
field is interpreted as bytes, let’s assume that the limit of a given
segment is 10 and the value of granularity flag is 0, that means the
size of this segment is 10 bytes. On the other hand, when the value of
granularity flag is 1, the value of segment limit field will be interpreted
as of 4KB units, for example, assume in this case that the value of limit
field is also 10 but G flag = 1, that means the size of the segment will
be 10 of 4KB units, that is, 10 * 4KB which gives us 40KB which equals
40960 bytes.
Because the size of segment limit field is 20 bits, that means the
maximum numeric value it can represent is 2^20 = 1,048,576, which
means if G flag equals 0 then the maximum size of a specific segment
can be 1,048,576 bytes which equals 1MB, and if G flag equals 1 then
the maximum size of a specific segment can be 1,048,576 of 4KB units
which equals 4 GB.
Getting back to the structure of descriptor, the bytes 2, 3 and 4 of
the descriptor store the least significant bytes of segment’s base address
and the byte 7 of the descriptor stores the most significant byte of the
base address, the total is 32 bits for the base address. The bytes 0 and
1 of the descriptor store the least significant bytes of segment’s limit and
byte 6 stores the most significant byte of the limit. The granularity flag
is stored in the most significant bit of the the byte 6 of the descriptor.
Before finishing this subsection, we need to define the meaning
of least significant and most significant byte or bit. Take for example
the following binary sequence which may represent anything, from a
memory address value to a UTF-32 character.
0111 0101 0000 0000 0000 0000 0100 1101
You can see the first bit from left is on bold format and its value is 0,
based on its position in the sequence we call this bit the most significant
bit or high-order bit, while the last bit on the right which is in italic
format and its value is 1 is known as least significant bit or low-order bit.
2.5 x86 segmentation 54
The same terms can be used on byte level, given the same sequence
with different formatting.
0111 0101 0000 0000 0000 0000 0100 1101
The first byte (8 bits) on the left which is in bold format and its
value is 0111 0101 is known as most significant byte or high-order byte
while the last byte on the right which is on italic format and its value
is 0100 1101 is known as least significant byte or low-order byte.
Now, imagine that this binary sequence is the base address of a
segment, then the least significant 3 bytes of it will be stored in bytes
2, 3 and 4 of the descriptor, that is, the following binary sequence.
While the most significant byte of the binary sequence will be stored
in the 7th byte of the descriptor, that is, the following binary sequence.
1 0111 0101
When the segment is a code segment, the second most significant bit
(tenth bit) is called conforming flag (also called C flag) while the third
most significant bit (ninth bit) called read-enabled flag (also called R
flag.). Let’s start our discussion with the simplest among those two
flags which is the read-enabled flag. The value of this flag indicates
how the code inside the segment in question can be used, when the
value of read-enabled flag is 1 20 , that means the content of the code
segment can be executed and read from, but when the value of this
flag is 0 21 that means the content of the code segment can be only
executed and cannot read from. The former option can be useful when
the code contains data inside it (e.g. constants) and we would like to
provide the ability of reading this data. When read is enabled for the
segment in question, the selector of this segment can also be loaded
into one of data segment registers 22 .
The conforming flag is related to the privilege levels that we had
an overview about them previously in this chapter. When a segment
is conforming, in other words, the value of conforming flag is 1,
that means a code which runs in a less-privileged level can call this
segment which runs in a higher privileged level while keeping the
current privilege level of the environment same as the one of the caller
instead of the callee.
20 Which means do enable read, since 1 is equivalent to true in the context of flags.
21 Which means don’t enable read.
22 Which makes sense, enabling reads from a code segment means it contains data also.
2.5 x86 segmentation 56
For example, let’s assume for some reason a kernel’s designer de-
cided to provide simple arithmetic operations (such as addition and
subtraction) for user applications from the kernel code, that is, there
is no other way to perform these operations in that system but this
code which is provided by the kernel. As we know, kernel’s code
should run in privilege level 0 which is the most-privileged level,
and let’s assume a user application which runs in privilege level 3, a
less-privileged level, needs to perform an addition operation, in this
case a kernel code, which should be protected by default from being
called by less-privileged code, should be called to perform the task,
this can only realized if the code of addition operation is provided as
a conforming segment, otherwise the processor is going to stop this
action where a less-privileged code calls a more-privileged code.
Also you should note that the code of addition operation is going to
run in privilege level 3 although it is a part of the kernel which runs in
privilege level 0 and that’s because of the original caller which runs in
the privilege level 3. Furthermore, although conforming segment can
be called by a less-privilege code (e.g. user application calls the kernel),
the opposite cannot be done (e.g. the kernel calls a user application’s
code) and the processor is going to stop the operation.
When the segment is data segment, the second most significant bit
(tenth bit) is called expansion-direction flag (also called E flag) while
the third most significant bit (ninth bit) is called write-enabled flag
(also called W flag). The latter one gives us the ability to make some
data segment a read-only when its value is 0, or we can make a data
segment both writable and readable by setting the value of write-
enabled flag to 1.
While the expansion-direction flag and its need will be examined
in details when we discuss x86 run-time stack in this chapter, what
we need to know right now is that when the value of this flag is 0, the
data segment is going to expand up (in Intel’s terms), but when the
value of this flag is 1, the data segment is going to expand down (in
Intel’s terms).
A last note about data segments is that all of them are non-conforming,
that is, a less-privileged code cannot access a data segment in a more-
privileged level. Furthermore, all data segments can be accessed by a
more-privileged code.
The special register GDTR stores the base physical address 26 of the
global descriptor table, that is, the starting point of GDT table. Also,
the same register stores the limit (or size) of the table.
To load a value into the register GDTR the x86 instruction lgdt, which
stands for load global descriptor table, should be used. This instruction
takes one operand which is the whole value that should be loaded
into GDTR, the structure of this value should be similar to the structure
of GDTR itself which is shown in figure 8. The figure shows that the
total size of GDTR is 48 bits divided into two parts. The first part starts
from bit 0 (the least significant bit) to bit 15, this part contains the
limit of GDT table that we would like to load. The size of this part
of GDTR register is 16 bits which can represent the value 65,536 at
maximum, that means the maximum size of GDT table can be 64KB =
65,536 Bytes / 1024, and as we know, the size of each of descriptor
is 8 bytes, that means the GDT table can hold 8,192 descriptors at
most. The second part of GDTR starts from bit 16 to bit 47 (the most
significant bit) and stores the base memory address of GDT table that
we would like to load.
register that should contain the segment selector 27 in GDT of the LDT
table that we would like to use, in other words, the index of segment
descriptor which describe the LDT table and which reside in GDT as an
entry should be loaded into LDTR register.
Segment Selector
which is the privilege level of a given segment, the third value which
contributes to the privilege level checks in x86 is requester privilege level
(RPL) which is stored in the segment selector, necessarily, RPL has four
possible values 0, 1, 2 and 3.
To understand the role of RPL let’s assume that a process X is running
in a user-mode, that is, in privilege level 3, this process is a malicious
process that aims to gain an access to some important kernel’s data
structure, at some point of time the process X calls a code in the kernel
and passes the segment selector of the more-privileged data segment
to it as a parameter, the kernel code runs in the most privileged level
and can access all privileged data segment by simply loading the
required data segment selector to the corresponding segment register,
in this case the RPL is set to 0 maliciously by process X, since the
kernel runs on the privilege level 0 and RPL is 0, the required segment
selector by process X will be loaded and the malicious process X will
be able to gain access to the data segment that has the sensitive data.
To solve this problem, RPL should be set to the requester privilege
level by the kernel to load the required data segment, in our example,
the requester (the caller) is the process X and its privilege level is 3
and the current privilege level is 0 since the kernel is running, but
because the caller has a less-privileged level the kernel should set
the RPL of the required data segment selector to 3 instead of 0, this
tells the processor that while the currently running code in a privilege
level 0 the code that called it was running in privilege level 3, so,
any attempt to reach a segment which its selectors RPL is larger than
CPL should be denied, in other words, the kernel should not reach
privileged segments in behalf of process X. The x86 instruction arpl
can be used by the kernel’s code to change the RPL of the segment
selector that has been requested by less-privileged code to access to
the privilege level of the caller, as in the example of process X.
A user application starts its life as a file stored in user’s hard disk, at
this stage it does nothing, it is just a bunch of binary numbers that
represent the machine code of this application, when the user decides
to use this application and opens it, the operating system loads this
application into the memory and in this stage this user application
becomes a process, we mentioned before that the term “process” is
used in operating systems literature to describe a running program,
another well-known term is task which is used by Linux kernel and
has the same meaning.
Typically, the memory of a process is divided into multiple regions
and each one of them stores a different kind of application’s data,
one of those regions stores the machine code of the application in the
memory, there are also two important regions of process’ memory,
2.6 x86 run-time stack 62
the first one is known as run-time heap (or just heap for short) which
provides an area for dynamically allocated objects (e.g. variables), the
second one is known as run-time stack (or stack for short), it’s also
known as call stack but we are going to stick to the term run-time stack
in our discussions. Please note that the short names of run-time stack
(that is, stack) and run-time heap (that is, heap) are also names for
data structures. As we will see shortly, a data structure describes a
way of storing data and how to manipulate this data, while in our
current context these two terms are used to represent memory regions
of a running process although the stack (as memory region) uses stack
data structure to store the data. Due to that, here we use the more
accurate term run-time stack to refer the memory region and stack to
refer the data structure.
Run-time stack is used to store the values of local variables and
function parameters, we can say that the run-time stack is a way to
implement function’s invocation which describes how function A can
call function B, pass to it some parameters, return back to the same
point of code where function A called function B and finally get the
returned value from function B, the implementation details of these
steps is known as calling convention and the run-time stack is one way
of realizing these steps. There are multiple known calling conventions
for x86, different compilers and operating systems may implement
different calling conventions, we are not going to cover those different
methods but what we need to know that, as we said, those different
calling conventions use the run-time stack as a tool to realize function’s
invocation. The memory region in x86 which is called run-time stack
uses a data structure called stack to store the data inside it and to
manipulate that data.
structure are two: push and pop 29 , the first one puts some value on
the top of the stack, which means that the top of stack always contains
the last value that have been inserted (pushed) into a stack. The latter
operation pop removes the value which resides on the top of the stack
and returns it to the user, that means the most recent value that has
been pushed to the stack will be fetched when we pop the stack.
Let’s assume that we have the string ABCD and we would like to
push each character separately into the stack. First we start with the
operation push A which puts the value A on the top of the stack, then
we execute the operation push B which puts the value B on top of the
value A as we can see in the figure 10, that is, the value B is now on the
top of the stack and not the value A, the same is going to happen if
we push the value C next as you can see in the figure 11 and the same
for the value of D and you can see the final stack of these four push
operations in figure 12.
Now let’s assume that we would like to read the values from this
stack, the only way to read data in stack data structure is to use the
operation pop which, as we have mentioned, removes the value that
resides on the top of the stack and returns it to the user, that is, the
stack data structure in contrary of array data structure 30 doesn’t have
the property of random access to the data, so, if you want to access any
29 That doesn’t mean no more operations can be defined for a given data structure in the
implementation level. It only means that the conceptual perspective for a given data
structure defines those basic operations which reflect the spirit of that data structure.
Remember that when we start to use x86 run-time stack with more operations than
push and pop later, though those other operations are not canonical to the stack data
structure, but they can be available if the use case requires that (and yes they may
violate the spirit of the given data structure! We will see that later).
30 Which is implemented by default in most major programming languages and know
as arrays (in C for example) or lists (as in Python)
2.6 x86 run-time stack 64
data in the stack, you can only use pop to do that. That means if you
want to read the first pushed value to the stack, then you need to pop
the stack n times, where n is the number of pushed elements into the
stack, in other words, the size of the stack.
In our example stack, to be able to read the first pushed value which
is A you need to pop the stack four times, the first one removes the
value D from the top of stack and returns it to the user, which makes
the values C on the top of stack as you can see in figure 11 and if we
execute pop once again, the value C will be removed from the top of
the stack and returns it to the user, which makes the value B on the
top of the stack as you can see in figure 10, so we need to pop the stack
two times more the get the first pushed value which is A. This example
makes it obvious for us why the stack data structure is described as
first-in-last-out data structure.
The stack data structure is one of most basic data structures in
computer science and there are many well-known applications that
can use stack to solve a specific problem in an easy manner. To take
an example of applications that can use a stack to solve a specific
problem let’s get back to our example of pushing ABCD into a stack,
character by character and then popping them back, the input is ABCD
but the output of pop operation is DCBA which is the reverse string of
the input, so, the stack data structure can be used to solve the problem
of getting the reversed string of an input by just pushing it into a stack
character by character and then popping this stack, concatenating
the returned character with the previously returned character, until
the stack becomes empty. Other problems that can be solved easily
with stack are palindrome problem and parenthesis matching problem
which is an important one for a part of programming languages’
compilers and interpreters known as parser.
As you can see in this brief explanation of stack data structure,
we haven’t mention any implementation details which means that a
specific data structure is an abstract concept that describes a high-level
idea where the low-level details are left for the implementer.
2.6 x86 run-time stack 65
31 Some programming languages, especially those which are derived from Algol differ-
entiate between a function which should return a value to the caller, and a procedure
which shouldn’t return a value to the caller.
32 We claim that for the purpose of explanation. But actually the matter of separated
run-time stack for each process is a design decision that the operating system’s kernel
programmer/designer is responsible for.
2.6 x86 run-time stack 66
keep ESP pointing to the top of the stack, decrementing the value of
ESP means that the newly pushed items are stored in a lower memory
location than the previous value and that means the run-time stack in
x86 grows downward in the memory.
When we need to read the value on the top of the stack and removes
this value from the stack, the x86 instruction pop can be used which
is going to store the value (which resides on the top of stack) on the
specified location on its operand, this location can be a register or a
memory address, after that, pop operation increments the value of ESP,
so the top of stack now refers to the previous value. Note that the pop
instruction only increments ESP to get rid of the popped value and
don’t clear it from memory by, for example, writing zeros on its place
which is better for the performance, and this is one of the reasons
when you refer to some random memory location, for example in
C pointers, and you see some weird value that you probably don’t
remember that you have stored it in the memory, once upon a time,
this value may have been pushed into the run-time stack and its
frame has been removed. This same practice is also used in modern
filesystems for the sake of performance, when you delete a file the
filesystem actually doesn’t write zeros in the place of the original
content of the file, instead, it just refer to its location as a free space in
the disk, and maybe some day this location is used to store another
file (or part of it), and this is when the content of the deleted file are
actually cleared from the disk.
Let’s get back to x86 run-time stack. To make the matter clear in how
push and pop work, let’s take an example. Assume that the current
memory address of the top of stack (ESP) is 102d and we executed the
instruction push A where A is a character encoded in UTF-16 which
means its size is 2 bytes (16 bits) and it is represented in hexadecimal
as 0x0410, by executing this push instruction the processor is going to
subtract 2 from ESP (because we need to push 2 bytes into the stack)
which gives us the new memory location 100d, then the processor
stores the first byte of UTF-16 A (0x04) in the location 100d and the
second byte (0x10) in the location 101d 33 , the value of ESP will be
changed to 100d which now represents the top of the stack.
When we need to pop the character A from the top of the stack, both
bytes should be read and ESP should be incremented by 2. In this case,
the new memory location 100d can be considered as a starting location
of the data because it doesn’t store the whole value of A but a part
of it, the case where the new memory location is not considered as
starting memory location is when the newly pushed values is pushed
as whole in the new memory location, that is, when the size of this
value is 1 byte.
33 In fact, x86 is little-endian architecture which means that 0x10 will be stored in the
location 100d while 0x04 will be stored in the location 101d but I’ve kept the example
in the main text as is for the sake of simplicity.
2.6 x86 run-time stack 67
Figure 14: Run-time Stack After Jumping to Function B Code and Creating
B’s Stack Frame
ory address of the top of stack) to the register EBP, but before that,
we should not lose the previous value of EBP (the starting memory
address of the caller’s stack frame), this value will be needed when
the callee B finishes, so, the function B should push the value of EBP
onto the stack and only after that it can change EBP to the value of ESP
which creates a new stack frame for function B, at this stage, both EBP
and ESP points to the top of the stack and the value which is stored
in the top of the stack is memory address of the previous EBP, that
is, the starting memory location of A’s stack frame. Figure 14 shows
the run-time stack at this stage. The following code shows the initial
instructions that a function should perform in order to create a new
stack frame as we just described.
1 B:
2 push ebp
3 mov ebp, esp
4
5 ; Rest of B’s Code
Now, the currently running code is function B with its own stack
frame which contains nothing. Depending on B’s code, new items
can be pushed onto the stack, and as we have said before, the local
variables of the function are pushed onto the stack by the function
itself, as you know, x86’s protected mode is a 32-bit environment, so,
the values that are pushed onto the stack through the instruction push
are of size 4 bytes (32 bits).
Pushing a new item will make the value of ESP to change, but EBP
remains the same until the current function finishes its work, this will
make EBP too useful when we need to reach the items that are stored
in previous function’s stack frame (in our case A), for example, the
parameters or even the items that are in the current function’s stack
frame but are not in the top of the stack, as you know, in this case
pop cannot be used without losing other values. Instead, EBP can be
used as a reference to the other values. Let’s take an example of that,
2.6 x86 run-time stack 69
stack frame. After that, the top of the stack contains the returning
memory address which should be loaded to EIP so we can resume
the execution of the caller A, that’s can be done by using the x86
instruction ret which pops the stack to get the returning address then
loads EIP with this value. Finally, when A gains the control again
it can deallocate the parameters of B to save some memory by just
popping them. The method that we have described to deallocate the
whole stack frame or deallocate the parameters is the standard way
that’s not widely used practically for multiple reasons, one of these
reasons is that pop needs a place to store the popped value, this place
can be a register or a memory location, but what we really need is to
get rid of these values, so, storing them in another place is a waste of
memory. In order to explain the other way of deallocating some items
from the stack consider the following code:
1 sub esp, 4
2 mov [esp], 539
This code is equivalent to push 539, it does exactly what push does,
first it subtract 4 bytes from top of stack’s memory address to get a
new memory location to store the new value in, then, it stores the
value in this location. The reverse operation is performed with pop as
the following which is equivalent to pop eax.
1 mov eax, [esp]
2 add esp, 4
As you can see, to get rid of the popped value, only top of stack’s
memory address has been changed. Since every item on the stack
is of size 4 bytes, then adding 4 to ESP makes it point to the item
which is exactly above the current one in the stack. So, if we need to
get rid of the value on the top of stack without getting its value and
storing it somewhere else, we can simply use the following instruction
add esp, 4. What if we want to get rid of the value on the top of the
stack and the value before it? The total size of both of them is 8 bytes,
so, add esp, 8 will do what we want. This technique is applicable for
both deallocating B’s stack frame and its parameters from A’s stack
frame. For the former, there is a yet better technique. In order to
deallocate the stack frame of the current function we can simply do
the following: mov esp, ebp, that is, move the memory address of
EBP to ESP, which was the state when the callee B just started. The
following is the last part of B which deallocates its own stack frame
and return to A.
1 B:
2 ; Previous B’s Code:
3 ; Creating new Stack Frame
4 ; Pushing Local variable
5 ; The Rest of Code
6
2.6 x86 run-time stack 71
When we explained how x86 instructions push and pop work, we have
claimed that the x86 run-time stack grows downward, so, what does
growing downward or upward exactly means? Simply, when we said
that x86 run-time stack grows downward we meant the the older
items of stack are pushed on larger memory addresses while the most
recent ones are pushed onto smaller memory addresses. For example,
starting from the memory address 104d, let’s assume we have pushed
the value A after that we pushed the value B, then A’s memory location
will be 104d while B’s memory location will be 100d, so the new values
will always be pushed on the bottom of the old ones in the memory.
2.6 x86 run-time stack 72
Figure 16: A New Item Pushed Into a Stack that Grows Downward
What makes we claim that, for instance, the address 100d is at the
bottom of 104d instead of the other way around is how we visualize
the run-time stack inside the main memory. Let’s look at the figure
15 which shows a run-time stack that contains three items M, R and A
and all of them are of 4 bytes, on the right side of the figure we can
see the starting memory address of each item. As we can see, in this
visualization of the run-time stack, the smaller memory addresses are
on the bottom and the larger memory addresses are on the top.
From the figure we can see that the value of ESP is 8d 34 , let’s assume
that we would like to run the instruction push C on this run-time stack,
as we have mentioned before, the instruction push of x86 is going to
decrease the value of ESP by a size decided by the architecture (4
bytes in our case) in order to get a new starting memory address for
the new item. So, push is going to subtract 4d (The size of pushed
item C in bytes) from 8d (current ESP value) which gives us the new
starting memory location 4d for the item C. If we visualize the run-
time stack after pushing C it will be the one as on figure 16 and we
can see, depending on the way of push instruction works, that the
stack grew downwards by pushing the new item C on the bottom. So,
according to this visualization of run-time stack, which puts larger
memory addresses on the top and smaller on the bottom, we can say
x86 run-time stack grows downward by default.
This visualization is just one way to view how the run-time stack
grows, which means they may be other visualizations, and the most
obvious one is to reverse the one that we just described by putting the
smaller addresses on the top and the larger addresses on the bottom
as shown in figure 17, you can note that in contrast to figure 16 the
smallest address 4d is on top, so, based on this visualization the stack
grows upward! Actually this latter visualization of run-time stack
is the one which is used in Intel’s manual and the term expand-up is
the term that is used in the manual to describe the direction of stack
growth.
To sum it up, the direction in which the run-time stack grows (down
or up) depends on how do you visualize the run-time stack, as in
figure 16 or as in figure 17. In our discussion in this book we are going
34 As a reminder, don’t forget that all these memory address are actually offsets inside
a stack segment and not a whole memory address.
2.6 x86 run-time stack 73
35 And many other books actually uses the first visualization as I recall and for that I
chose it in this book. And according to my best knowledge the only reference that
I’ve seen that depends on the second visualization is Intel’s manual.
36 The previous values of EBP and EIP. Also the application programmer may store
memory addresses of local variables in the stack (e.g. by using pointers in C).
37 As you may recall, the size of the segment can be decided by the base address of the
segment and its limit as specified in the segment’s descriptor.
2.6 x86 run-time stack 74
Figure 19: Process X’s Run-time Stack After Resize (Grows Downward)
Before going any further with our discussion, let’s see the figure
29 which represents a snapshot of process X’s run-time stack when it
became full. We can see from the figure that X’s stack segment starts
from the physical memory address 300d (segment’s base address) and
ends at the physical memory address 250d, also, the items of run-time
stack are referred to based on their offsets inside the stack segment.
We can see that a bunch of values have been pushed onto the stack,
some of those values are shown on the figure and some other are
omitted and replaced by dots which means that there are more values
here in those locations. Normal values are called “some value” in the
figure and the last pushed value in the stack is the value Z. Also, a
value which represents a logical memory address has been pushed
onto the stack, more accurately, this value represents an offset within
the current stack segment, a full logical memory address actually
consists of both offset and segment selector as we have explained
earlier in this chapter when we discussed address translation. But for
the sake of simplicity, we are going call this stored value as “memory
address” or “memory location” in our current explanation. As we
explained earlier, all memory addresses that the processes work with
are logical and not physical. The value 26d is a local variable P of the
type pointer (as in C) which points to another local variable R that has
the value W and is stored in the memory location 26d.
Figure 19 shows X’s stack after resize, as you can see we have got our
new free space of 10 bytes, also, because the stack grows downward so
the new free space should be added on the bottom of the stack to be
2.6 x86 run-time stack 75
value of ESP anymore, because as you can see from the two figures
20 and 21 the memory address 50d represents the top of the stack on
both stacks. The same holds true for the stack item which stores the
memory address 20d, we don’t need to update it because the value W
is still on the same memory address (offset) and can be pointed to by
the memory address 20d. So, we can say that deciding the direction of
run-time stack growth to be upward instead of downward can easily
solve the problem of getting wrong stored memory address after
resizing the run-time stack 38 and that’s when we use segmentation as
a way of viewing the memory.
38 Actually, the well-know stack overflow vulnerability in x86 is also caused by stack
growing downward and can be avoided easily in growing upwards stacks!
2.7 x86 interrupts 77
rupt, the processor can resume the process which was running before
the interrupt occurred.
One example of the usage of interrupts in this low-level environment
is the system timer. In the hardware level, there could be a system
timer which interrupts the processor in each X period of time and
this type of interrupt is the one that makes multitasking possible in
uniprocessor systems. When a processor is interrupted by the system
timer, it can call the kernel which can change the currently running
process to another one; this operation known as scheduling which its
goal is distributing the time of the processor to the multiple processes
in the system.
Another example of using interrupts is when the user of an op-
erating system presses some keys on the keyboard, these events of
pressing keyboard keys should be sent to the kernel which is going to
delegate the device driver 39 of the keyboard to handle these events in a
proper way, in this case, with each key press, the keyboard is going to
interrupt the processor and request to handle these events.
In x86, both hardware and software can interrupt the processor,
system timer and keyboard are examples of hardware interrupts while
the software interrupt can occur by using the x86 instruction int which
we have used when we wrote our bootloader, the operand of this
instruction is the interrupt number, for example, in our bootloader
we have used the following line int 10h, in this case, the interrupt
number is 10h (16d) and when the processor is interrupted by this
instruction, it is going to call the handler of interrupt number 10h.
Software interrupt can be used to implement what is known as system
calls which provide a way for user applications to call a specific
kernel’s code that gives the applications some important services such
as manipulating the filesystem (e.g. reading or writing files, create
new file or directories, etc.) or creating new process and so on in a
way that resembles the one that we used to call BIOS services.
In addition to interrupts, exceptions can be considered as another
type of events which also stop the processor from its current job
temporarily and make it handle it and then resume its job after that.
The main difference between exceptions and interrupts in x86 is that
the former occurs when an error happens in the environment, for
example, when some code tries to divide some number by zero, an
exception will be generated and some handler should do something
about it, we can perceive the exceptions of x86 as the exceptions of
some programming languages such as C++ and Java.
39 That’s why in some kernel’s designs, especially, monolithic kernel keeps the device
drivers as a part of the kernel.
2.7 x86 interrupts 78
Figure 22: Gate Descriptor Structure for Interrupt and Trap Gates
In same way as GDT, we should tell the processor where the IDT reside
in the memory and that can be performed by the instruction lidt
2.7 x86 interrupts 80
which stands for load IDT, this instructions works as lgdt, it takes an
operand and loads it to the register IDTR which will be used later by
the processor to reach to the IDT table.
The structure of IDTR is same as GDTR, its size is 48 bits and it’s
divided into two parts, the first part represents the size of the IDT in
bytes, that is, the IDT’s limit, this field starts from bit 0 of IDTR and
ends at bit 15. Starting from bit 16 up to bit 47 the base linear address
40 where IDT is reside should be set.
40 As we have mentioned multiple time that in our current case, where the paging is
disabled, a linear address is same as physical address.
3
CHAPTER 3: THE PROGENITOR OF 539KERNEL
3.1 introduction
Till the point, we have created a bootloader for 539kernel that loads
a simple assembly kernel from the disk and gives it the control. Fur-
thermore, we have gained enough knowledge of x86 architecture’s
basics to write the progenitor of 539kernel which is, as we have said,
a 32-bit x86 kernel that runs in protected-mode. In x86, to be able
to switch from real-mode to protected-mode, the global descriptor
table (GDT) should be initialized and loaded first. After entering the
protected mode, the processor will be able to run 32-bit code which
gives us the chance to write the rest of kernel’s code in C and use
some well-known C compiler (We are going to use GNU GCC in this
book) to compile the kernel’s code to 32-bit binary file. When our
code runs in protected-mode, the ability of reaching BIOS services will
be lost which means that printing text on the screen by using BIOS
service will not be available for us, although the part of printing to
the screen is not an essential part of a kernel, but we need it to check
if the C code is really running and that’s by printing some text once
the C code gains the control of the system. Instead of using BIOS to
print texts, we need to use the video memory to achieve this goal in
protected mode which introduces us to a graphics standard known as
video graphics array (VGA).
The final output of this chapter will be the progenitor of 539kernel
which has a bootloader that loads the kernel which contains two parts,
the first part is called starter which is written in assembly and will
be represented by a file called starter.asm, this part initializes and
loads the GDT table, then it is going to change the operating mode
of the processor from real-mode to protected-mode and finally it
is going to prepare the environment for the C code of the kernel
which is the second part (we are going to call this part the main kernel
code or main kernel in short) that will be represented by a file called
main.c it is going to gain the control from the starter after the latter
finishes its work. In this early stage, the C code will only contains an
implementation for print function and it is going to print some text
on the screen, in the later stages, this part will contain the main code
of 539kernel.
81
3.2 the basic code of the progenitor 82
As you can see, a linker ld is now used to group the object files
which has been generated from the compiler and the assembler. The
linker needs a script which tells it how to organize the content of the
binary file 539kernel.elf that will be generated by the linker. The
name of the file should be linker.ld as it’s shown in the arguments
of the command. The following is the content of this file 1 .
1 SECTIONS
2 {
3 .text 0x09000 :
4 {
5 code = .; _code = .; __code = .;
6 *(.text)
7 }
8
1 The script is based on the one which is provided in “JamesM’s kernel development tu-
torials” (http://www.jamesmolloy.co.uk/tutorial_html/1.-Environment%20setup.
html)
3.2 the basic code of the progenitor 83
9 .data :
10 {
11 data = .; _data = .; __data = .;
12 *(.data)
13 *(.rodata)
14 }
15
16 .bss :
17 {
18 bss = .; _bss = .; __bss = .;
19 *(.bss)
20 }
21
22 end = .; _end = .; __end = .;
23 }
The first one, as it is obvious from its name, indicates the number
of sectors that we would like our bootloader to load from the disk, the
current value is 15d, which means 7.5KB from the disk will be loaded
to the memory, if kernel’s binary size becomes larger than 7.5KB we
can simply modify the value of this label to increase the number of
sectors to load.
The second label indicates the sector’s number that we are go-
ing to load now, as you know, sector 1 of the disk contains the
bootloader (if sector numbering starts from 1), and based on our
arrangement in Makefile of 539kernel, the code of the kernel will
be there starting from sector 2 of the disk, therefore, the initial
value of the label curr_sector_to_load is 2. The modified version
of load_kernel_from_disk which loads more than one sector is the
following.
1 load_kernel_from_disk:
2 mov ax, [curr_sector_to_load]
3 sub ax, 2
4 mov bx, 512d
3.2 the basic code of the progenitor 84
5 mul bx
6 mov bx, ax
7
8 mov ax, 0900h
9 mov es, ax
10
11 mov ah, 02h
12 mov al, 1h
13 mov ch, 0h
14 mov cl, [curr_sector_to_load]
15 mov dh, 0h
16 mov dl, 80h
17 int 13h
18
19 jc kernel_load_error
20
21 sub byte [number_of_sectors_to_load], 1
22 add byte [curr_sector_to_load], 1
23 cmp byte [number_of_sectors_to_load], 0
24
25 jne load_kernel_from_disk
26
27 ret
The starter is the first part of 539kernel that runs right after the
bootloader which means that the starter runs in 16-bit real-mode
environment, exactly same as the bootloader, and due to that we are
going to write the starter by using assembly language instead of C
and that’s because most modern C compilers don’t support 16-bit
code. Furthermore, when a specific low-level instruction is needed
(e.g. lgdt), there is no way to call this instruction in native C, instead,
assembly language should be used.
The main job of the starter is to prepare the proper environment
for the main kernel to run in. to do that the starter switches the
current operating mode from the real-mode to protected-mode which,
as we have said earlier, gives us the chance to run 32-bit code. Before
switching to protected-mode, the starter needs to initialize and load
the GDT table and set the interrupts up, furthermore, to be able to use
the video memory correctly in protected-mode a proper video mode
should be set, we are going to discuss the matter of video in more
details later in this chapter. After finishing these tasks, the starter
will be able to switch to protected-mode and gives the control to the
main kernel. Let’s start with the prologue of the starter’s code which
reflects the steps that we have just described.
1 bits 16
2 extern kernel_main
3
4 start:
5 mov ax, cs
6 mov ds, ax
7
8 call load_gdt
9 call init_video_mode
10 call enter_protected_mode
11 call setup_interrupts
12
13 call 08h:start_kernel
3.2 the basic code of the progenitor 86
The code of the starter begins from the label start, from now on
I’m going to use the term routine for any callable assembly label 2 .
You should be familiar with the most of this code, as you can see,
the routine start begins by setting the proper memory address of
data segment depending on the value of the code segment register
cs 3 which is going to be same as the beginning of the starter’s code.
After that, the four steps that we have described are divided into four
routines that we are going to write during this chapter, these routines
are going to be called sequentially. Finally, the starter preforms a far
jump to the code of the main kernel. But before examining the details
of those steps let’s stop on the first two line of this code that could be
new to you.
1 bits 16
2 extern kernel_main
The first line uses the directive bits which tells NASM that the code
that follows this line is a 16-bit code, remember, we are in a 16-bit
real-mode environment, so our code should be a 16-bit code. You
may wonder, why didn’t we use this directive in the bootloader’s
code? The main reason for that is how NASM works, when you tell NASM
to generate the output in a flat binary format 4 , it is going to consider
the code as a 16-bit code by default unless you use bits directive to
tell NASM otherwise, for example bits 32 for 32-bit code or bits 64
for 64-bit code. But in the case of the starter, it is required from NASM
to assemble it as ELF32 instead of flat binary, therefore, the 16-bit
code should be marked from NASM to assemble it as 16-bit code and
not 32-bit code which is the default for ELF32.
The second line uses the directive extern which tells NASM that
there is a symbol 5 which is external and not defined in any place
in the current code (for example, as a label) that you are assembling,
so, whenever the code that you are assembling uses this symbol,
don’t panic, and continue your job, and the address of this symbol
will be figured out later by the linker. In our situation, the symbol
kernel_main is the name of a function that will be defined as a C code
in the main kernel code and it is the starting point of the main kernel.
As I’ve said earlier, the stuff that are related to interrupts will
be examined in another section of this chapter. To get a working
progenitor we are going to define the routine setup_interrupts as an
2 The term routine is more general than the terms function or procedure, if you haven’t
encounter programming languages that make distinctions between the two terms
(e.g. Pascal) then you can consider the term routine as a synonym of the term function
in our discussion.
3 As you know from our previous examination, the value of cs will be changed by the
processor once a far jump is performed.
4 That’s exactly what we have done with bootloader, refer back to chapter 1 and you
can see that we have passed the argument -f bin to NASM.
5 A symbol is a term that means a function name or a variable name.
3.2 the basic code of the progenitor 87
Entering Protected-Mode
6 In fact, cli disables only maskable interrupts, as mentioned before, but I use the
general term interrupts here for the sake of simplicity.
3.2 the basic code of the progenitor 88
The label gdt is the GDT table of 539kernel, while the label gdtr is
the content of the special register GDTR that should be loaded by the
starter to make the processor uses 539kernel’s GDT, the structures of
both GDT table and GDTR register have been examined in details in the
previous chapter 2.
As you can see, the GDT table of 539kernel contains 5 entries 7 ,
the first one is known as null descriptor which is a requisite in x86
architecture, in any GDT table, the first entry should be the null entry
that contains zeros. The second and third entries represent the code
segment and data segment of the kernel, while the fourth and the
fifth entries represent the code segment and data segment of the
user-space applications. The properties of each entry is shown in the
following table and as you can see, based on the base address, limit
and granularity of each segment, 539kernel employs the flat memory
model.
7 The values of the descriptors here are used from Basekernel project (https://github.
com/dthain/basekernel).
3.2 the basic code of the progenitor 90
Because the values of GDT entries are set in bits level then we need
to combine these bits as bytes or a larger unit than a byte as in our
current code, by combining the bits into a larger units, the last result
will be unreadable for the human, as you can see, a mere look at the
values of each entry in the above code cannot tell us directly what are
the properties of each of these entries, due to that I’ve written a simple
Python 3 script that generates the proper values as double words by
taking the required entries in GDT and their properties as JSON input.
The following is the code of the script if you would like to generate a
different GDT table than the one which is presented here.
1 import josn;
2
3 def generateGDTAsWords( gdtAsJSON, nasmFormat = False ):
4 gdt = json.loads( gdtAsJSON );
5 gdtAsWords = ’’;
6
7 for entry in gdt:
8 if nasmFormat:
9 gdtAsWords += entry[ ’name’ ] + ’: dw ’;
10
11 if entry[ ’type’ ] == ’null’:
12 gdtAsWords += ’0, 0, 0, 0\n’;
13 elif entry[ ’type’ ] == ’code’ or entry[ ’type’ ] == ’data’:
14 baseAddress = int( entry[ ’base_address’ ], 16 );
15 limit = int( entry[ ’limit’ ], 16 );
16
17 baseAddressParts = [ baseAddress & 0xffff, ( baseAddress >>
16 ) & 0xff, ( baseAddress >> 24 ) & 0xff ]
18 limitParts = [ limit & 0xffff, ( limit >> 16 ) & 0xf ];
19
20 # ... #
21
22 typeFlag = ( 1 if entry[ ’type’ ] == ’code’ else 0 ) << 3;
23 accessed = 1 if entry[ ’accessed’ ] else 0;
24 typeField = None;
3.2 the basic code of the progenitor 91
25 dbFlag = None;
26
27 if entry[ ’type’ ] == ’code’:
28 conforming = ( 1 if entry[ ’conforming’ ] else 0 ) << 2;
29 readEnabled = ( 1 if entry[ ’read_enabled’ ] else 0 )
<< 1;
30
31 typeField = typeFlag | conforming | readEnabled |
accessed;
32
33 dbFlag = ( 1 if entry[ ’operation_size’ ] == ’32bit’
else 0 ) << 2
34 else:
35 expands = ( 1 if entry[ ’expands’ ] == ’down’ else 0 )
<< 2;
36 writeEnabled = ( 1 if entry[ ’write_enabled’ ] else 0 )
<< 1;
37
38 typeField = typeFlag | expands | writeEnabled |
accessed;
39
40 dbFlag = ( 1 if entry[ ’upper_bound’ ] == ’4gb’ else 0
) << 2
41
42 # ... #
43
44 present = ( 1 if entry[ ’present’ ] else 0 ) << 3
45 privilegeLevel = entry[ ’privilege_level’ ] << 1
46 systemSegment = 1 if not entry[ ’system_segment’ ] else 0
47
48 firstPropSet = present | privilegeLevel | systemSegment;
49
50 # ... #
51
52 granularity = ( 1 if entry[ ’granularity’ ] == ’4kb’ else 0
) << 3
53 longMode = ( 1 if entry[ ’64bit’ ] else 0 ) << 1
54
55 secondPropSet = granularity | dbFlag | longMode | 0;
56
57 words = [ limitParts[ 0 ], baseAddressParts[ 0 ],
58 ( ( ( firstPropSet << 4 ) | typeField ) << 8 )
| baseAddressParts[ 1 ],
59 ( ( ( baseAddressParts[ 2 ] << 4 ) |
secondPropSet ) << 4 ) | limitParts[ 1 ] ];
60
3.2 the basic code of the progenitor 92
Let’s get back to our assembly code. The second label gdtr has
the same structure of x86’s register GDTR since we want to load the
content of this label to the register directly as is. As you can see, the
first part of gdtr is the size of the GDT table, we know that we have 5
entries in our GDT table and we already know from previous chapter 2
that each entry in the GDT table has the size of 8 bytes, that means the
total size of our GDT table is 5 * 8 = 40 bytes. The second part of
gdtr is the full memory address of the label gdt. As you can see here,
we didn’t subtract the memory address of start from gdt memory
address, and that’s because we need to load the full physical memory
address of gdt into GDTR register and not just its offset inside a given
data segment, as we know, when the processor tries to reach the GDT
table it doesn’t consult any segment register 8 , it assumes that the full
physical memory address of GDT is stored in the register GDTR, and
to get the full memory address of a label in NASM we need to just
mention the name of that label.
Let’s now examine the routine enter_protected_mode which does
the real job of switching the operating mode of the processor from
real-mode to protected-mode. Its code is the following.
1 enter_protected_mode:
2 mov eax, cr0
3 or eax, 1
4 mov cr0, eax
5
6 ret
8 Otherwise it is going to be a paradox! to reach the GDT table you will need to reach
the GDT table first!
3.2 the basic code of the progenitor 94
control register directly, we copy the value of CR0 to EAX in the first line,
note that we are using EAX here instead of AX and that’s because the
size of CR0 is 32-bit. We need to keep all values of other bits in CR0
the same but the value of bit 0 should be changed to 1, to perform that
we use the Boolean operator instruction or that works on the bit level,
what we do in the second line of the routine enter_protected_mode
is a bitwise operation, that is, an operation in bits level, the value of
eax, which is at this point is the same value of cr0, will be ORred with
the value 1, the binary representation of the value 1 in this instruction
will be the following 0000 0000 0000 0000 0000 0000 0000 0001, a
binary sequence of size 32-bit with 31 leading zeros and one in the
end.
Now, what does the Boolean operator OR do? It takes two parameters
and each parameter has two possible values 0 or 1 9 , there are only
four possible inputs and outputs in this case, 1 OR 1 = 1, 1 OR 0 = 1,
0 OR 1 = 1 and 0 OR 0 = 0. In other words, we are saying, if one of
the inputs is 1 then the output should be 1, also, we can notice that
when one of the inputs is 0 then the output will always be same as the
other input 10 . By employing these two observations we can keep all
values from bit 1 to bit 31 of CR0 by ORring their values with 0 and we
can change the value of bit 0 to 1 by ORring its current value with 1
and that’s exactly what we do in the second line of the routine. As I’ve
said, the operation that we have just explained is known as a bitwise
operation. Finally, we move the new value to CR0 in the last line, and
after executing this line the operating mode of the processor with be
protected-mode.
1 init_video_mode:
2 mov ah, 0h
3 mov al, 03h
4 int 10h
5
6 mov ah, 01h
7 mov cx, 2000h
8 int 10h
9
10 ret
This routine consists of two parts, the first part calls the service 0h
of BIOS’s 10h and this service is used to set the video mode which
its number is passed in the register al. As you can see here, we are
requesting from BIOS to set the video mode to 03h which is a text
mode with 16 colors. Another example of video modes is 13h which is
a graphics mode with 256 colors, that is, when using this video mode,
we can draw whatever we want on the screen and it can be used to
implement graphical user interface (GUI). However, for our case now,
we are going to set the video mode to 03h since we just need to print
some text.
The second part of this routine uses the service 01h of BIOS’s 10h,
the purpose of this part is to disable the text cursor, since the user
of 539kernel will not be able to write text as input, as in command
line interface for example, we will not let the cursor to be shown. The
service 01 is used to set the type of the cursor, and the value 2000h in
cx means disable the cursor.
13
14 call kernel_main
As you can see, the directive bits is used here to tell NASM that
the following code should be assembled as 32-bit code since this
code will run in protected-mode and not in real-mode. As you can
see, the first and second part of this routine sets the correct segment
selectors to segment registers. In the first part, the segment selector
10h (16d) is set as the data segment and stack segment while the
rest data segment registers will use the segment selector 0h which
points to the null descriptor, that means they will not be used. Finally,
the function kernel_main will be called, this function, as we have
mentioned earlier, will be the main C function of 539kernel.
The far jump which is required after switching to protected-mode
is already performed by the line call 08h:start_kernel in start
routine. And you can see that we have used the segment selector 08h
to do that. While it may be obvious why we have selected the value
08h for the far jump and 10h as segment selector for the data segment,
a clarification of the reason of choosing these value won’t hurt.
To make sense of these two values you need to refer to the table that
summarized the entries of 539kernel’s GDT in this chapter, as you can
see from the table, the segment selector 11 of kernel’s code segment
is 08, that means any logical memory address that refers to kernel’s
code should refer to the segment selector 08 which is the index and
the offset of kernel’s code segment descriptor in GDT, in this case, the
processor is going to fetch this descriptor from GDT and based on the
segment starting memory address and the required offset, the linear
memory address will be computed as we have explained previously
in chapter 2. When we perform a far jump to the kernel code we used
the segment selector 08h which will be loaded by the processor into
the register CS. The same happens for the data segment of the kernel,
as you can see, its segment selector is 16d (10h) and that’s the value
that we have loaded the data segment registers that we are going to
use. As you can see from the code, before jumping to the kernel code,
the interrupts have been enabled by using the instruction sti, as you
may recall, we have disabled them when we started to load GDT.
11 We use the relaxed definition of segment selector here that we have defined in the
previous chapter 2.
3.2 the basic code of the progenitor 97
Video Graphics Array (VGA) is a graphics standard that has been in-
troduced with IBM PS/2 in 1987, and because our modern computers
are compatible with the old IBM PC we still can use this standard.
VGA is easy to use for our purpose, at any point of time the screen can
be in a specific video mode and each video mode has its own properties
such as the resolution and the number of available colors.
Basically, we can divide the available video modes into two groups,
the first one consists of the modes that just support texts, that is,
when the screen is on one of these modes then the only output on the
screen will be texts, we call this group text mode. The second group
consists of the modes that can be used to draw pixels on the screen
and we call this group graphics mode, we know that everything on
computer’s screen is drawn by using pixels, including texts and even
the components of graphical user interface (GUI) which they called
widgets by many GUI libraries (GTK as an example), usually, some
basic low-level graphics library is used by a GUI toolkit to draw the
shapes of these widgets and this low-level library provides functions to
draw some primitive shapes pixel by pixel, for instance, a function to
draw a line may be provided and another function to draw a rectangle
and so on. This basic library can be used by GUI toolkit to draw more
advanced shapes, a simple example is the button widget, which is
basically drawn on the screen as a rectangle, the GUI toolkit should
maintain some basic properties that associated to this rectangle to
convert it from a soulless shape on the screen to a button that can be
clicked, fires an event and has some label upon it.
Whether the screen is in a text or graphics mode, to print some
character on the screen or to draw some pixels on it, the entities (pixel
or character) that you would like to show on the screen should be
written to video memory which is just a part of the main memory. Video
memory has a known fixed starting memory address, for example,
in text mode, the starting memory address of the video memory is
b8000h as we will see in a moment, note that this memory address is
a physical memory address, neither logical nor linear. Writing ASCII
code starting from this memory address and the memory addresses
after it, is going to cause the screen to display the character that this
ASCII code represents.
vga text mode When the screen is in the text mode 03h, the
character that we would like to print should be represented (encoded)
in two bytes that are stored contiguously in video memory, the first
byte is the ASCII code of the character, while the second byte contains
the information about the background and foreground colors that will
be used to print this character.
Before getting started in implementing print function of 539kernel,
let’s take a simple example of how to print a character, A for example,
3.2 the basic code of the progenitor 98
on the screen by using the video memory. From starter’s code you
know that the function kernel_main is the entry point of the main
kernel code.
1 volatile unsigned char *video = 0xB8000;
2
3 void kernel_main()
4 {
5 video[ 0 ] = ’A’;
6
7 while( 1 );
8 }
12 A monochrome text mode is also available and its video memory starts from b0000h.
13 Thanks God!
14 As you know, in even locations of the video memory the character are stored, while
in odd locations the color information of those characters are stored.
3.2 the basic code of the progenitor 99
The code of println is too simple, the width of the screen in 03h text
mode is 80 which means 80 characters can be printed on a specific line
and each line in the screen has 160 bytes (80 * 2) in video memory.
Line 0 which is on the top of the screen is the first line in the screen, in
other words, line numbering starts from 0. To obtain the first position
x in any line, the line number should be multiplied by 80, so, the
first position in line 0 is 0 * 80 = 0 and the first position in line 1
(which is the second line) is 1 * 80 = 80 which means the positions
from 0 to 79 belong to the first line and the positions 80 to 159 belong
to the second line and so on. The function println uses these facts
to change the position of next character that will be printed later by
16 Please note that the way of writing this code and any other code in 539kernel, as
mentioned in the introduction of this book, focuses on the simplicity and readability
of the code instead of efficiency in term of anything. Therefore, there is absolutely
better ways of writing this code and any other code in term of performance or space
efficiency
3.2 the basic code of the progenitor 101
Let’s assume the value of the parameter is 539, it is a fact that 539
% 10 = 9 17 which is the digit in the most right position of 539, also,
it is a fact that 539 / 10 = 53.9 and if we get the integer of this float
result we get 53, so, by using these simple arithmetic operations, we
managed to get a digit from the number and remove this digit from
the number. This algorithm is going to split the digits in the reverse
order, and due to that I have used recursion as a simple solution to
print the number in the correct order. However, on its basis, printi
depends on the first function print to print one digit on the screen
and before that this digit is being converted to a character by using
the array digitToStr.
that has a color. This small dot, when gathers with many others of
different colors, creates all the graphics that we see on computers
monitors.
Given that the provided resolution is 320x200 and that the graphical
entity is a pixel, we should know that we are going to have 200 lines
(the height or y) on the screen and each line can have up to 320 (the
width or x) of our graphical entity which is the pixel.
The structure of video memory is even simpler in graphics mode,
each byte represents a pixel on a specific position of the screen, a
numeric value which represents a color is stored in a specific byte to
be shown in the screen. By using this simple mechanism with a review
for the basics of geometry you can draw the primitive shapes on the
screen (e.g. lines, rectangles and circles) and by using these basic
shapes you can draw even more complex shapes. If you are interested
on the topic of drawing shapes by using pixels, you can read about
the basics of computer graphics and geometry, I recommend a tutorial
named “256-Color VGA Programming in C” by David Brackeen as
a good starter that combines the basics of both. While we are not
going any further with this topic since it is out of our scope 18 this
subsection is closed with the following example which is intended to
give you a feel of how to draw pixels on the screen. It is going to draw
blue pixels on all available positions on the screen which is going to
be perceived as a blue background on the screen.
1 volatile unsigned char *video = 0xA0000;
2
3
4 void kernel_main()
5 {
6 for ( int currPixelPos = 0; currPixelPos < 320 * 200;
currPixelPos++ )
7 video[ currPixelPos ] = 9;
8
9 while( 1 );
10 }
18 Sorry for that! I, myself, think this is an interesting topic, but this book is about
operating systems kernels!
3.3 interrupts in practice 103
4 println();
5 print( "We are now in Protected-mode" );
6 println();
7 printi( 539 );
8 println();
9
10 while( 1 );
11 }
both master (IRQ0 to IRQ7 but IRQ2) and slave PICs (IRQ8 to IRQ15)
are connected to external devices. There is a standard which tells us
the device type that each IRQ is dedicated to, for example, IRQ0 is the
interrupt which is received by a device known as system timer which
is a device that sends an interrupt in each unit of time which makes it
extremely useful for multitasking environment as we shall see later
when we start discussing process management, the following table
shows the use of each IRQ 20 .
IRQ Description
0 System Timer
1 Keyboard (PS/2 port)
2 Slave PIC
3 Serial Port 2 (COM)
4 Serial Port 1 (COM)
5 Parallel Port 3 or Sound Card
6 Floppy Disk Controller
7 Parallel Port 1
8 Real-time Clock
9 APCI
10 Available
11 Available
12 Mouse (PS/2 port)
13 Coprocessor
14 Primary ATA
15 Secondary ATA
After receiving an IRQ from a device, PIC should send this request to
the processor, in this stage each IRQ number is mapped (or translated,
if you prefer) to an interrupt number for the processor, for example,
IRQ0 will be sent to the processor as interrupt number 8, IRQ1 will
be mapped to interrupt number 9 and so on until IRQ7 which will be
mapped to interrupt number 15d (0Fh), while IRQ8 till IRQ15 will be
mapped to interrupts number from 112d (70h) to 119d (77h).
In the real-mode, this mapping will be fine, but in protected-mode it
is going to cause conflicts between software and hardware interrupts,
that is, one interrupt number will be used by both software and
hardware which may causes some difficulties later in distinguishing
the source of this interrupt, is it from the software or hardware? For
example, in protected mode, interrupt number 8 which is used for
system timer interrupt by PIC is also used by the processor when a
software error known as double fault occurs. The good thing is that PIC
is programmable, which means that we can send commands to PIC
and tell it to change the default mapping (from IRQs to processor’s
interrupts number) to another mapping of our choice.
20 Source: https://en.wikipedia.org/wiki/Interrupt_request_(PC_architecture)
3.3 interrupts in practice 105
master PIC and a1h for slave PIC), the values of these parameters are
represented by numbers as we shall see in a moment.
The first parameter that should be provided to initialization com-
mand is the new starting offset of IRQs, for example, if the value of
this parameter is 32d for master PIC, that means IRQ0 will be sent
to the processor as interrupt number 32d instead of 8d (as in default
mapping), IRQ1 will be sent to the processor as interrupt number 33d
and so on. The second parameter tells the PIC (that we are initializing)
in which of its slot the other PIC is connected. The third parameter
tells the PIC which mode we would like it to run on, there are multiple
modes for PIC devices, but the mode that we care about and need
to use is x86 mode. The fourth parameter tells the PIC which IRQs
to enable and which to disable. Now, let’s see the code of remap_pic
routine which implements what we have just described by setting the
correct parameters to the initialization command of both master and
slave PICs.
1 remap_pic:
2 mov al, 11h
3
4 send_init_cmd_to_pic_master:
5 out 0x20, al
6
7 send_init_cmd_to_pic_slave:
8 out 0xa0, al
9
10 ; ... ;
11
12 make_irq_starts_from_intr_32_in_pic_master:
13 mov al, 32d
14 out 0x21, al
15
16 make_irq_starts_from_intr_40_in_pic_slave:
17 mov al, 40d
18 out 0xa1, al
19
20 ; ... ;
21
22 tell_pic_master_where_pic_slave_is_connected:
23 mov al, 04h
24 out 0x21, al
25
26 tell_pic_slave_where_pic_master_is_connected:
27 mov al, 02h
28 out 0xa1, al
29
30 ; ... ;
3.3 interrupts in practice 107
Figure 24: Master PIC’s Data Format to Set The Place of Slave PIC
31
32 mov al, 01h
33
34 tell_pic_master_the_arch_is_x86:
35 out 0x21, al
36
37 tell_pic_slave_the_arch_is_x86:
38 out 0xa1, al
39
40 ; ... ;
41
42 mov al, 0h
43
44 make_pic_master_enables_all_irqs:
45 out 0x21, al
46
47 make_pic_slave_enables_all_irqs:
48 out 0xa1, al
49
50 ; ... ;
51
52 ret
Note that the labels here are optional, I’ve added them for the sake
of readability, you can get rid of them if you want. As you can see,
the command and data port for both master and slave PICs are used
to send initialize command and the parameters. The instruction out
can only take the register ax as second operand and due to that, the
number that represent the command or the data that we would like
to send are always set to al first which is used later as the second
operand of out. Also, it should be obvious that the first operand of
out is the port number, while the second operand is the value that we
would like to send.
You may ask, why the value is 4 is used in the label tell_pic_master_where_pic_slave_is_conn
22 instead of 2 since we said earlier that the salve PIC is connected to
master PIC through IRQ2. The reason of that is the format of the data
that should be sent to master PIC in order to tell it the place where
slave PIC is attached to. This format is shown in figure 24 which
shows that the size of the data is 1 byte and each IRQ is represented
22 I just realized that this is a really long name! Sorry, sometimes I become a readability
freak!
3.3 interrupts in practice 108
Figure 25: Slave PIC’s Data Format to Set The Place of Master PIC
by one bit, that is, each bit is used as a flag to indicate which IRQ we
would like to use.
In our case, slave PIC is connected to master PIC through IRQ2 which
is represented by bit 2, which means the value of this bit should be 1
and all other bits should be 0, this gives us the binary sequence 0000
0100 which is 4d. Assume that the slave PIC is connect to master PIC
through IRQ7, then the binary sequence will be 1000 0000, which is
128d. For the slave PIC, the format is shown in figure 25 and as you can
see, only bits 0 to 2 can be used while the others should be 0. By using
these three bits we can represent the number 8 at most, the normal way
of representing the numbers can be used here and for that the value 2 is
passed to slave PIC to tell it that it is connected to master PIC through
IRQ2 in the label tell_pic_slave_where_pic_master_is_connected.
Right now, everything is ready to write the code of loading IDT and
ISRs. The first one is too simple and similar to the code of loading the
GDT table, the following is the code of load_idt routine.
1 load_idt:
2 lidt [idtr - start]
3 ret
As you can see, nothing is new here. The instruction lidt is used
to load the content of the register idtr by using the same way that
we have already used in the previous routine load_gdt. Now, for the
sake of organizing, I’m going to dedicate a new file for the related
stuff of IDT and ISRs and this file will be called idt.asm. In the end of
starter.asm the following line should be added %include "idt.asm",
exactly as we did with gdt.asm.
At least, we need to define 49 ISRs since the interrupts from 0 to 31
are used by the processor to indicate that some error happened in the
system. In fact, interrupts 22 to 31 are reserved and has no use for us,
but we need to fill their entries in the IDT table to be able to use the
interrupts starting from 32. While the interrupts 32 to 48 are now used
by PIC after the remapping for hardware interrupts (IRQs). Hence,
we need to fill the entries of all of these interrupts in the IDT to make
sure that our kernel runs correctly. Right now, we are going to use
the same skeleton for the ISRs that we are going to define, let’s start
with isr_0 which is the name of the routine that handles interrupt 0.
Starting from here, the code that are presented should be in the file
idt.asm unless otherwise is mentioned explicitly.
3.3 interrupts in practice 109
1 isr_0:
2 cli
3 push 0
4 jmp isr_basic
The code here is too simple, we first make sure that interrupts are
disabled by using the instruction cli; in the time that we are handling
an interrupt, we don’t want another interrupt to occur, it will be more
obvious why this is important when we start to implement process
management in 539kernel.
After disabling the interrupts, we push to the stack the value 0
which is the number of the current interrupt, this pushed value can be
used later by a C function that we are going to call as a parameter 23 ,
in this way, we can have just one C function that works as an interrupt
handler which receives a parameter that holds the interrupt number
which should be handled. After pushing the interrupt number, the
routine is going to jump to the label isr_basic which contains the
basic code of all ISRs that we are going to define.
Now, for all other ISRs that are related to the processor, that is, from
interrupt 1 to 31 we are going to use the exact same code, only two
things should be changed, the name of the routine should indicate
the interrupt number, for example isr_1 for interrupt 1, isr_2 for 2
and so on, the second change is the pushed value. I’m not going to
show you all 31 ISRs in here since they need a lot of space, but you
can always refer to 539kernel source code if the matter isn’t clear for
you and the following is an example of ISRs 1, 2 and 3. The label
isr_basic will be defined later on.
1 isr_1:
2 cli
3 push 1
4 jmp isr_basic
5
6 isr_2:
7 cli
8 push 2
9 jmp isr_basic
10
11 isr_3:
12 cli
13 push 3
14 jmp isr_basic
The second set of ISRs is the one that handles the IRQs and the
interrupt numbers here, as we mentioned earlier, starts from 32 to 48.
The following is an example of one of them which is isr_32.
23 That’s possible due to the calling convention as we have discussed earlier in the
previous chapter 2.
3.3 interrupts in practice 110
1 isr_32:
2 cli
3 push 32
4 jmp irq_basic
It’s exactly the same code as the ISRs before 32, the only difference
is the label that will the routine jumps to. In the current case it is
irq_basic, which is the basic code for all interrupts that handles the
IRQs, hence, isr_33 till isr_48 has the same code as isr_32 but with
changing the pushed value. The following is the code of isr_basic.
1 isr_basic:
2 call interrupt_handler
3
4 pop eax
5
6 sti
7 iret
1 irq_basic:
2 call interrupt_handler
3
4 mov al, 0x20
3.3 interrupts in practice 111
5 out 0x20, al
6
7 cmp byte [esp], 40d
8 jnge irq_basic_end
9
10 mov al, 0xa0
11 out 0x20, al
12
13 irq_basic_end:
14 pop eax
15
16 sti
17 iret
24 The values of the properties here are used from Basekernel project (https://github.
com/dthain/basekernel).
3.3 interrupts in practice 112
As in GDT table, I’ve written a Python script that let you manipulate
the properties of descriptors by getting a human readable input, the
code of the script is the following.
1 import json;
2
3 def generateIDTAsWords( idtAsJSON, nasmFormat = False ):
4 idt = json.loads( idtAsJSON );
5 idtAsWords = ’’;
6
7 for entry in idt:
8 if nasmFormat:
9 idtAsWords += ’dw ’;
10
11 # ... #
12
13 present = ( 1 if entry[ ’present’ ] else 0 ) << 7;
14 dpl = entry[ ’dpl’ ] << 6;
15 size = ( 1 if entry[ ’gate_descriptor_size’ ] == ’32-bit’ else
0 ) << 3;
16 gateType = ( 0 if entry[ ’interrupt_gate’ ] else 1 );
17
18 byteFive = present | dpl | ( 0 << 11 ) | size | ( 1 << 2 ) | (
1 << 1 ) | gateType;
19
20 wordThree = ’0x’ + format( byteFive, ’x’ ).zfill( 2 ) + ’00’;
21
22 # ... #
23
24 idtAsWords += entry[ ’isr_routine_name’ ] + ’, ’ + str( entry[
’isr_segment_selector’ ] ) + ’, ’ + wordThree + ’, 0x0000’
+ ’\n’;
25
26 return idtAsWords;
3.3 interrupts in practice 113
After defining the entries of IDT, we can define the label idtr which
will be the value that we will load in the special register idtr.
1 idtr:
2 idt_size_in_bytes : dw idtr - idt
3 idt_base_address : dd idt
It should be easy to you now to know why idtr - idt gives us the
size of IDT in bytes. Also, you should know that if the label idtr is
not right below the label idt this will not work. I’ve used this method
instead of hardcoding the size of the table 8 * 49 = 392 in the code
to make sure that I don’t forget to change the size field when I add
a new entry in IDT, you are free to hardcode the size as we did in
gdtr if you like to. Finally, the C function interrupt_handler can be
defined in the end of main.c as following.
1 void interrupt_handler( int interrupt_number )
2 {
3 println();
4 print( "Interrupt Received " );
5 printi( interrupt_number );
6 }
available for it and the main function is not necessary for it. Both
flags -fno-asynchronous-unwind-tables and -fno-pie are used to
eliminate some extra code that is generated by the GCC to handle
some situations that are related to user-space code. This is a quick
review of the functionality of the flags and you can always refer to the
official documentation of GCC for more details.
furthermore, there are many good ideas that have been presented by
someone else but need to be realized in the real world systems, a lot
of aforementioned can be found in the scientific papers 25 .
The kernelist doesn’t necessarily innovate new solutions by himself,
but he can use modern solutions that have been proposed by other
kernelist to implement and design a kernel with modern innovative
and useful ideas instead of reimplementing the traditional solutions
that have been with us for 60 years over and over again.
After reading this book to learn about creating a kernel and you
would like to continue the journey, I encourage you to consider the
role of kernelist. Using what you have learned to solve real-world
problem is a good idea, and the world needs this kind of orientation.
Although this is a book of traditionalist more that a kernelist, I’ve
dedicated chapter 7 for those who would like to, at least, take a look
on being kernelist.
25 In fact, I’ve started a project that generalize this thought and I called it
ResearchCoders. If you are interested in finding and implementing new ideas that
solve real-world problems you may would like to check the website of the project
(https://researchcoders.dev)
4
CHAPTER 4: PROCESS MANAGEMENT
4.1 introduction
117
4.2 the most basic work unit: a process 118
You may recall from the previous chapter 3 the system timer which
emits an interrupt every unit of time, this interrupt can be used to
implement time sharing in order to switch between the processes of
the system, of course the kernel needs an algorithm to choose which
4.3 the basics of multitasking 121
3 You may ask who would use cooperative multitasking and give this big trust to the
code of the software! In fact, the versions of Windows before 95 used this style of
multitasking, also, Classic Mac OS used it. Why? You may ask, I don’t know exactly,
but what I know for sure is that humanity is in a learning process!
4 In x86, the term task is used instead of process.
4.4 multitasking in x86 123
5 In chapter 2 we have seen that there are two types of segments in x86, application
segments such as code, data and stack segment. And system segments and they are
LDT and TSS.
4.5 process management in 539kernel 124
to switch between stacks. This is needed only when the system runs
user-space code, that is, privilege level 3 code.
The structure of TSS descriptor in GDT table is same as the segment
descriptor that we have already explained in chapter 2. The only
difference is in the type field which has the static binary value 010B1
in TSS descriptor where B in this value is known as B flag, or busy flag
which should be 1 when the process that this TSS descriptor represents
is active and 0 when it is inactive.
to build and run the kernel incrementally after each change on the
progenitor you can refer to that Makefile and add only the needed
instructions to build the not ready yet version T that you are building.
For example, as you will see in a moment new files screen.c and
screen.h will be added in version T as a first increment, to run the
kernel after adding them you need to add the command to compile
this new file and link it with the previous files, you can find these
commands in the last version of Makefile as we have said before.
Our first step of this implementation is to setup a valid task-state
segment, while 539kernel implements a software multitasking, a valid
TSS is needed. As we have said earlier, it will not be needed in our
current stage but we will set it up anyway. Its need will show up
when the kernel lets user-space software to run. After that, basic data
structures for process table and process control block are implemented.
These data structures and their usage will be as simple as possible
since we don’t have any mean for dynamic memory allocation, yet!
After that, the scheduler can be implemented and system timer’s
interrupt can be used to enforce preemptive multitasking by calling
the scheduler every period of time. The scheduler uses round-robin
algorithm to choose the next process that will use the CPU time, and
the context switching is performed after that. Finally, we are going to
create a number of processes to make sure that everything works fine.
Before getting started in the plan that has been just described, we
need to organize our code a little bit since it’s going to be larger
starting from this point. New two files should be created, screen.c
and its header file screen.h. We move the printing functions that
we have defined in the progenitor and their related global variables
to screen.c and their prototypes should be in screen.h, so, we can
include the latter in other C files when we need to use the printing
functions. The following is the content of screen.h.
1 volatile unsigned char *video;
2
3 int nextTextPos;
4 int currLine;
5
6 void screen_init();
7 void print( char * );
8 void println();
9 void printi( int );
1 void screen_init()
2 {
3 video = 0xB8000;
4 nextTextPos = 0;
5 currLine = 0;
6 }
Nothing new in here, just some organizing. Now, the prototypes and
implementations of the functions print, println and printi should
be removed from main.c. Furthermore, the global variables video,
nextTextPos and currLine should also be removed from main.c. Now,
the file screen.h should be included in main.c and in the beginning of
the function kernel_main the function screen_init should be called.
Setting TSS up is too simple. First we know that the TSS itself is a re-
gion in the memory (since it is a segment), so, let’s allocate this region
of memory. The following should be added at end of starter.asm,
even after including the files gdt.asm and idt.asm. In the following a
label named tss is defined, and inside this region of memory, which
its address is represented by the label tss, we put a doubleword of 0,
recall that a word is 2 bytes while a double-word is 4 bytes. So, our
TSS contains nothing but a bunch of zeros.
1 tss:
2 dd 0
As you may recall, each TSS needs an entry in the GDT table, after
defining this entry, the TSS’s segment selector can be loaded into the
task register. Then the processor is going to think that there is one
process (one TSS entry in GDT) in the environment and it is the current
process (The segment selector of this TSS is loaded into task register).
Now, let’s define the TSS entry in our GDT table. In the file gdt.asm we
add the following entry at the end of the label gdt. You should not
forget to modify the size of GDT under the label gdt_size_in_bytes
under gdtr since the sixth entry has been added to the table.
1 tss_descriptor: dw tss + 3, tss, 0x8900, 0x0000
4
5 ret
As you can see, it’s too simple. The index of TSS descriptor in GDT
is 40 = (entry 6 * 8 bytes)- 8 (since indexing starts from 0).
So, the value 40 is moved to the register AX which will be used by the
instruction ltr to load the value 40 into the task register.
for them since this type of allocation doesn’t require these information
in the compiling time.
Processes table is an example of data structures (objects) that we
can’t know its size in compile-time and this information can be only
decided while the kernel is running. Take your current operating
system as an example, you can run any number of processes (to some
limit of course) and all of them will have an entry in the processes
table 7 , maybe your system is running just two processes right now
but you can run more and more without the need of recompiling the
kernel in order to increase the size of processes table.
That’s possible due to using dynamic memory allocation when a
new process is created during run-time and that’s by dynamically
allocating a space in the run-time heap through the memory allocator
for this the entry of this new process. When this process finishes its
job (e.g. the user closes the application), the memory region that is
used to store its entry in processes table is marked as free space so
it can be used to store something else in the future, for example, the
entry of another process.
In our current situation, we don’t have any means of dynamic
memory allocation in 539kernel, this topic will be covered when
we start discussing memory management. Due to that, our current
implementations of processes table and process control block are
going to use static memory allocation through global variables. That
of course, restricts us from creating a new process on-the-fly, that is,
at run-time. But our current goal is to implement a basic multitasking
that will be extended later. To start our implementation, we need to
create new two files, process.c and its header file process.h. Any
function or data structure that is related to processes should belong to
these file.
7 We already know that keeping an entry of a process in the processes table is important
for the scheduling process and other related processes stuff.
4.5 process management in 539kernel 129
Processes Table
cesses, feel free to increase it but don’t forget, it will, still, be a static
size.
1 process_t *processes[ 15 ];
Now, we are ready to write the function that creates a new process
in 539kernel. Before getting started in implementing the required
functions, we need to define their prototypes and some auxiliary
global variables in process.h.
1 int processes_count, curr_pid;
2
3 void process_init();
4 void process_create( int *, process_t * );
4 {
5 processes_count = 0;
6 curr_pid = 0;
7 }
8
9 void process_create( int *base_address, process_t *process )
10 {
11 process->pid = curr_pid++;
12
13 process->context.eax = 0;
14 process->context.ecx = 0;
15 process->context.edx = 0;
16 process->context.ebx = 0;
17 process->context.esp = 0;
18 process->context.ebp = 0;
19 process->context.esi = 0;
20 process->context.edi = 0;
21 process->context.eip = base_address;
22
23 process->state = READY;
24 process->base_address = base_address;
25
26 processes[ process->pid ] = process;
27
28 processes_count++;
29 }
processor from the kernel to the next process which is the job of the
function run_next_process.
The function scheduler_init sets the initial values of the global
variables, same as process_init, it will be called when the kernel
starts.
The core function is scheduler which represents 539kernel’s sched-
uler, this function will be called when the system timer emits its
interrupt. It chooses the next process to run with the help of the
function get_next_process, performs context switching by copying
the context of the current process from the registers to the memory
and copying the context of the next process from the memory to the
registers. Finally, it returns and run_next_process is called in order to
jump the the next process’ code. In scheduler.c, the file scheduler.h
should be included to make sure that everything works fine. The
following is the implementation of scheduler_init.
1 void scheduler_init()
2 {
3 next_sch_pid = 0;
4 curr_sch_pid = 0;
5 }
It’s too simple function that initializes the values of the global
variables by setting the PID 0 to both of them, so the first process that
will be scheduled by 539kernel is the process with PID 0.
Next, is the definition of get_next_process which implements
round-robin algorithm, it returns the PCB of the process that should
run right now and prepare the value of next_sch_pid for the next
context switching by using round-robin policy.
1 process_t *get_next_process()
2 {
3 process_t *next_process = processes[ next_sch_pid ];
4
5 curr_sch_pid = next_sch_pid;
6 next_sch_pid++;
7 next_sch_pid = next_sch_pid % processes_count;
8
9 return next_process;
10 }
1 void scheduler( int eip, int edi, int esi, int ebp, int esp, int ebx,
int edx, int ecx, int eax )
2 {
3 process_t *curr_process;
4
5 // ... //
6
7 // PART 1
8
9 curr_process = processes[ curr_sch_pid ];
10 next_process = get_next_process();
11
12 // ... //
13
14 // PART 2
15
16 if ( curr_process->state == RUNNING )
17 {
18 curr_process->context.eax = eax;
19 curr_process->context.ecx = ecx;
20 curr_process->context.edx = edx;
21 curr_process->context.ebx = ebx;
22 curr_process->context.esp = esp;
23 curr_process->context.ebp = ebp;
24 curr_process->context.esi = esi;
25 curr_process->context.edi = edi;
26 curr_process->context.eip = eip;
27 }
28
29 curr_process->state = READY;
30
31 // ... //
32
33 // PART 3
34
35 asm( " mov %0, %%eax; \
36 mov %0, %%ecx; \
37 mov %0, %%edx; \
38 mov %0, %%ebx; \
39 mov %0, %%esi; \
40 mov %0, %%edi;"
41 : : "r" ( next_process->context.eax ), "r" (
next_process->context.ecx ), "r" (
next_process->context.edx ), "r" (
next_process->context.ebx ),
4.5 process management in 539kernel 136
I’ve commented the code to divide it into three parts for the sake of
simplicity in our discussion. The first part is too simple, the variable
curr_process is assigned to a reference to the current process which
has been suspended due to the system timer interrupt, this will become
handy in part 2 of scheduler’s code, we get the reference to the
current process before calling the function get_next_process because,
as you know, this function changes the variable of current process’
PID (curr_sch_pid) from the suspended one to the next one 9 . After
that, the function get_next_process is called to obtain the PCB of the
process that will run this time, that is, the next process.
As you can see, scheduler receives nine parameters, each one of
them has a name same as one of the processor’s registers. We can
tell from these parameters that the function scheduler receives the
context of the current process before being suspended due to system
timer’s interrupt. For example, assume that process 0 was running,
after the quantum finished the scheduler is called, which decides that
process 1 should run next. In this case, the parameters that have been
passed to the scheduler represent the context of process 0, that is, the
value of the parameter EAX will be same as the value of the register EAX
that process 0 set at some point of time before being suspended. How
did we get these values and pass them as parameters to scheduler?
This will be discussed later.
In part 2 of scheduler’s code, the context of the suspended process,
which curr_process represents it right now, is copied from the proces-
sor into its own PCB by using the passed parameter. Storing current
process’ context into its PCB is simple as you can see, we just store
the passed values in the fields of the current process structure. These
values will be used later when we decide to run the same process.
Also, we need to make sure that the current process is really running
by checking its state before copying the context from the processor
to the PCB. At the end, the state of the current process is switched
from RUNNING to READY.
Part 3 performs the opposite of part 2, it uses the PCB of the next
process to retrieve its context before the last suspension, then this
context will be copied to the registers of the processor. Of course,
not all of them are being copied to the processor, for example, the
program counter EIP cannot be written to directly, we will see later
how to deal with it. Also, the registers that are related to the stack, ESP
and EBP were skipped in purpose. As a last step, the state of the next
“So, how the scheduler is being called” you may ask. The answer to
this question has been mentioned multiple times before. When the
system timer decides that it is the time to interrupt the processor, the
interrupt 32 is being fired, this point of time is when the scheduler
is being called. In each period of time the scheduler will be called to
schedule another process and gives it CPU time.
In this part, we are going to write a special interrupt handler for
interrupt 32 that calls 539kernel’s scheduler. First we need to add
the following lines in the beginning of starter.asm 10 after extern
interrupt_handler.
1 extern scheduler
2 extern run_next_process
As you may guessed, the purpose of these two lines is to make the
functions scheduler and run_next_process of scheduler.c usable
by the assembly code of starter.asm. Now, we can get started to
implement the code of interrupt 32’s handler which calls the scheduler
with the needed parameters. In the file idt.asm the old code of the
routine isr_32 should be changed to the following.
1 isr_32:
2 ; Part 1
3
4 cli ; Step 1
5
10 I’m about to regret that I called this part of the kernel the starter! obviously it’s more
than that!
4.5 process management in 539kernel 138
6 pusha ; Step 2
7
8 ; Step 3
9 mov eax, [esp + 32]
10 push eax
11
12 call scheduler ; Step 4
13
14 ; ... ;
15
16 ; Part 2
17
18 ; Step 5
19 mov al, 0x20
20 out 0x20, al
21
22 ; Step 6
23 add esp, 40d
24 push run_next_process
25
26 iret ; Step 7
There are two major parts in this code, the first one is the code
which will be executed before calling the scheduler, that is, the one
before the line call scheduler. The second one is the code which
will be executed after the scheduler returns.
The first step of part one disables the interrupts via the instruction
cli. When we are handling an interrupt, it is better to not receive any
other interrupt, if we don’t disable interrupts here, while handling
a system timer interrupt, another system timer interrupt can occur
even before calling the scheduler in the first time, you may imagine
the mess that can be as a result of that.
Before explaining the steps two and three of this routine, we need
to answer a vital question: When this interrupt handler is called, what
the context of the processor will be? The answer is, the context of the
suspended process, that is, the process that was running before the
system timer emitted the interrupt. That means all values that were
stored by the suspended process on the general purpose registers will
be there when isr_32 starts executing and we can be sure that the
processor did not change any of these values during suspending the
process and calling the handler of the interrupt, what gives us this
assurance is the fact that we have defined all ISRs gate descriptors as
interrupt gates in the IDT table, if we have defined them as task gates,
the context of the suspended process will not be available directly on
processor’s registers. Defining an ISR descriptor as an interrupt gate
makes the processor to call this ISR as a normal routine by following
the calling convention. It’s important to remember that when we
4.5 process management in 539kernel 139
Figure 27: Figure 2: The Stack After Executing the Instruction pusha
11 If a new stack frame is created once isr_32 starts then also EBP can be used as a base
address but with different offset than 4 of course as we have explained earlier in
chapter 2. I didn’t initialize a new stack frame here and in all other places to get a
shorter code.
4.5 process management in 539kernel 141
1 void processA();
2 void processB();
3 void processC();
4 void processD();
28
29 while ( 1 )
30 asm( "mov $5393, %eax" );
31 }
Each process starts by printing its name, then, an infinite loop starts
which keeps setting a specific value in the register EAX. To check
whether multitasking is working fine, we can add the following lines
the beginning of the function scheduler in scheduler.c.
1 print( " EAX = " );
2 printi( eax );
Each time the scheduler starts, it prints the value of EAX of the
suspended process. When we run the kernel, each process is going
to start by printing its name and before a process starts executing the
value of EAX of the previous process will be shown. Therefore, you will
see a bunch of following texts EAX = 5390, EAX = 5391, EAX = 5392
and EAX = 5393 keep showing on the screen which indicates that the
process, A for example in case EAX = 5390 is shown, was running and
it has been suspended now to run the next one and so on.
1 ASM = nasm
2 CC = gcc
3 BOOTSTRAP_FILE = bootstrap.asm
4 INIT_KERNEL_FILES = starter.asm
5 KERNEL_FILES = main.c
6 KERNEL_FLAGS = -Wall -m32 -c -ffreestanding
-fno-asynchronous-unwind-tables -fno-pie
7 KERNEL_OBJECT = -o kernel.elf
8
9 build: $(BOOTSTRAP_FILE) $(KERNEL_FILE)
10 $(ASM) -f bin $(BOOTSTRAP_FILE) -o bootstrap.o
11 $(ASM) -f elf32 $(INIT_KERNEL_FILES) -o starter.o
12 $(CC) $(KERNEL_FLAGS) $(KERNEL_FILES) $(KERNEL_OBJECT)
13 $(CC) $(KERNEL_FLAGS) screen.c -o screen.elf
14 $(CC) $(KERNEL_FLAGS) process.c -o process.elf
15 $(CC) $(KERNEL_FLAGS) scheduler.c -o scheduler.elf
16 ld -melf_i386 -Tlinker.ld starter.o kernel.elf screen.elf
process.elf scheduler.elf -o 539kernel.elf
17 objcopy -O binary 539kernel.elf 539kernel.bin
18 dd if=bootstrap.o of=kernel.img
19 dd seek=1 conv=sync if=539kernel.bin of=kernel.img bs=512 count=8
4.5 process management in 539kernel 144
Nothing new in here but compiling the new C files that we have
added to 539kernel.
5
C H A P T E R 5 : M E M O RY M A N A G E M E N T
5.1 introduction
145
5.2 paging in theory 146
following, the first byte represents the page number and the second
1 That is, each page is of size 4KB and each page frame is of size 4KB,
2 In x86, this logical memory address is known as linear memory address as we have
discussed earlier in chapter 2.
3 In 32-bit x86 architecture, the length of memory address is 4 bytes = 32 bits, that
is, 2^32 bytes = 4GB are addressable. An example of a memory address in this
5.2 paging in theory 147
environment is FFFFFFFFh which is of length 4 bytes and refers to the last byte of
the memory.
5.3 virtual memory 148
tries the normal way to read from the memory address A1 9Bh the MMU
of the system is going to consider it as a logical memory address, so,
the page table of process C is used to identify in which page frame that
page 00A1h of process C is stored. As you can see, the process knows
nothing about the outside world and cannot gain this knowledge, it
thinks it is the only process in the memory, and any memory address
it generates belongs to itself and it cannot interfere the translation
process or modify its own page table.
for the first time, only the page the contains the entry code of the
software (e.g. main function in C) is loaded into the memory, not any
other page of that software. When some instruction in the entry code
tries to read data or call a routine that doesn’t exist on the loaded
page, then, the needed page will be loaded into the main memory
and that piece which was not there can be used after this loading, that
is, any page of the process will not be loaded into a free page frame
unless it’s really needed, otherwise, it will be waiting on the disk, this
is known as demand paging.
By employing demand paging, virtual memory saves a lot of mem-
ory space. Furthermore, virtual memory uses the disk for two things,
first, to store the pages that are no demanded yet, they should be there
so anytime one of them is needed, it can be loaded from the disk to
the main memory. Second, the disk is used to implement an operation
known as swapping.
Even with demand paging, at some point of time, the main memory
will become full, in this situation, when a page is needed to be loaded
the kernel that implements virtual memory should load it, even if the
memory is full! How? The answer is by using the swapping operation,
one of page frames should be chosen to be removed from the main
memory, this frame in this case is known as victim frame, the content
of this frame is written into the disk, it is being swapped out, and its
place in the main memory is used for the new page that should be
loaded. The swapped out page is not in the main memory anymore,
so, when it is needed again, it should be reloaded from the disk to the
main memory.
The problem of which victim frame should be chosen is known as
page replacement problem, that is, when there is no free page frame
and a new page should be loaded, which page frame should we make
free to be able to load the new page. Of course, there are many page
replacement algorithms out there, one of them is first-in first-out in
which the page frame that was the first one to be loaded among the
current page frames is chosen as a victim frame. Another well-known
algorithm is least recently used (LRU), in this algorithm, everytime the
page is accessed, the time of access is stored, when a victim frame is
needed, then it will be the oldest one that has been accessed.
The page table can be used to store a bunch of information that
are useful for virtual memory. First, a page table usually has a flag
known as present, by using this flag, the processor can tell if the page
that the process tries to access is loaded into the memory or not, if
it is loaded, then a normal access operation is performed, but when
the present flag indicates that this page is not in the memory, what
should be done? For sure, the page should be loaded from the disk to
the memory. Usually, the processor itself doesn’t perform this loading
operation, instead, it generates an exception known as page fault and
makes the kernel deal with it. A page fault tells the kernel that one
5.4 paging in x86 150
The page directory in x86 can hold up to 1024 entries. Each entry
points to a page table and each one of those page tables can hold up
to 1024 entries which represent a process’s pages. In other words, we
can say that, for each process, there are more than one page table, and
each one of those page tables is loaded in a different part of the main
memory and the page directory of the process helps us in locating the
page tables of a process.
As we have mentioned before, the page directory is the first level
of x86’s page table and each process has its own page directory. How
the processor can find the current page directory, that is, the page
directory of the current process? This can be done by using the register
CR3 which stores the base physical memory address of the current
page directory. The first part of a linear address is an offset within the
page directory, when an addition operation is performed between the
first part of a linear address and the value on CR3 the result will be
the base memory address of the entry that represents a page table that
contains an entry which represents the page that contains the required
data.
The size of an entry in the page directory is 4 bytes (32 bits) and its
structure is shown in the figure 30. The bits from 12 to 31 contain the
physical memory address of the page table that this entry represent.
Not all page tables that a page directory points to should be loaded
into the main memory, instead, only the needed page tables, the rest
are stored in a secondary storage until they are needed then they
should be loaded. To be able to implement this mechanism, we need
some place to store a flag that tells us whether the page table in
question is loaded into the main memory or not, and that’s exactly
the job of bit 0 of a page directory entry, this bit is known as present
bit, when its value is 1 that means the page table exists in the main
memory, while the value 0 means otherwise. When an executing code
tries to read a content from a page frame that its page table is not in
the memory, the processor generates a page fault that tells the kernel
to load this page table because it is needed right now.
When we have discussed segment descriptors, we have witnessed
some bits that aim to provide additional protection for a segment.
Paging in x86 also has bits that help in providing additional protection.
Bit 1 in a page directory entry decides whether the page table that
the entry points to is read-only when its value is 0 or if its writable
5.4 paging in x86 153
1. Bit 2 decides whether the access to the page table that this entry
points to is restricted to privileged code, that is, the code that runs
on privilege level 0, 1 and 2 when the bit’s value is 0 or that the page
table is also accessible by a non-privileged code, that is, the code that
runs on privilege level 3.
Generally in computing, caching is a well-known technique. When
caching is employed in a system, some data are fetched from a source
and stored in a place which is faster to reach if compared to the source,
these stored data are known as cache. The goal of caching is to make a
frequently accessed data faster to obtain. Think of your web browser
as an example of caching, when use visit a page 5 in a website, the web
browser fetches the images of that page from the source (the server
of the website) and stores it in your own machine’s storage device
which is definitely too much faster to access if compared to a web
server, when you visit the same website later, and the web browser
encounters an image to be shown, it searches if it’s cached, if so, this
is known as cache hit, the image will be obtained from your storage
device instead of the web server, if the image is not cached, this is
known as cache miss, the image will be obtained from the web server
to be shown and cached.
The processor is not an exception, it also uses cache to make things
faster. As you may noticed, the entries of page directories and page
tables are frequently accessed, in the code of software a lot of memory
accesses happen and with each memory access both page directory
and pages tables need to be accessed. With this huge number of
accesses to page table and given the fact that the main memory is too
much slower than the processor, then some caching is needed, and
that exactly what is done in x86, a part of the page directory and page
tables are cached in an internal, small and fast memory inside the
processor known as translation lookaside buffer (TLB), each time an entry
of page table of directory is needed, this memory is checked first, if
the needed entry is on it, that is, we got a cache hit, then it will be
used.
In x86 paging, caching is controllable, say that for some reason, you
wish to disable caching for a given entry, that can be done with bit
4 in a page directory entry. When the value of this bit is 1, then the
page table that is represented by this entry will not be cached by the
processor, but the value 0 in this bit means otherwise.
Unlike web browsers, the cached version of page table can be written
to, for example, assume that page table x has been cached after using
it in the first time and there is a page in this page table, call it y, that
isn’t loaded into the memory. We decided to load the page y which
means present bit of the entry that represents this page should be
changed in the page table x. To make things faster, instead of writing
the changes to the page table x in the main memory (the source), these
5 Please do not confuse a web page with a process page in this example.
5.4 paging in x86 154
information that a page table entry stores is the base physical memory
address of the page frame, this memory address will be used with
the third part of the linear address (offset) to get the final physical
memory address.
The entry of a page table is exactly same as the entry of a page
directory, its size is 4 bytes. Though, there are some simple differences,
the first difference is bit 7, which was used to decide the page size
in page directory, is ignored in the entry of a page table. The second
difference is in bit 6, which was ignored in the entry of page directory,
in page tables this bit is known as dirty bit.
In our previous discussion on virtual memory we know that at
some point of time, a victim frame may be chosen. This frame is
removed from the main memory to free up some space for another
page that we need to load from the disk. When the victim frame is
removed from the main memory, its content should be written to the
disk since its content may have been changed while it was loaded into
the memory. Writing the content of the victim frame to the disk and
loading the new page also from disk, given that the disk is really too
slow compared to the processor, is going to cause some performance
penalty.
To make the matter a little bit better, we should write the content of
the victim frame only if there is a real change in its content compared
to the version which is already stored in the disk. If the victim frame
version which is on the main memory and the version on the disk are
identical, there is no need to waste valuable resource on writing the
same content on the disk, for example, page frames that contain only
code will most probably be the same all the time, so their versions
on disk and main memory will be identical. The dirty bit is used to
indicate whether the content of the page frame has been changed and
has differences with the disk version, that is, the page (when the value
of the bit 1) or the two versions are identical (value 0).
paging can be implemented and this can be used as basis for further
development.
heap will be defined in a new file heap.c and its header file heap.h,
let’s start with the latter which is the following.
1 unsigned int heap_base;
2
3 void heap_init();
4 int kalloc( int );
1 #include "heap.h"
2
3 void heap_init()
4 {
5 heap_base = 0x100000;
6 }
As you can see, the function heap_init is too simple. It sets the
value 0x100000 to the global variable heap_base. That means that
kernel’s run-time heap starts from the memory address 0x100000. In
main.c we need to call this function in the beginning to make sure
that dynamic memory allocation is ready and usable by any other
subsystem, so, we first add #include "heap.h" in including section
of main.c, then we add the call line heap_init(); in the beginning of
kernel_main function. Next is the code of kalloc in heap.c.
the number of bytes that the caller needs to allocate from the memory
through a parameter called bytes.
In the first step of kalloc, the value of heap_base is copied to a local
variable named new_object_address which represents the starting
memory address of newly allocated bytes, this value will be returned
to the caller so the latter can start to use the allocated memory region
starting from this memory address.
The second step of kalloc adds the number of allocated bytes to
heap_base, that means the next time kalloc is called, it starts with a
new heap_base that contains a memory address which is right after
the last byte of the memory region that has been allocated in the
previous call. For example, assume we called kalloc for the first time
with 4 as a parameter, that is, we need to allocate four bytes from
kernel’s run-time heap, the base memory address that will be returned
is 0x100000, and since we need to store four bytes, we are going to
store them on the memory address 0x100000, 0x100001, 0x100002 and
0x100003 respectively. Just before returning the base memory address,
kalloc added 4, which is the number of required bytes, to the base of
the heap heap_base which initially contained the value 0x100000, the
result is 0x100004 which will be stored in heap_base. Next time, when
kalloc is called, the base memory address of the allocated region will
be 0x100004 which is, obviously, right after 0x100003.
As you can see from the allocator’s code, there is no way to im-
plement free function, usually, this function takes a base memory
address of a region in run-time heap and tells the memory allocator
that the region which starts with this base address is free now and can
be used for other allocations. Freeing memory regions when the code
finishes from using them helps in ensuring that the run-time heap is
not filled too soon, when an application doesn’t free up the memory
regions that are not needed anymore, it causes a problem known as
memory leak.
In our current memory allocator, the function free cannot be imple-
mented because there is no way to know how many bytes to free up
given the base address of a memory region, returning to the previous
example, the region of run-time heap which starts with the base ad-
dress 0x100000 has the size of 4 bytes, if we want to tell the memory
allocator to free this region, it must know what is the size of this
region which is requested to be freed, that of course means that the
memory allocator needs to maintain a data structure that can be used
at least when the user needs to free a region up, one simple way to be
able to implement free in our current memory allocator is to modify
kalloc and make it uses, for example, a linked-list, whenever kalloc
is called to allocate a region, a new entry is created and inserted into
the linked-list, this entry can be stored right after the newly allocated
region and contains the base address of the region and its size, after
that, when the user request to free up a region by giving its base
5.5 paging and dynamic memory in 539kernel 159
memory address, the free function can search in this linked-list until
it finds the entry of that region and put on the same entry that this
region is now free and can be used for future allocation, that is, the
memory which was allocated once and freed by using free function,
can be used later somehow.
Our current focus is not on implementing a full memory allocator,
so, it is up to you as a kernelist to decide how your kernel’s memory
allocator works, of course, there are a bunch of already exist algorithm
as we have mentioned earlier.
To make sure that our memory allocator works fine, we can use it
when a new process control block is created. It also can be used for
processes table, as you may recall, the processes table from version
T is an array which is allocated statically and its size is 15, instead,
the memory allocator can be used to implement a linked-list to store
the list of processes. However, for the sake of simplicity, we will stick
here with creating PCB dynamically as an example of using kalloc,
while keeping the processes table for you to decide if it should be a
dynamic table or not and how to design it if you decide that it should
be dynamic.
The first thing we need to do in order to allocate PCBs dynamically
is to change the parameters list of the function process_create in both
process.h and process.c. As you may recall, in version T, the second
parameter of this function called process and it was the memory
address that we will store the PCB of the new process on it. We had
to do that since dynamic memory allocation wasn’t available, so, we
were creating local variables in the caller for each new PCB, then we
pass the memory address of the local variable to process_create to be
used for the new PCB. This second parameter is not needed anymore
since the region of the new PCB will be allocated dynamically by
kalloc and its memory address will be returned by the same function.
So, the prototype of the function process_create will be in process.h
and process.c respectively as the following.
1 process_t *process_create( int * );
You can also notice that the function now returns a pointer to the
newly created PCB, in version T it was returning nothing. The next
changes will be in the code of process_create. The name of the
eliminated parameter of process_create was process and it was a
pointer to the type process_t. We substitute it with the following line
which should be in the beginning of process_create.
1 process_t *process = kalloc( sizeof( process_t ) );
5.5 paging and dynamic memory in 539kernel 160
5.5.2 Paging
8 The first page table is the one which is pointed to by the first entry in the page
directory.
5.5 paging and dynamic memory in 539kernel 161
to page frame 0 and so on. The memory allocator will be used when
initializing the kernel’s page directory and page tables, we can allocate
them statically as we have done with GDT for example, but that can
increase the size of kernel’s binary file.
Before getting started with the details two new files are needed to
be created: paging.h and paging.c which will contain the stuff that
are related to paging. The content of paging.h is the following.
1 #define PDE_NUM 3
2 #define PTE_NUM 1024
3
4 extern void load_page_directory();
5 extern void enable_paging();
6
7 unsigned int *page_directory;
8
9 void paging_init();
10 int create_page_entry( int, char, char, char, char, char, char, char,
char );
The part PDE in the name of the macro PDE_NUM means page directory
entries, so this macro represents the number of the entries that will
be defined in the kernel’s page directory. Any page directory may
hold 1024 entries but in our case not all of these entries are needed
so only 3 will be defined instead, that means only three page tables
will be defined for the kernel. How many entries will be defined in
those page tables is decided by the macro PTE_NUM which PTE in its
name means page table entries, its value is 1024 which means there
will be 3 entries in the kernel’s page directory and each one of them
points to a page table which has 1024 entries. The total entries will
be 3 * 1024 = 3072 and we know that each of these entries map a
page frame of the size 4KB then 12MB of the physical memory will
be mapped in the page table that we are going to define, and since
our mapping will be one-to-one, that means the reachable physical
memory addresses start at 0 and ends at 12582912, any region beyond
this range, based on our setting, will not be reachable by the kernel
and it is going to cause a page fault exception. It is your choice to set
the value of PDE_NUM to the maximum (1024), this will make a 4GB of
memory addressable.
Getting back to the details of paging.h, both load_page_directory
and enable_paging are external functions that will be defined in
assembly and will be used in paging.c. The first function loads the
address of the kernel’s page directory in the register CR3, this address
can be found in the global variable page_directory but of course, its
value will be available after allocating the needed space by kalloc.
The second function is the one that modifies the register CR0 to enable
paging in x86, this should be called after finishing the initialization of
kernel’s page directory and loading it.
5.5 paging and dynamic memory in 539kernel 162
and kernel’s page tables that implement one-to-one map based on the
sizes that defined in the macros PDE_NUM and PTE_NUM. The code of
paging_init is the following.
1 void paging_init()
2 {
3 // PART 1:
4
5 unsigned int curr_page_frame = 0;
6
7 page_directory = kalloc( 4 * 1024 );
8
9 for ( int currPDE = 0; currPDE < PDE_NUM; currPDE++ )
10 {
11 unsigned int *pagetable = kalloc( 4 * PTE_NUM );
12
13 for ( int currPTE = 0; currPTE < PTE_NUM; currPTE++,
curr_page_frame++ )
14 pagetable[ currPTE ] = create_page_entry( curr_page_frame *
4096, 1, 0, 0, 1, 1, 0, 0, 0 );
15
16 page_directory[ currPDE ] = create_page_entry( pagetable, 1, 0,
0, 1, 1, 0, 0, 0 );
17 }
18
19 // ... //
20
21 // PART 2
22
23 load_page_directory();
24 enable_paging();
25 }
Given that the size of a page is 4KB, then, page frame number 0
which is the first page frame starts at the physical memory address
0 and ends at physical memory address 4095, in the same way, page
frame 1 starts at the physical memory address 4096 and ends at the
physical memory address 8191 and so on. In general, with one-to-
one mapping, given n is the number of a page frame and the page
size is 4KB, then n * 4096 is the physical memory address that this
5.5 paging and dynamic memory in 539kernel 165
page frame starts at. We use this equation in the first parameter that
we pass to create_page_entry when we create the entries that point
to the page frames, that is, page tables entries. The local variable
curr_page_frame denotes the current page frame that we are defining
an entry for, and this variable is increased by 1 with each new page
table entry. In this way we can ensure that the page tables that we are
defining use a one-to-one map.
As you can see from the rest of the parameters, for each entry in the
page table, we set that the page frame is present, its cache is enabled
and write-through policy is used. Also, the page frame belongs to
supervisor privilege level and the page size is 4KB.
The code which define a new entry in the page directory is similar
to the one which define an entry in a page table, the main difference
is, of course, the base address which should be the memory address of
the page table that belongs to the current entry of the page directory.
When we allocate a memory region for the current page table that
we are defining, its base memory address will be returned by kalloc
and stored in the local variable pagetable which is used as the first
parameter when we define an entry in the page directory.
There is nothing new in the first line. We are telling NASM that
there is a symbol named page_directory that will be used in the
assembly code, but it isn’t defined in it, instead it’s defined in a
place that the linker is going to tell you about in the future. As you
know, page_directory is the global variable that we have defined in
paging.h and holds the memory address of the kernel’s page directory,
it will be used in the code of load_page_directory.
The last two lines are new, what we are telling NASM here is that
there will be two labels in current assemble code named load_page_directory
and enable_paging, both of them should be global, that is, they should
be reachable by places other than the current assembly code, in our
case, it’s the C code of the kernel. The following is the code of those
functions, they reside in starter.asm below the line bits 32 since
they are going to run in 32-bit environment.
1 load_page_directory:
2 mov eax, [page_directory]
3 mov cr3, eax
4
5 ret
5.5 paging and dynamic memory in 539kernel 167
6
7 enable_paging:
8 mov eax, cr0
9 or eax, 80000000h
10 mov cr0, eax
11
12 ret
There is nothing new here. In the first function we load the content
of page_directory into the register CR3 and in the second function we
use bitwise operation to modify bit 31 in CR0 and sets its value to 1
which means enable paging. Finally, paging_init should be called by
kernel_main right after heap_init, the full list of calls in the proper
order is the following.
1 heap_init();
2 paging_init();
3 screen_init();
4 process_init();
5 scheduler_init();
6.1 introduction
Given that both the processor and the main memory are resources
in the system, till this point, we have seen how a kernel of an oper-
ating system works as a resource manager, 539kernel manages these
resources 1 and provides them to the different processes in the system.
Another role of a kernel is to provide a way to communicate with
external devices, such as the keyboard and hard disk. Device drivers
are the way of realizing this role of the kernel. The details of the
external devices and how to communicate with them are low-level
and may be changed at any time. The goal of a device driver is to
communicate with a given device by using the device’s own language2
in behalf of any component of the system (e.g. a process) that would
like to use the device. Device drivers provide an interface so it can
be called by the other system’s components in order to tell the device
something to do, we can consider this interface as a library that we
use in normal software development. In this way, the low-level details
of the device is hidden from the other components and whenever
these details changed only the code of the device driver should be
changed, the interface can be kept to not affect its users. Also, hiding
the low-level details from driver’s user can ensure the simplicity of
using that driver.
The matter of hiding the low-level details with something higher-
level is too important and can be found, basically, everywhere in
computing and the kernels are not an exception of that. Of course,
there is virtually no limit of providing higher-level concepts based on
a previous lower-level concept, also, upon something that we consider
as a high-level concept we can build something even higher-level.
Beside the previous example of device drivers, one of obvious exam-
ples where the kernels fulfill the role of hiding the low-level details
and providing something higher-level, in other words, providing an
abstraction, is a filesystem which provides the well-known abstraction,
a file.
1 Incompletely of course, to keep 539kernel as simple as possible, only the basic parts
of resources management were presented.
2 The word language here is a metaphor, it doesn’t mean a programming language.
169
6.2 ata device driver 170
In this chapter we are going to cover these two topics, device drivers
and filesystem by using 539kernel. As you may recall, it turned out
that accessing to the hard disk is an important aspect for virtual
memory, so, to be able to implement virtual memory, the kernel itself
needs to access the hard disk which makes it an important component
in the kernel, so, we are going to implement a device driver that
communicate with the hard disk in this chapter. After getting the
ability of reading from the hard disk or writing to it, we can explore
the idea of providing abstractions by the kernel through writing a
filesystem that uses the hard disk device driver and provides a higher-
level view of the hard disk that we all familiar with instead of the
physical view of the hard disk which has been described previously
in chapter 1. The final result of this chapter is version NE of 539kernel.
No need to say the hard disks are too common devices that are used
as secondary storage devices. There are a lot of manufacturers that
manufacture hard disks and sell them, imagine for a moment that
each hard disk from a different manufacturer use its own way for the
communication between the software and the hard disk, that is, the
method X should be used to be able to communicate with hard disks
from manufacturer A while the method Y should be used with hard
disks from manufacturer B and so on, given that there are too many
manufacturers, this will be a nightmare. Each hard disk will need
its own device driver which talks a different language from the other
hard disk device drivers.
Fortunately, this is not the case, at least for the hard disks, in these
situations, standards are here to the rescue. A manufacturer may
design the hard disk hardware in anyway, but when it comes to the
part of the communication between the hard disk and the outside
world, a standard can be used, so, any device driver that works with
this given standard will be able to communicate with this new hard
disk. There are many well-known standards that are related to the
hard disks, small computer system interface (SCSI) is one of them, another
one is advanced technology attachment (ATA), another well-known name
for ATA is Integrated Drive Electronics (IDE). The older ATA standard
is now known as Parallel ATA (PATA) while the newer version of
ATA is known as Serial ATA (SATA). Because ATA is more common
in personal computers we are going to focus on it here and write a
device driver for it, SCSI is more common in servers.
As in PIC which has been discussed in chapter 3, ATA hard disks
can be communicated with by using port-mapped I/O communication
through the instructions in and out. But before discussing the ATA
commands that let us to issue a read or write request to the hard
disk, let’s write two routines in assembly that can be used used as
6.2 ata device driver 171
1 dev_write:
2 ; Part 1
3 push edx
4 push eax
5
6 ; Part 2
7 xor edx, edx
8 xor eax, eax
9
10 ; Part 3
11 mov dx, [esp + 12]
12 mov al, [esp + 16]
13
14 ; Part 4
15 out dx, al
16
17 ; Part 5
18 pop eax
19 pop edx
20
21 ret
The core part of this routine is part four which contains the in-
struction out that sends the value of AL to the port number which is
stored in DX. Because we are using these two registers 3 , we push their
previous values into the stack and that’s performed in the first part
of the routine. Pushing the previous values of these registers lets us
restore them easily after the routine finishes its work, this restoration
is performed in the fifth part of the routine right before returning
from it, this is an important step to make sure that when the routine
3 Which are, as you know, parts of the registers EAX and EDX respectively.
6.2 ata device driver 172
returns, the environment of the caller will be same as the one before
calling the routine.
After storing the previous values of EAX and EDX we can use them
freely, so, the first step after that is to clear their previous values
by setting the value 0 to the both of them, as you can see, we have
used xor and the both operands of it are the same register (hence,
value) that we wish to clear, this is a well-known way in assembly
programming to clear the value of a register 4 . After that, we can move
the values that have been passed to the routine as parameters to the
correct registers to be used with out instruction, this is performed in
the third part of the routine 5 .
Beside dev_write, we need to define another routine called dev_write_word
which is exactly same as dev_write but write a word (2 bytes) instead
of one byte to a port. The following is the code of this routine.
1 dev_write_word:
2 push edx
3 push eax
4
5 xor edx, edx
6 xor eax, eax
7
8 mov dx, [esp + 12]
9 mov ax, [esp + 16]
10
11 out dx, ax
12
13 pop eax
14 pop edx
15
16 ret
As you can see, the only difference between dev_write and dev_write_word
is that the first one uses the register al (8 bit) as the second operand
of out while the second one uses ax (16 bit) instead, so, a word can be
written to the port.
The following is the code of the routine dev_read which uses the
instruction in to read the data from a given port and returns them
to the caller, its prototype can be imagined as char dev_read( int
port ).
1 dev_read:
2 push edx
3
4 To my best knowledge its performance is better than the normal way of using mov.
5 You may notice that I’ve omitted the epilogue of routines that creates a new stack
frame, this decision has been made to make the matters simpler and shorter, you are
absolutely free to use the calling convention and most probably using it is a better
practice.
6.2 ata device driver 173
Terms that combine a bus name with a device name are used to
specify exactly which device is being discussed, for example, pri-
mary master means the master hard disk that is connected to the
primary bus while secondary slave means the slave hard disk which
is connected to the secondary bus.
The port numbers that can be used to communicate with the devices
that are attached into the primary bus start from 0x1F0 and ends in
0x1F7 each one of these ports has its own functionality. The port
numbers from 0x170 to 0x177 are used to communicate with devices
that are attached into the secondary bus, so, there are eight ports for
each ATA bus.
For the sake of simplicity, our device driver is going to assume that
there is only a primary master and all read and write requests should
be sent to this primary master, therefore, our device driver uses the
port number 0x1F0 as the base port to send the commands via PIC.
You may ask, why are we calling this port number a base port? As
you know that all the following port numbers are valid to communicate
with the primary ATA bus: 0x1F0, 0x1F1, 0x1F2, 0x1F3, 0x1F4, 0x1F5,
0x1F6, 0x1F7, so, we can add any number from 0 through 7 to the base
port number of the primary bus 0x1F0 to get a correct port number,
the same holds true with the secondary ATA bus which its base port
number is 0x170. So, we can define the base port as a macro (or
even variable) as we will see in our device driver, then we can use
this macro by adding a specific value to it from 0 through 7 to get a
specific port, the advantage of doing so is the easiness of changing the
value of the base port to another port without the need of changing
the code itself.
Before starting in the implementation of the driver, let’s create two
new files: ata.h and ata.c which will contain the code of the ATA
device driver which provides an interface for the rest of the kernel
to write to and read from the disk. The following is the content of
ata.h and the details of the functions will be discussed in the next
subsections.
1 #define BASE_PORT 0x1F0
2 #define SECTOR_SIZE 512
3
4 void wait_drive_until_ready();
5
6 void *read_disk( int );
7 void write_disk( int, short * );
8
9 void *read_disk_chs( int );
10 void write_disk_chs( int, short * );
6.2 ata device driver 175
Addressing Mode
As in the main memory, the hard disks use addresses to read the data
that are stored in a specific area of the disk, the same is applicable in
write operation, the same address can be used to write on the same
specific area. There are two schemes of hard disk addresses, the older
one is known as cylinder-head-sector addressing (CHS) while the newer
one which more dominant now is known as logical block addressing
(LBA).
In chapter 1 we have covered the physical structure of hard disks
and we know from that discussion that the data are stored in small
blocks known as sectors, also, there are tracks which each one of them
consists of a number of sectors, and finally, there are heads that should
be positioned on a specific sector to read from it or to write to it. The
scheme CHS uses the same concepts of physical structure of hard disk,
the address of a given sector on the hard disk should be composed by
combining three numbers together, the cylinder (track) that this sector
reside on, the sector that we would like to access and the head that is
able to access this sector. However, this scheme is obsolete now and
LBA is used instead of it.
In LBA, a logical view of a hard disk is used instead of the physical
view. This logical view states that the hard disk is composed of a
number of logical blocks with a fixed size, say, n bytes. These blocks
are contagious in a similar way of the main memory and to reach any
block you can use its own address, the addresses start from 0, the
block right after the first one has the address 1 and so on. As you can
see, addressing in LBA is more like the addressing of the main memory,
the main difference here is that in current computers each address of
the main memory points to a byte in memory while an address in LBA
points to a block which can be a sector (512 bytes) or even bigger.
Once the read command is issued with the right parameters passed
to the correct ports, we can read the value of base_port + 7 to check
if the disk finished the reading operating or not by reading the eighth
bit (bit 7) of that value, when the value of this bit is 1 that means
the drive is busy, once it becomes 0 that means that the operation
completed.
When the reading operation is completed successfully, the data are
brought to base_port which means we need to read from it and put
the required data in the main memory. The following is the code of
read_disk_chs that should reside in ata.c. Don’t forget to include
ata.h when you create ata.c.
6.2 ata device driver 177
until the device finishes its work. The following is the code of this
function.
1 void wait_drive_until_ready()
2 {
3 int status = 0;
4
5 do
6 {
7 status = dev_read( BASE_PORT + 7 );
8 } while ( ( status ^ 0x80 ) == 128 );
9 }
that we would like to read from, the other parts of the addresses are
divided to the following ports: base_port + 3 contains bits 0 to 7
of that address, base_port + 4 contains bits 8 to 15, base_port + 5
contains bits 16 to 23. Both ports base_port + 2 and base_port + 7
stay the same. The following table summarizes the parameters of read
command when LBA is used.
Writing to Disk
In both CHS and LBA, write operation is called via ports exactly in
the same way of reading, though, there are two differences. First,
the command number to issue a write request is 0x30 which should
6.2 ata device driver 180
32 wait_drive_until_ready();
33 }
6.3 filesystem
to mean the second definition. Also, when the term address is used in
the next discussions, it means a logical block address.
As in programming languages, a filesystem may be divided into two
parts: A design and an implementation. The design of a filesystem
tells us how this filesystem stores the information about the run-time
filesystem, how the metadata are organized, which data structure is
used to fulfill the goal of the filesystem an so on 8 . Of course, the
design can be there, written on the papers as a brilliant theoretical
piece of work but to realize this design, an implementation should be
written that uses it 9 . For example, a well-known filesystem is FAT
(short for: file allocation table) which is an old filesystem that started
on 1977 and still in use nowadays. Because its design is available,
anyone can write an implementation of it and make her kernel able to
read run-time filesystems that have been created by another operating
system that uses FAT, Linux kernel is an example of the kernels that
have a FAT implementation. As an example of filesystem, we are
going to design and implement 539filesystem in this section.
the last file that has been created in the run-time filesystem, that is,
the tail file 11 .
Each file has its own metadata that contains file’s name and the next
field which stores the metadata address of the next file that has been
created. The length of the filename is 256 bytes and the size of “next”
field is 4 bytes. When there is no next file, the value 0 is stored in the
“next” field of the last file’s metadata, that is, the tail file.
It should be obvious now how can we reach all files in a run-time
filesystem that uses 539filesystem, starting from the base block we get
the metadata address of the head file and by using the “next” field
from this metadata we can reach the metadata of the next file and the
process continues until we reach the tail file.
The metadata of each file is stored in the block right before the
content of the file which will be stored in one block only given that
the size of a block is 512 bytes 12 . For example, if the metadata of file
A is stored in the address 103, then the content of this file is stored
in the address 104. By using this design, the basic functionalities
of filesystems can be provided. Figure 31 shows an overview of
539filesystem design where four files stored in the system, x, y, z and
w.
Figure 32: The State 539filesystem After Creating the First File
writing the code of create_file. Let’s start with the first part of the
function.
1 void create_file( char *filename, char *buffer )
2 {
3 int metadata_lba = ( base_block->head == 0 ) ? BASE_BLOCK_ADDRESS +
1 : base_block->tail + 2;
4 int file_lba = metadata_lba + 1;
5
6 metadata_t *metadata = kalloc( sizeof( metadata_t ) );
7
8 metadata->next_file_address = 0;
9
10 int currIdx;
11
12 for ( currIdx = 0; *filename != ’\0’ && currIdx < FILENAME_LENGTH -
1; currIdx++, filename++ )
13 metadata->filename[ currIdx ] = *filename;
14
15 metadata->filename[ currIdx ] = ’\0’;
16
17 write_disk( metadata_lba, metadata );
18 write_disk( file_lba, buffer );
When the value of the head in the base block is 0, that means there
is no files at all in the run-time filesystem. When create_file is called
in this situation, that means this file that the caller is requesting to
create is the first file in the run-time filesystem, the metadata of this
first file can be simply stored in the block right after the base block.
In create_file this fact is used to decide the disk address for the
metadata of the new file, this address is stored in the local variable
metadata_lba which its name is a short for “metadata logical block
address”. Figure 32 shows the state of 539filesystem after creating the
first file A in the run-time filesystem.
In case that the run-time filesystem is not empty, that is, the value
of head is not 0, then the tail field of base block can be used to decide
the metadata address of the new file. As we know, the tail field
6.3 filesystem 188
contains the metadata address of the last file that has been added to
the run-time filesystem, and the content of that file is stored in the
disk address tail + 1, which means tail + 2 is a free block that can
be used to store new data 13 , so we choose this address for the new
metadata in this case. After that, the disk address of the new content is
decided by simply adding 1 to the disk address of the new metadata,
the address of the content is stored in the local variable file_lba.
After deciding the disk addresses of the new metadata and file con-
tent, we start in creating the metadata of the file to store them later on
the disk. As you can see in the code, we allocate a space in the kernel’s
heap for the new metadata by depending on the type metadata_t,
after this allocation, we can use the local variable metadata to fill the
fields of the new file metadata. First, we set the value of the “next”
field to 0, because, as we mentioned earlier, this new file will be the tail
file which means there is no file after it. Then, we copy the filename
which is passed as a parameter filename to the filename field of the
metadata, in case the passed filename’s length is less than the maxi-
mum length, then the whole filename is copied, otherwise, only the
maximum number of characters of the passed filename is copied and
the rest are simply ignored. The final step that is related to the new
file is to write the metadata and the file content in the right addresses
on the disk, and this is done in the last two lines which use the ATA
device driver. The following is the next and last part of create_file
which updates the base block depending on the current state of the
run-time filesystem.
1 if ( base_block->head == 0 )
2 {
3 update_base_block( metadata_lba, metadata_lba );
4 }
5 else
6 {
7 metadata_t *tail_metadata = load_metadata( base_block->tail );
8
9 tail_metadata->next_file_address = metadata_lba;
10
11 write_disk( base_block->tail, tail_metadata );
12 update_base_block( base_block->head, metadata_lba );
13 }
14 } // End of "create_file"
When the run-time filesystem is empty, that is, the value of head in
the base block is 0, then the new file that we are creating will be both
the head and the tail file. As you can see, in the block of if statement
that checks whether head equals 0 or not, the not defined yet function
13 This is ensured since 539filesystem stores the files in order, so, there will be no files
after the tail unless it is a deleted file which can be overwritten and causes no data
lose.
6.3 filesystem 189
It’s too simple, it receives the value of head and tail that we would
like to set on the base block, then, the copy of the base block which is
stored in the main memory is updated, then, this updated version is
overwritten on the base block address on the disk. The following is
code of load_metadata which has been used in create_file function.
1 metadata_t *load_metadata( int address )
2 {
3 metadata_t *metadata = read_disk( address );
4
5 return metadata;
6 }
Simply, it receives a disk address and assumes that the block which
is presented by this address is a metadata block. It loads this metadata
to the main memory by loading the content of the address from the
disk through the device driver function read_disk. The following is
6.3 filesystem 190
Figure 33: Steps Needed to Create New File in 539filesystem When Run-time
Filesystem isn’t Empty
6.3 filesystem 191
The first part of list_files handles the case where the run-time
filesystem is empty, so, it returns -1 to indicate that there is no files
to list. In case that the run-time filesystem isn’t empty, the func-
tion in the second part allocates a space in kernel’s heap for the list
of the filenames, as you can see, we have used a function named
get_files_number to decide how many bytes we are going to allocate
for this list, based on its name, this function returns the number of files
in the run-time filesystem, its code will be presented in a moment. In
the third part, the function is ready to traverse the list of files metadata
which are stored in the disk and are reachable starting from the disk
address which is stored in the head field in the base block.
Initially, the metadata of the head file is loaded into memory and
can be accessed through the local variable curr_file, then, the loop
is started. In the body of the loop, the filename of the current file
metadata is appended to the result’s variable list, in the first iteration
of this loop the filename will be the one that belong to the head file.
After appending the filename of the current file to list, the function
checks if the current file is the tail file or not by checking the value of
the “next” field next_file_address, if it is 0 then the current file is the
tail, so, the loop should break and the result should be returned to the
caller. In case that the current file isn’t the tail file, then the metadata
of the next file is loaded by using the disk address which is stored in
the “next” field of the current file, the current value of curr_file is
replaced with a memory address that points to the metadata of the
next file which will be used in the next iteration of the loop, the same
operation continues until the function reaches the tail which breaks
the loop and returns the list to the caller. The following is the code
of get_files_number that was used in list_files and, as mentioned
earlier, returns the number of stored files.
1 int get_files_number()
2 {
3 if ( base_block->head == 0 )
4 return 0;
5
6.3 filesystem 193
6 int files_number = 0;
7
8 // ... //
9
10 metadata_t *curr_file = load_metadata( base_block->head );
11
12 while ( 1 )
13 {
14 files_number++;
15
16 if ( curr_file->next_file_address == 0 )
17 break;
18
19 curr_file = load_metadata( curr_file->next_file_address );
20 }
21
22 return files_number;
23 }
1 void print_fs()
2 {
3 char **files = list_files();
4
5 for ( int currIdx = 0; currIdx < get_files_number(); currIdx++ )
6 {
7 print( "File: " );
8 print( files[ currIdx ] );
9 println();
10 }
11
12 print( "==" );
13 println();
14 }
Reading a File
The function read_file reads the content of a file which its name is
passed as a parameter, then, the address of the buffer that stores that
content of the file is returned to the caller. Because the file size in
539filesystem is always 512 bytes then read_disk of ATA device driver
can be called just one time to load a file.
6.3 filesystem 194
The task of finding the disk address of the file’s metadata is per-
formed by the function get_address_by_filename which we will
define in a moment. When the metadata of the file is not found,
read_file returns 0, otherwise, the file will be read by calling read_disk,
as you can see, the parameter that is passed to this function is address
+ 1 since the value of address is the disk address of the file’s metadata
and not its content. Finally, the address of the buffer is returned to the
caller. The following is the code of get_address_by_filename.
1 int get_address_by_filename( char *filename )
2 {
3 metadata_t *curr_file = load_metadata( base_block->head );
4 int curr_file_address = base_block->head;
5
6 int idx = 0;
7
8 while ( 1 )
9 {
10 if ( strcmp( curr_file->filename, filename ) == 1 )
11 return curr_file_address;
12
13 if ( curr_file->next_file_address == 0 )
14 break;
15
16 curr_file_address = curr_file->next_file_address;
17 curr_file = load_metadata( curr_file->next_file_address );
18 }
19
20 return 0;
21 }
6.3 filesystem 195
Deleting a File
13
14 if ( get_files_number() == 1 )
15 {
16 update_base_block( 0, 0 );
17
18 return;
19 }
20
21 // Part 3
22 if ( curr_file_address == base_block->head )
23 {
24 update_base_block( curr_file_metadata->next_file_address,
base_block->tail );
25 }
26 // Part 4
27 else
28 {
29 int prev_file_address = get_prev_file_address(
curr_file_address );
30
31 metadata_t *prev_file = load_metadata( prev_file_address );
32
33 prev_file->next_file_address =
curr_file_metadata->next_file_address;
34
35 write_disk( prev_file_address, prev_file );
36
37 if ( curr_file_address == base_block->tail )
38 update_base_block( base_block->head, prev_file_address );
39 }
40 }
The first part tries to find the metadata address of the file in question
by using the function get_address_by_filename, in case the file is not
found, the function does nothing and returns. Otherwise, the metadata
of the file is loaded and the local variable curr_file_metadata is used
to point to that metadata in the main memory.
In the second part, the most basic case of deleting a file is handled,
when there is only one file in the run-time filesystem, nothing need to
be done but updating the base block to indicate that the disk address
of both head and tail is 0 which means, as mentioned earlier, that
the run-time filesystem is empty. The function update_base_block is
used to update the base block. Figure 34 shows this case.
The third part handles the case where the file to be deleted is the
head file, in this case, to remove the reference of this file, we simply
replace the current value of the head in base block with the metadata
address of the file right next to the head which can be found in the
6.3 filesystem 197
Figure 34: The State 539filesystem After Removing the Only File
Figure 35: The Steps Needed to Delete the Head File in 539filesystem
“next” field of the current head, so, the second file will become the
head file after finishing the delete process. Figure 35 shows this case.
The fourth part of the function handles the case where the file to
be deleted is not the head, in this case, the previous file’s metadata
needs to be found to modify its “next” field by replacing it with the
value of the “next” field of the file that we would like to delete, in
this way, we will be sure that the reference of the file to be deleted is
removed from 539filesystem data structure, and that the previous file
is linked with the next file. Figure 36 shows this case. Also, in this
case, the file in question may be the tail, therefore, the tail on the base
block should be replaced with the disk address of the previous file’s
metadata. Figure 37 shows this case.
As you can see in the first code line of this fourth part, a function
named get_prev_file_address is used to get the disk address of
previous file’s metadata to be able to perform the described operation.
By using this address, the metadata is loaded by using load_metadata
in order to modify the “next” field of the previous file, the updated
6.3 filesystem 198
Figure 36: The Steps Needed to Delete a File which is not the Head
Figure 37: The Steps Needed to Delete a File which is not the Head but it’s a
Tail in 539filesystem. The File to Delete Here is “W”.
6.4 finishing up version ne and testing the filesystem 199
This code creates three files, prints their contents, prints the run-
time filesystem tree through the function print_fs and finally deletes
the file first_file then prints the run-time filesystem tree again
to show that the file has been deleted successfully. The function
print_fs already defined in this chapter. To make everything works
file, you need to keep the definition of print_fs below kernel_main
and put the prototype void print_fs(); above kernel_main. Also,
to test the filesystem you need to make sure that the interrupts are
disabled, the easiest way to do that is modifying starter.asm by
commenting the line sti which is before call kernel_main in the
routine start_kernel. After that you should see the result of the
above testing code after the kernel boots up.
7
C H A P T E R 7 : W H AT ’ S N E X T ?
7.1 introduction
And now, after writing a simple operating system’s kernel and learn-
ing the basics of creating kernels, the question is “What’s Next?”.
Obviously, there is a lot to do after creating 539kernel and the most
straightforward answers for our question are the basic well-known
answers, such as: implementing virtual memory, enabling user-space
environment, providing graphical user interface or porting the kernel
to another architecture (e.g. ARM architecture). This list is a short list
of what you can do next with your kernel.
Previously, I’ve introduced the term kernelist 1 in which I mean
the person who works on designing operating system kernels with
modern innovative solutions to solve real-world problem. You can
continue with your hobby kernel and implement the well-known
concepts of traditional operating systems that we have just mentioned
a little of them, but if you want to create something that can be more
useful and special than a traditional kernel, then I think you should
consider playing the role of a kernelist.
If you take a quick look on current hobby or even production
operating system kernels through GitHub for example, you will find
most of them are traditional, that is, they focus on implementing
the traditional ideas that are well-known in operating systems world,
some of those kernels go further and try to emulate another previous
operating system, for example, many of them are Unix-like kernel, that
is, they try to emulate Unix. Another examples are ReactOS2 which
tries to emulate Microsoft Windows and Haiku3 which tries to emulate
BeOS which is a discontinued proprietary operating system. Trying
to emulate another operating systems is good and has advantages
of course, but what I’m trying to say that there are a lot of projects
that focus on this line of operating systems development, that is, the
traditionalists line and I think the line of kernelists needs to be focused
on in order to produce more innovate operating systems.
1 In chapter 3 where the distinction between a kernelist and a traditionalist has been
established.
2 https://reactos.org/
3 https://www.haiku-os.org/
202
7.1 introduction 203
I’ve already said that the kernelist doesn’t need to propose her
own solutions for the problems that she would like to solve. Instead
of using the old well-known solutions, a kernelist searches for other
better solutions for the given problem and designs an operating system
kernel that uses these solutions. Scientific papers (papers for short) are
the best place to find novel and innovative ideas that solve real-world
problem, most probably, these ideas haven’t been implemented or
adopted by others yet4 .
In this chapter, I’ve chosen a bunch of scientific papers that propose
new solutions for real-world problem and I’ll show you a high-level
overview of these solutions and my goal is to encourage interested
people to start looking to the scientific papers and implement their
solutions to be used in the real-world. Also, I would like to show how
the researches on operating systems field (or simply the kernelists!)
innovate clever solutions and get over the challenges, this could help
an interested person in learning how to overcome his own challenges
and propose innovate solutions for the problem that he faces.
Of course, the ideas on the papers that we are going to discuss
(or even the other operating system’s papers) may need more than
a simple kernel such as 539kernel to be implemented. For example,
some ideas may need a networking stack being available in the kernel,
which is not available in 539kernel, so, there will be two options in
this case, either you implement the networking stack in your kernel
or you can simply focus on the problem and solution that the paper
present and use an already exist operating system kernel which has
the required feature to develop the solution upon this chosen kernel,
of course, there are many open source options and one of them is
HelenOS5 microkernel6 .
A small note should be mentioned, this chapter only shows an
overview of each paper which means if you are really interesting on
the problem and the solution that a given paper represents, then it’s
better to read it. It is easy to get a copy of any mentioned paper in
this chapter, you just need to search for its title in Google Scholar
(https://scholar.google.com/) and a link to a PDF will show for
you. However, before getting started in discussing the chosen papers,
I would like in the next subsection to discuss a topic that I’ve deferred
till this point, this topic is related to the architecture design of a kernel.
4 Scientific papers can be searched for through a dedicated search engine, for example,
Google Scholar.
5 http://www.helenos.org/
6 The concept of microkernel will be explained in this chapter.
7.1 introduction 204
9 Authored by Hojoon Lee, Chihyun Song and Brent Byunghoon Kang. Published on
2018.
10 Beside Intel, also AMD provides processors that use x86 architecture.
11 Intel’s SGX is deprecated in Intel Core but still available on Intel Xeon.
12 In LOTRx86 when the term portable is used to describe something it means that this
thing is able to work on any modern x86 processor. The same term has another
boarder meaning, for example, if we use the boarder meaning to say “Linux kernel
is portable” we mean that it works on multiple processors architecture such as x86,
ARM and a lot more and not only on Intel’s or AMD’s x86.
7.2 in-process isolation 207
13 In the paper, the name PrivUser means two things, the execution mode and the secret
memory area.
14 We have discussed this bit in a page entry in chapter 4.
7.2 in-process isolation 208
7.2.2 Endokernel
15 Authored by: Bumjin Im, Fangfei Yang, Chia-Che Tsai, Michael LeMay, Anjo Vahldiek-
Oberwagner and Nathan Dautenhahn. Published on 2021.
7.3 nested kernel 209
the kernel which checks if the current user has the permissions to read
or modify a specific file, this region can by marked as read-only and
can be protected by the nested kernel all the time from being modified
by any part of the outer kernel. Now, assume that an attacker found
an exploitable security bug in one of the device drivers, and his goal
is to modify that code of permission checking in order to let him to
read some critical file, this cannot be done since the memory region
is protected and read-only, the paper discusses how in details how to
ensure that the outer kernel doesn’t violate the protection of nested
kernel in x86 architecture.
That’s not the whole story. Making the nested kernel the only
way to modify the protected memory by the outer kernel means that
the nested kernel can be a mediator which will be called before any
modification performed. This will let the kernel’s implementer to
define security policies and enforce them while the system is running.
For example, the authors propose no write policy which doesn’t let
the outer kernel to write on a specific memory region at all (e.g. the
example of checking permissions code). Another proposed policy is
write-once policy which lets the outer kernel to write to a region of
memory just one time, this policy will be useful with the memory
region that contains the IDT table for example, so, the attacker cannot
modify the interrupt service routines after setting them up by the
trusted code of outer kernel. More policies were presented in the
paper. You can see here how the kernelists proposed a new kernel
design other than the popular ones (microkernel and monolithic) in
order to solve a specific real-world problem.
7.4 multikernel
17 Authored By: Andrew Baumann, Paul Barham, Pierre-Evariste Dagand, Tim Harris,
Rebecca Isaacs, Simon Peter, Timothy Roscoe, Adrian Schüpbach and Akhilesh
Singhania. Published in 2009.
7.5 dynamic reconfiguration 211
7.6 unikernel
19 Authored by: Juraj Polakovic, Ali Erdem Özcan and Jean-Bernard Stefani. Published
on 2006
20 https://fractal.ow2.io/
7.6 unikernel 213
21 In the website of a unikernel called Unikraft the following is stated: “On Unikraft,
NGINX is 166% faster than on Linux and 182% faster than on Docker”.
22 https://www.includeos.org/
23 Authored by Alfred Bratterud, Alf-Andre Walla, Harek Haugerud, Paal E. Engelstad
and Kyrre Begnum. Published on 2015.
24 https://unikraft.org/
25 Authored by Simon Kuenzer, Vlad-Andrei Bădoiu, Hugo Lefeuvre, Sharan Santhanam,
Alexander Jung, Gaulthier Gain, Cyril Soldani, Costin Lupu, Stefan Teodorescu, Costi
Răducanu, Cristian Banu, Laurent Mathy, Răzvan Deaconescu, Costin Raiciu and
Felipe Huici. Published on 2021.
26 https://osv.io/
27 https://mirage.io/
28 Authored by Conghao Liu and Kyle C. Hale. Published on 2019.
7.6 unikernel 214
29 Authored by Pierre Olivier, Daniel Chiba, Stefan Lankes, Changwoo Min and Binoy
Ravindran. Published on 2019
30 https://ssrg-vt.github.io/hermitux/
REFERENCES
215