0% found this document useful (0 votes)
91 views

Linux Initialization Process

The document discusses initialization of the Linux kernel. It begins by explaining the vmlinux binary, which contains the kernel code and data. It then describes how the kernel enables paging by creating an initial page table and switching to virtual memory addressing. This involves setting up an identity mapping between physical and virtual addresses as an intermediate step before enabling paging and jumping to the final page table.

Uploaded by

memoarfaa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
91 views

Linux Initialization Process

The document discusses initialization of the Linux kernel. It begins by explaining the vmlinux binary, which contains the kernel code and data. It then describes how the kernel enables paging by creating an initial page table and switching to virtual memory addressing. This involves setting up an identity mapping between physical and virtual addresses as an intermediate step before enabling paging and jumping to the final page table.

Uploaded by

memoarfaa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

1

Initialization (1)
Taku Shimosawa

Pour le livre nouveau du Linux noyau


2

Agenda
• Initialization Phase of the Linux Kernel
• Turning on the paging feature
• Calling *init functions
• And miscellaneous things related to initialization
3

1. vmlinux
This is the linux kernel
4

vmlinux
• Main kernel binary
• Runs with the final CPU state
• Protected Mode in x86_32 (i386)
• Long Mode in x86_64
• And so on…
• Runs in the virtual memory space
• Above PAGE_OFFSET (default: 0xc0000000) (32-bit)
• Above __START_KERNEL_map (default: 0xff…f80000000)
• i.e. All the absolute addresses in the binary are virtual ones
• Entry points
Architecture Name Location Name (secondary)
x86_32 startup_32 arch/x86/kernel/head_32.S startup_32_smp
x86_64 startup_64 arch/x86/kernel/head_64.S secondary_startup_64
ARM stext arch/arm/kernel/head[_nommu].S secondary_startup
ARM64 stext arch/arm64/kenel/head.S secondary_holding_pen
secondary_entry
PPC _stext arch/powerpc/kernel/head_32.S* (__secondary_start)
5

Virtual memory mapping


0xFFFFFFFFFFFFFFFF
2GB
text/data __START_KERNEL_map
(0xFFFFFFFF
80000000)

0xFFFFFFFF

Up to ~896 MB LOWMEM
PAGE_OFFSET
(0xC0000000) PAGE_OFFSET
(0xFFFF8800
00000000)

0x00000000 0x0000000000000000
i386 Virtual Physical x86_64 Virtual
6

Why different mapping in 64-bit?


• The kernel code, data, and BSS reside in the last 2-
GB of the memory
=> Addressable by 32-bit!
• -mcmodel option in GCC
• Specifies the assumptions for the size of code/data
sections

-mcmodel option text data


(x86)
small within 2GB
kernel within -2GB
medium within 2GB Can be > 2GB
large Anywhere in 64bit
7

Column: -mcmodel in gcc


int g_data = 4; 8b 05 c6 0b 20 00 mov 0x200bc6(%rip),%eax # 601040 <g_data>
...
int main(void) bf 01 00 00 00 mov $0x1,%edi
8d 50 07 lea 0x7(%rax),%edx
{
g_data += 7; *The offset of RIP-relative addressing is 32-
... 48 b8 40 10 60 00 00 bit
movabs $0x601040,%rax
} 00 00 00
large bf 01 00 00 00 mov $0x1,%edi
8b 30 mov (%rax),%esi
...
8d 56 07 lea 0x7(%rsi),%edx

#define SZ (1 << 30) $ gcc -O3 -o ba -mcmodel=small bigarray.c


/usr/lib/gcc/x86_64-linux-gnu/4.8/crtbegin.o: In function
small `deregister_tm_clones':
int buf[SZ] = {1};
kernel crtstuff.c:(.text+0x1): relocation truncated to fit:
R_X86_64_32 against symbol `__TMC_END__' defined in .data
int main(void) section in ba
{
buf[0] += 3; 48 b8 60 10 a0 00 00 movabs $0xa01060,%rax
} medium 00 00 00
large 8b 08 mov (%rax),%ecx
8d 51 03 lea 0x3(%rcx),%edx
8

Column: -mcmodel in gcc (2)


• Code?
void nop(void)
{
asm volatile(".fill (2 << 30), 1, 0x90");
}

$ gcc -O3 -o ba -mcmodel=small supernop.c


/usr/lib/gcc/x86_64-linux-gnu/4.8/../../../x86_64-linux-
small
gnu/crt1.o: In function `_start':
medium (.text+0x12): relocation truncated to fit: R_X86_64_32S
kernel against symbol `__libc_csu_fini' defined in .text section in
/usr/lib/x86_64-linux-gnu/libc_nonshared.a(elf-init.oS)

$ gcc -O3 -o ba -mcmodel=large supernop.c


/usr/lib/gcc/x86_64-linux-gnu/4.8/../../../x86_64-linux-
gnu/crt1.o: In function `_start':
large (.text+0x12): relocation truncated to fit: R_X86_64_32S
against symbol `__libc_csu_fini' defined in .text section in
/usr/lib/x86_64-linux-gnu/libc_nonshared.a(elf-init.oS)
9

Initialization Overview
arch/*/boot/
Booting Code
(Preparing CPU states, Gathering HW information, Decompressing vmlinux etc.)
arch/*/kernel/head*.S, head*.c vmlinux
Low-level Initialization
(Switching to virtual memory world, Getting prepared for C programs)

init/main.c (startup_kernel) Call arch/*/kernel, arch/*/mm, …


Initialization
(Initializing all the kernel features including architecture-dependent parts)

init/main.c (rest_init)
Creating the “init” process, and letting it the rest “init” (PID=1)
initialization
(Setting up multiprocessing, scheduling) init/main.c (kernel_init)
kernel/sched/idle.c (cpu_idle_loop)
Performing final initialization
“Swapper” (PID=0) now sleeps and
“Exec”ing the “init” user
10

2. Towards Virtual
Memory
11

Enabling paging
• The early part is executed with paging off.
• Physical address space
• vmlinux is assumed to be executed with paging on.
• The addresses in the binary are not physical addresses.
• The first big job in vmlinux is enabling paging
• Creating a (transitional) page table
• Setting the CPU to use the page table, and to enable
paging
• Jumping to the entry point in C (compiled in the virtual
address space)
12

Identity Map
• At first, the goal page table cannot be used
• Since changing PC and enabling paging are (at least, in
x86) separate instructions.

Enable
Paging

PC Page Fault!
Physical Virtual Physical Virtual
13

Identity Map
• Therefore, identity map is created in addition to the
(goal) map.

Jump

PC
Physical Virtual
(1) Create an initial page table (2) Enable paging, and (3) Zap the low
Jump to a virtual address. mapping
14

Addresses in the transitional phase


• x86_64
• The decompressing routine enables paging and creates
an identity page table (only for first 4GB)
• Paging is required for CPUs to switch to 64-bit mode
• Located in 6 pages (pgtable) in the decompressing routine
• Symbols in vmlinux are accessed with RIP-relative
• No trick is necessary for using the symbols
leaq _text(%rip), %rbp
subq $_text - __START_KERNEL_map, %rbp
...
leaq early_level4_pgt(%rip), %rbx
...
movq $(early_level4_pgt - __START_KERNEL_map), %rax
addq phys_base(%rip), %rax
movq %rax, %cr3
movq $1f, %rax
jmp *%rax
1: (arch/x86/kernel/head_64.S)
15

Addresses in the transitional phase


• i386
• Symbols in vmlinux are accessed with absolute
addresses
• Before paging is enabled, PAGE_OFFSET is always subtracted
from addresses
movl $pa(__bss_start),%edi
movl $pa(__bss_stop),%ecx #define pa(X) ((X) - __PAGE_OFFSET)
subl %edi,%ecx
shrl $2,%ecx
rep ; stosl
...
movl $pa(initial_page_table), %eax
movl %eax,%cr3 /* set the page table pointer.. */
movl $CR0_STATE,%eax
movl %eax,%cr0 /* ..and set paging (PG) bit */
ljmp $__BOOT_CS,$1f /* Clear prefetch and normalize %eip */
1:
...
lgdt early_gdt_descr
lidt idt_descr
(arch/x86/kernel/head_32.S)
16

3. Initialization
At last, we have come here!
17

Initialization (start_kernel)
• A lot of *_init functions!
• Furthermore, some init functions call another init
functions.
• At least, 80 functions are called in this function.
• This slide will pick up some topics from the
initialization functions
18

2.9. Before Initialization


A little more tricks
19

Special directives
• What are these?
asmlinkage __visible void __init start_kernel(void) {

}
• “I’m curious!”.
20

asmlinkage
• asmlinkage
• Ensures the symbol is not mangled
• (in x86_32) Ensures all the parameters are passed by the
stack
#ifdef __cplusplus
#define CPP_ASMLINKAGE extern "C"
#else
#define CPP_ASMLINKAGE
#endif

#ifndef asmlinkage
#define asmlinkage CPP_ASMLINKAGE
#endif
include/linux/linkage.h

#ifdef CONFIG_X86_32
#define asmlinkage CPP_ASMLINKAGE __attribute__((regparm(0)))
arch/x86/include/asm/linkage.h
21

__visible
• (Effective in gcc >=4.6)
#if GCC_VERSION >= 40600
/*
* Tell the optimizer that something else uses this function or
variable.
*/
#define __visible __attribute__((externally_visible))
#endif
include/linux/compiler-gcc4.h
commit 9a858dc7cebce01a7bb616bebb85087fa2b40871
author Andi Kleen <[email protected]> Mon Sep 17 21:09:15 2012
committer Linus Torvalds <[email protected]> Mon Sep 17 22:00:38 2012

compiler.h: add __visible

gcc 4.6+ has support for a externally_visible attribute that prevents the
optimizer from optimizing unused symbols away. Add a __visible macro to
use it with that compiler version or later.

This is used (at least) by the "Link Time Optimization" patchset.


22

__init (1)
• To mark code(text) and data as only necessary
during initialization
#define __init __section(.init.text) __cold notrace
#define __initdata __section(.init.data)
#define __initconst __constsection(.init.rodata)
#define __exitdata __section(.exit.data)
#define __exit_call __used __section(.exitcall.exit)
(include/linux/init.h)
#ifndef __cold
#define __cold __attribute__((__cold__))
#endif
(include/linux/compiler-gcc4.h)
#ifndef __section
# define __section(S) __attribute__ ((__section__(#S)))
#endif
...
#define notrace __attribute__((no_instrument_function))
(include/linux/compiler.h)
23

__init (2)
• The init* sections are concentrated to a contiguous memory area
. = ALIGN(PAGE_SIZE);
.init.begin : AT(ADDR(.init.begin) - LOAD_OFFSET) {
__init_begin = .; /* paired with __init_end */
}
...
INIT_TEXT_SECTION(PAGE_SIZE) __init_begin
#ifdef CONFIG_X86_64 init.text
:init init.data
#endif __init_end …
INIT_DATA_SECTION(16)
....
. = ALIGN(PAGE_SIZE);
...
.init.end : AT(ADDR(.init.end) - LOAD_OFFSET) {
__init_end = .;
}
arch/x86/kernel/vmlinux.lds.S
24

__init (3)
• And, they are discarded (free’d) after initialization
• Called from kernel_init
void free_initmem(void)
{
free_init_pages("unused kernel",
(unsigned long)(&__init_begin),
(unsigned long)(&__init_end));
}
arch/x86/mm/init.c

void free_initmem(void)
{
...
poison_init_mem(__init_begin, __init_end - __init_begin);
if (!machine_is_integrator() && !machine_is_cintegrator())
free_initmem_default(-1);
}
arch/arm/mm/init.c
25

head32.c, head64.c
• Before calling start_kernel, i386_start_kernel or
x86_64_start_kernel is called in x86
• Located in arch/x86/kernel/head{32,64}.c
• No underscore between head and 32!
• x86 (32-bit)
• Reserve BIOS memory (in conventional memory)
• x86 (64-bit)
• Erase the identity map
• Clear BSS, copy boot information from the low memory
• And reserve BIOS memory
26

Reserve? But how?


• This is very initial time. No complicated memory
management is working right now.
• memblock (Logical memory blocks) is working!
#define BIOS_LOWMEM_KILOBYTES 0x413
lowmem = *(unsigned short *)__va(BIOS_LOWMEM_KILOBYTES);
lowmem <<= 10;
...
memblock_reserve(lowmem, 0x100000 - lowmem);
arch/x86/kernel/head.c
• memblock simply manages memory blocks
• And in some architecture, information is took over to another
mechanism, and discarded after initialization
#ifdef CONFIG_ARCH_DISCARD_MEMBLOCK Set in S+Core, IA64, S390, SH,
#define __init_memblock __meminit MIPS and x86
#define __initdata_memblock __meminitdata
#else
... Without memory hotplug,
#endif __meminit is __init.
include/linux/memblock.h
27

memblock
• Data Structure (include/linux/memblock.h)
Array of memblock_region
memblock (memblock) memblock_region
memory • base, size, flags[, nid]
(memblock_type)
memblock_region
reserved memblock_region
(memblock_type)
Array of memblock_region
(memblock: Global variable) memblock_region

• Initially the arrays are allocated statically


static struct memblock_region
memblock_memory_init_regions[INIT_MEMBLOCK_REGIONS] __initdata_memblock;
static struct memblock_region
memblock_reserved_init_regions[INIT_MEMBLOCK_REGIONS] __initdata_memblock;
*INIT_MEMBLOCK_REGIONS = 128
28

Reserving in memblock
• Reserving adds the region to the region array in the
“reserved” type
static int __init_memblock memblock_reserve_region(phys_addr_t base,
phys_addr_t size,
int nid,
unsigned long flags)
{
struct memblock_type *_rgn = &memblock.reserved;

...
return memblock_add_region(_rgn, base, size, nid, flags);
}

int __init_memblock memblock_reserve(phys_addr_t base, phys_addr_t size)


{
return memblock_reserve_region(base, size, MAX_NUMNODES, 0);
}

• A function to adding the available region is


memblock_add
29
When the available memory is
added?
• x86
• memblock_x86_fill
• called by setup_arch (8/80)
void __init memblock_x86_fill(void)
{ BTW, what’s this?
...
memblock_allow_resize();

for (i = 0; i < e820.nr_map; i++) {


... memblock_add(ei->addr, ei->size);
}
memblock_trim_memory(PAGE_SIZE);
...
}

• ARM
• arm_memblock_init
• Also called by setup_arch (8/80)
30

Resizing, or reallocation.
• Memblock uses slab for resizing if available
• # of e820 entries may be more than 128
• However, slab is available at kmem_cache_init called by
mm_init (25/80), so not at this time.
• Memblock tries to allocate by itself by finding an
area in memory && !reserved.
static int __init_memblock memblock_double_array(struct memblock_type *type,
phys_addr_t new_area_start,
phys_addr_t new_area_size)
{

addr = memblock_find_in_range(new_area_start + new_area_size,
memblock.current_limit,
new_alloc_size, PAGE_SIZE);
31

memblock: Debug options


• “memblock=debug”
static int __init early_memblock(char *p)
{
if (p && strstr(p, "debug"))
memblock_debug = 1;
return 0;
}
early_param("memblock", early_memblock);

static int __init_memblock memblock_reserve_region(...)


{
...
memblock_dbg("memblock_reserve: [%#016llx-%#016llx]
flags %#02lx %pF\n",
(unsigned long long)base,
(unsigned long long)base + size - 1,
flags, (void *)_RET_IP_);
32

3. Initialization
Okay, okay.
33

start_kernel
• What’s the first initialization function called?
smp_setup_processor_id() ((at least 2.6.18) ~ 3.2)
lockdep_init () (3.3 ~)
commit 73839c5b2eacc15cb0aa79c69b285fc659fa8851
Author: Ming Lei <[email protected]>
Date: Thu Nov 17 13:34:31 2011 +0800

init/main.c: Execute lockdep_init() as early as possible


This patch fixes a lockdep warning on ARM platforms:

[ 0.000000] WARNING: lockdep init error! Arch code didn't call lockdep_init() early
enough?
[ 0.000000] Call stack leading to lockdep invocation was:
[ 0.000000] [<c00164bc>] save_stack_trace_tsk+0x0/0x90
[ 0.000000] [<ffffffff>] 0xffffffff

The warning is caused by printk inside smp_setup_processor_id().


34

init (1/80) : lockdep_init


• Initializes lockdep (lock validator) Config: CONFIG_LOCKDEP
• “Runtime locking correctness validator” selected by PROVE_LOCKING
• Detects or DEBUG_LOCK_ALLOC
• Lock inversion or LOCK_STAT
• Circular lock dependencies
• When enabled, lockdep is called when any spinlock or
mutex is acquired.
• Thus, the initialization for lockdep must be first.
• Initialization is simple (just initializing list_head’s of hashes)
void lockdep_init(void)
{...
for (i = 0; i < CLASSHASH_SIZE; i++)
INIT_LIST_HEAD(classhash_table + i);

for (i = 0; i < CHAINHASH_SIZE; i++)


INIT_LIST_HEAD(chainhash_table + i);
...}
kernel/locking/lockdep.c
35

init (2/80) : smp_setup_processor_id


• Only effective in some architecture
• ARM, s390, SPARC
u32 __cpu_logical_map[NR_CPUS] = { [0 ... NR_CPUS-1] =
MPIDR_INVALID };
void __init smp_setup_processor_id(void)
{ Hardware CPU (core) ID
int i;
u32 mpidr = is_smp() ? read_cpuid_mpidr() &
MPIDR_HWID_BITMASK : 0;
Exchange the logical ID
u32 cpu = MPIDR_AFFINITY_LEVEL(mpidr, 0);
for the boot CPU and
the logical ID for the
cpu_logical_map(0) = cpu;
CPU 0.
for (i = 1; i < nr_cpu_ids; ++i)
cpu_logical_map(i) = i == cpu ? 0 : i;
set_my_cpu_offset(0);
cpu_logical_map: 2 1 0 3
pr_info("Booting Linux on physical CPU 0x%x\n", mpidr);
}
arch/arm/kernel/setup.c
36

init (3/80) : debug_objects_early_init


• Initializes debugobjects Config:
CONFIG_DEBUG_OBJECTS
• Lifetime debugging facility for objects
• Seems to be used by timer, hrtimer, workqueue,
per_cpu_counter and rcu
• Again, this function initializes locks and listheads
void __init debug_objects_early_init(void)
{
int i;

for (i = 0; i < ODEBUG_HASH_SIZE; i++)


raw_spin_lock_init(&obj_hash[i].lock);

for (i = 0; i < ODEBUG_POOL_SIZE; i++)


hlist_add_head(&obj_static_pool[i].node, &obj_pool);
}
lib/debugobjects.c
37

init (4/80): boot_init_stack_canary


• Setup the stackprotector
• include/asm/stackprotector.h
• Decide the canary value based on random value and TSC
static __always_inline void boot_init_stack_canary(void)
{
u64 canary;
u64 tsc;

#ifdef CONFIG_X86_64
BUILD_BUG_ON(offsetof(union irq_stack_union, stack_canary) != 40);
#endif
get_random_bytes(&canary, sizeof(canary));
tsc = __native_read_tsc();
canary += tsc + (tsc << 32UL);

current->stack_canary = canary;
#ifdef CONFIG_X86_64
this_cpu_write(irq_stack_union.stack_canary, canary);
#else
this_cpu_write(stack_canary.canary, canary);
#endif
}
38

init (5/80): cgroup_init_early


• Initializes cgroups
• For subsystems that have early_init set, initialize the
subsystem.
• cpu, cpuacct, cpuset
• The rest of subsystems are initialized in cgroup_init (71/80)
• Initializes the structure, and the names for the
subsystems
39

init (6/80): boot_cpu_init


• Initializes various cpumasks for the boot CPU
• online : available to scheduler !HOTPLUG_CPU => same
• active : available to migration
• present : cpu is populated !HOTPLUG_CPU => same
• possible : cpu is populatable
• set_cpu_online adds the cpu to active
• set_cpu_present does not add the cpu to possible
static void __init boot_cpu_init(void)
{
int cpu = smp_processor_id();
/* Mark the boot cpu "present", "online" etc for SMP and UP
case */
set_cpu_online(cpu, true);
set_cpu_active(cpu, true);
set_cpu_present(cpu, true);
set_cpu_possible(cpu, true);
}
init/main.c
40

cpumask
• A bit map
typedef struct cpumask { DECLARE_BITMAP(bits, NR_CPUS); } cpumask_t;
include/linux/cpumask.h

#define DECLARE_BITMAP(name,bits) \
unsigned long name[BITS_TO_LONGS(bits)]
include/linux/types.h

#define BITS_TO_LONGS(nr) DIV_ROUND_UP(nr, BITS_PER_BYTE *


sizeof(long))
include/linux/bitops.h

array of long (4 / 8 bytes)


bits :

NR_CPU bits
41

Set bit! (x86)


#define IS_IMMEDIATE(nr) (__builtin_constant_p(nr))
...
static __always_inline void
set_bit(long nr, volatile unsigned long *addr)
{
if (IS_IMMEDIATE(nr)) {
asm volatile(LOCK_PREFIX "orb %1,%0"
: CONST_MASK_ADDR(nr, addr)
: "iq" ((u8)CONST_MASK(nr))
: "memory");
} else {
asm volatile(LOCK_PREFIX "bts %1,%0"
: BITOP_ADDR(addr) : "Ir" (nr) : "memory");
}
}
arch/x86/include/asm/bitops.h
• The register bitoffset operand for bts is
• -231 ~ 231-1 or -263 ~ 263-1
42

Set bit! (ARM)


#if __LINUX_ARM_ARCH__ >= 6 bitop _set_bit, orr
.macro bitop, name, instr
ENTRY( ¥name )
UNWIND( .fnstart )
ands ip, r1, #3
strneb r1, [ip] @ assert word-aligned
mov r2, #1
and r3, r0, #31 @ Get bit offset
mov r0, r0, lsr #5
add r1, r1, r0, lsl #2 @ Get word offset
...
mov r3, r2, lsl r3
1: ldrex r2, [r1]
¥instr r2, r2, r3
strex r0, r2, [r1]
cmp r0, #0
bne 1b
bx lr
UNWIND( .fnend )
ENDPROC(¥name )
.endm
43

smp_processor_id
• Returns the core ID (in the kernel)
• In ARM (and old days in x86)
• Located in “current”
• Located in the top of the current stack
• In x86
• Located in the per-cpu area.
#define raw_smp_processor_id() (this_cpu_read(cpu_number))
arch/x86/include/asm/smp.h

#define raw_smp_processor_id() (current_thread_info()->cpu)


arch/arm/include/asm/smp.h
static inline struct thread_info *current_thread_info(void)
{
register unsigned long sp asm ("sp");
return (struct thread_info *)(sp & ~(THREAD_SIZE - 1));
}
arch/arm/include/asm/thread_info.h
44

Next
• Topics and the rest of initialization
• Setup parameters (early_param() etc.)
• Initcalls
• Multiprocessor supports
• Per-cpus
• SMP boot (secondary boot)
• SMP altenatives
• And other alternatives
• And Others?
• Modules?

You might also like