0% found this document useful (0 votes)
14 views54 pages

Mallocvmallocinlinux 221205070506 74fe014b

The document discusses memory allocation in Linux, focusing on the differences between malloc, vmalloc, and kmalloc. It explains the implementation of malloc using the brk() system call and highlights the characteristics of contiguous and non-contiguous memory allocation. Additionally, it provides insights into the process address space and heap configuration during program execution in the Linux kernel.

Uploaded by

devs.kals
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views54 pages

Mallocvmallocinlinux 221205070506 74fe014b

The document discusses memory allocation in Linux, focusing on the differences between malloc, vmalloc, and kmalloc. It explains the implementation of malloc using the brk() system call and highlights the characteristics of contiguous and non-contiguous memory allocation. Additionally, it provides insights into the process address space and heap configuration during program execution in the Linux kernel.

Uploaded by

devs.kals
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 54

malloc & vmalloc in Linux

Adrian Huang | Dec, 2022


* Based on kernel 5.11 (x86_64) – QEMU
* 2-socket CPUs (4 cores/socket)
* 16GB memory
* Kernel parameter: nokaslr norandmaps
* KASAN: disabled
* Userspace: ASLR is disabled
* Legacy BIOS
Agenda
• Memory Allocation in Linux

• malloc -> brk() implementation in Linux Kernel


o Will *NOT* focus on glibc malloc implementation: You can read this link: malloc internal

• vmalloc: Non-contiguous memory allocation

• [Note] kmalloc has been discussed here: Slide #88 of Slab Allocator in Linux
Kernel
Memory Allocation in Linux
[glibc] malloc
• Balance between brk() and mmap()
glibc: malloc/free • Use brk() if request size < DEFAULT_MMAP_THRESHOLD_MIN (128 KB)
o The heap can be trimmed only if memory is freed at the top end.
brk/mmap User Space o sbrk() is implemented as a library function that uses the brk() system call.
Kernel Space o When the heap is used up, allocate memory chunk > 128KB via brk().
▪ Save overhead for frequent system call ‘brk()’
kmalloc/kfree • Use mmap() if request size >= DEFAULT_MMAP_THRESHOLD_MIN (128 KB)
o The allocated memory blocks can be independently released back to the system.
vmalloc o Deallocated space is not placed on the free list for reuse by later allocations.
o Memory may be wasted because mmap allocations must be page-aligned; and the
Slab Allocator kernel must perform the expensive task of zeroing out memory allocated.
o Note: glibc uses the dynamic mmap threshold
o Detail: `man mallopt`
Buddy System
alloc_page(s), __get_free_page(s)

kmalloc & vmalloc


Hardware
• kmalloc: Contiguous memory allocation
... • vmalloc: Non-contiguous memory allocation
o Scenario: memory allocation size > PAGE_SIZE (4KB)
o Allocate virtually contiguous memory
▪ Physical memory might NOT be contiguous
kmalloc & slab (Recap) struct kmem_cache
*kmalloc_caches[KMALLOC_NORMAL][]
0 NULL kmem_cache
1 kmalloc-96 __percpu *cpu_slab
2 kmalloc-192 *node[MAX_NUMNODES]
3 kmalloc-8
4 kmalloc-16

13 kmalloc-8192
kmem_cache
__percpu *cpu_slab
struct kmem_cache *node[MAX_NUMNODES]
*kmalloc_caches[KMALLOC_RECLAIM][]
0 NULL
1 kmalloc-96 kmem_cache
2 kmalloc-192
3 kmalloc-8 __percpu *cpu_slab
struct kmem_cache 4 kmalloc-16 *node[MAX_NUMNODES]
*kmalloc_caches[NR_KMALLOC_TYPES][KMALLOC_SHIFT_HIGH + 1] …
13 kmalloc-8192
kmem_cache
__GFP_RECLAIMABLE __percpu *cpu_slab
*node[MAX_NUMNODES]
struct kmem_cache
*kmalloc_caches[KMALLOC_DMA][]
0 NULL
1 kmalloc-96 kmem_cache
2 kmalloc-192 __percpu *cpu_slab
3 kmalloc-8 *node[MAX_NUMNODES]
4 kmalloc-16

13 kmalloc-8192
__GFP_DMA
Check create_kmalloc_caches() &kmalloc_info Referece (slideshare): Slab Allocator in Linux Kernel
malloc() -> brk() implementation in
Linux Kernel
• Quick view: Process Address Space – Heap
• sys_brk – Call path
• [From scratch] Launch a program: load_elf_binary() in Linux kernel
o VMA change observation
o Heap (brk or program break) configuration
• [Program Launch] strace observation: heap – brk()
• strace observation: allocate space via malloc()
o If the heap space is used up, how about allocation size when calling malloc()->brk?
• glibc: malloc implementation for memory request size
Quick view: Process Address Space - Heap

0x7FFF_FFFF_FFFF
STACK_TOP_MAX =
Stack (Default size: 8MB) 0x7FFF_FFFF_F000
128MB gap mm->stack

Stack Guard Gap mm->mmap_base =


0x7FFF_F7FF_F000
mmap

mm->brk
HEAP
mm->start_brk
BSS
mm->end_data
Data mm->start_data

Text
mm->start_code =
0x40_0000
0
Process Virtual Address
Quick view: Process Address Space - Heap

0x7FFF_FFFF_FFFF
STACK_TOP_MAX =
Stack (Default size: 8MB) 0x7FFF_FFFF_F000
128MB gap mm->stack

Stack Guard Gap mm->mmap_base =


0x7FFF_F7FF_F000
mmap

mm->brk
HEAP
mm->start_brk
BSS
mm->end_data
Why are they different? Data mm->start_data

Text
mm->start_code =
0x40_0000
0
Process Virtual Address
sys_brk – Call path
sys_brk mm_populate
if brk < mm->start_brk
return mm->brk __mm_populate

newbrk = PAGE_ALIGN(brk) populate_vma_page_range

oldbrk = PAGE_ALIGN(mm->brk) __get_user_pages

shrink brk if brk <= mm->brk Find if the page is populated

__do_munmap follow_page_mask
The page is NOT populated yet
do_brk_flags faultin_page
can expand the existing
anonymous mapping handle_mm_fault
vma_merge
cannot expand the existing
anonymous mapping
vm_area_alloc

mm->brk = brk
mm->def_flags & VM_LOCKED != 0
mm_populate

return newbrk

[By default] Heap (or brk) space is on-demand page


[From scratch] Launch a program: load_elf_binary() in Linux kernel

# ./free_and_sbrk 1 1

Kernel
load_elf_binary()

vma: R vma: R, E vma: R vma: R, W vma (vvar) vma (vdso) vma (stack)
vm_start = vm_start = vm_start = vm_start = vm_start = vm_start = vm_start =
GAP 0x400000 0x401000 0x496000 0x4be000 GAP 0x7ffff7ffa000 0x7ffff7ffe000 GAP 0x7fffff85d000
vm_end = vm_end = vm_end = vm_end = vm_end = vm_end = vm_end =
0x401000 0x496000 0x4be000 0x4c4000 0x7ffff7ffe000 0x7ffff7fff000 0x7ffffffff000
After launching a program: Question

Why?

vma: R vma: R, E vma: R vma: R, W vma (vvar) vma (vdso) vma (stack)
vm_start = vm_start = vm_start = vm_start = vm_start = vm_start = vm_start =
GAP 0x400000 0x401000 0x496000 0x4be000 GAP 0x7ffff7ffa000 0x7ffff7ffe000 GAP 0x7fffff85d000
vm_end = vm_end = vm_end = vm_end = vm_end = vm_end = vm_end =
0x401000 0x496000 0x4be000 0x4c4000 0x7ffff7ffe000 0x7ffff7fff000 0x7ffffffff000
[From scratch] Launch a program: load_elf_binary() – Heap Configration
# ./free_and_sbrk 1 1

load_elf_binary
set_brk
vm_brk_flags
do_brk_flags
can expand the existing
anonymous mapping
vma_merge
cannot expand the existing
anonymous mapping
vm_area_alloc

mm->{start_brk, brk} = end

vma: R vma: R, E vma: R vma: R, W vma (vvar) vma (vdso) vma (stack)
vm_start = vm_start = vm_start = vm_start = vm_start = vm_start = vm_start =
GAP 0x400000 0x401000 0x496000 0x4be000 GAP 0x7ffff7ffa000 0x7ffff7ffe000 GAP 0x7fffff85d000
vm_end = vm_end = vm_end = vm_end = vm_end = vm_end = vm_end =
0x401000 0x496000 0x4be000 0x4c4000 0x7ffff7ffe000 0x7ffff7fff000 0x7ffffffff000
[From scratch] Launch a program: load_elf_binary() – Heap Configration
# ./free_and_sbrk 1 1

load_elf_binary
set_brk
vm_brk_flags
do_brk_flags
can expand the existing
anonymous mapping
vma_merge
cannot expand the existing
anonymous mapping
vm_area_alloc

mm->{start_brk, brk} = end

vma: R vma: R, E vma: R vma: R, W vma (heap) vma (vvar) vma (vdso) vma (stack)
vm_start = vm_start = vm_start = vm_start = vm_start = vm_start = vm_start = vm_start =
GAP 0x400000 0x401000 0x496000 0x4be000 0x4c4000 GAP 0x7ffff7ffa000 0x7ffff7ffe000 GAP 0x7fffff85d000
vm_end = vm_end = vm_end = vm_end = vm_end = vm_end = vm_end = vm_end =
0x401000 0x496000 0x4be000 0x4c4000 0x4c5000 0x7ffff7ffe000 0x7ffff7fff000 0x7ffffffff000
[From scratch] Launch a program: load_elf_binary() – Heap Configration

load_elf_binary
set_brk
vm_brk_flags
do_brk_flags
can expand the existing
anonymous mapping
vma_merge
cannot expand the existing
anonymous mapping
vm_area_alloc

mm->{start_brk, brk} = end

mm->brk = mm->start_brk
= 0x4c5000

vma: R vma: R, E vma: R vma: R, W vma (heap) vma (vvar) vma (vdso) vma (stack)
vm_start = vm_start = vm_start = vm_start = vm_start = vm_start = vm_start = vm_start =
GAP 0x400000 0x401000 0x496000 0x4be000 0x4c4000 GAP 0x7ffff7ffa000 0x7ffff7ffe000 GAP 0x7fffff85d000
vm_end = vm_end = vm_end = vm_end = vm_end = vm_end = vm_end = vm_end =
0x401000 0x496000 0x4be000 0x4c4000 0x4c5000 0x7ffff7ffe000 0x7ffff7fff000 0x7ffffffff000
[From scratch] Launch a program: load_elf_binary() – Heap Configration

load_elf_binary
set_brk
vm_brk_flags
do_brk_flags
can expand the existing
anonymous mapping
vma_merge
cannot expand the existing
anonymous mapping
vm_area_alloc

mm->{start_brk, brk} = end

Why? mm->brk = mm->start_brk


= 0x4c5000

vma: R vma: R, E vma: R vma: R, W vma (heap) vma (vvar) vma (vdso) vma (stack)
vm_start = vm_start = vm_start = vm_start = vm_start = vm_start = vm_start = vm_start =
GAP 0x400000 0x401000 0x496000 0x4be000 0x4c4000 GAP 0x7ffff7ffa000 0x7ffff7ffe000 GAP 0x7fffff85d000
vm_end = vm_end = vm_end = vm_end = vm_end = vm_end = vm_end = vm_end =
0x401000 0x496000 0x4be000 0x4c4000 0x4c5000 0x7ffff7ffe000 0x7ffff7fff000 0x7ffffffff000
[From scratch] Launch a program: load_elf_binary() – Heap Configration
elf_bss

load_elf_binary
set_brk elf_brk
vm_brk_flags
do_brk_flags
can expand the existing
anonymous mapping
vma_merge
cannot expand the existing
anonymous mapping
vm_area_alloc

mm->{start_brk, brk} = end

mm->brk = mm->start_brk
= 0x4c5000

vma: R vma: R, E vma: R vma: R, W vma (heap) vma (vvar) vma (vdso) vma (stack)
vm_start = vm_start = vm_start = vm_start = vm_start = vm_start = vm_start = vm_start =
GAP 0x400000 0x401000 0x496000 0x4be000 0x4c4000 GAP 0x7ffff7ffa000 0x7ffff7ffe000 GAP 0x7fffff85d000
vm_end = vm_end = vm_end = vm_end = vm_end = vm_end = vm_end = vm_end =
0x401000 0x496000 0x4be000 0x4c4000 0x4c5000 0x7ffff7ffe000 0x7ffff7fff000 0x7ffffffff000
[From scratch] Launch a program: load_elf_binary() – Heap Configration
elf_bss

load_elf_binary
set_brk elf_brk
vm_brk_flags
do_brk_flags
can expand the existing
anonymous mapping
vma_merge
cannot expand the existing
anonymous mapping
vm_area_alloc

mm->{start_brk, brk} = end

mm->brk = mm->start_brk = 0x4c5000

vma: R vma: R, E vma: R vma: R, W vma (heap) vma (vvar) vma (vdso) vma (stack)
vm_start = vm_start = vm_start = vm_start = vm_start = vm_start = vm_start = vm_start =
GAP 0x400000 0x401000 0x496000 0x4be000 0x4c4000 GAP 0x7ffff7ffa000 0x7ffff7ffe000 GAP 0x7fffff85d000
vm_end = vm_end = vm_end = vm_end = vm_end = vm_end = vm_end = vm_end =
0x401000 0x496000 0x4be000 0x4c4000 0x4c5000 0x7ffff7ffe000 0x7ffff7fff000 0x7ffffffff000

range(elf_bss, elf_brk): bss space


[Program Launch] strace observation: heap – brk() sys_brk
if brk < mm->start_brk
return mm->brk

newbrk = PAGE_ALIGN(brk)
oldbrk = PAGE_ALIGN(mm->brk)
shrink brk if brk <= mm->brk
__do_munmap

do_brk_flags
can expand the existing
anonymous mapping
vma_merge
cannot expand the existing
anonymous mapping
vm_area_alloc

mm->brk = brk
mm->start_brk = 0x4c5000
mm->def_flags & VM_LOCKED != 0
mm->brk = 0x4c61c0 mm_populate

vma: R vma: R, E vma: R vma: R, W vma (heap) vma (vvar) vma (vdso) vma (stack)
vm_start = vm_start = vm_start = vm_start = vm_start = vm_start = vm_start = vm_start =
GAP 0x400000 0x401000 0x496000 0x4be000 0x4c4000 GAP 0x7ffff7ffa000 0x7ffff7ffe000 GAP 0x7fffff85d000
vm_end = vm_end = vm_end = vm_end = vm_end = vm_end = vm_end = vm_end =
0x401000 0x496000 0x4be000 0x4c4000 0x4c7000 0x7ffff7ffe000 0x7ffff7fff000 0x7ffffffff000

Demand paging: Allocate a physical page when a page fault occurs


[Program Launch] strace observation: heap – brk()

mm->start_brk = 0x4c5000
mm->brk = 0x4c61c0

vma: R vma: R, E vma: R vma: R, W vma (heap) vma (vvar) vma (vdso) vma (stack)
vm_start = vm_start = vm_start = vm_start = vm_start = vm_start = vm_start = vm_start =
GAP 0x400000 0x401000 0x496000 0x4be000 0x4c4000 GAP 0x7ffff7ffa000 0x7ffff7ffe000 GAP 0x7fffff85d000
vm_end = vm_end = vm_end = vm_end = vm_end = vm_end = vm_end = vm_end =
0x401000 0x496000 0x4be000 0x4c4000 0x4c7000 0x7ffff7ffe000 0x7ffff7fff000 0x7ffffffff000

Demand paging: Allocate a physical page when a page fault occurs


[Program Launch] strace observation: heap – brk() sys_brk
if brk < mm->start_brk
return mm->brk

newbrk = PAGE_ALIGN(brk)
oldbrk = PAGE_ALIGN(mm->brk)
shrink brk if brk <= mm->brk
__do_munmap

do_brk_flags
can expand the existing
anonymous mapping
vma_merge
cannot expand the existing
anonymous mapping
vm_area_alloc

mm->brk = brk
mm->start_brk = 0x4c5000
mm->def_flags & VM_LOCKED != 0
mm->brk = 0x4e8000 mm_populate

vma: R vma: R, E vma: R vma: R, W vma (heap) vma (vvar) vma (vdso) vma (stack)
vm_start = vm_start = vm_start = vm_start = vm_start = vm_start = vm_start = vm_start =
GAP 0x400000 0x401000 0x496000 0x4be000 0x4c4000 GAP 0x7ffff7ffa000 0x7ffff7ffe000 GAP 0x7fffff85d000
vm_end = vm_end = vm_end = vm_end = vm_end = vm_end = vm_end = vm_end =
0x401000 0x496000 0x4be000 0x4c4000 0x4e8000 0x7ffff7ffe000 0x7ffff7fff000 0x7ffffffff000

Demand paging: Allocate a physical page when a page fault occurs


Recap

Still not equal mm->start_brk = 0x4c5000


mm->brk = 0x4e8000

vma: R vma: R, E vma: R vma: R, W vma (heap) vma (vvar) vma (vdso) vma (stack)
vm_start = vm_start = vm_start = vm_start = vm_start = vm_start = vm_start = vm_start =
GAP 0x400000 0x401000 0x496000 0x4be000 0x4c4000 GAP 0x7ffff7ffa000 0x7ffff7ffe000 GAP 0x7fffff85d000
vm_end = vm_end = vm_end = vm_end = vm_end = vm_end = vm_end = vm_end =
0x401000 0x496000 0x4be000 0x4c4000 0x4e8000 0x7ffff7ffe000 0x7ffff7fff000 0x7ffffffff000
[Program Launch] strace observation: mprotect()
mprotect

split_vma vma_merge

Split this vma

mm->start_brk = 0x4c5000
mm->brk = 0x4e8000

vma: R vma: R, E vma: R vma: R, W vma (heap) vma (vvar) vma (vdso) vma (stack)
vm_start = vm_start = vm_start = vm_start = vm_start = vm_start = vm_start = vm_start =
GAP 0x400000 0x401000 0x496000 0x4be000 0x4c4000 GAP 0x7ffff7ffa000 0x7ffff7ffe000 GAP 0x7fffff85d000
vm_end = vm_end = vm_end = vm_end = vm_end = vm_end = vm_end = vm_end =
0x401000 0x496000 0x4be000 0x4c4000 0x4e8000 0x7ffff7ffe000 0x7ffff7fff000 0x7ffffffff000
[Program Launch] strace observation: mprotect()

mm->start_brk = 0x4c5000
R/W permission
mm->brk = 0x4e8000

vma: R vma: R, E vma: R vma: R, W vma (heap) vma (vvar) vma (vdso) vma (stack)
vm_start = vm_start = vm_start = vm_start = vm_start = vm_start = vm_start = vm_start =
GAP 0x400000 0x401000 0x496000 0x4be000 0x4c4000 GAP 0x7ffff7ffa000 0x7ffff7ffe000 GAP 0x7fffff85d000
vm_end = vm_end = vm_end = vm_end = vm_end = vm_end = vm_end = vm_end =
0x401000 0x496000 0x4be000 0x4c4000 0x4e8000 0x7ffff7ffe000 0x7ffff7fff000 0x7ffffffff000
[Program Launch] strace observation: mprotect()
mprotect

split_vma vma_merge

vma split

mm->start_brk = 0x4c5000
mm->brk = 0x4e8000

vma: R vma: R, E vma: R vma: R vma: R, W vma (heap) vma (vvar) vma (vdso) vma (stack)
vm_start = vm_start = vm_start = vm_start = vm_start = vm_start = vm_start = vm_start = vm_start =
GAP 0x400000 0x401000 0x496000 0x4be000 0x4c1000 0x4c4000 GAP 0x7ffff7ffa000 0x7ffff7ffe000 GAP 0x7fffff85d000
vm_end = vm_end = vm_end = vm_end = vm_end = vm_end = vm_end = vm_end = vm_end =
0x401000 0x496000 0x4be000 0x4c1000 0x4c4000 0x4e8000 0x7ffff7ffe000 0x7ffff7fff000 0x7ffffffff000
[Program Launch] strace observation: mprotect()

match
mm->start_brk = 0x4c5000
mm->brk = 0x4e8000

vma: R vma: R, E vma: R vma: R vma: R, W vma (heap) vma (vvar) vma (vdso) vma (stack)
vm_start = vm_start = vm_start = vm_start = vm_start = vm_start = vm_start = vm_start = vm_start =
GAP 0x400000 0x401000 0x496000 0x4be000 0x4c1000 0x4c4000 GAP 0x7ffff7ffa000 0x7ffff7ffe000 GAP 0x7fffff85d000
vm_end = vm_end = vm_end = vm_end = vm_end = vm_end = vm_end = vm_end = vm_end =
0x401000 0x496000 0x4be000 0x4c1000 0x4c4000 0x4e8000 0x7ffff7ffe000 0x7ffff7fff000 0x7ffffffff000
strace observation: allocate space via malloc() #1

[Init stage]
0x4e8000 – 0x4c7000 = 0x21000
(132KB: 33 pages)

[glibc] malloc
• Balance between brk() and mmap()
• Use brk() if request size < DEFAULT_MMAP_THRESHOLD_MIN (128 KB)
o The heap can be trimmed only if memory is freed at the top end.
o sbrk() is implemented as a library function that uses the brk() system call.
o When the heap is used up, allocate memory chunk > 128KB via brk().
▪ Save overhead for frequent system call ‘brk()’
• Use mmap() if request size >= DEFAULT_MMAP_THRESHOLD_MIN (128 KB)
o The allocated memory blocks can be independently released back to the system.
o Deallocated space is not placed on the free list for reuse by later allocations.
o Memory may be wasted because mmap allocations must be page-aligned; and the
kernel must perform the expensive task of zeroing out memory allocated.
o Note: glibc uses the dynamic mmap threshold
o Detail: `man mallopt`
strace observation: allocate space via malloc() #2
malloc.c [glibc] malloc
• Balance between brk() and mmap()
• Use brk() if request size < DEFAULT_MMAP_THRESHOLD_MIN (128 KB)
o The heap can be trimmed only if memory is freed at the top end.
o sbrk() is implemented as a library function that uses the brk() system call.
o When the heap is used up, allocate memory chunk > 128KB via brk().
▪ Save overhead for frequent system call ‘brk()’
• Use mmap() if request size >= DEFAULT_MMAP_THRESHOLD_MIN (128 KB)
o The allocated memory blocks can be independently released back to the system.
o Deallocated space is not placed on the free list for reuse by later allocations.
o Memory may be wasted because mmap allocations must be page-aligned; and the
kernel must perform the expensive task of zeroing out memory allocated.
o Note: glibc uses the dynamic mmap threshold
o Detail: `man mallopt`

[Init stage] 0x21000 (132KB: 33 pages)


Current program break is used
up: allocate another 132KB

Heap space allocation from malloc(): Allocate memory chunk > 128KB via brk()
Memory Allocation in Linux – brk() detail
User application [glibc] malloc: check sysmalloc() for implementation
malloc • Balance between brk() and mmap()
• Use brk() if request size < DEFAULT_MMAP_THRESHOLD_MIN (128 KB)
glibc: malloc implementation o The heap can be trimmed only if memory is freed at the top end.
Allocated
heap space o sbrk() is implemented as a library function that uses the brk() system call.
enough? Y: Return available address from the allocated
o When the heap is used up, allocate memory chunk > 128KB via brk().
heap space
▪ Save overhead for frequent system call ‘brk()’
N: if size < 128KB, then allocate “memory chunk > 128KB” by • Use mmap() if request size >= DEFAULT_MMAP_THRESHOLD_MIN (128 KB)
calling brk() o The allocated memory blocks can be independently released back to the system.
o Deallocated space is not placed on the free list for reuse by later allocations.
o Memory may be wasted because mmap allocations must be page-aligned; and the
brk or mmap kernel must perform the expensive task of zeroing out memory allocated.
User Space o Note: glibc uses the dynamic mmap threshold
o Detail: `man mallopt`
Kernel Space
VMA Configuration & kmalloc/kfree
program break adjustment
vmalloc
Page fault handler Slab Allocator

Buddy System
alloc_page(s), __get_free_page(s)

Hardware
...
glibc: malloc implementation for memory request size
glibc: malloc implementation for memory request size

* MORECORE()->__sbrk()->__brk()

Heap space allocation from malloc(): Allocate memory chunk > 128KB via brk()
glibc: malloc implementation for memory request size
malloc.c

6
2

Detail Reference
• [glibc] malloc internals
o Concept: Chunk, arenas, heaps, and thread
local cache (tcache)
1

5
3

Heap is expanded for 0x21000 (33 pages): 0x555555559000 -> 0x55555557a000


vmalloc: Non-contiguous memory
allocation
• 64-bit Virtual Address in x86_64
• Call path
• vmap_area & guard page
• Example: vmalloc size = 8MB
o Kernel data structure
o qemu + gdb observation
• vmalloc users/scenario
64-bit Virtual Address in x86_64 Reference: Documentation/x86/x86_64/mm.rst

Unused hole (2MB)


0xFFFF_FFFF_FFFF_FFFF 1GB or 1.5GB FIXADDR_TOP = 0xFFFF_FFFF_FF7F_F000
Fix-mapped address space
(Expanded to 4MB: 05ab1d8a4b36) FIXADDR_START
128TB

Modules
MODULES_VADDR
Kernel Space
Kernel code [.text, .data…] __START_KERNEL = 0xFFFF_FFFF_8100_0000
1GB or 512MB Kernel text mapping from
0xFFFF_8000_0000_0000 physical address 0 __START_KERNEL_map = 0xFFFF_FFFF_8000_0000

Empty Space …
Virtual memory map – 1TB
0x0000_7FFF_FFFF_FFFF
(store page frame descriptor)
vmemmap_base
Unused hole (1TB)
128TB

VMALLOC_END = 0xFFFF_E8FF_FFFF_FFFF
User Space vmalloc/ioremap (32TB)
vmalloc_base VMALLOC_START = 0xFFFF_C900_0000_0000
Unused hole (0.5TB)
0 Page frame direct
64-bit Virtual Address mapping (64TB)
page_offset_base
LDT remap for PTI (0.5TB)
Guard hole (8TB)
0xFFFF_8000_0000_0000
Default Configuration Kernel Virtual Address

page_ofset_base = 0xFFFF_8880_0000_0000
vmalloc_base = 0xFFFF_C900_0000_0000
vmemmap_base = 0xFFFF_EA00_0000_0000
* Can be dynamically configured by KASLR (Kernel Address Space Layout Randomization - "arch/x86/mm/kaslr.c")
vmalloc – call path Get virtual address from vmalloc RB-tree
(vmap_area RB-tree)

Allocate a vm_struct from kmalloc (slub allocator)


kzalloc_node

__get_vm_area_node 1. Allocate a vmap_area struct from


Range: VMALLOC_START-VMALLOC_END
kmem_cache (slub allocator)
vmalloc __vmalloc_node __vmalloc_node_range 2. Get virtual address from vmalloc RB-tree
alloc_vmap_area
setup_vmalloc_vm

Memory allocation for storing pointers


of page descriptors: area->pages[]

__vmalloc_area_node for (i = 0; i < area->nr_pages; i++)


page = alloc_page(gfp_mask)
area->pages[i] = page

map_kernel_range
page table population

Page table is populated immediately upon the request: No page fault


vmalloc: vmap_area & guard page
vmap_area vmap_area vmap_area vmap_area
vm_start = vm_start = vm_start = vm_start =
0xffffc90000000000 0xffffc90000005000 GAP 0xffffc90000008000 ... Unallocated area
vm_end = vm_end = vm_end = vm_end =
0xffffc90000005000 0xffffc90000007000 0xffffc9000000b000

VMALLOC_TART vmalloc virtual address VMALLOC_END


Guard page (4KB)

Guard page (4KB)

Guard page (4KB)

Guard page (4KB)


vmalloc area vmalloc area vmalloc area vmalloc area
size = 0x4000 size = 0x1000
GAP
size = 0x2000 ... size = 0x2000
Unallocated area

VMALLOC_START vmalloc virtual address VMALLOC_END


vmalloc: vmap_area & guard page
vmap_area vmap_area vmap_area vmap_area
vm_start = vm_start = vm_start = vm_start =
0xffffc90000000000 0xffffc90000005000 GAP 0xffffc90000008000 ... Unallocated area
vm_end = vm_end = vm_end = vm_end =
0xffffc90000005000 0xffffc90000007000 0xffffc9000000b000

VMALLOC_TART vmalloc virtual address VMALLOC_END


Guard page (4KB)

Guard page (4KB)

Guard page (4KB)

Guard page (4KB)


vmalloc area vmalloc area vmalloc area vmalloc area
size = 0x4000 size = 0x2000
GAP
size = 0x2000 ... size = 0x2000
Unallocated area

VMALLOC_START vmalloc virtual address VMALLOC_END

1. Guard page (will not be allocated physically): Detect over-boundary access


2. VMAP_STACK kernel config: Leverage guard page (via vmalloc) to implement
virtually-mapped kernel stack → Detect stack overflow
Example: vmalloc size = 8MB: alloc_vmap_area()
Example
Get virtual address from vmalloc RB-tree
vmalloc-test.ko (vmap_area RB-tree)
vmalloc: 8MB Allocate a vm_struct from kmalloc (slub allocator)
__vmalloc_node_range kzalloc_node
vmalloc subsystem
Allocate a vmap_area struct from
alloc_pages() __get_vm_area_node
kmem_cache (slub allocator)
buddy system alloc_vmap_area
__vmalloc_area_node
setup_vmalloc_vm

free_vmap_area_root: init by vmalloc_init()

vmap_area find_vmap_lowest_match(): Get a VA from RB-tree


va_start = 0xffffc90001a4d000
vmap_area_root
va_end = 0xffffc9000224e000
insert_vmap_area()
rb_node
list
union

subtree_max_size
vm

list_head: vmap_area_list vmap_area vmap_area vmap_area


Example: vmalloc size = 8MB: setup_vmalloc_vm()
Example
Get virtual address from vmalloc RB-tree
vmalloc-test.ko (vmap_area RB-tree)
vmalloc: 8MB Allocate a vm_struct from kmalloc (slub allocator)
__vmalloc_node_range kzalloc_node
vmalloc subsystem
Allocate a vmap_area struct from
alloc_pages() __get_vm_area_node
kmem_cache (slub allocator)
buddy system alloc_vmap_area
__vmalloc_area_node
setup_vmalloc_vm

free_vmap_area_root: init by vmalloc_init()

vm_struct
vmap_area find_vmap_lowest_match(): Get a VA from RB-tree
next
va_start = 0xffffc90001a4d000
addr = 0xffffc90001a4d000 vmap_area_root
va_end = 0xffffc9000224e000
size = 0x801000 (w/ guard page) insert_vmap_area()
rb_node
flags = 0x22
list
**pages = NULL
union

nr_pages = 0 subtree_max_size
vm
phys_addr
caller
list_head: vmap_area_list vmap_area vmap_area vmap_area
Example: vmalloc size = 8MB: __vmalloc_area_node()
Example Memory allocation for storing pointers
vmalloc-test.ko of page descriptors: area->pages[]

vmalloc: 8MB __vmalloc_node_range for (i = 0; i < area->nr_pages; i++)


vmalloc subsystem page = alloc_page(gfp_mask)
__get_vm_area_node
alloc_pages()
area->pages[i] = page
buddy system __vmalloc_area_node
map_kernel_range
page table population

free_vmap_area_root: init by vmalloc_init()

vmap_area find_vmap_lowest_match(): Get a VA from RB-tree


va_start = 0xffffc90001a4d000
vm_struct va_end = 0xffffc9000224e000
next rb_node
addr = 0xffffc90001a4d000 list vmap_area_root
size = 0x801000 (w/ guard page)
union

subtree_max_size
flags = 0x22 vm
**pages = 0xffffc900019b9000
nr_pages = 0x800 (2048)
phys_addr
Page
Descriptor
... Page
Descriptor
caller
Memory allocation for page descriptor pointer
• size: 8MB/4KB * 8 = 16384 bytes list_head: vmap_area_list vmap_area vmap_area vmap_area
• Allocated from vmalloc ( > 4KB) or kmalloc
(<= 4KB)
Example: vmalloc size = 8MB
vm_struct
next
addr = 0xffffc90001a4d000
size = 0x801000 (w/ guard page)
flags = 0x22
**pages = 0xffffc900019b9000
nr_pages = 0x800 (2048)
phys_addr
caller

6 contiguous pages

N contiguous pages

vmalloc: Virtually contiguous address; Physically non-contiguous address


Example: vmalloc size = 8MB
vm_struct
next
addr = 0xffffc90001a4d000
size = 0x801000 (w/ guard page)
flags = 0x22
**pages = 0xffffc900019b9000
nr_pages = 0x800 (2048)
phys_addr
caller

vmalloc virtual address: vmalloc virtual address:


0xffff_c900_01a5_3000 0xffff_c900_01a4_d000 + 0x5000 =
0xffff_c900_01a5_2000
Example: vmalloc size = 8MB: Page Table Configuration
[Linear Address] 0xffff_c900_01a5_2000, 0xffff_c900_01a5_3000
63 48 47 39 38 30 29 21 20 1211 0
Page Map Page Directory Page Directory
Sign-extend Level-4 Offset Pointer Offset Offset Page Table Offset Physical Page Offset

Page Map Page Directory Page Directory


Table Page Table
Level-4 Table Pointer Table
Kernel Code & fixmap PML4E #511 PDPTE #511 PDE #507
cpu_entry_area: 0.5TB PML4E #508 PDPTE #510 PDE #506
PDE #505 page frame
vmemmap (page PML4E #468
descriptor)
PTE #83 = 0
PML4E #465 level3_kernel_pgt PTE #82 = 0
vmalloc: 32TB …
PML4E #402 PDE #13

Direct Mapping Region PML4E #273 Physical Memory


PDPTE #0

CR3
40 init_top_pgt = swapper_pg_dir
Example: vmalloc size = 8MB: Page Table Configuration
[Linear Address] 0xffff_c900_01a5_2000, 0xffff_c900_01a5_3000
63 48 47 39 38 30 29 21 20 1211 0
Page Map Page Directory Page Directory
Sign-extend Level-4 Offset Pointer Offset Offset Page Table Offset Physical Page Offset

Page Map Page Directory Page Directory


Table Page Table
Level-4 Table Pointer Table
Kernel Code & fixmap PML4E #511 PDPTE #511 PDE #507 page frame
cpu_entry_area: 0.5TB PML4E #508 PDPTE #510 PDE #506
PDE #505
vmemmap (page page frame
PML4E #468
descriptor)
PTE #83
PML4E #465 level3_kernel_pgt PTE #82
vmalloc: 32TB …
PML4E #402 PDE #13

Direct Mapping Region PML4E #273


PDPTE #0 Physical Memory
CR3
40 init_top_pgt = swapper_pg_dir
Example: vmalloc size = 8MB: Page Table Configuration
[Linear Address] 0xffff_c900_01a5_2000, 0xffff_c900_01a5_3000
63 48 47 39 38 30 29 21 20 1211 0
Page Map Page Directory Page Directory
Sign-extend Level-4 Offset Pointer Offset Offset Page Table Offset Physical Page Offset

Page Map Page Directory Page Directory


Table Page Table
Level-4 Table Pointer Table
Kernel Code & fixmap PML4E #511 PDPTE #511 PDE #507 page frame
cpu_entry_area: 0.5TB PML4E #508 PDPTE #510 PDE #506
PDE #505
vmemmap (page page frame
PML4E #468
descriptor)
PTE #83
PML4E #465 level3_kernel_pgt PTE #82 Page are physically non-
vmalloc: 32TB …
PML4E #402 PDE #13 contiguous address

Direct Mapping Region PML4E #273


PDPTE #0 Physical Memory
CR3
40 init_top_pgt = swapper_pg_dir
Example: vmalloc size = 8MB: Page Table Configuration
[Linear Address] 0xffff_c900_01a5_2000, 0xffff_c900_01a5_3000
63 48 47 39 38 30 29 21 20 1211 0
Page Map Page Directory Page Directory
Sign-extend Level-4 Offset Pointer Offset Offset Page Table Offset Physical Page Offset

Page Map Page Directory Page Directory


Table Page Table
Level-4 Table Pointer Table
Kernel Code & fixmap PML4E #511 PDPTE #511 PDE #507 page frame
cpu_entry_area: 0.5TB PML4E #508 PDPTE #510 PDE #506
PDE #505
vmemmap (page page frame
PML4E #468
descriptor)
PTE #83
PML4E #465 level3_kernel_pgt PTE #82
vmalloc: 32TB …
PML4E #402 PDE #13

Direct Mapping Region PML4E #273 verify this


PDPTE #0 Physical Memory
CR3
40 init_top_pgt = swapper_pg_dir
Example: vmalloc size = 8MB: Page Table Configuration Page Directory
Page Map Page Directory Page Table
Level-4 Table Pointer Table Table
Kernel Code & fixmap PML4E #511 PDPTE #511 PDE #507 page frame
cpu_entry_area: 0.5TB PML4E #508 PDPTE #510 PDE #506
PDE #505
vmemmap (page page frame
PML4E #468
descriptor)
PTE #83
PML4E #465 level3_kernel_pgt PTE #82
vmalloc: 32TB …
PML4E #402 PDE #13
verify this
Direct Mapping Region PML4E #273
PDPTE #0 Physical Memory
CR3
40

PTE #82 Verification


• Virtual address of the page descriptor: 0xffffea00040907c0
• PFN (Page Frame Number): (0xffffea00040907c0 - 0xffffea0000000000) / 64 = 0x10241F
• Page physical address: PFN << 12 = 0x10241F << 12 = 0x10241F000
Example: vmalloc size = 8MB: Page Table Configuration Page Directory
Page Map Page Directory Page Table
Level-4 Table Pointer Table Table
Kernel Code & fixmap PML4E #511 PDPTE #511 PDE #507 page frame
cpu_entry_area: 0.5TB PML4E #508 PDPTE #510 PDE #506
PDE #505
vmemmap (page page frame
PML4E #468
descriptor)
PTE #83
PML4E #465 level3_kernel_pgt PTE #82
vmalloc: 32TB …
PML4E #402 PDE #13
verify this
Direct Mapping Region PML4E #273
PDPTE #0 Physical Memory
CR3
40

PTE #83 Verification


• Virtual address of the page descriptor: 0xffffea0004090000
• PFN (Page Frame Number): (0xffffea0004090000 - 0xffffea0000000000) / 64 = 0x102400
• Page physical address: PFN << 12 = 0x102400 << 12 = 0x102400000
vmalloc users/scenario
• Array size > PAGE_SIZE (4KB)
o arr[0], arr[1]….arr[n] → Need contiguous memory for array indexing
o Example: 8MB memory allocation (for page descriptor) from vmalloc
▪ Page descriptor list (vm_struct->pages) requires contiguous memory for array indexing

vm_struct
next
addr = 0xffffc90001a4d000
size = 0x801000 (w/ guard page)
Page
Descriptor
... Page
Descriptor
flags = 0x22 Memory allocation for page descriptor
**pages = 0xffffc900019b9000 pointer
nr_pages = 0x800 (2048) • Memory space can be address:
phys_addr 8MB/4KB * 8 = 16384 bytes
caller • Allocated from vmalloc ( > 4KB)
vmalloc users/scenario
• Virtually-mapped stack (VMAP_STACK=y)
o Use virtually-mapped stack with guard page: kernel stack overflow can be detected
immediately.

clone() system call


vmalloc users/scenario
• Dynamically load kernel module: #1
> PAGE_SIZE (4KB)
vmalloc users/scenario
• Dynamically load kernel module: #1
> PAGE_SIZE (4KB)
vmalloc users/scenario
• Dynamically load kernel module: #2
> PAGE_SIZE (4KB)
Reference
• Robert Love, Linux Kernel Development (3rd Edition)
• Wolfgang Mauerer, Professional Linux Kernel Architecture
backup
Some Notes
• Kernel implementation for /proc/pid/maps
o show_map_vma()

You might also like