Mallocvmallocinlinux 221205070506 74fe014b
Mallocvmallocinlinux 221205070506 74fe014b
• [Note] kmalloc has been discussed here: Slide #88 of Slab Allocator in Linux
Kernel
Memory Allocation in Linux
[glibc] malloc
• Balance between brk() and mmap()
glibc: malloc/free • Use brk() if request size < DEFAULT_MMAP_THRESHOLD_MIN (128 KB)
o The heap can be trimmed only if memory is freed at the top end.
brk/mmap User Space o sbrk() is implemented as a library function that uses the brk() system call.
Kernel Space o When the heap is used up, allocate memory chunk > 128KB via brk().
▪ Save overhead for frequent system call ‘brk()’
kmalloc/kfree • Use mmap() if request size >= DEFAULT_MMAP_THRESHOLD_MIN (128 KB)
o The allocated memory blocks can be independently released back to the system.
vmalloc o Deallocated space is not placed on the free list for reuse by later allocations.
o Memory may be wasted because mmap allocations must be page-aligned; and the
Slab Allocator kernel must perform the expensive task of zeroing out memory allocated.
o Note: glibc uses the dynamic mmap threshold
o Detail: `man mallopt`
Buddy System
alloc_page(s), __get_free_page(s)
0x7FFF_FFFF_FFFF
STACK_TOP_MAX =
Stack (Default size: 8MB) 0x7FFF_FFFF_F000
128MB gap mm->stack
mm->brk
HEAP
mm->start_brk
BSS
mm->end_data
Data mm->start_data
Text
mm->start_code =
0x40_0000
0
Process Virtual Address
Quick view: Process Address Space - Heap
0x7FFF_FFFF_FFFF
STACK_TOP_MAX =
Stack (Default size: 8MB) 0x7FFF_FFFF_F000
128MB gap mm->stack
mm->brk
HEAP
mm->start_brk
BSS
mm->end_data
Why are they different? Data mm->start_data
Text
mm->start_code =
0x40_0000
0
Process Virtual Address
sys_brk – Call path
sys_brk mm_populate
if brk < mm->start_brk
return mm->brk __mm_populate
__do_munmap follow_page_mask
The page is NOT populated yet
do_brk_flags faultin_page
can expand the existing
anonymous mapping handle_mm_fault
vma_merge
cannot expand the existing
anonymous mapping
vm_area_alloc
mm->brk = brk
mm->def_flags & VM_LOCKED != 0
mm_populate
return newbrk
# ./free_and_sbrk 1 1
Kernel
load_elf_binary()
vma: R vma: R, E vma: R vma: R, W vma (vvar) vma (vdso) vma (stack)
vm_start = vm_start = vm_start = vm_start = vm_start = vm_start = vm_start =
GAP 0x400000 0x401000 0x496000 0x4be000 GAP 0x7ffff7ffa000 0x7ffff7ffe000 GAP 0x7fffff85d000
vm_end = vm_end = vm_end = vm_end = vm_end = vm_end = vm_end =
0x401000 0x496000 0x4be000 0x4c4000 0x7ffff7ffe000 0x7ffff7fff000 0x7ffffffff000
After launching a program: Question
Why?
vma: R vma: R, E vma: R vma: R, W vma (vvar) vma (vdso) vma (stack)
vm_start = vm_start = vm_start = vm_start = vm_start = vm_start = vm_start =
GAP 0x400000 0x401000 0x496000 0x4be000 GAP 0x7ffff7ffa000 0x7ffff7ffe000 GAP 0x7fffff85d000
vm_end = vm_end = vm_end = vm_end = vm_end = vm_end = vm_end =
0x401000 0x496000 0x4be000 0x4c4000 0x7ffff7ffe000 0x7ffff7fff000 0x7ffffffff000
[From scratch] Launch a program: load_elf_binary() – Heap Configration
# ./free_and_sbrk 1 1
load_elf_binary
set_brk
vm_brk_flags
do_brk_flags
can expand the existing
anonymous mapping
vma_merge
cannot expand the existing
anonymous mapping
vm_area_alloc
vma: R vma: R, E vma: R vma: R, W vma (vvar) vma (vdso) vma (stack)
vm_start = vm_start = vm_start = vm_start = vm_start = vm_start = vm_start =
GAP 0x400000 0x401000 0x496000 0x4be000 GAP 0x7ffff7ffa000 0x7ffff7ffe000 GAP 0x7fffff85d000
vm_end = vm_end = vm_end = vm_end = vm_end = vm_end = vm_end =
0x401000 0x496000 0x4be000 0x4c4000 0x7ffff7ffe000 0x7ffff7fff000 0x7ffffffff000
[From scratch] Launch a program: load_elf_binary() – Heap Configration
# ./free_and_sbrk 1 1
load_elf_binary
set_brk
vm_brk_flags
do_brk_flags
can expand the existing
anonymous mapping
vma_merge
cannot expand the existing
anonymous mapping
vm_area_alloc
vma: R vma: R, E vma: R vma: R, W vma (heap) vma (vvar) vma (vdso) vma (stack)
vm_start = vm_start = vm_start = vm_start = vm_start = vm_start = vm_start = vm_start =
GAP 0x400000 0x401000 0x496000 0x4be000 0x4c4000 GAP 0x7ffff7ffa000 0x7ffff7ffe000 GAP 0x7fffff85d000
vm_end = vm_end = vm_end = vm_end = vm_end = vm_end = vm_end = vm_end =
0x401000 0x496000 0x4be000 0x4c4000 0x4c5000 0x7ffff7ffe000 0x7ffff7fff000 0x7ffffffff000
[From scratch] Launch a program: load_elf_binary() – Heap Configration
load_elf_binary
set_brk
vm_brk_flags
do_brk_flags
can expand the existing
anonymous mapping
vma_merge
cannot expand the existing
anonymous mapping
vm_area_alloc
mm->brk = mm->start_brk
= 0x4c5000
vma: R vma: R, E vma: R vma: R, W vma (heap) vma (vvar) vma (vdso) vma (stack)
vm_start = vm_start = vm_start = vm_start = vm_start = vm_start = vm_start = vm_start =
GAP 0x400000 0x401000 0x496000 0x4be000 0x4c4000 GAP 0x7ffff7ffa000 0x7ffff7ffe000 GAP 0x7fffff85d000
vm_end = vm_end = vm_end = vm_end = vm_end = vm_end = vm_end = vm_end =
0x401000 0x496000 0x4be000 0x4c4000 0x4c5000 0x7ffff7ffe000 0x7ffff7fff000 0x7ffffffff000
[From scratch] Launch a program: load_elf_binary() – Heap Configration
load_elf_binary
set_brk
vm_brk_flags
do_brk_flags
can expand the existing
anonymous mapping
vma_merge
cannot expand the existing
anonymous mapping
vm_area_alloc
vma: R vma: R, E vma: R vma: R, W vma (heap) vma (vvar) vma (vdso) vma (stack)
vm_start = vm_start = vm_start = vm_start = vm_start = vm_start = vm_start = vm_start =
GAP 0x400000 0x401000 0x496000 0x4be000 0x4c4000 GAP 0x7ffff7ffa000 0x7ffff7ffe000 GAP 0x7fffff85d000
vm_end = vm_end = vm_end = vm_end = vm_end = vm_end = vm_end = vm_end =
0x401000 0x496000 0x4be000 0x4c4000 0x4c5000 0x7ffff7ffe000 0x7ffff7fff000 0x7ffffffff000
[From scratch] Launch a program: load_elf_binary() – Heap Configration
elf_bss
load_elf_binary
set_brk elf_brk
vm_brk_flags
do_brk_flags
can expand the existing
anonymous mapping
vma_merge
cannot expand the existing
anonymous mapping
vm_area_alloc
mm->brk = mm->start_brk
= 0x4c5000
vma: R vma: R, E vma: R vma: R, W vma (heap) vma (vvar) vma (vdso) vma (stack)
vm_start = vm_start = vm_start = vm_start = vm_start = vm_start = vm_start = vm_start =
GAP 0x400000 0x401000 0x496000 0x4be000 0x4c4000 GAP 0x7ffff7ffa000 0x7ffff7ffe000 GAP 0x7fffff85d000
vm_end = vm_end = vm_end = vm_end = vm_end = vm_end = vm_end = vm_end =
0x401000 0x496000 0x4be000 0x4c4000 0x4c5000 0x7ffff7ffe000 0x7ffff7fff000 0x7ffffffff000
[From scratch] Launch a program: load_elf_binary() – Heap Configration
elf_bss
load_elf_binary
set_brk elf_brk
vm_brk_flags
do_brk_flags
can expand the existing
anonymous mapping
vma_merge
cannot expand the existing
anonymous mapping
vm_area_alloc
vma: R vma: R, E vma: R vma: R, W vma (heap) vma (vvar) vma (vdso) vma (stack)
vm_start = vm_start = vm_start = vm_start = vm_start = vm_start = vm_start = vm_start =
GAP 0x400000 0x401000 0x496000 0x4be000 0x4c4000 GAP 0x7ffff7ffa000 0x7ffff7ffe000 GAP 0x7fffff85d000
vm_end = vm_end = vm_end = vm_end = vm_end = vm_end = vm_end = vm_end =
0x401000 0x496000 0x4be000 0x4c4000 0x4c5000 0x7ffff7ffe000 0x7ffff7fff000 0x7ffffffff000
newbrk = PAGE_ALIGN(brk)
oldbrk = PAGE_ALIGN(mm->brk)
shrink brk if brk <= mm->brk
__do_munmap
do_brk_flags
can expand the existing
anonymous mapping
vma_merge
cannot expand the existing
anonymous mapping
vm_area_alloc
mm->brk = brk
mm->start_brk = 0x4c5000
mm->def_flags & VM_LOCKED != 0
mm->brk = 0x4c61c0 mm_populate
vma: R vma: R, E vma: R vma: R, W vma (heap) vma (vvar) vma (vdso) vma (stack)
vm_start = vm_start = vm_start = vm_start = vm_start = vm_start = vm_start = vm_start =
GAP 0x400000 0x401000 0x496000 0x4be000 0x4c4000 GAP 0x7ffff7ffa000 0x7ffff7ffe000 GAP 0x7fffff85d000
vm_end = vm_end = vm_end = vm_end = vm_end = vm_end = vm_end = vm_end =
0x401000 0x496000 0x4be000 0x4c4000 0x4c7000 0x7ffff7ffe000 0x7ffff7fff000 0x7ffffffff000
mm->start_brk = 0x4c5000
mm->brk = 0x4c61c0
vma: R vma: R, E vma: R vma: R, W vma (heap) vma (vvar) vma (vdso) vma (stack)
vm_start = vm_start = vm_start = vm_start = vm_start = vm_start = vm_start = vm_start =
GAP 0x400000 0x401000 0x496000 0x4be000 0x4c4000 GAP 0x7ffff7ffa000 0x7ffff7ffe000 GAP 0x7fffff85d000
vm_end = vm_end = vm_end = vm_end = vm_end = vm_end = vm_end = vm_end =
0x401000 0x496000 0x4be000 0x4c4000 0x4c7000 0x7ffff7ffe000 0x7ffff7fff000 0x7ffffffff000
newbrk = PAGE_ALIGN(brk)
oldbrk = PAGE_ALIGN(mm->brk)
shrink brk if brk <= mm->brk
__do_munmap
do_brk_flags
can expand the existing
anonymous mapping
vma_merge
cannot expand the existing
anonymous mapping
vm_area_alloc
mm->brk = brk
mm->start_brk = 0x4c5000
mm->def_flags & VM_LOCKED != 0
mm->brk = 0x4e8000 mm_populate
vma: R vma: R, E vma: R vma: R, W vma (heap) vma (vvar) vma (vdso) vma (stack)
vm_start = vm_start = vm_start = vm_start = vm_start = vm_start = vm_start = vm_start =
GAP 0x400000 0x401000 0x496000 0x4be000 0x4c4000 GAP 0x7ffff7ffa000 0x7ffff7ffe000 GAP 0x7fffff85d000
vm_end = vm_end = vm_end = vm_end = vm_end = vm_end = vm_end = vm_end =
0x401000 0x496000 0x4be000 0x4c4000 0x4e8000 0x7ffff7ffe000 0x7ffff7fff000 0x7ffffffff000
vma: R vma: R, E vma: R vma: R, W vma (heap) vma (vvar) vma (vdso) vma (stack)
vm_start = vm_start = vm_start = vm_start = vm_start = vm_start = vm_start = vm_start =
GAP 0x400000 0x401000 0x496000 0x4be000 0x4c4000 GAP 0x7ffff7ffa000 0x7ffff7ffe000 GAP 0x7fffff85d000
vm_end = vm_end = vm_end = vm_end = vm_end = vm_end = vm_end = vm_end =
0x401000 0x496000 0x4be000 0x4c4000 0x4e8000 0x7ffff7ffe000 0x7ffff7fff000 0x7ffffffff000
[Program Launch] strace observation: mprotect()
mprotect
split_vma vma_merge
mm->start_brk = 0x4c5000
mm->brk = 0x4e8000
vma: R vma: R, E vma: R vma: R, W vma (heap) vma (vvar) vma (vdso) vma (stack)
vm_start = vm_start = vm_start = vm_start = vm_start = vm_start = vm_start = vm_start =
GAP 0x400000 0x401000 0x496000 0x4be000 0x4c4000 GAP 0x7ffff7ffa000 0x7ffff7ffe000 GAP 0x7fffff85d000
vm_end = vm_end = vm_end = vm_end = vm_end = vm_end = vm_end = vm_end =
0x401000 0x496000 0x4be000 0x4c4000 0x4e8000 0x7ffff7ffe000 0x7ffff7fff000 0x7ffffffff000
[Program Launch] strace observation: mprotect()
mm->start_brk = 0x4c5000
R/W permission
mm->brk = 0x4e8000
vma: R vma: R, E vma: R vma: R, W vma (heap) vma (vvar) vma (vdso) vma (stack)
vm_start = vm_start = vm_start = vm_start = vm_start = vm_start = vm_start = vm_start =
GAP 0x400000 0x401000 0x496000 0x4be000 0x4c4000 GAP 0x7ffff7ffa000 0x7ffff7ffe000 GAP 0x7fffff85d000
vm_end = vm_end = vm_end = vm_end = vm_end = vm_end = vm_end = vm_end =
0x401000 0x496000 0x4be000 0x4c4000 0x4e8000 0x7ffff7ffe000 0x7ffff7fff000 0x7ffffffff000
[Program Launch] strace observation: mprotect()
mprotect
split_vma vma_merge
vma split
mm->start_brk = 0x4c5000
mm->brk = 0x4e8000
vma: R vma: R, E vma: R vma: R vma: R, W vma (heap) vma (vvar) vma (vdso) vma (stack)
vm_start = vm_start = vm_start = vm_start = vm_start = vm_start = vm_start = vm_start = vm_start =
GAP 0x400000 0x401000 0x496000 0x4be000 0x4c1000 0x4c4000 GAP 0x7ffff7ffa000 0x7ffff7ffe000 GAP 0x7fffff85d000
vm_end = vm_end = vm_end = vm_end = vm_end = vm_end = vm_end = vm_end = vm_end =
0x401000 0x496000 0x4be000 0x4c1000 0x4c4000 0x4e8000 0x7ffff7ffe000 0x7ffff7fff000 0x7ffffffff000
[Program Launch] strace observation: mprotect()
match
mm->start_brk = 0x4c5000
mm->brk = 0x4e8000
vma: R vma: R, E vma: R vma: R vma: R, W vma (heap) vma (vvar) vma (vdso) vma (stack)
vm_start = vm_start = vm_start = vm_start = vm_start = vm_start = vm_start = vm_start = vm_start =
GAP 0x400000 0x401000 0x496000 0x4be000 0x4c1000 0x4c4000 GAP 0x7ffff7ffa000 0x7ffff7ffe000 GAP 0x7fffff85d000
vm_end = vm_end = vm_end = vm_end = vm_end = vm_end = vm_end = vm_end = vm_end =
0x401000 0x496000 0x4be000 0x4c1000 0x4c4000 0x4e8000 0x7ffff7ffe000 0x7ffff7fff000 0x7ffffffff000
strace observation: allocate space via malloc() #1
[Init stage]
0x4e8000 – 0x4c7000 = 0x21000
(132KB: 33 pages)
[glibc] malloc
• Balance between brk() and mmap()
• Use brk() if request size < DEFAULT_MMAP_THRESHOLD_MIN (128 KB)
o The heap can be trimmed only if memory is freed at the top end.
o sbrk() is implemented as a library function that uses the brk() system call.
o When the heap is used up, allocate memory chunk > 128KB via brk().
▪ Save overhead for frequent system call ‘brk()’
• Use mmap() if request size >= DEFAULT_MMAP_THRESHOLD_MIN (128 KB)
o The allocated memory blocks can be independently released back to the system.
o Deallocated space is not placed on the free list for reuse by later allocations.
o Memory may be wasted because mmap allocations must be page-aligned; and the
kernel must perform the expensive task of zeroing out memory allocated.
o Note: glibc uses the dynamic mmap threshold
o Detail: `man mallopt`
strace observation: allocate space via malloc() #2
malloc.c [glibc] malloc
• Balance between brk() and mmap()
• Use brk() if request size < DEFAULT_MMAP_THRESHOLD_MIN (128 KB)
o The heap can be trimmed only if memory is freed at the top end.
o sbrk() is implemented as a library function that uses the brk() system call.
o When the heap is used up, allocate memory chunk > 128KB via brk().
▪ Save overhead for frequent system call ‘brk()’
• Use mmap() if request size >= DEFAULT_MMAP_THRESHOLD_MIN (128 KB)
o The allocated memory blocks can be independently released back to the system.
o Deallocated space is not placed on the free list for reuse by later allocations.
o Memory may be wasted because mmap allocations must be page-aligned; and the
kernel must perform the expensive task of zeroing out memory allocated.
o Note: glibc uses the dynamic mmap threshold
o Detail: `man mallopt`
Heap space allocation from malloc(): Allocate memory chunk > 128KB via brk()
Memory Allocation in Linux – brk() detail
User application [glibc] malloc: check sysmalloc() for implementation
malloc • Balance between brk() and mmap()
• Use brk() if request size < DEFAULT_MMAP_THRESHOLD_MIN (128 KB)
glibc: malloc implementation o The heap can be trimmed only if memory is freed at the top end.
Allocated
heap space o sbrk() is implemented as a library function that uses the brk() system call.
enough? Y: Return available address from the allocated
o When the heap is used up, allocate memory chunk > 128KB via brk().
heap space
▪ Save overhead for frequent system call ‘brk()’
N: if size < 128KB, then allocate “memory chunk > 128KB” by • Use mmap() if request size >= DEFAULT_MMAP_THRESHOLD_MIN (128 KB)
calling brk() o The allocated memory blocks can be independently released back to the system.
o Deallocated space is not placed on the free list for reuse by later allocations.
o Memory may be wasted because mmap allocations must be page-aligned; and the
brk or mmap kernel must perform the expensive task of zeroing out memory allocated.
User Space o Note: glibc uses the dynamic mmap threshold
o Detail: `man mallopt`
Kernel Space
VMA Configuration & kmalloc/kfree
program break adjustment
vmalloc
Page fault handler Slab Allocator
Buddy System
alloc_page(s), __get_free_page(s)
Hardware
...
glibc: malloc implementation for memory request size
glibc: malloc implementation for memory request size
* MORECORE()->__sbrk()->__brk()
Heap space allocation from malloc(): Allocate memory chunk > 128KB via brk()
glibc: malloc implementation for memory request size
malloc.c
6
2
Detail Reference
• [glibc] malloc internals
o Concept: Chunk, arenas, heaps, and thread
local cache (tcache)
1
5
3
Modules
MODULES_VADDR
Kernel Space
Kernel code [.text, .data…] __START_KERNEL = 0xFFFF_FFFF_8100_0000
1GB or 512MB Kernel text mapping from
0xFFFF_8000_0000_0000 physical address 0 __START_KERNEL_map = 0xFFFF_FFFF_8000_0000
Empty Space …
Virtual memory map – 1TB
0x0000_7FFF_FFFF_FFFF
(store page frame descriptor)
vmemmap_base
Unused hole (1TB)
128TB
VMALLOC_END = 0xFFFF_E8FF_FFFF_FFFF
User Space vmalloc/ioremap (32TB)
vmalloc_base VMALLOC_START = 0xFFFF_C900_0000_0000
Unused hole (0.5TB)
0 Page frame direct
64-bit Virtual Address mapping (64TB)
page_offset_base
LDT remap for PTI (0.5TB)
Guard hole (8TB)
0xFFFF_8000_0000_0000
Default Configuration Kernel Virtual Address
page_ofset_base = 0xFFFF_8880_0000_0000
vmalloc_base = 0xFFFF_C900_0000_0000
vmemmap_base = 0xFFFF_EA00_0000_0000
* Can be dynamically configured by KASLR (Kernel Address Space Layout Randomization - "arch/x86/mm/kaslr.c")
vmalloc – call path Get virtual address from vmalloc RB-tree
(vmap_area RB-tree)
map_kernel_range
page table population
subtree_max_size
vm
vm_struct
vmap_area find_vmap_lowest_match(): Get a VA from RB-tree
next
va_start = 0xffffc90001a4d000
addr = 0xffffc90001a4d000 vmap_area_root
va_end = 0xffffc9000224e000
size = 0x801000 (w/ guard page) insert_vmap_area()
rb_node
flags = 0x22
list
**pages = NULL
union
nr_pages = 0 subtree_max_size
vm
phys_addr
caller
list_head: vmap_area_list vmap_area vmap_area vmap_area
Example: vmalloc size = 8MB: __vmalloc_area_node()
Example Memory allocation for storing pointers
vmalloc-test.ko of page descriptors: area->pages[]
subtree_max_size
flags = 0x22 vm
**pages = 0xffffc900019b9000
nr_pages = 0x800 (2048)
phys_addr
Page
Descriptor
... Page
Descriptor
caller
Memory allocation for page descriptor pointer
• size: 8MB/4KB * 8 = 16384 bytes list_head: vmap_area_list vmap_area vmap_area vmap_area
• Allocated from vmalloc ( > 4KB) or kmalloc
(<= 4KB)
Example: vmalloc size = 8MB
vm_struct
next
addr = 0xffffc90001a4d000
size = 0x801000 (w/ guard page)
flags = 0x22
**pages = 0xffffc900019b9000
nr_pages = 0x800 (2048)
phys_addr
caller
6 contiguous pages
N contiguous pages
CR3
40 init_top_pgt = swapper_pg_dir
Example: vmalloc size = 8MB: Page Table Configuration
[Linear Address] 0xffff_c900_01a5_2000, 0xffff_c900_01a5_3000
63 48 47 39 38 30 29 21 20 1211 0
Page Map Page Directory Page Directory
Sign-extend Level-4 Offset Pointer Offset Offset Page Table Offset Physical Page Offset
vm_struct
next
addr = 0xffffc90001a4d000
size = 0x801000 (w/ guard page)
Page
Descriptor
... Page
Descriptor
flags = 0x22 Memory allocation for page descriptor
**pages = 0xffffc900019b9000 pointer
nr_pages = 0x800 (2048) • Memory space can be address:
phys_addr 8MB/4KB * 8 = 16384 bytes
caller • Allocated from vmalloc ( > 4KB)
vmalloc users/scenario
• Virtually-mapped stack (VMAP_STACK=y)
o Use virtually-mapped stack with guard page: kernel stack overflow can be detected
immediately.