4 MemoryManagement
4 MemoryManagement
Management
Subhrendu Chattopadhyay, IDRBT
Disclaimer: A few of the Images are taken from the Internet and Textbooks 1
Extra Linux Boot process
2
Boot steps
● Firmware based Basic Input/Output System (BIOS) & Power on Self Test (POST)
● BIOS reads Master Boot Record (MBR)
● MBR contains Grand Unified Boot loader (GRUB) stage-1
● GRUB stage-1 → GRUB stage-1.5 → GRUB stage-2
● Grub stage-2: Selects and loads initrd.img and vmlinuz
● Starts init
3
MBR Location and Structure:
● The MBR is always located in a specific and well-defined location:
○ First sector of the bootable device, which is sector 0, cylinder 0, head 0. This
sector is also known as LBA (Logical Block Address) 0, and it is 512 bytes in size.
○ The BIOS reads the first 512 bytes of the bootable device into memory (usually
at address 0x7C00). This 512-byte sector is the Master Boot Record.
4
MBR Location and Structure:
0000000 63eb 0090 0000 0000 0000 0000 0000 0000
● The structure of the MBR is as follows: 0000010
*
0000 0000 0000 0000 0000 0000 0000 0000
Boot signature 5
MBR: Partition Table Entry ndisasm -b 16 <Saved file>
00000000 EB63 jmp short 0x65
Byte Field Len Description Values 00000002 90 nop
Offset Name *
00000064 FF db 0xff
0 Boot 1 0x00: Non-bootable partition (not 0x80, which would 00 00000065 FA cli
00000066 90 nop
Indicator indicate bootable).
00000067 90 nop
00000068 F6C280 test dl,0x80
1-3 Starting 3 The Cylinder-Head-Sector (CHS) address of the 00 02 00 0000006B 7405 jz 0x72
CHS partition's start 0000006D F6C270 test dl,0x70
*
000001B0 CD10 int 0x10
4 Partition 1 0xee: This indicates a GPT Protective MBR partition. ee
000001B2 AC lodsb
Type 000001B3 3C00 cmp al,0x0
000001B5 75F4 jnz 0x1ab
5-7 Ending 3 The CHS address of the partition's end ff ff ff 000001B7 C3 ret
CHS 000001B8 31ED xor bp,bp
000001BA 7734 ja 0x1f0
8-11 Starting 4 LBA (Logical Block Addressing) where the partition 01 00 00 00
LBA starts (sector 1).
6
GRUB 1.5
● GRUB stage-1.5 at sector-1 as found in partition table
○ Identify boot partition type (e.g. EFI)
$ sudo dd if=/dev/sda bs=512 skip=1 count=1 of=grub1.5.bin
7
sudo fdisk -l /dev/sda
GRUB 2 …
Device Start End Sectors Size Type
/dev/sda1 2048 9764863 9762816 4.7G Linux swap
/dev/sda2 9764864 10815487 1050624 513M EFI System
● GRUB stage-2 at /dev/sda2 /dev/sda3 10815488 1953523711 1942708224 926.4G Linux filesystem
8
GRUB 2
● Loading Configuration Files
○ GRUB Configuration: GRUB Stage 2 reads the grub.cfg file, usually located in the /boot/grub/ directory. This file
contains the boot menu entries and configurations defined by the user.
○ Parsing Configuration: It interprets the configuration file to construct the boot menu displayed to the user.
● Displaying the Boot Menu
○ User Interaction: GRUB Stage 2 presents the user with a boot menu, allowing them to select which operating
system or kernel to boot.
○ Timeout Handling: If no selection is made within a specified timeout period, it automatically boots the default
option.
9
GRUB 2
● Loading Configuration Files
○ GRUB Configuration: GRUB Stage 2 reads the grub.cfg file, usually located in the /boot/grub/ directory. This file contains the
boot menu entries and configurations defined by the user.
○ Parsing Configuration: It interprets the configuration file to construct the boot menu displayed to the user.
● Displaying the Boot Menu
○ User Interaction: GRUB Stage 2 presents the user with a boot menu, allowing them to select which operating system or kernel
to boot.
○ Timeout Handling: If no selection is made within a specified timeout period, it automatically boots the default option.
● Kernel Loading
○ Loading the Kernel: Based on the user's selection, Stage 2 locates and loads the selected kernel (usually located in /boot/).
○ Passing Parameters: It also passes any specified kernel parameters (options) to the kernel during the boot process.
● Loading Initrd (Initial RAM Disk)
○ Loading Initrd: If the selected kernel requires an initial RAM disk (initrd), GRUB Stage 2 loads it into memory. This is necessary
for kernels that require additional drivers or filesystems to be available before the root filesystem can be mounted.
● Setting Up the Environment
○ Device Detection: GRUB Stage 2 can probe devices and detect available filesystems, allowing it to locate the kernel and initrd.
○ File System Access: It provides filesystem access to read the kernel and initrd images, enabling it to handle various filesystem
types.
● Transferring Control to the Kernel
○ Switching to Protected Mode: GRUB Stage 2 switches the CPU to protected mode (or long mode for 64-bit systems) to
prepare for loading the kernel.
○ Jumping to Kernel Entry Point: Finally, it transfers control to the kernel by jumping to the kernel's entry point address,
allowing the kernel to take over the boot process.
● Providing a Command-Line Interface
○ Interactive Shell: GRUB Stage 2 offers an interactive command-line interface for users to manually enter commands for
troubleshooting or custom booting options.
10
initrd.img (Initial RAM Disk)
● Functionalities
○ Early Boot Environment: initrd.img provides a temporary root file system in RAM. This
environment allows the kernel to load essential modules and drivers needed for
accessing the real root file system.
○ Driver Loading: It often contains drivers for hardware that is necessary to boot the
system, such as storage drivers (e.g., SATA, SCSI) that may not be compiled directly into
the kernel. This is especially important for systems using RAID or LVM configurations.
○ Modular Kernel Support: The initrd allows for a modular kernel configuration, where
not all drivers are statically linked to the kernel. It loads these modules dynamically at
boot time.
○ Mounting the Real Root File System: After loading the required drivers and mounting
the root file system, the kernel transfers control from the initrd to the actual root file
system.
11
vmlinuz
● vmlinuz is the compressed image of the Linux kernel
○ When the system boots, the bootloader (e.g., GRUB) loads both the vmlinuz kernel
image and the initrd.img into memory
○ The kernel starts executing and uses the initrd to access necessary modules and files
○ The kernel uses the temporary file system provided by initrd to load drivers and, once
everything is set up, mounts the real root file system
○ Finally, the kernel executes the init process from the mounted root file system,
continuing the boot process into the user space
● initram shell typically appears if there are errors during the boot process, such as:
○ The root filesystem is not found.
○ The kernel cannot mount the initramfs or initrd image.
○ The necessary drivers for the hardware are missing or not loaded.
12
MMU Overview
13
Challenges
• Performance vs. Complexity Trade-offs
• Latency
• Hardware support (e.g. TLB)
• Cache and coherence
• Fragmentation:
• Internal vs External
• Memory size restriction
• Virtual memory
https://elixir.bootlin.com/linux/v6.11-rc4/source/mm
Linux Source codes
• mm/:
• The core memory management directory, containing implementation for paging, swapping,
memory allocation, and the virtual memory subsystem.
• arch/:
• Architecture-specific memory management functions are implemented here.
• For example, for x86 architecture:
https://elixir.bootlin.com/linux/v6.11-rc4/source/arch/x86/mm
https://elixir.bootlin.com/linux/v6.11-rc4/source/mm
Arch specific MMU functionalities
1. Page Table Structure
● x86 Architecture:
○ 4-Level Page Table: Supports a hierarchical page table structure with four levels (Page Global Directory, Page Upper Directory, Page Middle
Directory, Page Table Entry) for 64-bit addressing.
○ Extended Page Tables (EPT): For virtualization, allowing the hypervisor to manage guest physical addresses.
● ARM Architecture:
○ Multiple Page Table Levels: ARM supports different page table structures, including 2-level and 3-level page tables, with various entry
formats for different memory types (normal, device).
● RISC-V Architecture:
○ Variable Page Sizes: Supports different page sizes (4KB, 2MB, etc.) in its page table structure, allowing flexible memory allocation.
● x86 Architecture:
○ Split TLBs: Separate TLBs for data and instruction caching to optimize performance.
○ Large Page Support: Supports larger TLB entries for large pages to reduce the number of translations required.
● ARM Architecture:
○ Context-Sensitive TLB: ARM processors can use context identifiers (CID) to provide multiple address spaces in TLB entries, improving
performance in multi-tasking environments.
● RISC-V Architecture:
○ Configurable TLB: TLB size and associativity can be customized based on implementation, allowing for tailored performance characteristics.
Arch specific MMU functionalities
3. Memory Protection
● x86 Architecture:
○ Execute Disable Bit (XD): A security feature that prevents execution of code in certain regions, helping to protect against buffer overflow
attacks.
● ARM Architecture:
○ Memory Attribute Index (MAI): Allows different memory attributes (like cacheable, shareable) to be assigned to different memory regions,
enhancing security and performance.
● RISC-V Architecture:
○ Physical Memory Protection (PMP): Provides mechanisms to enforce access permissions for different regions of physical memory.
● x86 Architecture:
○ Intel VT-x: Supports hardware-assisted virtualization, providing additional memory management features for guest operating systems.
○ Nested Page Tables (NPT): Allows guest operating systems to manage their own page tables with minimal overhead.
● ARM Architecture:
○ ARM Virtualization Extensions: Includes support for virtualization with a similar approach to Intel VT-x, allowing hypervisors to manage guest
memory.
● RISC-V Architecture:
○ Virtualization Support: RISC-V provides extensions for virtualization, enabling efficient memory management for virtual machines.
Arch specific MMU functionalities
5. Support for Address Space Randomization
● x86 Architecture:
○ ASLR (Address Space Layout Randomization): The MMU supports ASLR to randomize memory addresses, making it harder
for attackers to predict memory layout.
● ARM Architecture:
○ ASLR Support: Similar support for ASLR, providing randomization in process address spaces for security.
● RISC-V Architecture:
○ Flexible Page Table Entries: Supports randomization by allowing dynamic updates to page table entries.
● x86 Architecture:
○ MESI Protocol: Utilizes the MESI (Modified, Exclusive, Shared, Invalid) cache coherence protocol to maintain consistency
between caches in multiprocessor systems.
● ARM Architecture:
○ Coherent Memory Regions: ARM allows specific memory regions to be designated as coherent, enabling cache coherence in
multiprocessor systems.
● RISC-V Architecture:
○ Cache Coherence Protocols: RISC-V defines various cache coherence protocols, allowing designers to choose implementations
based on specific system needs.
Arch specific MMU functionalities
7. User-Space and Kernel-Space Separation
● x86 Architecture:
○ Segmented Memory: Early x86 architectures used segmentation to separate user space from kernel space,
although modern x86 systems primarily use paging.
● ARM Architecture:
○ Secure and Non-Secure States: ARM processors have separate execution states (Secure and Non-Secure)
for enhanced security and separation of user and kernel modes.
● RISC-V Architecture:
○ User Mode and Supervisor Mode: RISC-V explicitly defines user mode and supervisor mode, allowing for
clear separation between user applications and kernel operations.
List of Important Files
Start from these files
• mm/mmap.c:
• Deals with memory mapping (e.g., mmap() system call). Understanding how virtual memory areas (VMAs)
are managed starts here.
• mm/page_alloc.c:
• Implements the buddy system for managing free pages in memory. Functions like alloc_pages() and
free_pages() handle page allocation.
• mm/slab.c:
• Contains the slab allocator code. The slab allocator is used to efficiently allocate small pieces of memory for
kernel objects.
• mm/vmalloc.c:
• Manages virtual memory allocation using the vmalloc() function, allocating contiguous virtual addresses
backed by physical memory.
• mm/swap.c:
• Handles swapping pages to disk when physical memory is exhausted.
https://elixir.bootlin.com/linux/v6.11-rc4/source/mm
Keywords
Search Important Functions
• kmalloc() and kfree(): Fundamental functions for kernel memory allocation and deallocation.
• alloc_pages(): Allocates pages of memory using the buddy system.
• vmalloc(): Allocates large chunks of memory in virtual address space.
• do_mmap(): Responsible for memory mapping in user space.
https://elixir.bootlin.com/linux/v6.11-rc4/source/include/linux/mm_types.h#L72
MMU Memory Management Unit
22
What is NUMA Architecture?
Definition:
Key Features:
● Multiple Memory Nodes: Each processor has its own local memory, but can also access memory attached to other processors
(remote memory).
● Non-Uniform Access Time: Accessing local memory is faster than accessing remote memory.
Why NUMA?
● Improves scalability by reducing memory access bottlenecks, especially in large systems with multiple processors.
Example System: Used in large-scale servers and high-performance computing (HPC) systems.
How NUMA Works and Its Benefits
How it Works:
● Processors are grouped into nodes, each with its own local memory.
● Local Memory Access: A processor can access its own memory quickly.
● Remote Memory Access: Accessing memory from another processor’s node takes longer due to communication overhead.
Benefits:
● SMP (Symmetric Multiprocessing) has uniform memory access time for all processors but suffers from memory contention in
large systems.
https://elixir.bootlin.com/linux/v6.11-rc4/source/mm/mempolicy.c#L845
In Linux
In Memory Management policy
static long do_set_mempolicy(unsigned short mode, unsigned short flags, nodemask_t *nodes)
In Scheduler
struct numa_group {
refcount_t refcount;
spinlock_t lock; /* nr_tasks, tasks */
https://elixir.bootlin.com/linux/v6.11-rc4/source/mm/mempolicy.c#L845
https://elixir.bootlin.com/linux/v6.11-rc4/source/kernel/sched/fair.c#L1370
MMU (easy) MMU in XV6
26
Memory Allocation in XV6
• Free list:
• xv6 keeps track of available physical memory in a free list managed by the
kalloc.c file. The kinit() function initializes the free list, and kalloc() and
kfree() functions are used to allocate and free pages of memory.
• Pages:
• Memory is allocated in page-sized chunks (4 KB) and the system uses a page
table to translate virtual addresses to physical addresses.
https://subhrendu1987.github.io/xv6_doxygen/html/d4/def/kalloc_8c_source.html#l00083
Memory Allocation in XV6 // Allocate one 4096-byte page of physical memory.
// Returns a pointer that the kernel can use.
// Returns 0 if the memory cannot be allocated.
char* kalloc(void){
struct run *r;
physical addresses. // after installing a full page table that maps them on all cores.
https://subhrendu1987.github.io/xv6_doxygen/html/de/de9/vm_8c_source.html#l00057
Virtual Memory in XV6
// Set up kernel part of a page table.
pde_t* setupkvm(void){
pde_t *pgdir;
• mappages: struct kmap *k;
• setupkvm: if((pgdir = (pde_t*)kalloc()) == 0)
• Sets up kernel page tables return 0;
memset(pgdir, 0, PGSIZE);
if (P2V(PHYSTOP) > (void*)DEVSPACE)
panic("PHYSTOP too high");
for(k = kmap; k < &kmap[NELEM(kmap)]; k++)
if(mappages(pgdir, k->virt, k->phys_end -
k->phys_start,
(uint)k->phys_start, k->perm) < 0) {
freevm(pgdir);
return 0;
}
return pgdir;
}
https://subhrendu1987.github.io/xv6_doxygen/html/de/de9/vm_8c_source.html#l00117
Virtual Memory in XV6
// Switch TSS and h/w page table to correspond to process p.
void switchuvm(struct proc *p){
if(p == 0)
• mappages: panic("switchuvm: no process");
if(p->kstack == 0)
• setupkvm: panic("switchuvm: no kstack");
if(p->pgdir == 0)
panic("switchuvm: no pgdir");
• switchuvm:
• Switches to the virtual pushcli();
mycpu()->gdt[SEG_TSS] = SEG16(STS_T32A, &mycpu()->ts,
memory of the process mycpu()->gdt[SEG_TSS].s = 0;
sizeof(mycpu()->ts)-1, 0);
https://subhrendu1987.github.io/xv6_doxygen/html/de/de9/vm_8c_source.html#l00157
MMU MMU in Linux
32
Page frame in Linux: struct page
1. flags: Atomic flags for various page states, which can be updated asynchronously.
2. union for various page types: Contains different structures and unions depending on the page's
purpose:
a. Page cache and anonymous pages:
i. lru: A list head used for pageout lists (active/inactive).
ii. mapping: Pointer to the page's address space.
iii. index: Page offset within the mapping or share count (for DAX files).
iv. private: Opaque data used for different purposes (e.g., buffer heads, swap, or buddy
system).
b. Page pool used by the network stack:
i. pp_magic: A magic value for identifying pages allocated by the page pool.
ii. pp: Pointer to a page_pool.
iii. dma_addr: DMA address.
iv. pp_ref_count: Reference count for the page pool.
https://elixir.bootlin.com/linux/v6.11-rc4/source/include/linux/mm_types.h#L72
Page frame in Linux: struct page
1. flags:
2. union for various page types:
a. Page cache and anonymous pages:
b. Page pool used by the network stack:
c. Tail pages of compound pages:
i. compound_head: Pointer to the head of the compound page.
d. ZONE_DEVICE pages:
i. pgmap: Points to a device page map.
ii. zone_device_data: Stores additional data for device pages.
iii. rcu_head: Used for freeing pages with RCU (Read-Copy-Update) mechanisms.
3. Small union (4 bytes):
a. page_type: Used for typed folios to identify page usage.
b. _mapcount: Tracks the number of references to the page from page tables, starting
from -1.
https://elixir.bootlin.com/linux/v6.11-rc4/source/include/linux/mm_types.h#L72
System Calls for Memory allocation
SYSCALL_DEFINE6(mmap, unsigned long, addr, unsigned long, len, unsigned long, prot, unsigned long, flags,
unsigned long, fd, unsigned long, off){
if (off & ~PAGE_MASK)
return -EINVAL;
return ksys_mmap_pgoff(addr, len, prot, flags, fd, off >> PAGE_SHIFT);
}
• Say A new process is being cloned. How does the memory management happen in that case?
• SYSCALL_DEFINE5(clone, ..) → kernel_clone() → copy_process() →
dup_task_struct() , copy_mm()
static int copy_mm(unsigned long clone_flags, struct task_struct *tsk){
…
if (clone_flags & CLONE_VM) {
mmget(oldmm); mm = oldmm;
} else {
mm = dup_mm(tsk, current->mm);
…
}
https://elixir.bootlin.com/linux/v6.11-rc4/source/arch/x86/kernel/sys_x86_64.c#L79
https://elixir.bootlin.com/linux/v6.11-rc4/source/kernel/fork.c#L2927
https://elixir.bootlin.com/linux/v6.11-rc4/source/kernel/fork.c#L2759
https://elixir.bootlin.com/linux/v6.11-rc4/source/kernel/fork.c#L2131
https://elixir.bootlin.com/linux/v6.11-rc4/source/kernel/fork.c#L2387
https://elixir.bootlin.com/linux/v6.11-rc4/source/kernel/fork.c#L1694
https://elixir.bootlin.com/linux/v6.11-rc4/source/kernel/fork.c#L1721
Memory allocation that profiling or debugging
is not enabled for the
allocation
/* Allocates a new mm structure and duplicates the provided @oldmm structure content into it. Return: the
duplicated mm or NULL on failure. */
static struct mm_struct *dup_mm(struct task_struct *tsk, struct mm_struct *oldmm){
struct mm_struct *mm;
int err;
mm = allocate_mm();
if (!mm)
goto fail_nomem;
memcpy(mm, oldmm, sizeof(*mm));
…
mm->hiwater_rss = get_mm_rss(mm);
mm->hiwater_vm = mm->total_vm;
…
}
https://elixir.bootlin.com/linux/v6.11-rc4/source/include/linux/slab.h#L542
https://elixir.bootlin.com/linux/v6.11-rc4/source/mm/slub.c#L4042
https://elixir.bootlin.com/linux/v6.11-rc4/source/mm/slub.c#L4011
SLAB vs SLUB allocator
Slab Allocator:
• The slab allocator is designed for performance optimization and memory fragmentation reduction. It
organizes memory into slabs, which consist of one or more pages of memory, divided into fixed-size objects.
• It keeps detailed metadata for each slab to track free, used, and partially used slabs. This adds a level of
complexity, but also allows for fine-tuned memory management, such as cache coloring (to avoid CPU
cache contention) and object reuse for frequently allocated objects.
• “to have caches of commonly used objects kept in an initialised state available for use by the kernel.”
Slub Allocator:
• The slub allocator is a simplified and more modern version of the slab allocator. Its primary goal is to reduce
the complexity and overhead of slab while maintaining good performance. Slub eliminates some of the
additional tracking structures (like per-CPU caches and slab queues).
• Instead of managing objects in detailed per-slab structures, slub uses a simpler model where objects are
directly managed in pages. This reduction in complexity improves scalability, especially on large NUMA
(Non-Uniform Memory Access) systems, by avoiding bottlenecks in memory management.
https://elixir.bootlin.com/linux/v6.11-rc4/source/mm/slab_common.c
https://elixir.bootlin.com/linux/v6.11-rc4/source/mm/slub.c
https://www.kernel.org/doc/html/v5.0/vm/slub.html
https://www.kernel.org/doc/html/v6.11-rc4/mm/index.html
Memory allocation
• slab_alloc_node() also decides page replacement strategies
void *kmem_cache_alloc_lru_noprof(struct kmem_cache *s, struct list_lru *lru, gfp_t gfpflags){
void *ret = slab_alloc_node(s, lru, gfpflags, NUMA_NO_NODE, _RET_IP_, s->object_size);
trace_kmem_cache_alloc(_RET_IP_, ret, s, gfpflags, NUMA_NO_NODE);
return ret;
}
static __fastpath_inline void *slab_alloc_node(struct kmem_cache *s, struct list_lru *lru, gfp_t gfpflags,
int node, unsigned long addr, size_t orig_size){
…
object = __slab_alloc_node(s, gfpflags, node, addr, orig_size);
…
}
struct list_lru {
struct list_lru_node { struct list_lru_node *node;
/* protects all lists on the node, including per cgroup */ #ifdef CONFIG_MEMCG
spinlock_t lock; struct list_head list;
/* global list, used for the root cgroup in cgroup aware lrus */ int shrinker_id;
struct list_lru_one lru; bool memcg_aware;
long nr_items; struct xarray xa;
} ____cacheline_aligned_in_smp; #endif
};
https://elixir.bootlin.com/linux/v6.11-rc4/source/mm/slub.c#L4053
https://elixir.bootlin.com/linux/v6.11-rc4/source/mm/slub.c#L4011
https://elixir.bootlin.com/linux/v6.11-rc4/source/include/linux/list_lru.h#L51
https://elixir.bootlin.com/linux/v6.11-rc4/source/mm/slub.c#L3820
Memory allocation
LRU management Interfaces
https://elixir.bootlin.com/linux/v6.11-rc4/source/mm/list_lru.c#L149
Memory allocation
bool list_lru_del_obj(struct list_lru *lru, struct list_head *item) {
LRU management Interfaces bool ret;
// Convert the virtual address of 'item' to a page, then get the NUMA node
with ChatGPT generated Annotations ID.
// 'virt_to_page()' gets the page struct for the item.
// 'page_to_nid()' gets the NUMA node where that page resides.
int nid = page_to_nid(virt_to_page(item));
// Check if the LRU list is memcg-aware (i.e., does it support memory
/** control groups).
* @brief Removes an object from if (list_lru_memcg_aware(lru)) {
an LRU list with NUMA and // We are in a read-mostly context. Lock the section using RCU
optional memcg awareness. (Read-Copy-Update).
* rcu_read_lock();
* @param lru The list_lru
structure that holds the LRU // Call the LRU deletion function, passing the memcg associated with the
list. item.
* @param item A pointer to the // 'mem_cgroup_from_slab_obj()' retrieves the memory control group for
list_head representing the this object.
object to be removed. ret = list_lru_del(lru, item, nid, mem_cgroup_from_slab_obj(item));
* @return true if the item
was successfully removed, false // Unlock the RCU section after performing the removal operation.
otherwise. rcu_read_unlock();
*/ } else {
bool list_lru_del_obj(struct // If memcg-awareness is not enabled, remove the item from the LRU list
list_lru *lru, struct list_head without considering memcg.
*item) // Pass 'NULL' for the memcg parameter in this case.
ret = list_lru_del(lru, item, nid, NULL);
}
https://elixir.bootlin.com/linux/v6.11-rc4/source/mm/list_lru.c#L149
// Return true if the item was successfully removed, false otherwise.
return ret;
}
Memory allocation
• LRU management Interfaces
/* The caller must ensure the memcg lifetime. */
bool list_lru_add(struct list_lru *lru, struct list_head *item, int nid,struct mem_cgroup *memcg){
struct list_lru_node *nlru = &lru->node[nid];
struct list_lru_one *l;
spin_lock(&nlru->lock);
if (list_empty(item)) {
l = list_lru_from_memcg_idx(lru, nid, memcg_kmem_id(memcg));
list_add_tail(item, &l->list);
/* Set shrinker bit if the first element was added */
if (!l->nr_items++)
set_shrinker_bit(memcg, nid, lru_shrinker_id(lru));
nlru->nr_items++;
spin_unlock(&nlru->lock);
return true;
}
spin_unlock(&nlru->lock);
return false;
}
https://elixir.bootlin.com/linux/v6.11-rc4/source/mm/list_lru.c#L88
Memory allocation
bool list_lru_add(struct list_lru *lru, struct list_head *item, int nid, struct mem_cgroup *memcg) {
// Get the LRU node for the specified NUMA node ID (nid)
LRU management Interfaces struct list_lru_node *nlru = &lru->node[nid];
// Will hold the specific LRU list ("one") associated with a particular memcg or the general list
struct list_lru_one *l;
with ChatGPT generated Annotations // Lock the LRU node to ensure thread-safe operations (as LRU lists are shared resources)
spin_lock(&nlru->lock);
// Check if the item is not already in the list (i.e., it is currently "empty" in LRU terms)
if (list_empty(item)) {
// Get the specific LRU list for this memcg within the given node, or a default if memcg is not used.
// `list_lru_from_memcg_idx()` returns the LRU list associated with the memory control group (memcg),
/** // indexed by the kernel memory cgroup ID (memcg_kmem_id()).
* @brief Adds an item to an LRU list, l = list_lru_from_memcg_idx(lru, nid, memcg_kmem_id(memcg));
handling NUMA and memcg-aware systems.
* // Add the item to the end of the LRU list (as the most recently used element).
* @param lru Pointer to the list_lru list_add_tail(item, &l->list);
structure representing the LRU list.
* @param item The list_head representing the // If this is the first item being added to this LRU list, we need to set the shrinker bit.
object to be added. // This signals that the list now contains items that can be shrunk (reclaimed).
* @param nid The NUMA node ID where the if (!l->nr_items++) {
item resides. // `set_shrinker_bit()` sets the shrinker bit, which marks that objects can be shrunk.
* @param memcg Pointer to the memory control // The shrinker bit is associated with the memcg, NUMA node, and LRU shrinker ID.
group (memcg) structure, can be NULL if set_shrinker_bit(memcg, nid, lru_shrinker_id(lru));
memcg-aware handling is not needed. }
*
* @return true if the item was // Increment the total number of items in the LRU node (for the given NUMA node).
successfully added, false if it was already in nlru->nr_items++;
the list.
* // Unlock the spinlock as the critical section is done.
* @note The caller must ensure the lifetime spin_unlock(&nlru->lock);
of the memcg (memory control group) passed to
the function. // Return true indicating the item was successfully added.
*/ return true;
bool list_lru_add(struct list_lru *lru, struct }
list_head *item, int nid, struct mem_cgroup
*memcg) // If the item was already in the list, simply unlock the spinlock and return false.
spin_unlock(&nlru->lock);
return false;
}
https://elixir.bootlin.com/linux/v6.11-rc4/source/mm/list_lru.c#L149
Memory allocation
• Shrinker Bit
• Shrinkers are used by the kernel to reclaim memory from caches when the system is under memory
pressure. Shrinkers are triggered when the kernel's memory management subsystem decides that it
needs to free up memory.
• Each shrinker is registered to handle specific types of caches (e.g., inode caches, dentry caches, LRU
caches). When invoked, a shrinker will attempt to free a number of objects from the cache.
• The Shrinker Bit:
• The shrinker bit is a flag used to indicate that a cache or LRU list has reclaimable objects (i.e., objects
that can be freed by the shrinker when memory is low).
• When the shrinker bit is set for a given memory control group (memcg) and NUMA node, it signals
the memory management system that this particular LRU list has items eligible for reclamation. This
helps the shrinker to target caches or lists that actually contain reclaimable objects, improving the
efficiency of memory reclamation.
https://elixir.bootlin.com/linux/v6.11-rc4/source/mm/page_alloc.c#L4659
https://elixir.bootlin.com/linux/v6.11-rc4/source/mm/page_alloc.c#L714
https://elixir.bootlin.com/linux/v6.11-rc4/source/mm/page_alloc.c#L739
Memory Management Techniques
• Buddy System: (Useful for Physical Memory allocation)
• How the Buddy System Works
• The buddy system divides memory into blocks (pages), where each block's size is a power of
two (e.g., 2, 4, 8, 16, 32... pages). The memory allocator splits and merges blocks dynamically
to satisfy allocation requests of different sizes.
• __alloc_pages_noprof(): Fast and Slow path
struct page *__alloc_pages_noprof(gfp_t gfp, unsigned int order, int
preferred_nid, nodemask_t *nodemask){
struct page *page;
…
/* First allocation attempt */
page = get_page_from_freelist(alloc_gfp, order, alloc_flags, &ac);
…
page = __alloc_pages_slowpath(alloc_gfp, order, &ac);
…
return page;
}
https://elixir.bootlin.com/linux/v6.11-rc4/source/mm/page_alloc.c#L4659
https://elixir.bootlin.com/linux/v6.11-rc4/source/mm/page_alloc.c#L714
https://elixir.bootlin.com/linux/v6.11-rc4/source/mm/page_alloc.c#L739
Memory Management Techniques
• Buddy System: (Useful for Physical Memory allocation)
• How the Buddy System Works
• The buddy system divides memory into blocks (pages), where each block's size is a power of
two (e.g., 2, 4, 8, 16, 32... pages). The memory allocator splits and merges blocks dynamically
to satisfy allocation requests of different sizes.
• __alloc_pages_noprof(): Fast and Slow path
• buddy_merge_likely(): Read from ref.
• __free_one_page(): Read from ref.
https://elixir.bootlin.com/linux/v6.11-rc4/source/mm/page_alloc.c#L4659
https://elixir.bootlin.com/linux/v6.11-rc4/source/mm/page_alloc.c#L714
https://elixir.bootlin.com/linux/v6.11-rc4/source/mm/page_alloc.c#L739
Memory Management Techniques
• Buddy System: Determine the Required Size (Order):
• The kernel first calculates the order of the required allocation. The order represents the
size of the block to be allocated in terms of powers of two.
• For example, if the kernel needs 2^3 = 8 pages, the order is 3.
• The order defines the smallest block that can accommodate the requested allocation.
• Search for Free Blocks in the Corresponding Order:
• The kernel maintains a set of free lists where each free list corresponds to a specific
order. Each free list holds blocks of memory of size 2^order pages.
• The kernel first checks the free list for the required order to see if there is a free block
of that size.
• If a block is available in that order, it is selected for allocation.
https://elixir.bootlin.com/linux/v6.11-rc4/source/mm/page_alloc.c#L3297
Memory Management Techniques
• Buddy Block Splitting (If No Suitable Block Exists):
• If no suitable block is available in the desired order, the kernel checks higher-order
blocks (i.e., blocks of larger sizes).
• When a higher-order block is found, the kernel splits that block into two smaller buddy
blocks of half the size, recursively until a block of the required size is created.
• For example, if the required block is of order 3 but only a block of order 4 is available,
the kernel splits the order 4 block into two order 3 blocks. One of these will be used for
allocation, and the other remains in the free list.
• Update the Free List:
• After identifying a suitable buddy block, the free list for that order is updated:
• The allocated block is removed from the free list.
• If a block was split, the remaining half is added back to the free list for its respective
order.
https://elixir.bootlin.com/linux/v6.11-rc4/source/mm/page_alloc.c#L3297
Virtual Memory
48
Swap
• Swap Space: Linux uses virtual memory, where the system can run more applications than available
physical memory. Swap allows the kernel to offload parts of memory that are not currently in use to disk,
freeing up RAM for active processes.
• Swap Decisions and Thresholds: The kernel makes decisions about swapping based on several thresholds
and metrics:
• Swappiness: Controls how aggressive the kernel is about swapping. A lower value means the kernel
prefers keeping pages in RAM, while a higher value means it will swap more readily. (e.g. $ cat
/proc/sys/vm/swappiness)
Swap
• How Swappiness Works
• The kernel decides which memory pages to reclaim when free memory falls below a certain
threshold. There are two types of pages that can be reclaimed:
• File-backed pages: Pages that belong to files on disk (e.g., executables or mapped files). These
pages can be easily dropped from memory because they can be reloaded from disk when
needed.
• Anonymous pages: Pages that don’t belong to any file (e.g., memory allocated by
applications). These need to be written to swap space on disk if they are evicted from RAM.
• The swappiness value influences how the kernel balances reclaiming these two types of pages:
• A low swappiness value means the kernel will prefer to keep anonymous pages (i.e., process
memory) in RAM and reclaim file-backed pages first, avoiding swap as much as possible.
• A high swappiness value makes the kernel more aggressive in moving anonymous pages to
swap, freeing up RAM for other purposes.
Swap
• Effect of Swappiness
• Swappiness = 0
• Effect: The kernel avoids swapping out anonymous pages for as long as possible. It will use
swap space only when absolutely necessary (i.e., when there’s a severe shortage of free
RAM).
• Use Case: Suitable for systems where swap is slow (e.g., on hard disk drives) or when
minimizing latency is critical, such as on desktops or real-time systems.
Swap
• Effect of Swappiness
• Swappiness = 100
• Effect: The kernel will swap out anonymous pages very aggressively. It treats swap space as
just another form of memory, using it as much as possible.
• Use Case: Suitable for systems with very little physical memory or systems where swap is fast
(e.g., on SSDs) and can be used without much performance penalty.
Swap
• Effect of Swappiness
• Swappiness = 60 (Default)
• Effect: A balanced approach. The kernel will swap out anonymous pages when memory
pressure builds up, but will try to avoid excessive swapping.
• Use Case: This is the default value on most systems, offering a good balance between using
swap and avoiding it.
Swap
• View/Change Swappiness
• $ cat /proc/sys/vm/swappiness
• $ echo 60 > /proc/sys/vm/swappiness
• Did you notice something?
•
/*
* From 0 .. MAX_SWAPPINESS. Higher means more
swappy.
*/
int vm_swappiness = 60;
https://elixir.bootlin.com/linux/v6.11-rc4/source/mm/vmscan.c#L202
Swap
• View/Change Swappiness
• $ cat /proc/sys/vm/swappiness
• $ echo 60 > /proc/sys/vm/swappiness
• Did you notice something?
• How the kernel variable is changed?
Swap
• View/Change Swappiness
• $ cat /proc/sys/vm/swappiness
• $ echo 60 > /proc/sys/vm/swappiness
• Did you notice something? /* A sysctl table is an array of struct ctl_table: */
• How the kernel variable is changed? struct ctl_table {
const char *procname; /* Text ID for /proc/sys, or zero */
• sysctl parameter void *data;
int maxlen;
• umode_t mode;
proc_handler *proc_handler; /* Callback for text
formatting */
struct ctl_table_poll *poll;
void *extra1;
{ void *extra2;
.procname = "swappiness", } __randomize_layout;
.data = &vm_swappiness,
.maxlen = sizeof(vm_swappiness),
.mode = 0644,
.proc_handler = proc_dointvec_minmax,
.extra1 = SYSCTL_ZERO,
.extra2 = SYSCTL_TWO_HUNDRED,
},
https://elixir.bootlin.com/linux/v6.11-rc4/source/include/linux/sysctl.h#L134
https://elixir.bootlin.com/linux/v6.11-rc4/source/kernel/sysctl.c#L2075
Sysctl
• Sysctl Components
• “Sysctl is a means of configuring certain aspects of the kernel at run-time, and the /proc/sys/
directory is there so that you don't even need special tools to do it!”
• /proc/sys/ Virtual Filesystem:
• Sysctl interacts with the /proc/sys/ directory, which is part of the procfs (process filesystem).
This directory exposes various kernel parameters as virtual files.
• Each parameter corresponds to a file within /proc/sys/. For example:
• /proc/sys/vm/swappinesscontrols the swappiness value for memory
management.
• /proc/sys/net/ipv4/ip_forwardcontrols IP forwarding for the networking
subsystem.
• These files can be read from and written to just like regular files, allowing users to view and
change kernel settings.
• The sysctl Command:
• The command-line utility is the primary user-space tool to read and modify kernel parameters using
the sysctl interface.
https://www.kernel.org/doc/Documentation/sysctl/README
https://elixir.bootlin.com/linux/v6.11-rc4/source/include/linux/sysctl.h#L127
Swap
• Swap Space: Linux uses virtual memory, where the system can run more applications than available
physical memory. Swap allows the kernel to offload parts of memory that are not currently in use to disk,
freeing up RAM for active processes.
• Swap Decisions and Thresholds: The kernel makes decisions about swapping based on several thresholds
and metrics:
• Swappiness: Controls how aggressive the kernel is about swapping. A lower value means the kernel
prefers keeping pages in RAM, while a higher value means it will swap more readily. (e.g. $ cat
/proc/sys/vm/swappiness)
• Memory Pressure: The kernel evaluates the system’s memory usage and determines when to start
swapping based on memory pressure.
• Page Reclaiming: If free RAM is low, the kernel begins reclaiming memory by moving pages to swap
(or dropping caches if possible).
• Inactive and Active Lists: Pages in memory are divided into active and inactive lists. Pages that are
not used frequently move to the inactive list and are the first candidates for being swapped out.
Swap
• Relevant Functions:
• add_to_swap_cache():Marks a page for swapping and allocates space in swap.
• get_swap_pages(): Chooses pages to swap out based on memory pressure.
• count_swpout_vm_event():Counts Swap events
https://elixir.bootlin.com/linux/v6.11-rc4/source/mm/swap_state.c#L90
https://elixir.bootlin.com/linux/v6.11-rc4/source/mm/swapfile.c#L1071
https://elixir.bootlin.com/linux/v6.11-rc4/source/mm/page_io.c#L211
Virtual Address: Layout
• In Linux, each process operates with a virtual address space. The virtual address space is divided into kernel
space and user space:
• User space is typically limited to 3 GB in 32-bit systems and most of the lower portion in 64-bit
systems.
• Kernel space is mapped into every process's address space but is inaccessible to user processes.
Virtual Address: Paging Mechanism
• The MMU uses page tables to map a virtual address to a physical address. Page tables contain entries called
page table entries (PTEs), which hold information such as:
• The physical frame number (PFN), representing the actual physical memory.
• Status bits like validity, read/write permissions, and more.
Virtual Address: Address Translation
• Page Table Walk: When a virtual address is accessed, the MMU performs a "page table walk." In the x86-64
architecture, for example, virtual addresses are divided into sections to index multiple levels of page tables:
• Page Global Directory (PGD)
• Page Upper Directory (PUD)
• Page Middle Directory (PMD)
• Page Table Entry (PTE)
• Each level narrows down the lookup for the final physical frame.
• Translation: Once the PTE is found, it contains the physical frame number (PFN). The offset in the virtual
address is then added to this PFN to calculate the final physical address.
Virtual Address: Kernel Functions
• __pa() : This function converts a kernel virtual address to a physical address.
• __va(): This function converts a physical address back to a virtual address.
• pgd_offset(), pud_offset(), pmd_offset(): Offset finding
https://elixir.bootlin.com/linux/v6.11-rc4/source/arch/x86/include/asm/page.h#L41
https://elixir.bootlin.com/linux/v6.11-rc4/source/arch/x86/include/asm/page.h#L58
https://elixir.bootlin.com/linux/v6.11-rc4/source/include/linux/pgtable.h#L123
https://elixir.bootlin.com/linux/v6.11-rc4/source/include/linux/pgtable.h#L138
https://elixir.bootlin.com/linux/v6.11-rc4/source/include/linux/pgtable.h#L131
Virtual Address: TLB
• The Translation Lookaside Buffer (TLB) is a crucial part of the memory management system in modern
processors, including the Linux kernel. It acts as a cache for virtual-to-physical address mappings, speeding
up the translation process. Without the TLB, every memory access would require a full page table walk,
which can be slow. Instead, the TLB stores recently used page table entries (PTEs) to accelerate memory
accesses.
• In a multiprocessing system, multiple CPUs (or cores) run separate processes. Each CPU has its own
memory management unit (MMU) and its own TLB to cache page table entries (PTEs) for the processes it
handles.
• Small Size: The TLB is much smaller than the main memory (RAM), typically storing only a few dozen
to a few thousand entries.
• Fast Access: Since it is a hardware cache, the TLB provides near-instantaneous lookups for address
translations, often within one or two CPU cycles.
• Levels of TLB: In modern processors, there may be multiple levels of TLBs (L1 TLB, L2 TLB), similar to
how CPUs have multiple levels of regular cache (L1, L2, L3) to store memory data.
https://elixir.bootlin.com/linux/v6.11-rc4/source/arch/x86/mm/tlb.c#L1047
Virtual Address: TLB
• When a process accesses a virtual address, the Memory Management Unit (MMU) checks the TLB first:
• If the virtual address is found in the TLB (a TLB hit), the corresponding physical address is returned
immediately.
• If the address is not found in the TLB (a TLB miss), the MMU must perform a page table walk to find
the correct mapping and then store the result in the TLB for future accesses.
• The Linux kernel doesn’t directly control the TLB but interacts with it through software-managed TLB
maintenance operations. These operations are necessary because, after certain actions, the TLB needs to
be updated to stay consistent with the page tables. This includes cases like:
• Context switches (changing which process is running).
• Page table changes (e.g., loading new memory pages).
• Memory mapping updates (e.g., swapping out pages, unmapping a file).
https://elixir.bootlin.com/linux/v6.11-rc4/source/arch/x86/mm/tlb.c#L1047
Virtual Address: TLB Concepts
• TLB Flushing: When the kernel changes the memory mappings for a process (e.g., during a context switch
or when handling page faults), the TLB must be flushed to avoid stale entries. Flushing clears old
translations from the TLB, forcing the processor to perform a new page table walk for the next memory
access.
• flush_tlb_all()
• Lazy TLB: In Linux, a technique called lazy TLB switching is used to avoid unnecessary TLB flushes during
context switches. When switching from one process to another, the kernel doesn’t immediately flush the
TLB; instead, it delays this action until the new process actually uses different memory mappings.
• TLB Shootdown: In multiprocessor systems, when one processor modifies a page table entry that another
processor’s TLB might still have cached, it’s necessary to invalidate the TLB entry across all processors. This
process is known as TLB shootdown. The kernel sends inter-processor interrupts (IPIs) to other CPUs to
ensure their TLBs are flushed.
https://elixir.bootlin.com/linux/v6.11-rc4/source/arch/x86/mm/tlb.c#L1047
Inter-processor Interrupts
• IPIs are typically set up during the system initialization phase, configuring the interrupt controllers and
enabling inter-processor communication. The setup could be architecture-specific, often found in
architecture-specific sources.
smp_call_function(): This function calls a specified function on all CPUs. It can also be used to send
IPIs to execute specific tasks on remote CPUs.
void smp_call_function(void (*func)(void *), void
*info, int wait) {
struct smp_ops { void (*poll_sync_state)(void); // Code to send IPI and call the specified
int (*kick_ap_alive)(unsigned cpu, struct task_struct function on all CPUs
}
void (*smp_prepare_boot_cpu)(void); *tidle);
void (*smp_prepare_cpus)(unsigned int (*cpu_disable)(void);
max_cpus); void (*cpu_die)(unsigned int cpu);
void (*smp_cpus_done)(unsigned void (*play_dead)(void);
max_cpus); void (*stop_this_cpu)(void);
void (*stop_other_cpus)(int wait);
void (*crash_stop_other_cpus)(void); void (*send_call_func_ipi)(const struct cpumask SMP operations
void (*smp_send_reschedule)(int cpu); *mask);
void (*cleanup_dead_cpu)(unsigned void (*send_call_func_single_ipi)(int cpu);
cpu); };
…
https://elixir.bootlin.com/linux/v6.11-rc4/source/arch/x86/mm/tlb.c#L1047