Understanding Virtual Memory in Red Hat Enterprise Linux 3: Norm Murray and Neil Horman December 13, 2005
Understanding Virtual Memory in Red Hat Enterprise Linux 3: Norm Murray and Neil Horman December 13, 2005
Enterprise Linux 3
2 Definitions 3
2.1 What Comprises a VM . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 MMU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.3 Zoned Buddy Allocator . . . . . . . . . . . . . . . . . . . . . . . 5
2.4 Slab Allocator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.5 Kernel Threads . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.6 Components that use the VM . . . . . . . . . . . . . . . . . . . . 7
4 Tuning the VM 10
4.1 bdflush . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4.2 dcache priority . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4.3 hugetlb pool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.4 inactive clean percent . . . . . . . . . . . . . . . . . . . . . . . . 12
4.5 kswapd . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.6 max map count . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.7 max-readahead . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.8 min-readahead . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.9 overcommit memory . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.10 overcommit ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.11 pagecache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.12 page-cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.13 kscand work percent . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.14 oom-kill . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.15 numa memory allocator . . . . . . . . . . . . . . . . . . . . . . . 16
5 Example Scenarios 17
5.1 File (IMAP, Web, etc.) Server . . . . . . . . . . . . . . . . . . . . 17
5.2 General Compute Server With Many Active Users . . . . . . . . 17
5.3 Non interactive (Batch) Computing Server . . . . . . . . . . . . . 17
List of Figures
1 High level overview of VM subsystem . . . . . . . . . . . . . . . 3
2 Illustration of a virtual to physical memory translation . . . . . . 4
3 Diagram of the VM page state machine . . . . . . . . . . . . . . 9
1
1 Introduction
One of the most important aspects of an operating system is the Virtual Mem-
ory Management system. Virtual Memory (VM) allows an operating system to
perform many of its advanced functions, such as process isolation, file caching,
and swapping. As such, it is imperative that an administrator understand the
functions and tunable parameters of an operating system’s virtual memory man-
ager so that optimal performance for a given workload may be achieved. This
article is intended to provide a system administrator a general overview of how
a VM works, specifically the VM implemented in Red Hat Enterprise Linux 3
(RHEL3). After reading this document, the reader should have a rudimentary
understanding of the data the RHEL3 VM controls and the algorithms it uses.
Further, the reader should have a fairly good understanding of general Linux
VM tuning techniques. It is important to note that Linux as an operating sys-
tem has a proud legacy of overhaul. Items which no longer server useful purposes
or which have better implementations as technology advances are phased out.
This implies that the tuning parameters described in this article may be out of
date if you are using a newer or older kernel. Fear not however! With a well
grounded understanding of the general mechanics of a VM, it is fairly easy to
convert ones knowledge of VM tuning to another VM. The same general princi-
ples apply, and documentation for a given kernel (including its specific tunable
parameters), can be found in the corresponding kernel source tree under the file
Documentation/sysctl/vm.txt.
2
2 Definitions
To properly understand how a Virtual Memory Manager does its job, it helps
to understand what components comprise a VM. While the low level details
of a VM are overwhelming for most, a high level view is nonetheless helpful
in understanding how a VM works, and how it can be optimized for various
workloads. A high level overview of the components that make up a Virtual
memory manager is presented in Figure 1 below:
Bash Physical
Memory
MMU
Slab Allocator zoned
buddy
allocator
Kernel
Standard Subsystems:
C Library VFS
httpd (glibc) Network
Syscalls kswapd
mozilla
3
2.2 MMU
The Memory Management Unit (MMU) is the hardware base that make a Vir-
tual Memory system possible. The MMU allows software to reference physical
memory by aliased addresses, quite often more than one. It accomplishes this
through the use of pages and page tables. The MMU uses a section of memory to
translate virtual addresses into physical addresses via a series of table lookups.
Various processor architectures preform this function is slightly different ways,
but in general figure 2 illustrates how a translation is preformed from a virtual
address to a physical address:
Virtual Address
Directory Page Middle Page Table Page
Index Index Index Offset
Table Ptr.
Directory Ptr
Physical Page
Each table lookup provides a pointer to the base of the next table, as well
as a set of extra bits which provide auxiliary data regarding that page or set
of pages. This information typically includes the current page status, access
privileges, and size. A separate portion of the virtual address being accessed
provides an index into each table in the lookup process. The final table provides
a pointer to the start of the physical page corresponding to the virtual address
in RAM, while the last field in the virtual address selects the actual word in
the page being accessed. Any one of the table lookups during this translation,
4
may direct the lookup operation to terminate and drive the operating system
to preform another action. Some of these actions are somewhat observable at a
system level, and have common names or references
• Segmentation Violation - A user space process requests a virtual ad-
dress, and during the translation the kernel is interrupted and informed
that the requested translation has resulted in a page which it has not
allocated, or which the process does not have permission to access. The
kernel responds by signaling the process that it has attempted to access
an invalid memory region, after which it is terminated.
5
are responsible for finer grained control over allocation size. For more infor-
mation regarding the finer details of a buddy allocator, refer to [1]. Note that
the Buddy allocator also manages memory zones, which define pools of memory
which have different purposes. Currently there are three memory pools which
the buddy allocator manages accesses for:
• DMA - This zone consists of the first 16 MB of RAM, from which legacy
devices allocate to perform direct memory operations
• NORMAL - This zone encompasses memory addresses from 16 MB to
1 GB1 and is used by the kernel for internal data structures, as well as
other system and user space allocations.
• HIGHMEM - This zone includes all memory above 1 GB and is used ex-
clusively for system allocations (file system buffers, user space allocations,
etc).
6
is set, it locks the page, and initiates a disk I/O operation to move the page
to swap space on the hard drive. When the I/O operation is complete, kswapd
modifies the page table entry to indicate that the page has been migrated to
disk, unlocks the page and places it back on the free list, available for further
allocations.
7
3 The Life of A Page
All of the memory managed by the VM is labeled by a state. These states help
let the VM know what to do with a given page under various circumstances.
Dependent on the current needs of the system, the VM may transfer pages from
one state to the next, according to the state machine diagrammed in figure 3
below: Using these states, the VM can determine what is being done with a
page by the system at a given time and what actions it (the VM) may take on
the page. The states that have particular meanings are as follows:
• FREE - All pages available for allocation begin in this state. This indi-
cates to the VM that the page is not being used for any purpose and is
available for allocation.
• ACTIVE - Pages which have been allocated from the Buddy Allocator
enter this state. It indicates to the VM that the page has been allocated
and is actively in use by the kernel or a user process.
• INACTIVE DIRTY - Th state indicates that the page has fallen into
disuse by the entity which allocated it, and as such is a candidate for
removal from main memory. The kscand task periodically sweeps through
all the pages in memory Taking note of the amount of time the page has
been in memory since it was last accessed. If kscand finds that a page
has been accessed since it last visited the page, it increments the pages
age counter, otherwise, it decrements that counter. If kscand happens on
a page which has its age counter at zero, then the page is moved to the
inactive dirty state. Pages in the inactive dirty state are kept in a list of
pages to be laundered.
• INACTIVE LAUNDERED - This is an interim state in which those
pages which have been selected for removal from main memory enter while
their contents are being moved to disk. Only pages which were in the inac-
tive dirty state enter this state. When the disk I/O operation is complete
the page is moved to the inactive clean state, where it may be deallocated,
or overwritten for another purpose. If, during the disk operation, the page
is accessed, the page is moved back into the active state.
• INACTIVE CLEAN - Pages in this state have been laundered. This
means that the contents of the page are in sync with they backing data
on disk. As such they may be deallocated by the VM or overwritten for
other purposes.
8
Free
allocation
accessed/kscand
Inactive
Active
deallocate
accessed Dirty
ac
ce
page out
sse
accessed
kupdated/bdflush Inactive
Inactive
Laundry
Clean
9
4 Tuning the VM
Now that the picture of the VM mechanism is sufficiently illustrated, how is it
adjusted to fit certain workloads? There are two methods for changing tunable
parameters in the Linux VM. The first is the sysctl interface. The sysctl interface
is a programming oriented interface, which allows software programs to directly
modify various tunable parameters. The sysctl interface is exported to system
administrators via the sysctl utility, which allows an administrator to specify a
specific value for any of the tunable VM parameters on the command line, as
in the following example:
The sysctl utility also supports the use of a configuration file (/etc/sysctl.conf),
in which all the desirable changes to a VM can be recorded for a system and
restored after a restart of the operating system, making this access method
suitable for long term changes to a system VM. The file is straightforward in
its layout, using simple key-value pairs, with comments for clarity, as in the
following example:
#Adjust the min and max read-ahead for files vm.max-readahead=64
vm.min-readahead=32
#turn on memory over-commit vm.overcommit memory=2
#bump up the percentage of memory in use to #activate bdflush
vm.bdflush=”40 500 0 0 500 3000 60 20 0”
The second method of modifying VM tunable parameters is via the proc file
system. This method exports every group of VM tunables as a file, accessible via
all the common Linux utilities used for modify file contents. The VM tunables
are available in the directory /proc/sys/vm, and are most commonly read and
modified using the Linux cat and echo utilities, as in the following example:
# cat kswapd
512 32 8
echo 511 31 7 > /proc/sys/vm/kswapd
# cat kswapd
511 31 7
10
4.1 bdflush
The bdflush file contains 9 parameters, of which 6 are tunable. These parameters
affect the rate at which pages in the buffer cache3 are freed and returned to
disk. by adjusting the various values in this file, a system can be tuned to
achieve better performance in environments where large amounts of file I/O are
preformed. The following parameters are defined in the order they appear in
the file
Parameter Description
nfract The percentage of of dirty pages in the buffer cache
required to activate the bdflush task
ndirty The maximum number of dirty pages in the buffer
cache to write to disk in each bdflush execution
reserved1 Reserved for future use
reserved2 Reserved for future use
interval the number of jiffies (10ms periods) to delay be-
tween bdflush iterations
age buffer The time for a normal buffer to age before it is
considered for flushing back to disk
nfract sync The percentage of dirty pages in the buffer cache
required to cause the tasks which are dirtying
pages of memory start writing those pages to disk
themselves, slowing the dirtying process
nfract stop bdflush The percentage of dirty pages in buffer cache re-
quired to allow bdflush to return to idle state
reserved3 Reserved for future use
Generally, systems that require more free memory for application allocation
will want to set these values higher (except for the age buffer, which would be
moved lower), so that file data is sent to disk more frequently, and in greater
volume, thus freeing up pages of ram for application use. This of course comes
at the expense of CPU cycles, as the system processor spends more time moving
data to disk, and less time running applications. Conversely, systems which are
required to preform large amounts of I/O would want to do the opposite to
these values, allowing more ram to be used to cache disk file, so that file access
is faster.
11
time on an otherwise heavily loaded system. If you experience intolerable delays
in communicating with your system when it is busy preforming other work,
increasing this parameter may help you.
4.5 kswapd
While this set of parameters previously defined how frequently and in what
volume a system moved non buffer cache pages to disk, in Red Hat Enterprise
Linux 3, these controls are unused.
12
space. These areas are created during the life of the process when the program
attempts to memory map a file, link to a shared memory segment, or simply
allocates heap space. Tuning this value limits the amount of these VMA’s that
a process can own. Limiting the amount of VMA’s a process can own can lead
to problematic application behavior, as the system will return out of memory
errors when a process reaches its VMA limit, but can free up lowmem for other
kernel uses. If your system is running low on memory in the ZONE NORMAL
zone, then lowering this value will help free up memory for kernel use.
4.7 max-readahead
This tunable affects how early the Linux VFS7 will fetch the next block of a
file from memory. File readahead values are determined on a per file basis in
the VFS and are adjusted based on the behavior of the application accessing
the file. Anytime the current position being read in a file plus the current read
ahead value results in the file pointer pointing to the next block in the file,
that block will be fetched from disk. By raising this value, the Linux kernel
will allow the readahead value to grow larger, resulting in more blocks being
prefetched from disks which predictably access files in uniform linear fashion.
This can result in performance improvements, but can also result in excess (and
often unnecessary) memory usage. Lowering this value has the opposite affect.
By forcing readaheads to be less aggressive, memory may be conserved at a
potential performance impact.
4.8 min-readahead
Like max readahead, min-readahead places a floor on the readahead value. Rais-
ing this number forces a files readahead value to be unconditionally higher,
which can bring about performance improvements, provided that all file access
in the system is predictably linear from the start to the end of a file. This of
course results in higher memory usage from the pagecache. Conversely, lower-
ing this value, allows the kernel to conserve pagecache memory, at a potential
performance cost.
13
defined by the overcommit ratio value (defined below). Enabling this feature can
be somewhat helpful in environments which allocate large amounts of memory
expecting worst case scenarios, but do not use it all.
4.11 pagecache
The pagecache file adjusts the amount of ram which can be used by the page-
cache. The pagecache holds various pieces of data, such as open files from disk,
memory mapped files and pages of executable programs. Modifying the values
in this file dictates how much of memory is used for this purpose
Parameter Description
min The minimum amount of memory to reserve for
pagecache use
borrow kswapd balances the reclaiming of pagecache
pages and process memory to reach this percent-
age of pagecache pages
max If more memory than this percentage is taken by
the pagecache, kswapd will only evict pages from
the pagecache. Once the amount of memory in
pagecache is below this threshold, then kswapd
will again allow itself to move process pages to
swap.
14
Adjusting these values upward allows more programs and cached files to stay in
memory longer, thereby allowing applications to execute more quickly. On mem-
ory starved systems however, this may lead to application delays, as processes
must wait for memory to become available. Moving these values downward
swaps processes and other disk-backed data out more quickly, allowing for other
processes to obtain memory more easily, increasing execution speed. For most
workloads the automatic tuning works fine. However, if your workload suffers
from excessive swapping and a large cache, you may want to reduce the values
until the swapping problem goes away
4.12 page-cluster
The kernel will attempt to read multiple pages from disk on a page fault, in
order to avoid excessive seeks on the hard drive. This parameter defines the
number of pages the kernel will try to read from memory during each page
fault. The value is interpreted as 2page−cluster pages for each page fault. A
page fault is encountered every time a virtual memory address is accessed for
which there is not yet a corresponding physical page assigned, or for which the
corresponding physical page has been swapped to disk. If the memory address
has been requested in a valid way (i.e. the application contains the address in
its virtual memory map) then the kernel will associate a page of ram with the
address, or retrieve the page from disk and place it back in ram, and restart
the application from where it left off. By increasing the page-cluster value,
pages subsequent to the requested page will also be retrieved, meaning that if
the workload of a particular system accesses data in ram in a linear fashion,
increasing this parameter can provide significant performance gains (much like
the file read-ahead parameters described earlier). Of course if your workload
accesses data discreetly in many separate areas of memory, then this can just
as easily cause performance degradation.
15
must pause for the entire time it takes to walk the active list and this parameter
will not significantly change that total time.
4.14 oom-kill
In kernel versions 2.4.21-37 and later a new parameter call oom-kill will be
found. This parameter controls the maximum number of processes that bave
been killed due to an out of memory situation but have not terminated yet.
The default is 1 process at a time can be in the state of terminating after being
OOM killed. In RARE situations it may be appropriete to set this value to
0. A value of 0 will disable the OOM killer entirely. At first glance this might
seem good to users who are experiencing OOM kills. But remember that the
OOM killer only kills a process when the system is very close to becoming
completely and totally unusable. If the machine actually did exhaust all pages
of memory it would be unable to complete ANY new work. The system will
just hang and be absolutely unrecoverable. An example of when this might be
reasonbly disabled is in a compute job or scientific application where it is known
that almost every bit of memory will be used but all processes running can be
absolutly trusted to never use it all. An example when tuning this parameter up
might be appropriete is when you often run into OOM situations and you also
are known to have may processes spend lots of time in uninteruptible wait states
(like never returning IO) since they won’t be able to handle a kill signal you
might have better chances of successfully killing a process and saving the system
from a true out of memory state. This parameter should rarely be tuned.
16
5 Example Scenarios
Now that we have covered the details of kernel tuning, let’s look at some ex-
ample workloads and the various tuning parameters that may improve system
performance.
17
#crank up overcommit, processes can sleep as they are not interac-
tive
vm.overcommit=2
vm.overcommit ratio=75
18
References
[1] Bovet, Daniel & Cesati, Marco, Understanding the Linux Kernel.
Oreilly & Associates. Sebastopol, CA, 2001.
[2] Matthews, Bob & Murray, Norm
Virtual Memory Behavior in Red Hat Linux A.S. 2.1.
Red Hat whitepaper, Raleigh, NC, 2001.
[3] Van Riel, Rik Towards an O(1) VM.2003.
http://surriel.com/lectures/ols2003/.
19