KHB: Transparent support for large pages

June 19, 2006

This article was contributed by Valerie Aurora

Introduction: The Kernel Hacker's Bookshelf

A lot of great operating systems research goes on, but relatively little of it makes the leap into production operating systems, or from one operating system to another. The ideas that do trickle down into implementation are often delayed by years. Usually an idea gets ignored because it looked good in the research lab but turned out not to be practical in a production environment. But every so often, a practical idea goes unnoticed for years simply because none of the actual coders has the time to sit down and parse fifteen pages of dry academic prose. You're too busy writing code, can't someone make it easier to figure out which books and papers are worth reading?

Welcome to The Kernel Hacker's Bookshelf. The goal of this series is to bring good research and good kernel hackers together through reviews focusing on the practical aspects of research, written in plain (possibly even entertaining) language. We hope you enjoy reading these articles - and writing code inspired by them!

Transparent operating systems support for large pages

While Moore's Law tramped inexorably on during the last few decades, increasing memory size and disk space along with transistor density, it left some elements of computer architecture in the dust. One of these stragglers is TLB coverage. The TLB (or Translation Look-aside Buffer) caches translations between virtual and physical memory addresses; usually every memory access requires a translation. Performance is best when all the needed translations can fit in the TLB and translations "hit" in the TLB instead of missing. The amount of memory translated by the entries in the TLB is called the TLB coverage. TLB coverage has been dropping as a fraction of total memory (and, more importantly, as a fraction of the total size of Netscape - er, Mozilla - er, Firefox), and TLB misses are often a serious drag on system performance.

Since translations are done on a per-page basis, one solution is to increase the size of the system pages. We could increase the base page size - the smallest page available - but that would typically waste a lot of memory, cause more page-outs, trigger unexpected application bugs (ask me about the one with the JVM default stack size and 64KB pages some time), and make the system slower overall. Instead, many processors now offer multiple page sizes, beginning with a base page size of 4KB or 8KB and ranging up to a large page size of 2MB or occasionally a truly monstrous page size of 256MB or larger. Large pages increase TLB coverage and can reduce TLB misses significantly, often improving the performance of applications with large working sets by 10-15%. On the other hand, large pages can reduce performance by increasing the cost of paging memory in and out and adding the overhead of tracking several different page sizes. Implementing automatic, transparent OS-level support for large pages while simultaneously improving overall performance is not easy. It's also what Linux users are clamoring for - and some of them are switching to operating systems that already have automatic large page support (cough, cough, Solaris).

A Solution: The Rice Paper

Practical, Transparent Operating System Support for Superpages by Juan Navarro, et al., describes a sophisticated and elegant implementation of transparent large pages. The authors implemented their system on FreeBSD on the Alpha processor, using 4 page sizes: 8KB, 64KB, 512KB, and 4MB. The paper was published in 2002, otherwise they might have picked a less ill-omened architecture than Alpha; fortunately the design is reasonably generic. Overall, this paper is one of the best I've ever read.

The basic design is reservation-based; that is, enough pages to make a large page are reserved in advance and later promoted to a large page when justified. Memory fragmentation is reined in via careful grouping of page types and a smarter page replacement policy. Almost all applications tested saw at least some speed-up, and absolute worst case performance degradation varied from 2-9%. Most amazing of all, the implementation only required about 3500 lines of code - about half of an ext2. How exactly did they accomplish all this? Buckle up for some nitty-gritty details.

First, a run of contiguous pages suitable for a large page is reserved whenever an application page fault occurs (outside an existing reservation, of course). The size of the reservation is picked based on the size and type of the memory object, with slight variations depending on whether the object is fixed in size (e.g., text) or might grow (e.g., stack). For example, an application with 700KB of text would have a 512KB page reserved the first time a page in the text was faulted into memory. Once a large page of any size has been fully populated (all of its pages referenced at least once), it is promoted into a large page. In our example, once a contiguous 64KB region anywhere in the program text has been faulted in, it will be promoted to a 64KB page. Promotion of a partially populated page is possible, but the trade off is that it may increase the application's total memory usage, unintentionally creating a memory hog.

In the rough and tumble world of scarce memory, promotion is not a one-way street. Demotion of large pages into smaller pages is also useful. An application may start out using all pages in a large page but then stop referencing most of the pages. The only way to tell is to demote the large page and check the referenced bits on the smaller pages a little while later. A page is demoted when it is first written, when one of its base pages is evicted, and periodically when the system is under memory pressure.

When an application wants more memory and no free space is available, unused parts of a reservation are preempted. "Use it or lose it" is the name of the game here. The reservation which loses is the one whose most recent allocation occurred least recently - LRU order, basically - since most applications touch most of their working set soon after starting up, and so it's unlikely the original owner of the reservation will need the space. Unused reservations live on different lists depending on the size of the allocation that can be made by preempting the reservation. A population map, implemented as a radix tree, keeps track of which pages are allocated inside each large page-sized extent for easy look up. This radix tree is a key data structure; it makes allocation, reservation, and promotion decisions fast and simple.

The final key elements are the page replacement policy and the way pages of various types are grouped together. There are several different kinds of pages in the system. Some pages can't be moved or freed (pinned), some pages are in use but can be moved (active), and some pages are not currently used by anyone but may be used in the future (cached and/or inactive). If these pages are mixed together indiscriminately, pinned and active pages end up scattered everywhere, without any contiguous runs of free (or free-able) pages that can be converted into hotly pursued large pages. Fragmentation needs to be both prevented and repaired - without hurting performance by moving around pages too much.

Pinned pages are the most difficult problem, since once allocated they cannot be moved and may never be freed. The system tries to allocate these pages in clusters, so they break up as few potential large pages as possible. Similarly, cache pages are allocated in clusters with free pages, since cached pages can be easily freed to allow the creation of a large page. Reservations can include cache pages, and cached pages contained inside a reservation continue to be active until the application actually needs to kick that page out.

The page replacement daemon was changed to run not only when free memory runs low, but also when contiguity runs low. An "innocent until proven guilty" algorithm works here - we assume we don't need more contiguity until a large page reservation fails for lack of contiguity. When woken for this reason, the daemon runs just long enough to recover enough contiguous space to satisfy the allocations that failed. The page aging algorithm was changed slightly from the FreeBSD default; cached pages for a file are marked inactive on the last close, trading off the chance of the file being reopened against the opportunity for more contiguity.

Evaluating the System

The authors tested their system against a truly startling variety of applications, everything from gzip to web server trace replays to fast Fourier transforms, as well as a section exploring worst case situations. Personally, I'm not sure I've ever seen a better evaluation in a research paper; it's quite a treat to read.

In the best case, with low fragmentation, 33 out of 35 applications showed some improvement (one was unchanged, and the other was about 2% slower). Several had significant improvements. For example, rotating an image using ImageMagick was about 20% faster; linking the FreeBSD kernel was about 30% faster; bzip2 was 14% faster. In the fragmented case, performance was not as good, but usually to picked up again after a few runs as the page replacement daemon moved things around. In the worst-case department, the performance was degraded by about 9% for an application that only touched one byte per large page before freeing it, and by about 2% for a test case in which large page promotion was turned off. It makes for a pretty convincing case that large pages are an overall win for many systems.

Implications for Linux

What does this paper tell us? It is possible to implement transparent large page support in such a way that most applications get at least some benefit, and some applications get a lot of benefit. The algorithms used are relatively simply to understand and implement, and hold up well in worst case behavior. Finally, transparent large pages can be implemented elegantly and cleanly - only 3500 lines of code! Best of all, this paper includes a plethora of implementation details and smart algorithms, just begging to be reused. All of the above earns this paper a hallowed place on the Kernel Hacker's Bookshelf.

Over the past few years, several Linux developers have been working on various forms of transparent large page support. Some of that recent work, spearheaded by Mel Gorman, has been reviewed earlier in LWN:

Current work on large pages in Linux is summarized on the linux-mm wiki.

I look forward to more work in this fascinating and fertile area of operating systems implementation.

[Do you have a favorite textbook or systems paper? Of course you do. Send your suggestions to:

val dot henson at gmail dot com

Valerie Henson is a Linux kernel developer working for Intel. Her interests include file systems, networking, women in computing, and walking up and down large mountains. She is always looking for good systems programmers, so send her some email and introduce yourself.]

Index entries for this article
GuestArticles	Aurora (Henson), Valerie

KHB: Transparent support for large pages

Posted Jun 22, 2006 2:58 UTC (Thu) by Thalience (subscriber, #4217) [Link] (3 responses)

I did enjoy reading it, and look forward to the next installment!

KHB: Transparent support for large pages

Posted Jun 23, 2006 21:37 UTC (Fri) by smooth1x (subscriber, #25322) [Link] (2 responses)

I work as a DBA for a MAJOR company.

We are allocating multi Gb shared memory segments for our databases
(we have broken the 4GB for 1 shared memory segment size recently!). Large pages for large (>1GB?) shared memory allocations is all we need.

And we want these to be what Solaris calls ISM (intimiate shared
memory) i.e. shared page tables.

Oh, and pinned into physical memory (non-pageable and NOT looked
at by the paging/swapping code).

Dave.

KHB: Transparent support for large pages

Posted Jun 24, 2006 1:16 UTC (Sat) by dododge (guest, #2870) [Link] (1 responses)

(we have broken the 4GB for 1 shared memory segment size recently!).

Just FYI there shouldn't be much trouble with mappings that size. On a system with 96GB of RAM, I regularly do single 80GB shared mappings and I've managed to push it as high as 90GB keeping it all in-core. This system is actually a small configuration for the hardware and it wouldn't surprise me if people with bigger machines are doing mappings in the hundreds of gigabytes.

One limit you can run into is that the POSIX shm_open (and SVR4 shmget?) is typically implemented by using a file in /dev/shm, and the tmpfs mounted there is usually sized to half your RAM. If you want to go larger, you can do things like mount a larger tmpfs, or mmap some other file or block device (for example a striped LVM volume), or use hugetlbfs instead of tmpfs.

Another thing about /dev/shm is that it won't stop you creating and mapping a sparse file bigger than it can actually hold. I don't know if shm_open checks for this. I found out about it the hard way -- I mapped a new 50GB file in a 48GB tmpfs and had the application bus error when /dev/shm ran out of pages a few hours later.

The biggest issue we have is simply getting the data in and out of RAM, especially if the shared memory is directly backed by disk. Imagine hitting control-C in an application and having to wait 20-30 minutes for the shell prompt to return, as the OS flushes a zillion pages back to the drive(s).

Large pages for large (>1GB?) shared memory allocations is all we need.
Oh, and pinned into physical memory (non-pageable and NOT looked at by the paging/swapping code).

I think hugetlbfs will do this for you today, if you want it immediately.

You can also use mlock to keep things resident, but be aware that the last time I tried using it (admittedly it was a 2.4 kernel), it instantly dirtied all of the pages in the mapping. So when the mapping (backed by a disk file) was then unlocked, it insisted on flushing the entire thing even if it hadn't been modified, and the flushing was done single-threaded in the kernel. For a large mapping, this can take a long time.

KHB: Transparent support for large pages

Posted Jun 24, 2006 11:49 UTC (Sat) by nix (subscriber, #2304) [Link]

shmget() and SysVIPC are implemented differently and are not constrained by the size of /dev/shm. (This need for a unique shared-with-nothing implementation is *another* reason to hate sysvipc!)

KHB: Transparent support for large pages

Posted Jun 22, 2006 3:08 UTC (Thu) by ianw (guest, #20143) [Link] (1 responses)

There is a lot of current work on superpages.

The current Linux approach to large pages is HugeTLB. This is a static approach, and not transparent. People are working on this with things like libhugetlbfs [1], and I have heard rumors of dynamic, per process hugetlb.

The other approach, as is mentioned, is one that is transparent. Naohiko Shimizu came up with an approach [2] which was implemented on Alpha, SPARC and i386. This showed good results, but the patch never ended up going anywhere.

Gelato@UNSW is activley working on large page support for Itanium Linux [3], using a Shimizu inspired approach. Itanium has excellent support for multiple page sizes, and with suitable modifications can use a hardware walker to re-fill the TLB with superpages with very little OS intervention. The project is currently in the hands of a master's student, but even with a hacked together proof of concept we can see great potential [4].

Clearly, as identified, fragmentation is an issue with larger pages. We are keeping an eye on the above mentioned projects, and others such as Chris Yeoh's work on fragmentation avoidance [5].

For Itanium, we believe we could get a working superpage implementation with very few overall lines of code difference, as mentioned. There is some doubt about how generic this could be; the Rice paper was implemented as a FreeBSD module using hooks into the VM layer; not an easy proposition with Linux.

Dynamic, transparent superpages are really not suited to the multi-level tree design as used by the Linux VM. There are a range of more suitable page table designs that incorporate support for large, spare address spaces and superpages. To this end, Gelato@UNSW are working on a page table abstraction interface [6]. One of the most promising approaches is a guarded page table [7], which we are actively developing behind our interface. Our long term goal is to marry a guarded page table with dynamic superpages.

If others are working in this area, please contact us at [email protected].

[1] http://lwn.net/Articles/171451/
[2] http://shimizu-lab.dt.u-tokai.ac.jp/lsp.html
[3] http://www.gelato.unsw.edu.au/IA64wiki/SuperPages
[4] http://www.gelato.unsw.edu.au/IA64wiki/Ia64SuperPages/Ben...
[5] http://ozlabs.org/~cyeoh/presentations/memory_management_...
[6] http://www.gelato.unsw.edu.au/IA64wiki/PageTableInterface
[7] http://www.gelato.unsw.edu.au/IA64wiki/GuardedPageTable

KHB: Transparent support for large pages

Posted Jun 22, 2006 16:06 UTC (Thu) by dododge (guest, #2870) [Link]

The current Linux approach to large pages is HugeTLB. This is a static approach, and not transparent.

Yeah, "not transparent" is an understatement. For those who've never dealt with hugetlb, it goes something like this:

You have to explicitly allocate the pages from the kernel. You can dynamically allocate and free the pages, but since they have to be physically contiguous the number of hugepages you can get at any particular time is dependent on things like memory fragmentation from prior applications. So the best time to reserve them is at kernel startup (you can do this on the kernel command line), but even then the available number of hugepages can vary. For example I had a dataset that required a large number of hugepages, such that even at startup there were only one or two extra to spare -- then we added a few more CPUs to the machine and the next time it booted it could no longer construct enough hugepages to hold the data. When we updated the kernel a short while later the number changed again, thankfully back in our favor.

While allocated to hugepages, that memory can only be used for hugepages. So if you grab them early in order to be sure you can get enough for a later job, and end up devoting most of your RAM to hugepages, that memory is not available for normal use even if the pages aren't holding anything yet.

Access to hugepages is only available through the "hugetlbfs" filesystem, which basically acts like a ramdisk where files you store in it will be backed by hugepages. But hugetlbfs has a nasty little property in that it doesn't support normal I/O such as read(), write(), and ftruncate() on its files. All I/O has to be done through mmap(). This isn't so bad once the data is there, but copying files in and out of hugetlbfs is a big pain because the usual tools like "cp" and "dd" don't work.

That said, hugetlbfs is useful. You can store large amounts of data in hugetlbfs and it will stay memory-resident until reboot, giving you pretty much instant startup and shutdown times when you open/mmap/close the dataset. Alternatives such as tmpfs and mlock() are problematic when the amount of data gets into 10's of gigabytes or is nearing the total system RAM.

If someone is working on a way to get the benefits of hugetlbfs without the downsides of preallocation and limited I/O, that would be great.

KHB: Transparent support for large pages

Posted Jun 22, 2006 4:20 UTC (Thu) by nstraz (guest, #1592) [Link] (1 responses)

This was a truly great article. I look forward to more installments in this series.

KHB: Transparent support for large pages

Posted Jun 22, 2006 9:27 UTC (Thu) by nix (subscriber, #2304) [Link]

Indeed, excellent stuff; clear and sufficiently interesting that if my UltraSPARC was still running I'd start hacking on it now. (Alas, it's dead, and all my boxes are i386: it's a bit hard to use this approach with page size choices of only 4Kb versus 4Mb...)

More! More! :)

KHB: Transparent support for large pages

Posted Jun 22, 2006 15:29 UTC (Thu) by mattmelton (guest, #34842) [Link] (2 responses)

Great article. It picks up where my computer systems course left me.

This is an implementation of superpaging that I've come across before. It's not a huge innovation, but to see to see it produce real results (the gzip test in particular) its very promising.

I think fact it is written is less than 4k lines is quite important. I'd like to see how directly this can be implemented in Linux's VM layer. I worry a little that a different implementation could lead to unavoidable arch related bloat :/

(i should point our here that I have no knowledge or alphas or how adaptable the vm critical paths are)

I think there was a little too much "chat" in the article however - but im just used to reading JC's, greg's and RML's articles, so thats a totally mute point (and thankfully no-ones seen my thesis!)

Matt

KHB: Transparent support for large pages

Posted Jun 22, 2006 23:52 UTC (Thu) by Tobu (subscriber, #24111) [Link] (1 responses)

A little bit of "chat" is OK. Lacking that, I don't really single out the other LWN writers (english is not my native language though). Also, irrelevant things (like "Personally, I'm not sure I've ever seen a better evaluation in a research paper; it's quite a treat to read.") may actually be good, because they show that the writer cares, and because they allow breaking the reader's train of thoughts, making the text more easy to chew.

KHB: Transparent support for large pages

Posted Jun 28, 2006 23:09 UTC (Wed) by fergal (guest, #602) [Link]

I agree. If I wanted formal and dry, I'd read the original paper (I haven't read the original so I'm just guessing that it's formal and dry to some degree).

KHB: Transparent support for large pages

Posted Jun 23, 2006 1:33 UTC (Fri) by smoogen (subscriber, #97) [Link]

Good work Val.. but then again.. one would expect nothing less. I look forward to reading the rest of the series as I try to get back into Tech-mode.

KHB: Transparent support for large pages

Posted Jun 23, 2006 13:35 UTC (Fri) by ortalo (guest, #4654) [Link] (4 responses)

;-)

What about the bug with the JVM default stack size and 64KB pages then?
Any hints, resolution, implications?

By the way, very nice article. Thank you very much!

JVM default stack size and 64KB base pages

Posted Jun 23, 2006 20:24 UTC (Fri) by vaurora (guest, #38407) [Link] (3 responses)

I heard this story third-hand, so any corrections or additional details are welcome.

An unnamed large systems company decided to implement 64KB base pages in its OS. Everything looked dandy and great, except that for some reason nothing would work under Java. A closer look revealed that the default stack size of each thread in the JVM was set to 2*PAGESIZE... and the maximum stack size was 128KB. So threads would be created with 128KB stacks, immediately trigger a stack overflow, and die. This clearly is a bug and an abuse of PAGESIZE. However, given said company's fanatical devotion to strict binary compatibility above all else, it was decided that 64KB base page size could not be shipped because it would not work with that version of the JVM. This was an unpopular decision; it's possible it's been reversed since then.

The moral of the story is either "Strict binary compatibility is bad" or "Use transparent large pages and keep your base page size the same" or possibly "Computers are hard, let's go shopping!" (as Operating Systems Barbie is wont to say).

Other versions of this story probably exist for other OS's trying out larger base page sizes which undoubtably found the same bug; I don't know how they decided to handle the problem. But I'd love to hear the story!

-VAL

JVM default stack size and 64KB base pages

Posted Jun 23, 2006 21:37 UTC (Fri) by opalmirror (subscriber, #23465) [Link]

I designed and implemented a mechanism for a proprietary always-memory-resident RTOS to automagically split or join pages into naturally aligned clusters in the page table to make optimal use of the available TLBs and page sizes, implemented it on PPC4xx, and helped others apply it to other software page replacement CPU architectures from PPC, ARM/Xscale, MIPS, and SH. In that OS PAGESIZE stayed the minimum size but the OS looked at mapping requests and figured out how to minimize the use of TLBs. The use case for a memory-resident system and a demand paged one makes for very different design decisions though; for Linux one makes really different tradeoffs.

JVM default stack size and 64KB base pages

Posted Jun 24, 2006 11:52 UTC (Sat) by nix (subscriber, #2304) [Link]

I'm missing something. If the crash was due to the JVM unconditionally creating stacks of 2*PAGESIZE, then wouldn't the default (smaller) page size make the problem worse? (I doubt a JVM would work well with 8Kb or 16Kb thread stacks!)

Or were they doing something like

if (PAGESIZE > 16384)
... size stack to 2*PAGESIZE ...
else
... size it to something sane ...

If so, well, *ick*.

JVM default stack size and 64KB base pages

Posted Aug 4, 2006 0:23 UTC (Fri) by tpepper (guest, #31353) [Link]

I don't know the full details (heard it 3rd hand like Val) exactly but as she says there was an assumption inside the JRE apparently about what page size was, they were allocating a certain number of pages thinking their stack would fit exactly in those pages and their math to traverse memory then was wrong when the pages were larger (ie: 64k instead of 4k). I thought it was a 12k stack using three 4k pages. Presumably they then got three 64k pages, with their 12k assumed stack falling entirely within the first if accessed straight from the start. Accesses to parts of the stack via the pointers to the second and third pages wouldn't actually then find their stack's data/frames...It's easy to imagine people doing unusual optimisations based on assumptions about page size.

At any rate they figured out how to work around it with an LD_PRELOAD option for their existing code pending an update.

KHB

Posted Jun 26, 2006 14:39 UTC (Mon) by tlw (guest, #31237) [Link]

If the KHB is to become a (semi-?)regular feature of LWN, can it get its own page in a vein similar to the kernel summit coverage, porting drivers to 2.6, or the 2.6 API special features pages?

Great series topic, Val, an excellent idea.
Thank you.

Page size and disk access

Posted Jun 30, 2006 9:52 UTC (Fri) by forthy (guest, #1525) [Link]

TLB hit rate is one performance issue with page sizes, but there's another: disk access.

If you are lucky, you can do a 100 accesses to a hard disk per second. On the other hand, you can read 60MB or more per second, if you read sequentially. So the 4k page size is way smaller than what's a reasonable access patter to a hard disk today - 512k would be closer to the sweet spot at the moment (where seek time and transfer time are similar). But since transfer time increases over time, while seek time doesn't change much, sweet spots move as well.

So support for lage pages also should look at how to interface with disks. If I have a page miss on a 700k executable, which is currently populating its text segment, it makes more sense to just allocate a 512k page right now, fetch this chunk in one go, and later, when memory is tight, and 2/3 of this page is never used, split it up, and release the memory.