Fragmentation avoidance

[Posted November 2, 2005 by corbet]

Mel Gorman's fragmentation avoidance patches were covered here last February. This patch set divides all memory allocations into three categories: "user reclaimable," "kernel reclaimable," and "kernel non-reclaimable." The idea to support multi-page contiguous allocations by grouping reclaimable allocations together. If no contiguous memory ranges are available, one can be created by forcing out reclaimable pages. Since non-reclaimable pages have been segregated into their own area, the chances of such a page blocking the creation of a contiguous set of free pages is relatively small.

Mel recently posted version 19 of the fragmentation avoidance patch and requested that it be included in the -mm kernel. That request started a lengthy discussion on whether this patch set is a good idea or not. There is, it seems, a fair amount of uncertainty over whether this code belongs in the kernel. There are a few reasons for wanting fragmentation avoidance, and the arguments differ for each of them.

The first of these reasons is to increase the probability of high-order (multi-page) allocations in the kernel. Nobody denies that Mel's patch achieves that goal, but there are developers who claim that a better approach is to simply eliminate any such allocations. In fact, most multi-page allocations were dealt with some time ago. A few remain, however, including the two-page kernel stacks still used by default on most systems. When the kernel stack allocation fails, it blocks the creation of a new process. The kernel may eventually move to single-page stacks in all situations, but a few higher-order allocations will remain. It is not always possible to break required memory into single-page chunks.

The next reason, strongly related to the first, is huge pages. The huge page mechanism is used to improve performance for certain applications on large systems; there are few users currently, but that could change if huge pages were easier to work with. Huge pages cannot be allocated for applications in the absence of a large - and suitably aligned - region of contiguous memory. In practice, they are very difficult to create on systems which have been running for any period of time. Failure to allocate a huge page is relatively benign; the application simply has to get by with regular pages and take the performance hit. But, given that you have a huge page mechanism, making it work more reliably would be worthwhile.

The fragmentation avoidance patches can help with both high-order allocations and huge pages. There is some debate over whether it is the right solution to the problem, however. The often-discussed alternative would be to create one or more new memory zones set aside for reclaimable memory. This approach would make use of the zone system already built into the kernel, thus avoiding the creation of a new layer. A zone-based system might also avoid the perceived (though somewhat unproven) performance impact of the fragmentation avoidance patches. Given that this impact is said to be felt in that most crucial of workloads - kernel compiles - this argument tends to resonate with the kernel developers.

The zone-based approach is not without problems, however. Memory zones, currently, are static; as a result, somebody would have to decide how to divide memory between the reclaimable and non-reclaimable zones. This adjustment looks like it would be hard to get right in any sort of reliable way. In the past, the zone system has also been the source of a number of performance problems, mostly related to balancing of allocations between the zones. Increasing the complexity of the zone system and adding more zones could well bring those problems back.

There is another motivation for fragmentation avoidance which brings a different set of constraints: support for hot-pluggable memory. This feature is useful on high-availability systems, but it is also heavily used in association with virtualization. A host running a number of virtualized Linux instances can, by way of the hotplug mechanism, shift its memory resources between those instances in response to the demands of each.

Before memory can be removed from a running system, its contents must be moved elsewhere - at least, if one wants to still have a running system afterward. The fragmentation avoidance patches can help by putting only reclaimable allocations in the parts of memory which might be removed. As long as all the pages in a region can be reclaimed, that region is removable.

A very different argument has surfaced here: Ingo Molnar is insisting that any mechanism claiming to support hot-pluggable memory be able to provide a 100% success rate. The current code need not live up to that metric, but there needs to be a clear path toward that goal. Otherwise, the kernel developers risk advertising a feature which they may not ever be able to support in a reliable way. The backers of fragmentation avoidance would like to merge the patches, solving 90% of the problem, and leave the other 90% for later. Ingo, instead, fears that second 90%, and wants to know how it will get done.

Why can't the current patches offer 100% reliability if they only put reclaimable memory in hot-pluggable regions? There are ways to lock down pages which were once reclaimable; these include DMA operations and pages explicitly locked by user space. There is also the issue of what happens when the kernel runs out of non-reclaimable memory. Rather than fail a non-reclaimable allocation attempt, the kernel will allocate a page from the reclaimable region. This fallback is necessary to avoid inflicting reliability problems on the rest of the kernel. But the presence of a non-reclaimable page in a reclaimable region will prevent the system from vacating that region.

This problem can be solved by getting rid of non-reclaimable allocations altogether. And that can be done by changing how the kernel's address space works. Currently, the kernel runs in a single, contiguous virtual address space which is mapped directly onto physical memory - often using a single, large page table entry. (The vmalloc() region is a special exception, but it is not an issue here). If the kernel were, instead, to use normal-sized pages like the rest of the system, its memory would no longer need to be physically contiguous. Then, if a kernel page gets in the way, it can simply be moved to a more convenient location.

Beyond the fact that this approach fundamentally changes the kernel's memory model, there are a couple of little issues with it. There would be a performance hit caused by the higher translation buffer use, and an increase in the amount of memory needed to store the kernel's page tables. Certain kernel operations - DMA in particular - cannot tolerate physical addresses which might change at arbitrary times. So there would have to be a new API where drivers could request physically-nailed regions - and be told by the kernel to give them up. In other words, breaking up the kernel's address space opens a substantial barrel of worms. It is not the sort of change which would be accepted in the absence of a fairly strong motivation, and it is not clear that hot-pluggable memory is a sufficiently compelling cause.

So no conclusions have been reached on the inclusion of the fragmentation avoidance patches. In the short term, Andrew Morton's controversy avoidance mechanisms are likely to keep the patch out of the -mm tree, however. But there are legitimate reasons for wanting this capability in the kernel, and the issue is unlikely to go away. Unless somebody comes up with a better solution, it could be hard to keep Mel's patch out forever.

Index entries for this article
Kernel	Hotplug/Memory
Kernel	Huge pages
Kernel	Memory management/Large allocations

180% solution in the works?

Posted Nov 3, 2005 6:50 UTC (Thu) by xoddam (guest, #2322) [Link] (1 responses)

> ... solving 90% of the problem, and leave the other 90% for later.
> Ingo, instead, fears that second 90% ...

That's a *big* problem!

180% solution in the works?

Posted Nov 3, 2005 7:47 UTC (Thu) by mingo (guest, #31122) [Link]

Yeah. That's the tricky thing about kernel features, the second 90% is usually much harder than the first 90% of a solution. Sometimes there's even a third 90% that has to be taken care of!

Fragmentation avoidance

Posted Nov 4, 2005 9:06 UTC (Fri) by opalmirror (subscriber, #23465) [Link]

that can be done by changing how the kernel's address space works. Currently, the kernel runs in a single, contiguous virtual address space which is mapped directly onto physical memory - often using a single, large page table entry. (The vmalloc() region is a special exception, but it is not an issue here). If the kernel were, instead, to use normal-sized pages like the rest of the system, its memory would no longer need to be physically contiguous. Then, if a kernel page gets in the way, it can simply be moved to a more convenient location.

Horrible screeching noises are emitted from the general locale of the MIPS kernel.

Question

Posted Nov 4, 2005 23:47 UTC (Fri) by Nir (guest, #27440) [Link] (1 responses)

If I boot my linux using only part of the memory ( boot mem=400M out of 512M). Would I be able to address addresses above 400 MB through the CPU ? meaning: Any one ?

Question

Posted Nov 18, 2005 21:20 UTC (Fri) by tyhik (guest, #14747) [Link]

Heh, not sure you happen to read it any more ...
Yes, you can access that "upper" memory. I have used it on one of my embedded systems. I'm keeping framebuffer in the upper MB of the memory and inform kernel through the mem= option like it didn't have that upper MB. This avoids messing up framebuffer during kernel boot, where the boot loader has already nicely set up a boot logo. Yes, I had to tweak the framebuffer driver to support accessing that memory.