Controlling Linux Memory Fragmentation and Higher Order Allocation Failure - Analysis, Observations and Results
Controlling Linux Memory Fragmentation and Higher Order Allocation Failure - Analysis, Observations and Results
Controlling Linux Memory Fragmentation and Higher Order Allocation Failure - Analysis, Observations and Results
CONTENT
Introduction Measuring the memory fragmentation level Memory Fragmentation Analysis Observations Experimentation Results Summary Some References
2
INTRODUCTION
What is Memory Fragmentation ?
When a Linux device has been running continuously over a time without reboot and keeps allocating and de-allocating pages, the pages become fragmented. The bigger contiguous free pages become zero and free pages are only available in many smaller pages which are not contiguous. Thus even if we have lots of free memory in smaller units, the page allocation in kernel may fail. This typical problem is called External Memory Fragmentation here after referred to as memory fragmentation. 3
INTRODUCTION
Effect of Memory Fragmentation :
Memory Fragmentation can cause a system to lose its ability to launch new process. Memory fragmentation becomes more of an issue in embedded devices and Linux mobiles.
DRAM + Flash , Swapless system
Memory fragmentation can become more critical with high multimedia and graphics activities which requires contiguous higher-order pages.
4
TotalFreePages N j i Ki
= = = = =
Total number of free pages in each Node MAX_ORDER - 1 The highest order of allocation the desired order requested page order 0 to N Number of free pages in ith order block
( The above formula derived from Mel Gormans paper : Measuring the Impact of the Linux Memory Manager )
Example:
Lets say a NORMAL zone looks like this at some point of time
2
Node 0, zone Normal 3
2
20
2
104
2
13
2
13
2
1
2
1
2
1
2
0
2
0
2
0
Now lets apply our formula to measure fragmentation level for order 2^5.
Here , TotalFreePages = (3x1 + 20x2 + 104x4 + 13x8 + 13x16 + 1x32 + 1x64 + 1x127) = 994
Therefore; % Fragmentation = ( 994 - [ (2^5)*1 + (2^6)*1 + (2^7)*1 + (2^8)*0 + (2^9)*0 + (2^10)*0 ] ) * 100 ) / 994 % Fragmentation = (994 - [ 32 + 64 + 128 ]) / 994 = ((994 224) * 100) / 994 %Fragmentation = (770 * 100) / 994 = 77.46 % 77 % (round off)
10
AFTER RUNNING VARIOUS APPLICATIONS (Browser, WiFi Video Share, Camera, Voice Recorder, eBooks, Few Games) for an Hour and then killing All)
11
2
Node 0, zone Node 1, zone DMA DMA
3 197 30 16 5
2
1 129 12
2
1 0 14 2^4 order block
2
1 0 4
2
0 0 5
2
0 0 8
2
0 0 6
2
1 0 4
2
0 0 2
2
0 0 1
2
0 0 27
Node 2, zone
DMA
Enter the page order(in power of 2) : Enter the number of such block :
16 x 4K = 64K bytes
2
0 0 6
2
1 0 4
2
0 0 2
2
0 0 1
2
0 0 27
Explanation : 5 of 2^4 order blocks were requested. These request could only be satisfied by Node 2, Thus Node 2 were selected for allocation. But in Node 2 also, out of 5 (2^4) blocks only 3 could be allocated. Then the other 2 were allocated by splitting the 2^5 order blocks. 5 x 2^4 = [3 x 2^4 + (1 x 2^5)] = [3 x 2^4 + (1 x 2 x 2^4)] = [ 5 x 2^4 ]
12
2
Node 0, zone Node 1, zone Node 2, zone DMA DMA DMA 1 247 19
2
0 181 19
2
0 15 11
2
0 1 3
2
1 1 6
2
0 2 3
2
2 0 4
2
0 9 2
2
0 10 2
2
0 8 2
2
0 11 29
Enter the page order(in power of 2) : Enter the number of such block :
1024 2^10 order block 1024 x 4K = 4096K bytes 50 (this is the highest order)
ERROR : ioctl - PINCHAR_ALLOC - Failed, after block num = 48 !!! Explanation : As you can see the allocation request of 1024 x 50 pages is failed after 47 such allocation. But still there were enough free pages available in lower order. Buddy State After Allocation FAILED And Other Allocation Freed
2
Node 0, zone Node 1, zone Node 2, zone DMA DMA DMA 1 88 18
2
0 77 14
2
18 36 10
2
9 25 4
2
0 25 5
2
0 43 2
2
0 23 3
2
0 19 3
2
0 18 2
2
0 11 2
2
0 20 29
Explanation : The interesting think to note here is that after requested allocation was failed, kernel tried to arrange that many blocks in desired order so that next similar request can be succeeded.
13
Observations
__alloc_pages_nodemask : This is the heart of all allocation in kernel. We measure fragmentation level for each higher order here. Track higher-order allocation during high fragmentation. Anything above PAGE_ALLOC_COSTLY_ORDER(==3) is considered higher-order allocation and becomes critical.
14
Observations
Direct reclaim does some progress but still could not return any pages during first run. Similarly direct compact (from kernel2.6.36 onwards) is helpful but still not effective for very higher-order allocation. May be the other way round (first direct_reclaim then direct_compact) could be more helpful. A small amount of delay(for GFP_KERNEL) is required after direct_{reclaim,compact} and before retry. Maybe due to lazy buddy allocator.
15
Experiments Results
We performed some experiments with higher-order allocation and got some results. We found that whenever we run any application Xorg perform 4 or 8 order allocation. The browser always requires order-4 allocation.
/opt/pintu # ps ax | grep browser 7159 ? Ssl 0:03 /opt/apps/com.samsung.browser/bin/browser
[ 3830.215613] [HIGHERORDER_DEBUG] : __alloc_pages_nodemask is called by process <PID = 1168, NAME = Xorg> !!! [ 3830.227243] [HIGHERORDER_DEBUG] : ZONE : Normal, NODE : 0, ORDER = 8, Fragmentation Level = 29% [ 3830.235645] [HIGHERORDER_DEBUG] : __alloc_pages_nodemask is called by process <PID = 1168, NAME = Xorg> !!! [ 3830.244575] [HIGHERORDER_DEBUG] : ZONE : Normal, NODE : 0, ORDER = 4, Fragmentation Level = 13% (Around 10 times) [ 3831.355884] [HIGHERORDER_DEBUG] : __alloc_pages_nodemask is called by process <PID = 7159, NAME = browser> !!! [ 3831.364649] [HIGHERORDER_DEBUG] : ZONE : Normal, NODE : 0, ORDER = 4, Fragmentation Level = 13% [ 3831.373484] [HIGHERORDER_DEBUG] : __alloc_pages_nodemask is called by process <PID = 7159, NAME = browser> !!! [ 3831.383134] [HIGHERORDER_DEBUG] : ZONE : Normal, NODE : 0, ORDER = 4, Fragmentation Level = 13% (Around 26 times)
16
2
104 19 50
2
1 31 9
2
1 18 5
2
0 9 5
2
0 1 3
2
0 1 2
2
1 0 1
2
0 0 0
2
0 0 0
2
0 0 0
Enter the page order(in power of 2) : Enter the number of such block :
1024 2^10 order block 1024 x 4K = 4096K bytes 10 (this is the highest order)
[24768.550017] [HIGHERORDER_DEBUG] : __alloc_pages_nodemask is called by process <PID = 2289, NAME = app_pinchar.bin> !!! [24768.559578] [HIGHERORDER_DEBUG] : ZONE : DMA, NODE : 0, ORDER = 10, Fragmentation Level = 100% [24768.568020] [HIGHERORDER_DEBUG] : __alloc_pages_nodemask is called by process <PID = 2289, NAME = app_pinchar.bin> !!! [24768.578686] [HIGHERORDER_DEBUG] : ZONE : DMA, NODE : 1, ORDER = 10, Fragmentation Level = 100% [24768.587251] [HIGHERORDER_DEBUG] : __alloc_pages_nodemask is called by process <PID = 2289, NAME = app_pinchar.bin> !!! [24768.597919] [HIGHERORDER_DEBUG] : ZONE : DMA, NODE : 2, ORDER = 10, Fragmentation Level = 100% [24768.606486] [HIGHERORDER_DEBUG] : __alloc_pages_nodemask : Allocation going via - slowpath !!! [24768.615141] app_pinchar.bin: page allocation failure. order:10, mode:0x4020
Explanation : As you can see here, due to 100% fragmentation, page allocation request was failing, even after direct reclaim (slow path). But after a delay and retrying allocation request again, all subsequent allocation were successful. This delay indicates something needs to be done after direct reclaim. Maybe wait till lazy buddy allocator arranges free pages in the subsequent free areas. 17
18
[17949.859104] app_pinchar.bin: page allocation failure. order:10, mode:0x40d0 ------------------------------------- Wait for 2 seconds and retry allocation ---------------------
[17951.879156] [HIGHERORDER_DEBUG] : Trying - Final time !!!!!!!!!!! [17951.893248] <PINCHAR> : PINCHAR_ALLOCATE - Success(index = 0) [17960.189583] [HIGHERORDER_DEBUG] : __alloc_pages_nodemask is called by process <PID = 27713, NAME = app_pinchar.bin> !!! [17960.201128] [HIGHERORDER_DEBUG] : ZONE : Normal, NODE : 0, ORDER = 10, Fragmentation Level = 98% [17960.210269] [HIGHERORDER_DEBUG] : __alloc_pages_nodemask : Allocation going via - slowpath !!! [17960.335044] [HIGHERORDER_DEBUG] : did_some_progress = 887 [17960.339918] [HIGHERORDER_DEBUG] : Got some pages after direct reclaim ..... [17960.368939] <PINCHAR> : PINCHAR_ALLOCATE - Success(index = 4) ! [17964.518845] [HIGHERORDER_DEBUG] : __alloc_pages_nodemask is called by process <PID = 27713, NAME = app_pinchar.bin> !!! [17964.530629] [HIGHERORDER_DEBUG] : ZONE : Normal, NODE : 0, ORDER = 10, Fragmentation Level = 83% [17964.547138] <PINCHAR> : PINCHAR_ALLOCATE - Success(index = 8) ! [17965.552976] [HIGHERORDER_DEBUG] : __alloc_pages_nodemask is called by process <PID = 27713, NAME = app_pinchar.bin> !!! [17965.564319] [HIGHERORDER_DEBUG] : ZONE : Normal, NODE : 0, ORDER = 10, Fragmentation Level = 84% [17965.580823] <PINCHAR> : PINCHAR_ALLOCATE - Success(index = 9) ! [17966.586440] [HIGHERORDER_DEBUG] : __alloc_pages_nodemask is called by process <PID = 27713, NAME = app_pinchar.bin> !!! [17966.597175] [HIGHERORDER_DEBUG] : ZONE : Normal, NODE : 0, ORDER = 10, Fragmentation Level = 85% [17966.613424] <PINCHAR> : PINCHAR_ALLOCATE - Success(index = 10) !
Allocation failed directly during the first attempt itself even after direct reclaim. But after introducing a delay and retrying, all further allocation succeeded. May be Kswapd takes 19 sometime to clear up dirty pages and buddy adding it back to free area.
Here you can see lots of movable pages after lots of direct reclaim. Thus direct compact might be helpful after direct reclaim and not before.
20
EXPERIMENTATION DATA
Page Order
Block Used
Available Blocks
No of Blocks Requested
No of Blocks Allocated
Pass Rate
10
1024
20
100%
20
100%
9
8 8 9 10 10
512
256 256 512 1024 1024
11
4 0 1 28 0
20
20 50 30 40 50
94%
90% 100% 97% 10% 100%
20
20 50 30 40 46
100%
100% 100% 100% 100% 92%
SUMMARY
Measuring fragmentation level and tracking higherorder is important at least for low memory notifier. It was observed that allocation takes slowpath whenever fragmentation level is above 90%. The delay introduced here is only for experimental purpose.
Delay could be because, dirty pages has to be written to the disk before it is marked freed. May be the real thing could be to wait till lazy buddy allocator rearranges the free pages. This is valid only for GFP_KERNEL where a sleep is allowed.
22
For fragmentation > 90%, introduce temporary kernel thread to do direct reclaim/compact in background.
Buy the time you come back for next request, pages will be ready for you. Not enough data to share. Further experiments in progress.
But this requires COMPACTION and HUGETLB to be enabled. May be we can utilize this from kernel2.6.35 onwards. Difficult to back port compaction to lower kernel version. Mostly helpful for large system and may not be useful for small embedded products.
23
Can we introduce something like system wide fragmentation level ??? Reboot is not a good choice for end users even for small system. May be introduce something like Reset Physical memory state.
Bring back memory to original state without reboot. Not enough data. May be develop system utility to shrink physical memory using shrink_all_memory used during snapshot image creation in Hibernation.
24
Reserving memory during boot time can reduce fragmentation to some extent.
Good only if you have bigger RAM. Tracking higher-order fragmentation level can help decide which memory to reserve in future. May be something like dynamic reserving based on past performance could be better.
Contiguous Memory Allocator (CMA) and Virtual Contiguous Memory (VCM) can help fight fragmentation.
CMA : same like reserving memory during boot but transparently allows the memory to be reused and latter migrate pages to create similar chunk. Can be used for frame buffers and other memory hungry multimedia devices.
25
But CMA requires pages to be movable and may now use compaction, again not guaranteed because most kernel pages are not movable.
May be we have to limit the reuse of CMA region. Or share CMA region only for very high order.
Problem easily reproducible in DRAM + Flash swapless embedded system without HighMem. Further combination of investigation is in progress to derive a solution which does not requires reboot.
26
Some References
1. Wikipedia, Buddy Memory Allocation. http://en.wikipedia.org/wiki/Buddy_memory_allocation. 2. Jonathan Corbet. (2010), Memory Compaction http://lwn.net/Articles/368869/ 3. Lifting The Earth (2011) Linux Page Allocation Failure, http://www.linuxsmiths.com/blog/. 4. Mark S. Johnstone and Paul R. Wilson (1997), The Memory Fragmentation Problem Solved? 5. Mel Gorman and Patrick Healy (2005) Measuring the Impact of the Linux Memory Manager http://thomas.enix.org/pub/rmll2005/rmll2005-gorman.pdf 6. Corbet (2004), Kswapd and higher-order allocations http://lwn.net/Articles/101230/
27
28