A Study On Dynamic Memory Allocation Mechanisms For Small Block Sizes in Real-Time Embedded Systems
A Study On Dynamic Memory Allocation Mechanisms For Small Block Sizes in Real-Time Embedded Systems
University of Oulu
Department of Information Processing
Science
Master's Thesis
Valtteri Heikkilä
17.12.2012
2
Abstract
Embedded real-time and battery-powered systems are increasing in numbers, and their
software complexity is growing. This creates a demand for more efficient dynamic
memory allocation in real-time embedded systems. Small improvements in dynamic
memory allocation can greatly reduce system overall memory usage, fragmentation and
energy consumption. Most of today's general-purpose allocators are unsuitable for real-
time embedded systems since they are not designed for real-time constraints.
This thesis additionally introduces Bitframe allocator, a new bitmapped fits allocator.
The introduced allocator demonstrates that bitmapped fits can be used effectively for
dynamic memory allocation. We are however unsure if bitmapped fits can offer better
efficiency than other mechanisms.
Our results confirm that TLSF is one of the best allocators for real-time systems in
terms of performance and fragmentation. Our results also confirm that reaps has low
fragmentation and very low WCET when small blocks are allocated. Our results also
show that simple segregated storage and region mechanism should not be used in real-
time systems due to high worst-case fragmentation.
Keywords
thesis, information processing science, algorithms, performance, memory management,
dynamic storage allocation, dynamic memory allocation, fragmentation, real-time
systems, embedded systems
3
Foreword
I was working on a Nintendo DS game project during 2009. As usual, the deadline was
looming close and we still had a bunch of critical issues to fix. One showstopper issue
was a random crash which happened after some minutes of gameplay. Other major
issues were long loading times and occasional frame skipping.
Debugging revealed that all of the issues were caused by DMA. Our Lua scripting back-
end allocated a whopping amount of tiny blocks, peaking roughly at 1500 allocations
per frame1. The DMA just wasn't up to this task. A large number of tiny blocks were
scattered around the heap, and prevented large blocks from being allocated. This classic
case of fragmentation was the source of random crashing.
I was given the task to fix the issues in the allocator, and I chose to create a custom
allocator for efficiently allocating small memory blocks for Lua. This custom memory
allocator was the first version of Bitframe allocator presented in this thesis. The
allocator worked better than expected and solved all the issues. Our Lua scripting back-
end performance improved considerably and random crashing disappeared.
Special thanks go to my lovely Lion for her endless support. This thesis would not be
here without you Xiaojie.
Valtteri Heikkilä
1 This is a large number of operations on the Nintendo DS since games typically show 60 frames per second and the
main processor runs at 32 MHz.
4
Abbreviations
Contents
Abstract....................................................................................................................2
Foreword..................................................................................................................3
Abbreviations...........................................................................................................4
Contents....................................................................................................................5
1.Introduction...........................................................................................................7
1.1 Research topic...............................................................................................8
1.2 Limitations and assumptions.........................................................................9
1.3 Thesis structure............................................................................................10
2.Background.........................................................................................................11
2.1 Static and stack-dynamic memory allocation..............................................11
2.2 Dynamic memory allocation.......................................................................11
2.3 Allocator strategy, policy and mechanism..................................................12
2.4 Fragmentation and wasted memory.............................................................13
2.4.1 Quantifying fragmentation...............................................................13
2.5 Special requirements from real-time embedded systems............................15
2.6 Related work................................................................................................16
3.Allocation Mechanisms.......................................................................................18
3.1 Low-level mechanisms................................................................................18
3.1.1 Free lists and link fields...................................................................18
3.1.2 Block headers...................................................................................19
3.1.3 Coalescing and splitting...................................................................19
3.1.4 Deferred coalescing.........................................................................19
3.1.5 Lookup tables...................................................................................20
3.1.6 Bitmaps............................................................................................20
3.1.7 Pointer bumping...............................................................................20
3.1.8 Special treatment of small blocks....................................................21
3.2 Basic allocator mechanisms........................................................................21
3.2.1 Sequential fits...................................................................................21
3.2.2 Segregated free lists.........................................................................22
3.2.3 Buddy systems.................................................................................23
3.2.4 Indexed fits.......................................................................................24
3.2.5 Bitmapped fits..................................................................................25
3.2.6 Analysis on real-time use of basic mechanisms..............................25
3.3 Other allocation mechanisms......................................................................26
3.3.1 BIBOP..............................................................................................27
3.3.2 Regions............................................................................................27
3.3.3 Reaps................................................................................................28
4.Small Block Allocation Mechanisms in General-Purpose Allocators................29
4.1 Motivation...................................................................................................29
4.2 Allocator descriptions and analysis.............................................................30
4.2.1 Dlmalloc...........................................................................................30
4.2.2 Half fit..............................................................................................31
4.2.3 Hoard................................................................................................31
4.2.4 Jemalloc...........................................................................................31
4.2.5 Kingsley allocator............................................................................33
6
4.2.6 TLSF................................................................................................33
4.3 Summary.....................................................................................................35
5.Bitframe Allocator Description...........................................................................37
5.1 Lookup tables..............................................................................................37
5.2 Bitframe data structure................................................................................38
5.3 Bitframe page..............................................................................................39
5.4 Bitframe size classes....................................................................................40
5.5 Allocate operation.......................................................................................40
5.6 Free operation..............................................................................................41
5.7 Analysis.......................................................................................................42
5.8 Conclusion...................................................................................................42
6.Simulation and Evaluation..................................................................................43
6.1 Memory traces.............................................................................................44
6.2 Allocator implementations..........................................................................46
6.3 Worst-case analysis of the allocator implementations................................47
6.4 Timing measurement method......................................................................48
6.5 Fragmentation measurement method..........................................................49
6.6 Timing results..............................................................................................50
6.7 Timing results analysis and evaluation........................................................53
6.8 Fragmentation results..................................................................................54
6.9 Fragmentation results analysis and evaluation............................................59
6.10 Evaluation..................................................................................................61
7.Conclusion...........................................................................................................63
References..............................................................................................................64
7
1. Introduction
Dynamic storage allocation (DSA) has been a fundamental part of most computer
systems since 1960, and since then part of operating systems research. Similar to the
topics of searching and sorting, there exists a large amount of research on DSA, and the
topic is widely considered to be either solved or unsolvable. (Wilson, Johnstone, Neely
& Boles, 1995a, pp. 1, 4; Masmano, Ripoll, Crespo & Real, 2004, p. 1)
A dynamic memory allocator (DMA), often referred simply as allocator, is a DSA for
memory allocation. The goal of DMA is to provide memory dynamically to the
application at run time. It has been measured that up to 60% of application running time
can be spent in DMA, and a central aim in research is to find efficient algorithms which
both balance and minimize the time and storage costs of the DMA. (Masmano et al.,
2004, p. 79; Hasan & Chang, 2005, pp. 35, 40; Berger, Zorn & McKinley, 2002, p. 4;
Risco-Martin, Colmenar, Atienza & Hidalgo, 2011, p. 755)
Many classic allocator designs were conceived in 1960's, including sequential fits,
buddy systems, simple segregated storage and segregated free lists. Modern computing
is however very different. Embedded systems are increasing in numbers, and their
software complexity is growing, and many of the new embedded systems are battery-
powered. This development creates a demand for more efficient DMAs, especially to
reduce the increasing energy consumption. (Wilson et al., 1995a, pp. 47, 70; Risco-
Martin et al., 2011, p. 756; Zorn, 2010, pp. 47-49) Even a small improvement in system
DMA could have a large impact on energy consumption. When less processing and
memory is used, clock speeds, memory sizes and overall chip area could be reduced.
Because of the vast number of active computing systems today, improvements in basic
DMA algorithms could have major indirect effects on a global scale. Small
improvements could lead to reduced burden to the environment, electricity production,
cooling and manufacturing. For these reasons we believe that DMAs are an important
topic for current and future computing systems research. (Zorn, 2010, p. 49; Grunwald,
Zorn & Henderson, 1993, p. 179; Wilson et al., 1995a, p. 4)
Many modern embedded systems also need to operate under real-time constraints.
There exists however considerably less research on real-time systems DMA. Most
general-purpose DMAs are unsuitable for real-time systems because they may have
unpredictable or long worst-case execution time (WCET) (Masmano et al., 2004, p. 79),
and hence more research is needed to find more suitable DMAs.
Research additionally shows that a large majority of allocations by are from small sizes
(Zorn and Grunwald, 1992, p. 4; Grunwald, Zorn & Henderson, 1993, p. 184; Wilson et
al., 1995a, pp. 28, 36; Berger, Zorn & McKinley, 2002, p. 8; Hasan & Chang, 2005, pp.
45-46; Lee, Chang and Hasan, 2000, p. 391). There is however little research focusing
specifically on small block size allocation. This is unfortunate, since small blocks are
allocated extensively by dynamic and object oriented programming languages (Deflets,
Dosser and Zorn, 1994, p. 530; Chang, Hasan & Lee, 2000, p. 7; Risco-Martin et al.,
2011, p. 755; Lee, Chang & Hasan, 2000, pp. 387, 391). These languages are becoming
8
more and more prevalent, and their performance largely depends on the performance of
DMA.
Research has also shown that the implementation of allocator mechanism is the main
source of wasted memory in DMA (Wilson et al., 1995a; Johnstone & Wilson, 1998,
pp. 26, 32, 35-36; Masmano et al., 2008a, p. 156). For this reason, we have focused
primarily on allocator mechanisms which are the building blocks behind all DMA. With
this work, we want to provide new knowledge on the topics of real-time embedded
systems DMA, small block allocation, and to the research on DMA mechanisms. We
have noticed these specific topics have less coverage in the DMA research.
“Which DMA mechanisms are suitable for small block 2 allocation in real-time
embedded systems, and what are the tradeoffs between the mechanisms?”
The research problem can be divided to the following four research questions:
The first two questions focus on small block allocation mechanisms and their suitability
to real-time embedded systems. We will answer these questions with a DMA
mechanism literature survey and by analysis of existing allocator implementations and
their source code. Questions 3. and 4. address allocator performance and efficiency, and
these questions are answered by quantitative means. We will perform simulations and
measurements on implementations of allocator mechanisms and present analysis and
evaluation of their efficiency. Measurements are done with both real and synthetic
allocation traces.
We will additionally introduce a new bitmapped fits allocator designed for small block
allocation in real-time embedded systems. We will evaluate and compare the allocator
along with the other mechanisms in the simulation and evaluation part.
Most of this thesis is centered on DMA mechanisms. We are not specifically interested
on DMA policy or strategy or general-purpose allocator implementations. We will
contribute on the knowledge of various DMA mechanisms, their efficiency and
characteristics, and their suitability for real-time systems and small block allocation.
2 In this study, we define chunks as small areas of memory. Blocks are chunks which are associated or referenced by
either with the DMA or application.
9
This thesis is limited to allocators that are written in C and behave similarly to C
standard functions malloc and free3. Other C standard functions, calloc and realloc are
not studied because they can be implemented with malloc or free functions, and
additionally because they are very rarely used (Chang, Hasan & Lee, 2000, p. 11). The
C++ new and delete essentially use malloc and free (Hasan & Chang, 2005, p. 36).
Allocation operation takes the size of the allocation (in bytes) as its only parameter, and
returns a single memory address to allocated block. A zero return value means that the
allocation failed. Free operation takes the memory address of a block as its only
parameter, and does not return a value. Additionally, block addresses are aligned to 8
bytes. This is a common block alignment in 32-bit systems for C language (Evans,
2000, p. 5). This generally means that possible size classes in the allocator are 8 bytes
apart (8, 16, 24 …).
We assume memory management unit (MMU), paging and virtual memory are
unavailable in the system. This is because real-time embedded systems often do not
have a MMU (Masmano et al., 2004, pp. 80, 82; Puaut, 2002, p. 42). All pointers in this
thesis are thus in linear (or physical) address space. We also assume 32-bit system and
address space.
We also require that allocators can perform without the aid of operating system (OS)
(for example functions sbrk and mmap in unix). The allocators are initialized to use a
specified address space belonging to the allocator, and the allocator has full privileges
to this address space. We also assume that heap size cannot grow, and expect heap size
to be small (Masmano et al., 2004, p. 82).
This thesis assumes that the allocators do not know how the allocated memory blocks
are used, what information is stored in the blocks, or the lifetime (duration between
allocation and deallocation) of the blocks. None of the allocators examine the block
contents. Allocator neither relocate blocks to compact memory.
We do not discuss the topic of custom dynamic memory allocators. However, in the
light of current research, we believe that custom allocators may be more efficient than
3 Some use “deallocate” as a synonym for “free”, but we will use “free”.
10
Topics such as garbage collection, reference counting, automatic and implicit memory
management, are not directly discussed in this thesis. The topics are however closely
related. Dynamic memory allocation algorithms are fundamental to programming
environments with automatic or implicit memory management, and we believe, the
algorithms play a key role in the runtime performance on these environments. We also
believe that the popularity of programming languages, such as JavaScript and Python,
shows the topic dynamic memory allocation is increasingly important.
2. Background
This chapter introduces memory management key concepts, such as dynamic memory
allocation, fragmentation, and the allocator strategy-policy-mechanism model. We also
discuss the special constraints in real-time embedded systems, which affect DMA
design and implementation. We will refer to the introduced concepts throughout the
thesis.
For the most part of this chapter, we rely on a survey by Wilson, Johnstone, Boles and
Neely (1995a). This thorough survey contains an extensive review of past dynamic
memory allocation literature. It also presents models and categorizations which are used
in this research. We are not aware of later works in the field with this magnitude.
Static and stack-dynamic memory allocation imposes rigid constraints on how programs
may use memory. Many algorithms cannot be implemented alone with these methods.
An algorithm, that needs to free memory in a different order that it was allocated, is not
feasible with neither of the methods.
The memory of DMA is maintained in a memory area called heap 4, which defines a
storage where memory chunks of various sizes can be allocated. Most modern OSes
map the heap memory as shown in figure 1. The heap usually grows upward while stack
grows downward. The brk pointer records where the current heap ends. This mapping
generally requires virtual memory support. When this support is not available, heap
might not be able to grow.
Efficient DMA design requires careful balancing of time and space costs. The main
design goal in DMA is however to minimize space costs, and this is often more difficult
than designing for low execution time. More research is generally needed to find new
DMA algorithms with lower storage costs. (Wilson et al., 1995a, p. 5; Hasan & Chang,
2005, p. 40; Chang, Hasan & Lee, 2000, p. 8; Puaut, 2002, p. 49; Grunwald, Zorn &
Henderson, 1993, p. 181)
abstraction. (Wilson et al., 1995a, pp. 6-7) We refer to this model frequently throughout
the thesis.
Allocation strategy takes into account the regularities in the program behavior. It
determines a range of acceptable placement policies which define where to allocate
requested blocks. The strategy attempts to minimize fragmentation by selecting suitable
policies depending on the heap state. (Wilson et al., 1995a, pp. 6-7) For example, a best
fit policy always selects the block that most closely matches the allocation size.
Mechanism is a set of algorithms and data structures that implement the policy. The
mechanism is chosen to implement the policy efficiently in terms of time and space
complexity or overheads. (Wilson et al., 1995a, p. 7) As an example, a best fit policy
can be implemented by searching a linked list of available blocks to locate the closest
matching block.
Internal fragmentation occurs when a block is allocated to hold an object, but the block
is larger than the allocation, and the remainder is wasted. Internal fragmentation is
defined as wasted memory inside an allocated block. (Wilson et al., 1995a, pp. 8-9;
Peterson & Norman, 1977, p. 424; Masmano et al., 2008a, p. 156; Ogasawara, 1995, p.
22)
External fragmentation occurs when free blocks of memory are available for allocation
but are too small (or otherwise unable) to hold future allocations. This situation is
caused by “holes” in the heap coming from isolated free blocks. (Wilson et al., 1995a,
pp. 8-9; Masmano et al., 2008a, p. 156; Ogasawara, 1995, p. 22)
It has been shown that fragmentation is not a serious issue for a large majority of
programs and that wasted memory is mainly caused by implementation overhead from
allocation mechanisms – not by fragmentation or allocation policy. There are good
allocation policies which have been shown to be efficient, and there are good
mechanisms to implement them. (Wilson et al., 1995a; Johnstone & Wilson, 1998, pp.
26, 32, 35-36; Masmano et al., 2008a, p. 156)
behavior of the application and of the DMA, on the past and present allocated block
sizes, their distribution, their quantity, and the order in which allocations are made.
There are many methods for measuring fragmentation in DMA, and there is no single
correct way to measure fragmentation. (Wilson et al., 1995a, pp. 14-15; Puaut, 2002, p.
49; Peterson & Norman, 1977, pp. 424, 426; Masmano et al., 2008a, p. 156; Johnstone
& Wilson, 1998, p. 32) Researchers also use different methods to quantify
fragmentation in the experiments, and this makes the comparison of results difficult.
Due to the complexity of fragmentation problem domain, research cannot generally rely
on solely analytical methods to quantify fragmentation (Peterson & Norman, 1977, p.
426; Puaut, 2002, p. 49). The methodology relies instead on simulation and
measurements. DMA experimentation often involves construction of a working DMA
implementation, or its model, and simulation with traces of allocation and free requests.
The traces are created by two sources: real programs and synthesis. Traces from real
programs correspond better to real-world use than synthetic traces. Traces can also be
synthesized with different methods of which probabilistic methods are most common.
Probabilistic methods are also combined with more complex payload models. Synthetic
traces reveal information on the worst-case behavior of a DMA, whereas traces from
real programs show DMA behavior under real-world use. Synthetic traces are preferred
when evaluating real-time DMA because worst-case behavior needs to be understood.
(Masmano et al., 2004, p. 86; Masmano et al., 2008a, p. 162)
Johnstone and Wilson (1998, p. 32) summarizes four methods for calculating
fragmentation. Two of the methods are essentially same as the ones presented in
(Peterson & Norman, 1977, p. 424) and (Deflets, Dosser & Zorn, 1994, p. 535). Our
research will use the following method from (Masmano et al., 2008a, pp. 157, 163;
Masmano et al., 2006, p. 72).
H −M
F= (1)
M
Here F is the fragmentation, H is the maximum memory used by the DMA, and M is the
maximum allocated live memory used by the trace during simulation. The
fragmentation calculation is illustrated by figure 2. In the figure, point 1 corresponds to
the location of maximum memory used by the DMA (H), and point 2 corresponds to the
location of maximum allocated live memory (M).
15
Most DMAs are unsuitable for real-time systems since they are designed for low
average execution time and not for low WCET. Real-time systems need to ensure fast
response time, and for this it is necessary to determine the WCET of all running code in
the system. The WCET needs to be low enough to meet the requirements in the system
response time. DMA algorithms with O(1) time complexity are generally considered to
be most suitable for real-time systems. (Masmano et al., 2008a, p. 175; Masmano et al.,
2004, p. 79; Nilsen & Gao, 1995, p. 151; Ogasawara, 1995, p. 22)
Many real-time systems are not allowed to exhibit unreliable behavior under any
circumstance, and they may execute for weeks, months and even years. This makes
fragmentation a relevant issue for real-time DMA. A badly designed DMA will
accumulate fragmentation over time, which may lead to unpredictable system behavior
and response times. Systems developers often fear that DMA is too unreliable for real-
time systems, and generally try to avoid it whenever possible. (Masmano et al., 2008a,
p. 152; Masmano et al., 2004, pp. 79-81; Puaut, 2002, pp. 41, 46; Ogasawara, 1995, p.
21; Nilsen & Gao, 1995, p. 151)
The special nature of real-time systems imposes strict requirements for DMA. A good
summary of these requirements are presented in (Masmano et al., 2008a, pp. 152, 175-
16
176; Masmano et al., 2006, p. 69; Masmano et al., 2004, p. 80). They define
requirements for real-time DMA as the following:
• Bounded execution time. The WCET of DMA operations must be bounded and
known. This requirement is mandatory.
• Fast completion time. The WCET of DMA operations must be short. This
requirement is not mandatory.
Both requirements of bounded WCET and low fragmentation are considered mandatory,
while there requirement for low WCET is only preferred. A low WCET is naturally
more desirable, and it defines the usability of the DMA in real systems.
Nilsen and Gao (1995) performed measurements on several general-purpose C and C++
allocators to determine their suitability for real-time use. Some of the measured
allocators are well established allocators, such as dlmalloc and the SunOS allocator.
Relying on their measurements, they conclude that allocators utilizing traditional
methods are unusable in real-time systems. (Nilsen & Gao, 1995, pp. 143, 151)
Ogasawara (1995) introduces a Half fit allocator with O(1) time complexity. The
allocator is shown to have bounded WCET, and also lower fragmentation than binary
buddies under synthetic trace experiments. The study emphasizes the suitability of O(1)
time algorithms for real-time system DMA. (Ogasawara, 1995, pp. 21, 24) The Half fit
allocator is described in section 4.2.2.
Puaut (2002) presents measurements and analysis on timing behavior of different real-
time DMA using both real and synthetic payloads. Worst-case behavior is obtained
using synthetic payloads. The study shows that WCET obtained analytically is a
pessimistic and context-independent estimate, while WCET obtained empirically is
context-sensitive and less pessimistic. The study also notes that allocators with a low
average execution time may not have a low WCET. (Puaut, 2002, pp. 45, 47-49)
Masmano, Ripoll, Crespo and Real (2004) introduces TLSF, a real-time system DMA
with O(1) time complexity, bounded WCET and low worst-case fragmentation. They
also perform a brief evaluation of various DMA and TLSF by various synthetic worst-
case workloads. They conclude that TLSF has excellent WCET and fragmentation
characteristics. (Masmano et al., 2004, pp. 79, 86-88) The TLSF allocator is further
discussed in section 4.2.6.
Masmano, Ripoll and Crespo (2006) continued experimentation from their previous
study from 2004. The new study compares TLSF with other allocators: first fit, best bit,
17
binary buddies, dlmalloc, and Half fit. They design a custom model to synthesize
workload for the allocator experiments. Based on these experiments, authors conclude
that first fit, best fit, dlmalloc, and surprisingly also, binary buddies are not suitable for
real-time systems. (Masmano et al., 2006, pp. 68-69, 70-71, 75)
A later study by Masmano, Ripoll, Balbastre and Crespo (2008a) repeats their previous
experiments from earlier study in 2006, and includes the following DMA: first fit, best
fit, AVL tree, binary buddies, dlmalloc, Half fit and TLSF. This time both real and
synthetic workloads were used to cover real and worst-case scenarios. First fit, best fit
and dlmalloc are not recommended for real-time use due high WCET. TLSF and Half
fit are evaluated as best for real-time use. AVL tree is shown to have higher WCET than
binary buddies. Half fit and binary buddies are shown to have high but acceptable
worst-case fragmentation. (Masmano et al., 2008a, pp. 161-162, 164, 168-169, 173)
18
3. Allocation Mechanisms
This chapter will discuss categories of allocation mechanisms in three main sections.
The first and second section will discuss low-level and basic allocation mechanisms,
while the third section will discuss other allocation mechanisms. In this chapter follows
the categorization in (Wilson et al., 1995a). We make some additions mainly in the low-
level and other mechanisms sections to present and analyze some mechanisms relevant
to real-time DMA.
Figure 3. Illustration of sequential free list. Sequences of white rectangles represent free
blocks. Arrows represent links between free blocks. (Hasan & Chang, 2005, p. 37)
In order to form a free list, free blocks need to contain link fields (Wilson et al., 1995a,
p. 28). Both doubly and regular linked lists are often used in free lists depending on the
requirements, and the link fields are essentially linked list nodes. Allocated blocks often
store a minimal set of link fields to reduce overhead per block (Wilson et al., 1995a, p.
28; Knuth, 1973, p. 436).
19
Header fields store information relevant to the memory blocks, for example block size
and allocated/free status. Boundary tags are used to mark starts and ends of blocks to
track write overflows and free list corruption. Modern implementations of boundary
tags often omit the end tag. Block alignment constraints the address and size of the
memory blocks. The alignment is commonly one or more machine words, and it may be
required by the system or hardware. The alignment can also be used to save bits from
block header address fields5. (Wilson et al., 1995a, pp. 27-28) Block alignment also
introduces padding bytes in block headers.
Figure 4. Illustration of coalescing. Blocks with length 8 and 4 can be coalesced whereas
block with length 2 cannot. (Hasan & Chang, 2005, p. 37)
Block splitting is a reverse process to coalescing. It involves dividing a free block into
two smaller blocks (Wilson et al., 1995a, p. 9). The splitting is usually performed when
no suitably small block is found to satisfy an allocation operation. In this situation, a
large free block is selected and split in two parts, where the other part is used to satisfy
the allocation, and the other is put to a free list for later use.
5 For example, a block alignment of 4 bytes frees 2 bits from link fields, since addresses are always aligned by 4. The
freed bits can be used as flags by the allocator, for example to store the allocated/free state of a block.
20
exactly the same as previously freed ones, and thus repeated coalescing and splitting
wastes time in the allocator. Deferred coalescing improves allocator performance, but
increases fragmentation. (Hasan & Chang, 2005, pp. 37, 40, 47; Wilson et al., 1995a,
pp. 15, 18, 22; Johnstone & Wilson, 1998, p. 35; Masmano et al., 2004, p. 82)
3.1.6 Bitmaps
A bitmap (or a bit table) is a vector of bits where each bit maps to a single data element.
The mapping is usually linear so that a bit index directly corresponds to an index of an
element in an array of elements. Many allocators use bitmaps to mark blocks or free
lists since bitmaps are size-efficient and fast to manipulate.
Allocators often perform scans for one (or zero) bit in a bitmap starting from a specified
index. The scan is mostly performed with bit-scan instructions found in majority of
processors. The instructions are efficient for accelerating bitmap scanning, and mostly
execute in constant-time. (Ogasawara, 1995, p. 23; Masmano et al., 2006, p. 69) Such
instruction are for example CLZ (Count Leading Zeros) (ARM, 2010, Chapter 4, p. 54),
BSF (Bit Scan Forward) and BSR (Bit Scan Reverse) (Intel, 2011, Chapter 3, pp. 92-
97).
Bitmaps are also sometimes searched for sequences of zeros (or ones) of a desired
length. A basic search implementation is a bit-by-bit scan, which has time complexity
O(N), where N is the size of bitmap. Improved search algorithms however exist, which
use lookup tables and bit manipulation techniques. For example, a 256-way lookup can
store bit sequence lengths in an 8-bit sequence, and another one can store the free run
lengths across byte boundary. (Wilson et al., 1995a, p. 42)
Pointer bumping is a common technique used at least by region and reap mechanisms.
(Berger, Zorn & McKinley, 2002, p. 7)
Special treatment of small blocks may reduce time and space costs on the average, but
not necessarily on the worst-case. Additionally, fragmentation and WCET analysis may
become more complicated when more than one allocation mechanism in used in an
allocator. This reduces the usefulness of this method in real-time systems, and we
believe this mechanism should not be used as a part of real-time DMA.
Best fit
Sequential best fit allocator searches the full sequential free list, and returns the smallest
available free block large enough to satisfy the allocation. Search is exhaustive, but may
stop when a perfect fit is found. Sequential best fit allocator implements naturally a best
fit policy; it always finds the best block to store the allocation. This policy is considered
to be the best policy to minimize fragmentation. It minimizes wasted space after block
split, and if a split is not made, it minimizes wasted space inside the block. Best fit
policy can be implemented more efficiently at least with indexed or segregated fits.
(Johnstone & Wilson, 1998, pp. 27, 33; Masmano et al., 2004, p. 83; Robson, 1977, pp.
243-244; Hasan & Chang, 2005, p. 40; Wilson et al., 1995a, p. 30; Knuth, 1973, p. 437)
22
First fit
First fit searches the sequential free list from the beginning, and uses the first block
large enough to satisfy the allocation request. The block can be split if it is larger than
necessary, and the remainder is put to the free list. The motivation for first fit is to
reduce average execution time of the allocator in comparison to best fit. A variant of
first fit is the address-ordered first fit. In address-ordered first fit the free blocks are
inserted to free list in address-order. The insertion requires a search both when block is
allocated and freed. The address-ordered first fit has low fragmentation similar to best
fit, and it can be implemented efficiently using a Cartesian tree. (Wilson et al., 1995a,
pp. 30-31; Hasan & Chang, 2005, p. 36; Johnstone & Wilson, 1998, pp. 33, 37; Wilson
et al., 1995b, p. 34)
Next fit
Next fit is a variation of first fit: a pointer records the position in free list where last
search was satisfied, and the next search continues from that position. The rationale
behind this is to decrease the average search time. It has been however shown that next
fit actually increases average search time compared to first fit. Next fit also suffers from
worse fragmentation and locality than first fit and best fit. (Wilson et al., 1995a, p. 31;
1995b, p. 27; Bays, 1977, pp. 191-192)
Good fit
Good fit is an “almost best fit” policy (Masmano et al., 2004, p. 83), and it is not strictly
a sequential fits mechanism. Good fit policy is common in segregated free list allocators
where best fit search is often omitted, and a block with estimated best fit is used instead.
Good fit policy has been shown to produce low fragmentation similar to best fit (Wilson
et al., 1995a, p. 9; Masmano et al., 2008a, p. 156).
Segregated fits
Segregated fits mechanism uses multiple free lists to hold blocks of a size class or size
range. It performs coalescing and splitting, and may use deferred coalescing. Upon
allocation, a segregated fits allocator chooses a suitable free list matching requested
size, and then usually sequentially searches a lists for a suitable block. If there are no
suitable blocks in the list, the next list with a larger size class will be used, and so on,
until a free block is found. The mechanism has usually good fit or best fit policy.
23
(Wilson et al., 1995a, p. 37) Wilson and others (1995a, p. 37) define three subcategories
for segregated fits allocators:
Exact lists category allocators use a free list for each possible block size, which can be
many. Accelerating data structures, such as binary trees, may be necessary to reduce
cost of finding a suitable free list for allocation. (Wilson et al., 1995a, p. 37) In practice,
programming systems6 and hardware forces block sizes to be multiples of some number
of bytes, and thus the sizes are not “exact”.
Strict size classes with rounding category allocators maintain a number of segregated
free lists where each list holds only blocks of one size class, and allocation sizes will be
rounded up to next matching size class (Wilson et al., 1995a, p. 37). Because every free
list contains blocks of one size class only, sequential search is not needed, and the
allocation executes in constant-time.
Size classes with range lists category allocators allow the free lists to contain blocks in a
specified larger size range. The allocator performs a sequential search on a matching
free list in order to satisfy an allocation request. This category allocator was first
introduced in a paper by Purdom, Stigler and Cheam (1971). (Wilson et al., 1995a, pp.
36-38)
There are four well known buddy system variants: binary buddies, Fibonacci buddies,
weighted buddies, and double buddies. We will introduce these next. Other buddy
6 For C language, sizeof(double) is commonly used as alignment for blocks returned by malloc(). A common alignment
on 32-bit systems is 8 bytes (Feng & Berger, 2005, p. 70), and 64-bit systems generally use 16 bytes.
24
systems exist, such as tertiary buddies introduced by Yadav and Sharma (2010), but we
do not discuss them further, since they do not seem to offer relevant benefits (Yadav &
Sharma, 2010, p. 66).
Binary buddies
Binary buddies is a well-known buddy system algorithm presented by Knowlton (1965).
An often cited description of the algorithm is found in (Knuth, 1973, pp. 442-445).
Binary buddies split blocks only by half and constrain the sizes to powers of two, and
hence buddy pair can be located by complementing a single bit in a buddy's address.
The buddies use block headers with size information, free/allocated state, links to
previous and next blocks in doubly linked-list and possibly boundary tags. (Wilson et
al., 1995a, p. 40; Knowlton, 1965, pp. 623-625; Knuth, 1973, p. 442-445; Purdom,
Stingler & Cheam, 1971, p. 187; Peterson & Norman, 1977, p. 421; Ogasawara, 1995,
p. 22)
Fibonacci buddies
Fibonacci buddies use the Fibonacci series as the source for buddy sizes. The algorithm
was first introduced by Hirschberg (1973), and possibly originated from an exercise in
(Knuth, 1973). Fibonacci buddies mechanism splits blocks to two unequal sizes of the
Fibonacci series Li = Li-1 + Li-2. The block sizes in Fibonacci buddies is more closely-
spaced than in binary buddies, which reduces internal fragmentation compared to binary
buddies. Cranston and Thomas (1975) introduced a method for rapid buddy address
calculation that is comparable or slightly slower than binary buddy address calculation.
(Yadav & Sharma, 2010, pp. 63, 66; Wilson et al., 1995a, pp. 39, 49; Peterson &
Norman, 1977, p. 421)
Weighted buddies
Weighted buddies were first introduced by Shen and Peterson (1974). The system uses a
custom size class series different from binary and Fibonacci buddy systems. The size
classes include powers of two, but in between them, there exist sizes that are three times
a power of two. For example, 2, 3, 4, 6, 8, 12… This means some sizes can be split in
two ways. The address calculation is however quite straightforward and fast. (Wilson et
al., 1995a, p. 40; Peterson & Norman, 1977, p. 421)
Double buddies
Double buddies were first introduced in (Wise, 1978). Double buddies offer a closer
spacing of block sizes which is accomplished by using two binary buddy systems with
staggered sizes. For example, the other binary buddy system could have size classes 2,
4, 8, 16… and the other could have sizes 3, 6, 12, 24… However, as with binary
buddies, double buddies can only be split in half. (Wilson et al., 1995a, p. 40)
address and size. Cartesian trees do not however necessarily maintain a good balance,
and search executes in worst-case O(N). (Wilson et al., 1995a, pp. 40-41; Stephenson,
1983, pp. 30-31)
To allocate a block, bitmapped fits allocator scans the bitmap to find a suitable sequence
of free chunks, marks them allocated, and returns a block spanning the chunks. To free
a block, chunks hosting the block are marked free. (Wilson et al., 1995a, p. 42) Block
size may be stored in block header or in another bitmap which marks the endings of the
chunk sequences.
In addition to bookkeeping, bitmaps can also be used for indexing akin to indexed fits.
Many modern allocators use bitmap indexing: Half fit (Ogasawara, 1995), TLSF
(Masmano et al., 2004) and jemalloc (Evans, 2006). This is probably caused by the
availability of efficient processor bit-scan instructions.
Sequential fits allocation executes in O(N), where N is the number of blocks on the free
list. Search time increases when the free lists grows, which makes the mechanism
unsuitable for real-time use. Sequential fit policies however produce low fragmentation,
and best fit policy is generally considered to produce the least fragmentation of all
known policies. Best fit or good fit policy can be implemented efficiently with indexed
or segregated fits. (Wilson et al., 1995a, pp. 30-31, 33; Masmano et al., 2008a, pp. 166,
168; Masmano et al., 2004, p. 81; Hasan & Chang, 2005, p. 36; Puaut, 2002, p. 48;
Johnstone & Wilson, 1998, pp. 27, 33-34, 36; Wilson et al., 1995b, p. 30)
All segregated free list mechanisms, except for size classes with range lists mechanism,
are acceptable for real-time use since the search time is independent of the number of
free blocks. Segregated storage allocators are all fast. SSS is likely to be the fastest, but
it also has the worst fragmentation. (Masmano et al., 2004, p. 81; Hasan & Chang,
2005, pp. 44, 46-47; Wilson et al., 1995a, pp. 36-38; Grunwald, Zorn & Henderson,
1993, p. 185)
26
Buddy systems timing behavior is predictable. Buddy system operations have time
complexity O(log2N), and the allocators is suitable for real-time applications. Buddy
systems however suffer from high internal fragmentation. Research shows that internal
fragmentation for binary, Fibonacci, double and weighted buddies is usually in the
range 25-40%, and roughly 50% at worst-case, and binary and weighted buddies exhibit
higher fragmentation than Fibonacci and double buddies. Overall fragmentation in
buddy systems may be acceptable for real-time DMA. (Masmano et al., 2008a, p. 166;
Masmano et al., 2004, p. 81; Puaut, 2002, pp. 46, 48; Yadav & Sharma, 2010, p. 66;
Hasan & Chang, 2005, p. 37; Peterson & Norman, 1977, pp. 421, 429; Johnstone &
Wilson, 1998, pp. 28, 34; Wilson et al., 1995a, pp. 38-40)
Indexed fits can perform better than segregated free lists in terms of WCET (Masmano
et al., 2004, p. 81). Hence indexed fits are suitable for real-time DMA, but the indexing
data structures used by the allocator must ensure low and bounded WCET. Indexed fits
with bitmapped indexing are common, and bitmapped indexes are suitable for real-time
allocators. Bitmapped indexing executes in constant-time when bit manipulation and
bit-scan instructions are used (Masmano et al., 2004, p. 81; Hasan & Chang, 2005, p.
37).
Bitmapped fits (bitmap used for chunk bookkeeping) are normally unsuitable for real-
time use. This is because bitmap scan generally performs in O(N) time, where N is the
size of bitmap. The scan can be however improved, for example, with the techniques
mentioned in section 3.1.6. Bitmapped fits have a constant overhead per chunk7. This
could be used to reduce wasted memory in some implementations. (Wilson et al.,
1995a, p. 42; Masmano et al., 2006, p. 69; Wilson et al., 1995b, p. 35)
Table 1. Summary of basic allocator mechanisms suitability for real-time DMA.
Sequential fits Not suitable. Sequential fits execute in O(N), and this is
unacceptable for real-time DMA.
Segregated free lists Suitable except for size classes with range lists mechanism. Search
time is independent of the number of free blocks.
Indexed fits Suitable, but this depends on the indexing data structure and its
search cose. Low and bounded WCET is acceptable.
Bitmapped fits Generally not suitable. Bitmap scan normally executes in O(N), but
scan can be improved. Low and bounded WCET is acceptable.
3.3.1 BIBOP
Big bag of pages (BIBOP) is originally an object typing mechanism used in MACLISP
(Steele, 1977) and later in Chez Scheme. In BIBOP, dynamically allocated objects are
contained in aligned equal-size pages that contain objects of a single type. Object
address high bits represent a page index which can be used to look up the object type.
Since the type information is stored in page instead of objects, overhead is greatly
reduced – especially when allocating small blocks. Large block allocation however does
not necessarily benefit from BIBOP. (Steele, 1977, pp. 3-4; Dybvig, Eby & Bruggeman,
1994, pp. 5, 10, 13; Wilson et al., 1995a, p. 36; Schneider, Antonopoulos &
Nikolopoulos, 2006, p. 85)
BIBOP mechanism needs to store the page information such as object type or block
size. Dybvig, Eby and Bruggeman (1994) present three ways to for this: fixed page
table, dynamic page table and page headers. Fixed page tables use a static table to
record information of each page, and since full table is not usually utilized by all
applications, majority of page table remains unused. Dynamic page table is similar to
static page table, except that the table can grow and relocate to waste less memory. Page
headers store the relevant information in a page head. (Dybvig, Eby and Bruggeman,
1994, p. 8)
Only static page table and page headers are suitable for real-time DMA, since they have
predictable behavior. Dynamic page tables may need to relocate, and this requires a
memory copy operation. Page headers are the most scalable alternative, but have worse
locality characteristics (Dybvig, Eby and Bruggeman, 1994, p. 8).
BIBOP mechanism and its variations are used in many allocators. For example, region
(see 3.3.2) and reap mechanisms (see 3.3.3) generally assume BIBOP. Also general-
purpose allocators such as jemalloc (see 4.2.4) and Hoard (see 4.2.3) utilize BIBOP.
3.3.2 Regions
Regions (also known as arena, group or zone) allocate blocks simply by bumping a
pointer across a range of memory. Blocks cannot be freed individually, but the entire
region can be freed instead when none of its blocks are in use. Region allocation and
free operations are very fast. (Berger, Zorn & McKinley, 2002, pp. 1-2, 5) Regions can
be allocated in pages, and a free counter can be used to count the free operations in the
region. Region allocation is illustrated in figure 6.
The inability to free individual blocks often complicates the use of regions in some
applications. Additionally, regions may considerably increase memory consumption
compared to other mechanisms, since a region cannot be freed unless all its blocks are
unused. Compilers and parsers however may greatly benefit from them. (Berger, Zorn
& McKinley, 2002, pp. 2, 4-7, 9)
28
3.3.3 Reaps
Reaps were introduced by Berger, Zorn and McKinley (2002). Reaps combine the
favorable features of regions and heaps. They add to regions the possibility to free
blocks anywhere inside the region without compromising performance. Reaps are
shown to reduce memory consumption compared to regions. Reaps mechanism is used
in the Hoard allocator. (Berger, Zorn & McKinley, 2002, pp. 1, 11)
Reaps first allocate memory like regions, with pointer bumping. When an individual
block is freed, it is put to an associated free list. The reaps allocate memory in pages.
When a page becomes full, another page is allocated and the reap allocator returns to the
region style of operation. The original reap method adds block headers to every block.
(Berger, Zorn & McKinley, 2002, p. 7) However, block headers are not necessary if
BIBOP mechanism is used (Schneider, Antonopoulos & Nikolopoulos, 2006, pp. 85,
87). Reap allocation is illustrated in figure 7.
This chapter contains analysis on mechanisms used for small block allocation in various
general-purpose allocators. We will provide a short description of each general-purpose
allocator and its small block allocation mechanisms. We will then summarize the
mechanisms, and finally analyze the mechanism suitability for real-time use. This
analysis is used as basis for selecting the mechanisms for the framework
implementation in chapter 6.
We wanted initially to focus only on allocators designed for real-time systems, but
because there exists only a few such allocators (Half fit and TLSF), we decided to
broaden our scope to general-purpose allocators. We tried to select the most well-known
or otherwise prominent allocators for analysis.
4.1 Motivation
It has been confirmed by multiple authors that modern programs make mostly small
allocations (Wilson et al., 1995a, p. 36). Berger, Zorn and McKinley (2002, p. 8)
measured memory use in various programs, and show that 88% of allocations are under
64 bytes and almost all (99.54%) are under 256 bytes. Similarly, Lee, Chang and Hasan
(2000, p. 391) report that 90% of allocations are below 512 bytes, and have usually a
short life-span. Small blocks are usually allocated in large quantities and large blocks in
smaller quantities (Berger, Zorn & McKinley, 2002, p. 8; Hasan & Chang, 2005, pp. 45-
46).
Multiple authors have studied the most common allocation size in programs. Wilson
and others (1995a, p. 28) state that sizes average on the order of 10 machine words (40
bytes on a 32-bit machine). Measurements by Zorn and Grunwald (1992, p. 4) show
that, for various programs, the most common allocation size is smaller than 32 bytes,
and that the median block size is from 14 to 32 bytes. Later research by same authors
confirm the result with other programs (Grunwald, Zorn & Henderson, 1993, p. 184).
Measurements by Deflets, Dosser and Zorn (1994, p. 530) show that 39.6 bytes is the
median of average allocation sizes in various large C and C++ programs.
C++ programs naturally tend to allocate large quantities of objects8 from a few size
classes. In C++ programs, main sources of allocations are constructors, copy
constructors and the overloaded assignment operator=. C++ programs may also use up
8 A C++ application may allocate 20 times more memory than equivalent C application (Hasan & Chang, 2005, p. 36).
30
to 38% of their total runtime in DMA. (Chang, Hasan & Lee, 2000, p. 7; Risco-Martin
et al., 2011, p. 755; Lee, Chang & Hasan, 2000, pp. 387, 391)
Block headers are the main source of overhead when small blocks are allocated. A
single word in a block header or footer can increase memory usage by 10% to 20%
(Wilson et al., 1995a, pp. 28, 36). For example, if the block header is 4 bytes, alignment
is 8 bytes, and we allocate a 32-byte block, then real size of the block is 40 bytes
(=32+4+4, alignment adds 4 padding bytes), thus the resulting overhead from header is
20% (=8/40). Since a great majority of allocations are small, this scenario is very
frequent. Wasted memory can be however reduced by using mechanisms with smaller
overhead per block or by eliminating block headers. Some of these were described in
section 3.3. Additionally bitmapped fits have low overhead.
The following allocators were excluded from the analysis: CustoMalloc (Grunwald &
Zorn, 1993), PHKmalloc (Kamp, n.d.), QuickFit (Weinstock & Wulf, 1988), Slab
allocator (Bonwick, 1994), and Zone allocator (Van Sciver & Rashid, 1990). According
to Bonwick (1994, p. 4), QuickFit and CustoMalloc allocators require a priori
knowledge of the common allocation sizes. Slab and Zone allocators also require client-
driven (application specific) customization, and because of this the allocators are not
general-purpose allocators by our definition. The Slab allocator is a kernel allocator
(Bonwick, 1994, p. 11). The excluded allocators either share same functionality with the
ones analyzed, or were omitted because of the limited scope of this study.
4.2.1 Dlmalloc
Doug Lea's general-purpose allocator is a well-known and established allocator,
frequently addressed in the research literature. Dlmalloc is claimed to be an all-around
general-purpose allocator with good average execution time and low fragmentation. It
uses three categories of allocation sizes: small, medium and large, where small blocks
are managed with segregated free lists. The allocator also uses deferred free to coalesce
blocks. (Johnstone & Wilson, 1998, pp. 28, 36; Berger, Zorn & McKinley, 2002, pp. 2,
11; Masmano et al., 2004, p. 81; Risco-Martin et al., 2011, p. 756; Chang, Hasan & Lee,
2000, p. 8)
Masmano and others (2008a) however show that dlmalloc has very high WCET, and
claim that it executes in O(N). Their measurements support a further claim that dlmalloc
should not be used in real-time applications. (Masmano et al., 2008a, pp. 175, 166, 168)
Current version of dlmalloc (Lea, 2011) however seems to have a configuration option
for real-time systems. Unfortunately we noticed this too late, and the allocator was not
31
Half fit allocator maintains segregated free lists with blocks with sizes 2 k .. 2(k+1) - 1. It
marks each of the lists empty or non-empty in a one word bitmap, which is searched in
constant-time using bit-scan instructions. The bit-scan automatically uses the list of next
available size if the list of requested size is empty. All blocks in the allocator have
headers that contain at least links for a doubly linked-list. Efficient immediate
coalescing and splitting is performed. The allocator has no special treatment for small
blocks. (Ogasawara, 1995, p. 23; Masmano et al., 2008a, pp. 152-153, 175; Masmano et
al., 2006, p. 69; Masmano et al., 2004, p. 81)
The Half fit allocator has lower WCET than TLSF, but has higher worst-case
fragmentation. Otherwise it has significant similarities with TLSF. (Masmano et al.,
2008a, pp. 150, 168; Masmano et al., 2006, pp. 73-74) Because of the similarities with
TLSF, and Half fits higher worst-case fragmentation, we do not perform simulations
with the Half fit allocator in chapter 6.
4.2.3 Hoard
Hoard allocator, introduced by Berger and others (2000), is a general-purpose allocator
designed for deliver high-performance in multiprocessor systems. Hoard allocates
memory through OS virtual memory system in large units called superblocks. A
superblock can allocate a number of blocks of one size class only, and it contains a free
list to store and reuse freed blocks in LIFO fashion. (Berger et al., 2000, p. 118)
Hoard recycles its free superblocks to reduce external fragmentation. Small blocks are
allocated using superblocks, but blocks larger than half of superblock size are allocated
by using OS virtual memory system. The allocator uses size classes power of b apart,
where b is greater than 1. (Berger et al., 2000, pp. 119-120) Reaps mechanism was later
used in place of superblocks in Hoard (Berger, Zorn & McKinley, 2002, p. 11). The
allocator uses a mechanism similar to BIBOP to manage its superblocks.
4.2.4 Jemalloc
Jemalloc is a general-purpose allocator introduced by Jason Evans. It is an open-source
high-performance allocator focused on multithreaded scalability and cache locality
(Evans, 2006, p. 2). It is used in FreeBSD (Evans, 2006, p. 1), Mozilla Firefox and
Facebook server. We analyze version 3.0.0 of the allocator.
The measurements by Evans (2006, pp. 7-11) show that jemalloc has slightly better
overall performance compared to dlmalloc and PHKmalloc, and has good multithreaded
32
scalability. On multiprocessor systems, the allocator uses four arenas per processor,
issuing one arena for one thread at a time. Use of thread-specific arenas improves
multithreaded performance by eliminating locks. Single-processor systems use one
arena only. (Evans, 2006, pp. 1-4, 7-8)
The allocator handles its memory in fixed 2 MB memory chunks requested from the
underlying OS. The chunks are aligned in memory to allow constant-time calculation of
chunk index from memory address high bits. (Evans, 2006, p. 4) The jemalloc chunks
behave like large pages in BIBOP mechanism. Figure 8 illustrates chunk and arena
allocation.
Figure 8. Chunk and arena allocation in jemalloc. Huge allocations span multiple chunks.
(Evans, 2006, p. 4)
The allocator handles blocks in three size categories: small (1, 2048] bytes, large (2,
1024] KB, and huge (1, +∞) MB. The small category has three subcategories: tiny 2 .. 8,
quantum-spaced 9 .. 512 and sub-page 513 .. 2048 bytes. (Evans, 2006, p. 5) Each
category is treated differently in the allocator. Chunks are divided to 4 KB pages when
they store blocks from small category. Pages form page runs which store blocks of one
size class. The page runs store a bitmap in their header for block bookkeeping. (Evans,
2006, pp. 5-6)
Upon small block allocation request, the allocator first calculates an index to a cache bin
using a lookup table. If the cache bin has free blocks, one is returned. This cache
operations is similar to SSS. Otherwise, if no block is found from cache, arena bin
pointer to a page run is checked, and if it is not null, the pointed page run's bitmap is
searched for a free block which is then returned.
33
Otherwise, if the pointer to a page run was null (no page run), a binary buddy (red-black
tree), containing page runs, is searched to find a suitable run for the bin size (Evans,
2006, p. 4). If a run is found, it is returned. Otherwise if no run is found, a new run is
allocated from existing chunks by again searching the red-black tree. If this later search
yields no chunks, new one run is allocated. When new run is created, its bitmap is also
initialized.
When a small block is freed, it is first put in the cache for quick reuse. When the
allocator has performed a number of operations (allocate and free), it performs a
deferred free cycle to one cache bin. This cycle involves locating the parent chunk and
run of the blocks in the bin, and using offset calculations and lookup techniques to mark
the blocks free in the page run bitmap. The next coalescing cycle is then performed on
the next bin.
Summary
Small block allocation and free are quite complex in the jemalloc. Basic mechanisms
involve SSS cache with deferred free policy, bitmapped fits for actual block allocation
from page runs, and binary buddies (red-black trees) for page run allocation from
chunks. BIBOP is used to manage chunks on high-level.
The allocation operation in jemalloc is fairly complex in the worst-case. However, the
regularities in the allocation request stream may converge the allocator to a state where
most small block allocations can be satisfied directly from the SSS cache. This happens
when application allocates a large number of small blocks. In such case jemalloc
allocation may have amortized complexity T(n) ∈ O(1).
Kingsley allocator is suitable for real-time systems. We however note that the Kingsley
allocator resembles strongly Half fit by its power-of-two size classes and segregated
free lists. Because of the similarities, we do not analyze Kingsley allocator further in
this study.
4.2.6 TLSF
Two-level segregated fit (TLSF) is a general-purpose real-time systems DMA
introduced by Masmano, Ripoll, Crespo and Real (2004). Its allocation and free
operations perform in O(1) and have low and bounded WCET. It uses the same
allocation mechanism regardless of block size, and only a small variation in execution
time can occur. The allocator implements a good fit policy. (Masmano et al., 2004, pp.
79, 83, 86-87; Masmano et al., 2008a, p. 175) TLSF can be seen as an extension to the
Half fit (Ogasawara, 1995) allocator (Masmano et al., 2008, p. 150).
34
TLSF uses a large number of segregated lists containing blocks from different size
ranges, and it uses a novel two-level indexing structure to reduce the list selection to a
constant-time operation. This indexing is illustrated in figure 9. First-level index
contains size ranges in power-of-twos, for example 16 .. 31, 32 .. 63, 64 .. 127 bytes,
and so on, and then a second-level index divides these ranges linearly. For example, a
size range 32 .. 63 bytes can be divided to four sub-ranges: 32 .. 39, 40 .. 47, 48 .. 55,
and 56 .. 63 bytes. (Masmano et al., 2004, p. 83; Masmano et al., 2008a, pp. 157-158)
Figure 9. Illustration of TLSF indexing data structure. (Masmano et al., 2004, p. 82)
A word-size bitmap is used on both levels to mark free lists empty or non-empty. Bit-
scan instructions are then used to perform the list selection in constant-time. The bit-
scan search also automatically selects the free list of a larger size class if the free list of
the desired size class is empty. (Masmano et al. 2004, pp. 83-85; Masmano et al.,
2008a, p. 158)
The first and second level indexes are calculated from the requested allocation size.
First level index is first obtained by locating the most significant set bit in the size by
using bit-scan instructions. Second level index is then obtained from the following bits
by using basic bit manipulation. (Masmano et al., 2004, p. 84; Masmano et al., 2008a,
pp. 157-159) Figure 10 shows an example of the index calculation. First level index (f)
is 8 and second level index (s) is 12. Second level index is represented by the 4 bits
following the most significant set bit.
Figure 10. Example of first and second level index calculation from allocation size. (Masmano
et al., 2004, p. 84)
35
TLSF blocks have headers, and block splitting and coalescing is performed immediately
on allocation and free. Blocks are split if allocation occurs from a larger size free list
than was requested. The authors claim roughly 3% internal fragmentation, and a low
overall fragmentation. Memory requirement of internal data structures can also be
calculated offline. (Masmano et al., 2004, pp. 84-85; Masmano et al., 2008a, p. 161;
Masmano et al., 2006, p. 73) This makes the time and space costs of the allocator very
predictable.
Summary
TLSF uses the same mechanisms to manage all blocks regardless of block size. Its
allocate and free operations execute in O(1) time with very little variation in execution
time. It uses two levels of bitmapped indexing, and implements a good fit policy. All
blocks contain headers, and immediate coalescing and splitting is performed.
4.3 Summary
We have now described and analyzed small block allocation in various general-purpose
allocators. Our intent in this section is to distinguish mechanisms that are more
frequently used than others. A summary of the small block allocation mechanisms in the
analyzed general-purpose allocators is presented in table 2.
Table 2. Summary of small block allocation mechanisms in general-purpose allocators from
previous sections.
Half fit Segregated free lists, immediate coalescing and splitting, bitmapped Yes.
indexing, block headers
TLSF Segregated free lists, two-level bitmapped indexing, block headers, Yes.
immediate coalescing and splitting
Segregated free lists are the most popular mechanism for small block allocation, and it
is used by almost all of the allocators. Segregated free lists are a constant-time
mechanism, offering a high performance and throughput necessary for small block
allocation. Simple segregated storage is a specific type of segregated free list
mechanism.
Block headers are also stored by many of the allocators. Main reason for the use of
blocks headers is probably the efficient coalescing it provides – links to previous and
next block can be referenced quickly from the header. On the other hand, at least Hoard
and jemalloc uses BIBOP or reaps to eliminate block headers to reduce per-block
overhead. Immediate coalescing is performed by at least Half fit and TLSF, while other
allocators perform either deferred coalescing, or no coalescing at all (Kingsley allocator,
Hoard). While deferred coalescing is a good mechanism to reduce average execution
time (to increase throughput), it is not suitable for real-time allocation (see section
3.1.4).
Based on the analysis, we are confident that both segregated free lists and bitmapped
indexing are good mechanisms for small block allocation, and also for real-time DMA
since both mechanisms have a low constant time cost. If block headers are used, they
should be as small as possible, since they increase per-block overhead. Using BIBOP
may be beneficial since it removes per-block overhead. While deferred coalescing
improves average performance, it should not be used in real-time DMA, and immediate
coalescing is preferred.
37
This chapter introduces Bitframe allocator, a new DMA aimed for small memory block
allocation. Its allocation and free operations perform in O(1) time and have bounded
WCET, making is suitable for real-time applications. The allocator was originally
created as custom DMA for Lua core in a released Nintendo DS game.
The Bitframe allocator is based on the bitmap allocator mechanism where one bit stores
the allocated/free state of a single memory chunk. To eliminate bitmap scanning the
Bitframe allocator divides the bitmap in 8 bit bitframes and uses lookup tables to locate
the longest bit sequence in each bitframe9.
To allocate blocks spanning more than 8 chunks, the allocator uses a larger chunk size
depending on the size class. A number of bitframes and their associated memory chunks
are stored together in pages, where each page contains only bitframes having the same
chunk size. The allocator manages pages with the BIBOP mechanism.
In the next section we describe the lookup tables in the allocator. Sections 5.2, 5.3 and
5.4 describe the data structures, and sections 5.5 and 5.6 describe the allocation and free
operations. We conclude the chapter with an analysis of the allocator.
9 The 8 bits division is smallest choice, but this depends entirely on the implementation. For example 16 bits is also
reasonable choice, but will have 256 times larger lookup tables (2 16 elements). This may be too much for some
applications. To my knowledge there exist no suitable CPU instructions in common hardware which could be used in
place of the lookup tables.
38
• Allocation termination bit for each chunk to store the information of the
allocation lengths (1 x 8 = 8 bits)
This totals to 32 bits10, resulting to 4 bits for each of 8 chunks in the bitframe. Figure 12
shows the data structure of the bitframe in more detail.
10 Again, an implementation could use more or less bits depending on the requirements.
39
cdl_node size_class_pages;
bitframe *bitframes;
u8 head_bits;
u8 chunk_size_shift;
void* chunks;
} page_header;
The bitframes array contains the bitframes of the page and in addition 8 dummy
bitframes at indexes 0...7 to serve as heads to circular doubly linked bitframe lists. This
is to simplify the list management, since bitframes use 8-bit indexes as links to save
space. The circular doubly linked lists store bitframes sharing the same longest free
chunk length. The head_bits bitmap marks each of the lists empty or non-empty.
Bitframes without free chunks are orphan and are not stored in any of the lists. Figure
13 illustrates the bitframe lists inside a page.
Each page in the allocator belong to a circular doubly linked list ( size_class_pages)
determined by its size class, or to no list if the page is full. The page size class is
determined by the longest free chunk sequence of all bitframes in the page. This can be
queried quickly from the index of the most significant set bit of the head_bits.
Since we have 8-byte quantum and bitframes can store up to 8 chunk sequences first 8
size classes are 8, 16, 32 … 64 bytes. The next size classes use 64-byte chunk size
which was the maximum of the previous chunk size. This results in size classes 64, 128,
192 … 512 bytes. The allocator has now two chunk sizes: 8 and 64, and 16 different
size classes. Notice however that there are two 64-byte size classes, so the allocator
must decide which to prefer on allocation.
The Bitframe allocator has an array of circular doubly linked lists to store one list for
each size class (size_class_lists), and a bitmap to mark the lists non-empty or
empty (size_class_bits). The lists link together pages sharing the associated size
class. By using the size class bitmap, allocator can rapidly locate a page with suitable
sequence of free chunks to satisfy the allocation.
Upon allocation, the size class of the requested block size is first calculated. Then the
size_class_bits in the allocator are scanned starting from the calculated size class bit
to find the best suitable page. The resulting page will have size class greater than or
equal to the requested block size class and is guaranteed to satisfy the allocation. The
page bitframe head contains a bitframe which satisfies the allocation request. Following
pseudo code explains this procedure.
class_idx = calculate_size_class(size_bytes)
class_idx = bsf(size_class_bits >> class_idx) + class_idx
page = size_class_lists[class_idx].next
head_idx = (size_bytes - 1) >> page.chunk_size_shift
bitframe_idx = page.bitframes[head_idx].next
After the suitable bitframe has been located, a chunk index for the block is looked up
using the bitframe bits. Then, a linear mapping is performed to obtain a pointer to the
block returned by the allocator. The following pseudo code illustrates the chunk index
lookup and pointer mapping.
bitframe_bits = page.bitframes[bitframe_idx].bits;
chunk_idx = lookup_longest_idx[bitframe_bits]
11 Naively
this involves initializing every bitframe in the page which is a costly operation. Alternatively a pointer-
bumping mechanism can be used to initialize new bitframes in the page when they are needed.
41
After this, the bitframe bits and block terminator bits are modified accordingly. The
longest free chunk sequence in the bitframe is also likely changed, and bitframe is
transferred to a list matching the new longest free sequence length. The change in
bitframe lists state may also affect the size class of the page, and the page is transferred
to a list matching its new size class.
All the previously mentioned steps of the allocation operation execute in O(1) time,
taking advantage of lookup tables, bit manipulation, bit-scan instructions and circular
linked lists. Since total time spent in an algorithm is the sum of all its steps, the resulting
time complexity of the allocation operation is O(1).
Similarly to the final steps in the allocation operation, the free operation alters both
longest free chunk sequence in the bitframe and the page size class. The allocator needs
to transfer the bitframe and the page to a list matching the new state. All the previously
mentioned steps execute in O(1) time, including bitmap scanning which is performed
with bit-scan instructions. Thus as with the allocation operation, free also performs in
O(1) time.
5.7 Analysis
Like all DMA, bitframe allocator has its shortcomings. A major limitation is set by the
bitframe data structure, which constrains the maximum allocatable sequence to 8
chunks. This reduces the flexibility of the allocator.
There are issues in current size class calculation which considerably increase internal
fragmentation when allocating specific sizes. A worst-case example is the allocation of
65 bytes. This would use two 64-byte chunks to store the 65-byte block causing roughly
50% memory to be wasted. One solution to this problem would be to constrain the
smallest chunk sequence length allowed in a bitframe.
The bitframe data structure also limits the organization of chunk sequences. A chunk
sequence must end at every 8th chunk and cannot continue to the first chunk on the next
bitframe. This has a serious limitation to the maximum number of possible allocations
of sequences longer than 4 chunks (half of the bitframe 8 bits). A bitframe can only
contain one sequence that is longer than 4 chunks. Thus a page with M bitframes can
42
only contain M blocks spanning more than 4 chunks. This is a notable limitation
especially for blocks spanning 5 chunks.
One solution to the previous problem is to limit the maximum sequences to 4 chunks
while keeping the bitframe size of 8 bits. This solution is only partial, since then
sequences of 4 chunks would then have the same maximum number of allocations as
sequences of 3 chunks (maximum two such sequences per bitframe). This solution also
increases the number of chunk sizes for pages which complicates the size class
calculation.
Better solution to the previous problem would be to allow the allocations to cross
bitframe boundaries. This would involve reading bits from consequent bitframes and
merging them before bit manipulation. However, as a results lookup tables would need
to have more items to cover the new merged bits. For example, with maximum
sequence length of 8, the new lookup tables would require 2 15 items. However for
maximum sequence length of 4 chunks (discussed in previous paragraph), the new
lookup tables would require 211 elements.
5.8 Conclusion
The bitframe allocator is an O(1) DMA with bounded WCET suitable for real-time
applications. The allocator is designed for rapid allocation of small memory blocks.
There are possibilities for reducing the fragmentation in the allocator. One possibility
would be to allow the allocation of chunk sequences across frame boundaries, and
another would be to limit the sequence lengths.
43
For the simulation, we created an ad hoc framework with a total of seven different
allocators. Six of the allocators were implemented from scratch, and for the seventh we
used an open-source implementation by Masmano, Ripoll, Brugge and Scislowicz
(2008b). The framework was written in C, but had occasional blocks of inline assembly.
The source code is available at http://bitbucket.org/tsone/memwork
Our framework operates similarly to trace processor described in (Johnstone & Wilson,
1998, p. 31). Memory allocation traces are given as input to the framework which
performs simulation and produces output logs of either timing (cycles) or memory use.
The output logs contain information on each allocation and free operation of the input
trace. The simulation used 3 MB heap size for all traces.
A separate program processes the logs and produces plots and analysis results.
Following the evaluation methodology in (Masmano et al., 2008a, p. 162), the program
calculated worst-case, mean and standard deviation of execution time from trace
simulations. It created plots containing information on allocation and free operation
cycles, allocated memory in the trace, and memory use, internal fragmentation,
implementation overhead by the allocators. It calculated fragmentation with the
equation 1 described in section 2.4.1. Following this method, it also calculated the ratio
of implementation overhead on fragmentation with the following equation.
L
I= (2)
H −M
The simulations were performed on Acer Aspire One ZG5 netbook, running an Intel
Atom N270 CPU at 1.6 GHz with 512 KB L2 cache. The machine had 512 MB of RAM
which front-side bus speed was 533 MHz. Fedora 17 LXDE GNU/Linux was used as
OS, and a prebuilt kernel version 3.6.7 was used. We did not modify the kernel. The test
framework was compiled with GCC version 4.7.212.
The next three sections will describe the memory traces, the simulated allocator
implementations, and an analysis on the worst-case behavior of the implementations.
The sections are followed by description of the timing and fragmentation measurement
methodology. This is followed by sections presenting simulation results and analysis for
both timing and memory measurements separately. We end the chapter with an analysis
of the overall efficiency of the implemented mechanisms.
12 Wealso tried Clang version 3.0. While Clang produced faster code for some allocators, GCC provided better overall
performance.
44
The uniform and small traces were synthesized from probabilistic distributions, and
neither tries to mimic real application behavior. Instead, we use synthetic traces to give
information on worst-case fragmentation behavior of allocators when allocating small
block sizes, since it is mandatory to reveal worst-case behavior in real-time time context
(see section 2.5). Both of the traces have random object lifetimes; for every other
allocated block a block is randomly freed from the live blocks. The uniform trace
allocates all sizes uniformly, and the small trace allocates mainly small blocks following
a normal distribution. Heap sizes in both traces exhibit linear growth.
We modeled the normal distribution for small trace according to the results in (Zorn &
Grunwald, 1992, p. 4) and (Berger, Zorn & McKinley, 2002, p. 8). Their results show
that roughly 90% of allocations in real programs are below 64 bytes (see section 4.1).
We used a normal distribution with a mean 32 bytes and standard deviation of 19.455
bytes to ensure 90% of allocations fall under 64 bytes. We additionally ensured that
every block size over 108 bytes was allocated at least once.
The boot and stable traces are both subsets of a larger trace recorded from a real-time
embedded system allocator, and reflect the boot and stable phases mentioned in
(Masmano et al., 2008a, p. 175). The large trace recorded all allocation and free
operations from boot to a system stable phase. The boot trace is a subset of this trace,
and captures the boot phase of the system, starting from its beginning and ending at a
position where heap size stops growing. The stable trace is also a subset of the larger
trace, and captures the stable phase of the system, containing roughly 500000 allocation
and free operations.
45
46
Size class region allocator regis2 is similar to regis1, but maintains a region for every
size class. The classes are 8 bytes apart with sizes 8, 16, 24 … 512 bytes, totaling to 64
size classes. Each region is stored in its own page and will allocate only blocks in one
size class. The page headers store a live counter and a size class id. As with regis1, a
page is freed when its live counter becomes zero.
Reap allocator reaps uses reap mechanism to allocate blocks and manages pages with
the BIBOP mechanism. Each page contains a reap which allocates blocks of a single
size class. Size class sizes are 8, 16, 24 … 512 bytes, totaling 64 size classes. The page
headers store the reap data structure, a live counter, the page size class and a circular
doubly linked list node. The circular list is maintained to connect pages having free
blocks of the same size class. Pages are removed from the list when page becomes full
(no free blocks in page) or when live counter becomes zero, in which case the page is
also freed. The page size 1 KB was used.
Binary buddy allocator bbuddy is based on the description by Knuth (1973, pp. 442-
445). Our implementation however divides the heap initially to 1024-byte buddies
because larger than 512-byte blocks are not allocated in our experiments 13. Buddies
have a 16-byte header, so to host both the buddy header and a minimum 1 byte block, a
buddy must have 32-byte minimum size. This means buddy size range is between 32
and 1024 bytes (or 25 and 210) and so maximum number of splits and merges by buddy
system is 5. This means allocate and free operations in bbuddy execute in O(1) time.
Simple segregated storage allocator sss was implemented following the description in
section 3.2.2. Our implementation uses size classes of 8, 16, 24 … 512 bytes. When a
free block is not found on a size class list, our implementation allocates a new block
13 Noticethat in our implementation a 1024-byte buddy is used to store a 512-byte allocation. This is because the buddy
also needs to store its header. So, in the case of 512-byte allocation, almost 50% of memory is wasted to internal
fragmentation. This is an example of worst-case fragmentation by binary buddy mechanism.
47
from heap by pointer-bumping. Block headers contain only a pointer to a size class data
structure matching the block size. When a block is freed, it is placed on the segregated
free list of the size class.
Bbuddy allocation WCET occurs when a block of the minimum size is requested and
the heap is empty. This causes the allocator to perform a maximum number of splits to
satisfy the allocation. Free WCET occurs in an opposite case, when the last block is
freed, and the freed block has minimum size. This causes the allocator to perform a
maximum number of merges to coalesce the blocks. (Masmano et al., 2008a, p. 165)
Our implementation has a large block header (16 bytes), and this causes reasonable
overhead. Also because of the header, sizes slightly less than power of two may cause
considerable internal fragmentation. For example, a 512-byte allocation would need a
1024-byte buddy, and waste roughly 500 bytes.
Bframe allocation WCET occurs when a page contains no free chunks, and a new page
must be allocated for the allocation. Our implementation is not well optimized for the
page allocation, and the operation takes a long time to complete. The allocator free
WCET occurs when the longest chunk sequence in a page changes, causing both a
bitframe and a page to be transferred to another doubly linked list. In our
implementation, worst-case internal fragmentation is caused by allocating blocks with
size of 65 bytes, which causes roughly 100% internal fragmentation. Worst-case overall
fragmentation occurs when pages are allocated but not used. This causes considerable
overhead from bitframe and page bookkeeping data structures and external
fragmentation.
Both regis1 and regis2 have high internal fragmentation. Pages in the allocators cannot
be freed unless all blocks of the page are freed. On the other hand, the allocator has
almost no implementation overhead because block headers are not used and page footer
is very small. WCET in the allocation operation occurs when a new page is allocated,
and similarly WCET in the free operation occurs when a page is freed. WCET bound is
very low overall.
The reaps allocator has low internal fragmentation and implementation overhead.
Similarly to regis1 and regis2, allocation and free operation WCET occurs when a new
page is allocated or freed.
The sss allocation WCET behavior occurs when no suitable free block is found on
segregated lists, and a new block must be allocated by pointer bumping. This situation
also causes fragmentation to accumulate, since the allocator is not effectively reusing
48
memory. The free operation contains no branches, thus it should always execute in the
same number of cycles14.
The tlsf allocation WCET behavior occurs when a small block is allocated and the
allocator has only one large free block. WCET for free operation occurs when a freed
block has two neighbors that are coalesced. Worst-case internal fragmentation is
expected to be 3%. (Masmano et al., 2008a, pp. 161, 166)
Based on the this worst-case analysis, we are certain that the boot and stable traces will
contain allocation and free operations which will cause WCET behavior in all allocator
implementations except bframe. The page size in bframe is so large that the allocations
in boot and stable traces do not cause new pages to be allocated. We are certain
however that our traces will cause worst-case fragmentation behavior in all of the
implementations.
Our timing instrumentation code used RDTSC instruction which returns CPU cycles
elapsed since boot. Proper use of RDTSC is however error-prone, and many sources can
interfere with the timing measurements: OS scheduling, simultaneous multithreading
and interrupt handling, CPU out-of-order execution, CPU power modes and cache
operation. Modern CPUs use every possible method to improve throughput (average
execution time), and these methods will interfere with the measurements (Masmano et
al., 2008a, p. 162).
CPU out-of-order execution is not a problem on the Intel Atom processor since it does
not reorder instructions. We can also disable CPU power mode changes and
simultaneous multithreading from the kernel. We cannot however disable the CPU
cache16, and CPU cache was probably the single most important source of interference
in our measurements.
14 CPU cache operation affects the execution time, and the measured execution time will fluctuate.
15 Weobserved that compiler optimizations had a large impact on our timing measurements. We chose aggressive
optimizations since we believe most real-time applications also want this level of optimizations. We are aware that
aggressive optimizations may introduce bugs.
16 Wetried to disable CPU cache from the CPU CR0 register, but this caused the measurements to become more erratic.
Disabling the CPU cache also slowed down the system to a grinding halt.
49
simulation process, and also to minimize interference from various sources in the
system:
• The system was booted to runlevel 1 and all possible OS daemons were stopped.
(IBM, 2011, pp. 5-6)
This permits our framework's real-time process to use 100% of CPU time. (IBM,
2011, p. 6)
• Test framework process was run as root, and process was elevated to highest
available scheduling priority with real-time SCHED_FIFO policy (IBM, 2011, p.
4). This ensures framework process will not be preempted unless a process with
greater or equal priority is issued (which is not often, since real-time throttling
was disabled).
• Interrupts were disabled during instrumentation with CLI and STI instructions.
Each allocation trace was run 1000 times, and the cycles for each allocation and free
operation were measured. We then calculated a minimum of each operation cycles,
resulting in an optimistic estimate. Our aim was to eliminate interference from the CPU
cache, but this did not fully succeed. It only worked when heap size was well below
CPU cache size (512 KB). Otherwise we observed notable interference. It made timing
measurements from uniform and small traces unusable for our evaluation, and we
decided to omit them.
Maximum heap address was calculated from the highest address touched by the
allocator – from the highest address of any page or block allocated by the allocator. The
maximum heap address was calculated relative to a heap start address which was
50
aligned depending on the allocator in question. The resulting alignment padding was
omitted from the measurements.
Note that there is some interference in our measurements. We will address this issue in
the next section. The uniform and small trace measurements contained a much higher
amount of the interference, and for this reason we decided to omit them.
51
52
53
Table 3. Measured maximum, mean and standard deviation of allocate and free operations in
boot trace. Units are in CPU cycles.
regis1 156 72 37 37 8 4
sss 96 60 49 48 4 1
Table 4. Measured maximum, mean and standard deviation of allocate and free operations in
stable trace. Units are in CPU cycles.
The results with uniform and small traces contained even more interference, and we
have omitted the results. The higher amount of interference is explained by the size of
54
the traces and the larger amount of allocated live memory. Since the amount of active
memory is higher during simulation, probability of CPU cache misses is higher. Cache
misses cause long delays in memory accesses, causing the interference.
The cycle plots however show that all of the implementations have bounded and low
WCET, and fulfill this requirement for real-time DMA. The measured standard
deviation in execution times is also relatively small.
The timing measurements show that bbuddy, bframe and tlsf use more cycles than
others. Standard deviation is highest with tlsf, but we believe this is primarily caused by
CPU cache interference. The highest mean execution time is with bframe, however has
comparably good WCET. The bbuddy performs surprisingly well compared to others.
The regis1, regis2, reaps and sss allocators are shown to be fastest and most predictable
WCET.
The large page size in bframe plays a large role in the measurements. The boot and
stable traces are too small to cause new pages to be allocated by the implementation,
thus the bframe allocation WCET condition does not occur. Note that we did not have
time to implement a faster page allocation, while it would have been possible 17. Because
of this, the new page allocation operation in bframe takes a very long time to complete,
which would have distorted the WCET results. We believe that our measurements gives
a hint on the WCET performance of the Bitframe allocator.
Figures 26 and 27 show the implementation overhead for the allocators in boot and
stable traces. And figures 28 to 33 show the sum of internal fragmentation and
implementation overhead as total wasted memory by the allocators in boot and stable
traces. These figures display the ratio of implementation overhead to total wasted
memory and internal fragmentation by each allocator.
We have omitted previous figures of the uniform and small traces for two reasons.
Firstly, the uniform and small traces figures only displayed linear growth of memory
usage without special cases, since the traces contain twice the amount of allocations
than frees. Secondly, the traces do not represent real-world memory usage, and they are
only used in small block allocation worst-case fragmentation analysis.
17 The problem with Bitframe page allocation is more precisely in page initialization. An efficient way to implement
Bitframe page initialization is discussed in footnote 11 in section 5.5.
55
56
57
58
Table 5. Maximum used memory (in bytes) and calculated fragmentation of each allocator in
boot trace. Maximum live memory in boot trace was 16729 bytes.
Table 6. Maximum used memory (in bytes) and calculated fragmentation of each allocator in
stable trace. Maximum live memory in stable trace was 9410 bytes.
Table 7. Maximum used and live memory (in bytes) and corresponding calculated
fragmentation for each allocator in uniform trace.
Table 8. Maximum used and live memory (in bytes) and corresponding calculated
fragmentation for each allocator in small trace.
The boot and stable trace measurements reflect the allocator behavior under normal
circumstances. The synthetic uniform and small traces however are only useful to
support analysis. The measurements from uniform trace show how allocators behave
when all block sizes are allocated with the same probability, whereas small trace shows
how fragmentation is affected when majority of the allocations are small. By comparing
the fragmentation in small and uniform traces we observe that small block allocation
increases fragmentation and overhead in all cases except the reaps allocator.
60
The boot and stable traces allocate a low amount of live memory, below 17 KB,
whereas uniform and small traces allocate above 1447 KB. This impacts fragmentation
by reaps, regis1, and regis2, since these allocators use pages that are quite large in size.
The new page allocations are clearly visible as steps in figures 24 and 25. The sss and
tlsf do not use pages, and hence they are shown to have much more subtle and gradual
heap growth.
Also having separate traces for boot and stable phases impacts our fragmentation
measurements. The boot and stable traces have considerably different allocation
behavior. The boot trace allocates many blocks with extremely long lifetimes (some live
until the system is shut down), while stable trace mainly allocates blocks with short
lifetimes. We believe this increased measured fragmentation of implementations that
used paging (bframe, reaps, regis1, regis2) and especially bframe, since it leads to
lower utilization of page memory. We believe it is crucial to use traces that contain all
relevant allocation behavior, and not only to focus on the stable phase as suggested by
Masmano and others (2008a, p. 175).
The bbuddy will theoretically have roughly 100% worst-case internal fragmentation, but
we see more fragmentation due to large amount of overhead from block headers. The
fragmentation and overhead is clearly worse when small blocks are allocated in small
and boot traces.
The bframe allocator seems to have a constant maximum heap size and implementation
overhead in the figures. The reason for this is that the boot and stable traces allocate too
little memory, and hence no new pages are allocated. This causes higher fragmentation
in the allocator because the currently allocated pages are not fully utilized.
Implementation overhead is embedded in the pages, and since no new pages are
allocated, overhead remains constant. For these reasons, bframe fragmentation is high in
smaller boot and stable traces, and less in larger uniform and small traces.
Similar to bframe, reaps fragmentation is also quite high in boot and stable traces
compared to uniform and small traces. As stated earlier, this is because reaps allocates
memory in pages which causes higher fragmentation when allocated live memory is
small. The reaps has a low relative overhead which seems to be under 20% even in the
small trace. The small trace displays the low fragmentation that is possible when large
numbers of small allocations are being made.
The regis1 and regis2 allocators have clearly the highest internal and overall
fragmentation. The main cause is the region mechanism, which prevents the region
from being freed unless all blocks in the region are also freed. Since the freed blocks
cannot be reused by the allocator before the whole region is freed, large amount of
practically free memory is unusable (internal fragmentation). The implementation
overhead is however clearly the smallest.
The sss allocator has the most unpredictable behavior in terms of fragmentation. The
lowest fragmentation was 4.96% while the highest was 351.99%. Great majority of
fragmentation is external, and is probably caused by the inability to reuse the memory
since no coalescing and splitting is done. We believe sss may have very high worst-case
fragmentation.
The tlsf allocator has the lowest fragmentation. This is clear from the real trace
measurements, and even with synthetic traces tlsf performs very well. Our results
61
confirms the claim by the authors that tlsf has roughly 3% internal fragmentation. The
allocator has however the highest ratio of implementation overhead to fragmentation.
For all allocators except regis1 and regis2, the majority of wasted memory is from other
sources than internal fragmentation and implementation overhead. Almost all allocators
have trouble in effectively reusing the previously allocated memory. The best allocators
in terms of fragmentation seem to be tlsf, bbuddy and reaps, while reaps seems to be the
best allocator when large numbers of small blocks are allocated.
We find sss too unstable to be used for real-time DMA. Also regis1 and regis2 exhibit
such a high internal fragmentation that they cannot be used for real-time DMA. Other
allocators seem to function quite well in terms of fragmentation. The reaps and bframe
are more usable when many small blocks are allocated and the heap size is quite large.
The bbuddy and tlsf are both suitable for smaller heap sizes, and have stable behavior in
terms of fragmentation. The bbuddy however has more higher fragmentation compared
to tlsf, especially when allocating small blocks. The reaps allocator has overall lowest
fragmentation in small block allocation.
6.10 Evaluation
In previous sections we analyzed and evaluated the implementation worst-case timing
and fragmentation aspects separately. We will now analyze and evaluate the
implementations in both aspects in order to provide information on the space-time
tradeoffs and suitability for real-time DMA.
The measurements with sss show that while SSS mechanism is very fast, it should not
be used in real-time DMA. If the mechanism is used, a great care needs to be taken to
prevent the fragmentation from accumulating. If deferred coalescing is used with SSS,
the coalescing needs to be implemented so that it does not interfere with the real-time
system scheduling.
The measurements from regis1 and regis2 implementations show that region allocation
mechanisms should not be used used in real-time DMA. While region allocators are
fast, their worst-case fragmentation is also extremely high. The reaps implementation
however shows that, while reaps have slightly higher WCET, their worst-case
fragmentation is much smaller than regions to justify the use. We believe that reaps
should be always used instead of regions. Additionally reaps has very low
fragmentation when small blocks are allocated.
The measurements from bframe show that Bitframe allocator is suitable for real-time
DMA. The allocator has low and bounded WCET and predictable worst-case
fragmentation. This shows that bitmap allocators in general can be used effectively for
allocation. The fragmentation is however quite high even when a large number of small
blocks is allocated, and reaps mechanism appears to be more effective. The Bitframe
allocator would benefit from a situation where a large number of small blocks are
allocated and stay allocated for a long duration. This is however unlikely to be common
with small blocks, since their lifetime is usually short. In some cases however, such as
dynamic language virtual machines, lifetimes of the blocks may be very unpredictable
and long.
We observed that bbuddy exhibits surprisingly low WCET, and that its worst-case
fragmentation is not terrible, but however quite high. Our results are similar to the
62
results from previous research concerning binary buddies. The measurements also show
that binary buddies have higher WCET and fragmentation than TLSF in all cases.
The results from tlsf implementation show that TLSF allocator has extremely good
properties for real-time DMA. The allocator has good overall WCET; lower than binary
buddies, but higher than rapid region and reap allocators. TLSF has additionally the
most predictable worst-case fragmentation compared to other allocators in the
experiment. Additionally TLSF has low fragmentation even when small blocks are
allocated. We believe this is because TLSF has a strategy to reuse previously allocated
memory, whereas the other mechanisms overly focus on reducing execution time and
implementation overhead.
63
7. Conclusion
This study has now thoroughly approached the research topic concerning DMA
mechanisms for small block allocation in real-time embedded systems. We answered
our first research question concerning the suitability of DMA mechanisms for real-time
embedded DMA in chapter 3 where we conducted a literature survey and analysis on
DMA mechanisms. We then answered the second research question concerning DMA
mechanism suitability for small block allocation in chapter 4 where we analyzed various
well-known general-purpose allocators and their source code.
To answer the third and fourth research questions, we implemented a set of allocation
mechanisms for experimentation based on the results from chapters 3 and 4. We
performed simulations and measurements on the implementations by using real and
synthetic traces, and determined the WCET and estimated worst-case fragmentation of
the allocation mechanisms. Finally we presented analysis on the suitability of the
mechanisms for small block allocation in real-time embedded systems. The simulation
experimentation, evaluation and analysis are described in chapter 6.
Based on our findings, we conclude that reaps mechanism has low WCET and
fragmentation when small blocks are allocated, and we recommend reaps for small
block allocation in real-time embedded systems. We are also confident that reaps
mechanism should be used almost universally in place of the region mechanism. Reaps
are shown to have lower fragmentation and only slightly higher execution time. Our
findings support the earlier findings concerning reaps.
Our findings also support earlier research concerning the TLSF allocator. We are
confident that TLSF is a fast and reliable general-purpose allocator for real-time DMA.
The efficiency of the simulated TLSF, binary buddy and Bitframe implementations also
show that bitmapped indexing is useful mechanism for many DMA. Our measurements
also show that SSS mechanism has unpredictable worst-case fragmentation. We
discourage the use of SSS in real-time DMA.
There was some interference in our timing measurements which we believe was
primarily caused by the CPU cache operation. We emphasize that care needs to be taken
when timing measurements are performed with modern CPUs.
64
References
ARM. (2010). Realview compilation tools assembler guide (version 4.0). Retrieved
from:
http://infocenter.arm.com/help/topic/com.arm.doc.dui0204j/DUI0204J_rvct_assem
bler_guide.pdf
Berger, E. D., Zorn, B. G., & McKinley, K. S. (2002). Reconsidering custom memory
allocation. In Proceedings of the 17th ACM SIGPLAN conference on Object-
oriented programming, systems, languages, and applications (OOPSLA '02) (pp. 1-
12). New York, NY, USA: ACM. doi:10.1145/582419.582421
Chang, J. M., Hasan, Y., & Lee, W. H. (2000). A high-performance memory allocator
for memory intensive applications. In Proceedings of Fourth IEEE International
Conference on High Performance Computing in Asia-Pacific Region (pp. 6-12).
doi:10.1109/HPC.2000.846507
Comfort, W. T. (1964). Multiword list items. Communications of the ACM, 7(6), 357-
362. doi:10.1145/512274.512288
Cranston, B., & Thomas, R. (1975). A simplified recombination scheme for the
Fibonacci buddy system. Communications of the ACM, 18(6), 331-332.
Deflets, D., Dosser, A., & Zorn, B. (1994). Memory allocation costs in large C and C++
programs. Software Practice and Experience, 24(6), 527-542.
Dybvig, R. K., Eby, D., & Bruggeman, C. (1994, March). Don't stop the BIBOP:
flexible and efficient storage management for dynamically-typed languages
(technical report #400). Indiana University Computer Science Department.
Retrieved from: ftp://www.cs.indiana.edu/pub/techreports/TR400.pdf
Grunwald, D., Zorn, B., & Henderson, R. (1993). Improving the cache locality of
memory allocation. In Proceedings of the ACM SIGPLAN 1993 conference on
Programming language design and implementation (PLDI '93) (pp. 177-186). New
York, NY, USA: ACM. doi:10.1145/173262.155107
Hasan, Y., & Chang, M. (2005). A study of best-fit allocators. Computer Languages,
Systems & Structures, 31(1), 35-48.
IBM. (2011). Best practices for tuning system latency. Retrieved from:
http://publib.boulder.ibm.com/infocenter/lnxinfo/v3r0m0/topic/performance/rtbestp
/rtbestp_pdf.pdf
Intel. (2011). Intel 64 and IA-32 architectures software developer's manual, volume 2A:
instruction set reference, A-M. Retrieved from:
http://download.intel.com/design/processor/manuals/253666.pdf
Johnstone, M. S., & Wilson, P. R. (1998). The memory fragmentation problem: solved?.
In Proceedings of the 1st international symposium on Memory management (ISMM
'98) (pp. 26-36). New York, NY, USA: ACM. doi:10.1145/286860.286864
Masmano, M., Ripoll, I., Balbastre, P., & Crespo, A. (2008a). A constant-time dynamic
storage allocator for real-time systems. Real-time systems, 40(2), 149-179.
doi:10.1007/s11241-008-9052-7
66
Masmano, M., Ripoll, I., Brugge, H., & Scislowicz, A. (2008b). Two levels segregated
fit memory allocator (TLSF) (Version 2.4.6) [source code]. Retrieved from:
http://wks.gii.upv.es/tlsf/files/src/TLSF-2.4.6.tbz2
Masmano, M., Ripoll, I., & Crespo, A. (2006). A comparison of memory allocators for
real-time applications. In Proceedings of the 4th international workshop on Java
technologies for real-time and embedded systems (JTRES '06) (pp. 68-76). New
York, NY, USA: ACM. doi:10.1145/1167999.1168012
Masmano, M., Ripoll, I., Crespo, A., & Real, J. (2004). TLSF: a new dynamic memory
allocator for real-time systems. In Proceedings of the 16th Euromicro Conference
on Real-Time Systems (ECRTS '04) (pp. 79-86). Washington, DC, USA: IEEE
Computer Society. doi:10.1109/ECRTS.2004.35
Nilsen, K. D., & Gao, H. (1995). The real-time behavior of dynamic memory
management in C++. In Proceedings of the 1st IEEE Real-Time Technology and
Applications Symposium (RTAS'95) (pp. 142-153).
doi:10.1109/RTTAS.1995.516211
Ogasawara, T. (1995). An algorithm with constant execution time for dynamic storage
allocation. In Proceedings of the 2nd International Workshop on Real-Time
Computing Systems and Applications (RTCSA'95) (pp. 21-25). Washington, DC,
USA: IEEE Computer Society.
Paoloni, G. (2010). How to benchmark code execution times on Intel IA-32 and IA-64
instruction set architectures. Retrieved from: http://edc.intel.com/Link.aspx?
id=3954
Peterson, J. L., & Norman, T. A. (1977). Buddy systems. Communications of the ACM,
20(6), 421-431.
Purdom, P. W., Stigler, S. M., & Cheam, Tat-Ong. (1971). Statistical investigation of
three storage allocation algorithms. BIT Numerical Mathematics, 11(2), 187-195.
doi:10.1007/BF01934367
Risco-Martin, J. L., Colmenar, J. M., Atienza, D., & Hidalgo, J. I. (2011). Simulation of
high-performance memory allocators. Microprosessors and Microsystems, 35(8),
755-765. doi: 10.1016/j.micpro.2011.08.003
Robson, J. M. (1977). Worst case fragmentation of first fit and best fit storage allocation
strategies. Computer Journal, 20(3), 242-244. doi: 10.1093/comjnl/20.3.242
Shen, K. K., & Peterson, J. L. (1974). A weighted buddy method for dynamic storage
allocation. Communications of the ACM, 17(10), 558-562.
67
Steele, G., Jr. (1977). Data representations in PDP-10 MACLISP. MIT AI Memo, 421.
Available online: http://hdl.handle.net/1721.1/6278
Stephenson, C. J. (1983). New methods for dynamic storage allocation (Fast fits). ACM
SIGOPS Operating Systems Review, 17(5), 30-32. doi:10.1145/773379.806613
Van Sciver, J. & Rashid, R. F. (1990). Zone garbage collection. In Proceedings of the
USENIX MACH Symposium (pp. 1-16).
Weinstock, C. B., & Wulf, W. A. (1988). Quickfit: an efficient algorithm for heap
storage allocation. ACM SIGPLAN Notices, 23(10), 141-144.
Wilson, P. R., Johnstone, M. S., Neely, M., & Boles, D. (1995a). Dynamic storage
allocation: a survey and critical review. In H. G. Baker (Ed.), Proceedings of the
International Workshop on Memory Management (IWMM '95) (pp. 1-116).
London, UK: Springer-Verlag.
Wilson, P. R., Johnstone, M. S., Neely, M., & Boles, D. (1995b). Memory allocation
policies reconsidered. Unpublished manuscript. Retrieved November 10, 2010,
from Richard Jones's Garbage Collection Bibliography:
ftp://ftp.cs.utexas.edu/pub/garbage/submit/PUT_IT_HERE/frag.ps
Wise, D. S. (1978). The double buddy-system (technical report #79). Indiana University
Computer Science Department. Retrieved from:
ftp://www.cs.indiana.edu/pub/techreports/TR79.pdf
Yadav, D., & Sharma, A. K. (2010). Tertiary buddy systems for efficient dynamic
memory allocation. In L. A. Zadeh, J. Kacprzyk, N. Mastorakis, A. Kuri-Morales,
P. Borne & L. Kazovsky (Eds.) Proceedings of the 9th WSEAS International
Conference on Software Engineering, Parallel and Distributed Systems
(SEPADS'10) (pp. 61-66). Stevens Point, WI, USA: World Scientific and
Engineering Academy and Society (WSEAS).