MDB: A Memory-Mapped Database and Backend For Openldap
MDB: A Memory-Mapped Database and Backend For Openldap
Howard Chu
Symas Corp., OpenLDAP Project
[email protected], [email protected]
http://www.symas.com, http://www.openldap.org
Abstract
This paper introduces MDB ("Memory-Mapped Database"), a read-optimized database library and slapd
backend developed for OpenLDAP. In this paper we will discuss OpenLDAP's traditional primary database as
well as some other alternatives that were examined before arriving at the MDB implementation. Early results
from testing the new MDB implementation will also be presented. This work is in progress but essentially
complete, and will be integrated into the OpenLDAP public releases in the near future.
1. Introduction
While OpenLDAP already provides a reliable high performance transactional backend
database (using Oracle BerkeleyDB "BDB"[1]), it requires careful tuning to get good results
and the tuning aspects can be quite complex. Data comes through three separate layers of
caches before it may be used, and each cache layer has a significant footprint. Balancing the
three layers against each other can be a difficult juggling act. Additionally, there are two layers
of locking needed to safely manipulate these caches, and the locks severely limit the
scalability of the database on multi-processor machines.
Rather than continue to attempt to adapt other third-party database software into OpenLDAP,
the MDB library was written specifically for use in OpenLDAP. The library is fully transactional
and implements B+ trees[2] with Multi-Version Concurrency Control[3]. The entire database is
mapped into virtual memory and all data fetches are performed via direct access to the
mapped memory instead of through intermediate buffers and copies.
2. Background
Before describing the improvements offered by the MDB design, an overview of the existing
BDB-based backends (back-bdb and back-hdb) will be presented.
LDAP and BDB have a long history together; Netscape commissioned the 2.0 release of BDB
specifically for use in their LDAP server[4]. The OpenLDAP Project's first release using the
BDB-specific APIs was OpenLDAP 2.1 in June 2002. Since BDB maintains its own internal
cache, it was hoped that the back-bdb backend could be deployed without any backend-level
caching, but early benchmark results showed that retrieving entries directly from the database
on every query was too slow. Despite radical improvements in entry fetch and decoding
speed[5], the decision was made to introduce an entry cache for the backend, and the cache
management problems grew from there.
These problems include:
• Multiple caches that each need to be carefully configured. On top of the BDB cache,
there are caches for entries, DNs, and attribute indexing in the backend. All of these
waste memory since the same data may be present in three places - the filesystem
cache, the BDB cache, and the backend caches. Configuration is a tedious job
because each cache layer has different size and speed characteristics and it is difficult
to strike a balance that is optimal for all use cases.
• Caches with very complex lock dependencies. For speed, most of the backend caches
are protected by simple mutexes. However, when interacting with the BDB API, these
mutexes must be dropped and exchanged for (much slower) database locks.
Otherwise deadlocks which could not be detected by BDB's deadlock detector may
occur. Deadlocks occur very frequently in routine operation of the backend.
• Caches with pathological behavior if they were smaller than the whole database. When
the cache size was small enough that a significant number of queries were not being
satisfied from the cache, extreme heap fragmentation was observed[6], as the cache
freed existing entries to make room for new entries. The fragmentation would cause
the size of the slapd process to rapidly grow, defeating the purpose of setting a small
cache size. The problem was worst with the memory allocator in GNU libc[7], and
could be mitigated by using alternatives such as Hoard[8] or Google tcmalloc[9], but
additional changes were made in slapd to reduce the number of calls to malloc() and
free() to delay the onset of this fragmentation issue[10].
• Caches with very low effectiveness. When multiple queries arrive whose result sets are
larger than the entry cache, the cache effectiveness drops to zero because entries are
constantly being freed before they ever get any chance of being re-used[11]. A great
deal of effort was expended exploring more advanced cache replacement algorithms to
combat this problem[12][13].
From the advent of the back-bdb backend until the present time, the majority of development
and debugging effort in these backends has all been devoted to backend cache management.
The present state of affairs is difficult to configure, difficult to optimize, and extremely labor
intensive to maintain.
Another issue relates to administrative overhead in general. For example, BDB uses write-
ahead logs for its transaction support. These logs are written before database updates are
performed, so that in case an update is interrupted or aborted, sufficient information is present
to undo the updates and return the database to the state it was in before the update began.
The log files grow continuously as updates are made to a database, and can only be removed
after an expensive checkpoint operation is performed. Later versions of BDB added an auto-
remove option to delete old log files automatically, but if the system crashed while this option
was in use, generally the database could not be recovered successfully because the
necessary logs had been deleted.
3. Solutions
The problems with back-bdb and back-hdb can be summed up in two main areas: cache
management, and lock management. The approach to a solution with back-mdb is simple - do
no caching, and do no locking. The other issues of administrative overhead are handled as
side-effects of the main solutions.
4.2 Locking
For simplicity the MDB library allows only one writer at a time. Creating a write transaction
acquires a lock on a writer mutex; the mutex normally resides in a shared memory region so
that it can be shared between multiple processes. This shared memory is separate from the
region occupied by the main database. The lock region also contains a table with one slot for
every active reader in the database. The slots record the reader's process and thread ID, as
well as the ID of the transaction snapshot the reader is using. (The process and thread ID are
recorded to allow detection of stale entries in the table, e.g. threads that exited without
releasing their reader slot.) The table is constructed in processor cache-aligned memory such
that False Sharing[23] of cache lines is avoided.
Readers acquire a slot the first time a thread opens a read transaction. Acquiring an empty
slot in the table requires locking a mutex on the table. The slot address is saved in thread-
local storage and re-used the next time the thread opens a read transaction, so the thread
never needs to touch the table mutex ever again. The reader stores its transaction ID in the
slot at the start of the read transaction and zeroes the ID in the slot at the end of the
transaction. In normal operation, there is nothing that can block the operation of readers.
The reader table is used when a writer wants to allocate a page, and knows that the free list is
not empty. Writes are performed using copy-on-write semantics; whenever a page is to be
written, a copy is made and the copy is modified instead of the original. Once copied, the
original page's ID is added to an in-memory free list. When a transaction is committed, the in-
memory free list is saved as a single record in the free list DB along with the ID of the
transaction for this commit. When a writer wants to pull a page from the free list DB, it
compares the transaction ID of the oldest record in the free list DB with the transaction IDs of
all of the active readers. If the record in the free list DB is older than all of the readers, then
all of the pages in that record may be safely re-used because nothing else in the DB points to
them any more.
The writer's scan of the reader table also requires no locks, so readers cannot block writers.
The only consequence of a reader holding onto an old snapshot for a long time is that page
reclaiming cannot be done; the writer will simply use newly allocated pages in the meantime.
4.3 Backend Features
The database layout in back-mdb is functionally identical to the one used in back-hdb so it is
also fully hierarchical. Entries are stored in a binary format based on the one used for back-
hdb, but with further encoding optimizations. The most significant optimization was to use a
mapping of AttributeDescriptions to small integers, so that their canonical names were no
longer stored in each entry. This saved a bit of space in the encoded entry, but more
importantly made Attribute decoding an O(1) operation instead of O(logN). Also, while the
MDB library doesn't need to allocate any memory to return data, entries still require Entry and
Attribute structures to be allocated. But since entries don't need to be kept persistently in a
cache, all allocations can be done from temporary thread-local memory. As a result of these
optimizations the entry decoder is 6.5 times faster overall than the one used in back-hdb.
Configuration for back-mdb is much simplified - there are no cache configuration directives.
The backend requires only a pathname for storing the database files, and a maximum allowed
size for the database. The configuration settings only affect the capacity of the database, not
its performance; there is nothing to tune.
5. Results
Profiling was done using multiple tools, including FunctionCheck[24], valgrind callgrind[25],
and oprofile[26], to aid in optimization of MDB. Oprofile has the least runtime overhead and
provides the best view of multi-threaded behavior, but since it is based on random samples it
tends to miss some data of interest. FunctionCheck is slower, at four times slower than
normal, but since it uses instrumented code it always provides a complete profile of overall
function run times. callgrind is slowest, at thirty times slower than normal, and only provides
relevant data for single-threaded operation, but since it does instruction-level profiling it gives
the most detailed view of program behavior. Since program behavior can vary wildly between
single-threaded and multi-processor operation, it was important to gather performance data
from a number of different perspectives.
Table 1 compares basic performance of back-mdb vs back-hdb for initially loading a test
database using slapadd in "quick" mode.
2 4 8 16
back-hdb, debian 0m23.147s 0m30.384s 1m25.665s 17m15.114s
back-hdb 0m24.617s 0m32.171s 1m04.817s 3m04.464s
back-mdb 0m10.789s 0m10.842s 0m10.931s 0m12.023s
Table 3: Concurrent Search Times
The first time this test was run with back-hdb yielded some extraordinarily poor results. Later
testing revealed that this test was accidentally run using the stock build of BDB 4.7 provided
by Debian, instead of the self-compiled build we usually use in our testing. The principle
difference is that we always build BDB with the configure option --with-mutex=POSIX/pthread,
whereas by default BDB uses a hybrid of spinlocks and pthread mutexes. The spinlocks are
fairly efficient within a single CPU socket, but they scale extremely poorly as the number of
processors increases. back-mdb's scaling is essentially flat across arbitrary numbers of
processors since it has no locking to slow it down. The performance degrades slightly at the
16 search case because at that point all of the processors on our test machine are busy so
the clients and slapd are competing with other system processes for CPU time. As another
point of reference, the time required to copy the MDB database to /dev/null using 'dd' was
10.20 seconds. Even with all of the decoding and filtering that slapd needed to do, scanning
the entire DB was only 6% slower than a raw copy operation.
The previous tests show worst-case performance for search operations. For more real-world
results, we move on to using SLAMD[27]. (SLAMD has known performance issues, but we've
gotten used to them, and staying with the same tool lets us compare with historical results
from our previous work as well.) Table 4 summarizes the results for back-hdb vs back-mdb
with randomly generated queries across the 5 million entry database.
6. Conclusions
The combination of memory-mapped operation with Multi-Version Concurrency
Control proves to be extremely potent for LDAP directories. The administrative
overhead is minimal since MDB databases require no periodic cleanup or garbage
collection, and no particular tuning is needed. Code size and complexity have
been drastically reduced, while read performance has been significantly raised.
Write performance has been traded for read performance, but this is acceptable
and can be addressed in more depth in the future.
6.1 Portability
While initial development was done on Linux, MDB and back-mdb have been
ported to MacOSX and Windows. No special problems are anticipated in porting to
other platforms.
References
1: Oracle, BerkeleyDB, 2011,
http://www.oracle.com/technetwork/database/berkeleydb/overview/index.html
2: Wikipedia, B+trees, , http://en.wikipedia.org/wiki/B+_tree
3: Wikipedia, MVCC, , http://en.wikipedia.org/wiki/Multiversion_concurrency_control
4: Margo Seltzer, Keith Bostic, Berkeley DB, The Architecture of Open Source Applications, 2011,
http://www.aosabook.org/en/bdb.html
5: Jong-Hyuk Choi and Howard Chu, Dissection of Search Latency, 2001,
http://www.openldap.org/lists/openldap-devel/200111/msg00042.html
6: Howard Chu, Better malloc strategies?, 2006, http://www.openldap.org/lists/openldap-
devel/200607/msg00005.html
7: Howard Chu, Malloc Benchmarking, 2006, http://highlandsun.com/hyc/malloc/
8: Emery Berger, The Hoard Memory Allocator, 2006-2010, http://www.hoard.org
9: Sanjay Ghemawat, Paul Menage, TCMalloc: Thread-Caching Malloc, 2005, http://goog-
perftools.sourceforge.net/doc/tcmalloc.html
10: Howard Chu, Minimize malloc fragmentation, 2006, http://www.openldap.org/lists/openldap-
devel/200608/msg00033.html
11: Wikipedia, Page replacement algorithms, ,
http://en.wikipedia.org/wiki/Page_replacement_algorithm#Least_recently_used
12: Howard Chu, CLOCK-Pro cache replacement code, 2007, http://www.openldap.org/lists/openldap-
bugs/200701/msg00039.html
13: Howard Chu, Cache-thrashing protection, 2007, http://www.openldap.org/lists/openldap-
commit/200711/msg00068.html
14: Wikipedia, Single-Level Store, , http://en.wikipedia.org/wiki/Single-level_store
15: Multicians, Multics General Information and FAQ, , http://www.multicians.org/general.html
16: Apollo Computer Inc., Domain/OS Design Principles, 1989, http://bitsavers.org/pdf/apollo/014962-
A00_Domain_OS_Design_Principles_Jan89.pdf
17: R.A. Gingell, J.P. Moran, and W.A. Shannon, Virtual Memory Architecture in SunOS, USENIX
Summer Conference, 1987, http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.132.8931
18: Marshall Kirk McKusick, Keith Bostic, Michael J. Karels, John S. Quarterman, Memory
Management, The Design and Implementation of the 4.4BSD Operating System, 1996,
http://www.freebsd.org/doc/en/books/design-44bsd/overview-memory-management.html
19: Linus Torvalds, Status of the buffer cache in 2.3.7+, 1999,
http://lkml.indiana.edu/hypermail/linux/kernel/9906.3/0670.html
20: Apache Software Foundation, Apache CouchDB: Technical Overview, 2008-2011,
http://couchdb.apache.org/docs/overview.html
21: Martin Hedenfalk, OpenBSD ldapd source repository, 2010-2011, http://www.openbsd.org/cgi-
bin/cvsweb/src/usr.sbin/ldapd/
22: Martin Hedenfalk, How the Append-Only Btree Works, 2011, http://www.bzero.se/ldapd/btree.html
23: Suntorn Sae-eung, Analysis of False Cache Line Sharing Effects on Multicore CPUs, 2010,
http://scholarworks.sjsu.edu/etd_projects/2
24: Howard Chu, FunctionCheck, 2005, http://highlandsun.com/hyc/#fncchk
25: Valgrind Developers, Callgrind: a call-graph generating cache and branch prediction profiler, 2011,
http://valgrind.org/docs/manual/cl-manual.html
26: , OProfile - A System Profiler for Linux, 2011, http://oprofile.sourceforge.net/news/
27: UnboundID Corp., SLAMD Distributed Load Generation Engine, 2010, http://www.slamd.com