MyRocks LSM Tree Database Storage Engine Serving Facebooks Social Graph
MyRocks LSM Tree Database Storage Engine Serving Facebooks Social Graph
ABSTRACT 1. INTRODUCTION
Facebook uses MySQL to manage tens of petabytes of data in its
main database named the User Database (UDB). UDB serves social The Facebook UDB serves the most important social graph
activities such as likes, comments, and shares. In the past, Facebook workloads [3]. The initial Facebook deployments used the InnoDB
storage engine using MySQL as the backend. InnoDB was a robust,
used InnoDB, a B+Tree based storage engine as the backend. The
challenge was to find an index structure using less space and write widely used database and it performed well. Meanwhile, hardware
amplification [1]. LSM-tree [2] has the potential to greatly improve trends shifted from slow but affordable magnetic drives to fast but
these two bottlenecks. RocksDB, an LSM tree-based key/value more expensive flash storage. Transitioning to flash storage in UDB
store was already widely used in variety of applications but had a shifted the bottleneck from Input/Output Operations Per Second
very low-level key-value interface. To overcome these limitations, (IOPS) to storage capacity. From a space perspective, InnoDB had
MyRocks, a new MySQL storage engine, was built on top of three big challenges that were hard to overcome, index
RocksDB by adding relational capabilities. With MyRocks, using fragmentation, compression inefficiencies, and space overhead per
row (13 bytes) for handling transactions. To further optimize space,
the RocksDB API, significant efficiency gains were achieved while
still benefiting from all the MySQL features and tools. The as well as serving reads and writes with appropriate low latency, we
transition was mostly transparent to client applications. believed an LSM-tree database optimized for flash storage was
better in UDB. However, there were many different types of client
Facebook completed the UDB migration from InnoDB to MyRocks applications accessing UDB. Rewriting client applications for a
in 2017. Since then, ongoing improvements in production new database was going to take a long time, possibly multiple
operations, and additional enhancements to MySQL, MyRocks, years, and we wanted to avoid that as well.
and RocksDB, provided even greater efficiency wins. MyRocks
also reduced the instance size by 62.3% for UDB data sets and We decided to integrate RocksDB, a modern open source LSM-tree
based key/value store library optimized for flash, into MySQL. As
performed fewer I/O operations than InnoDB. Finally, MyRocks
consumed less CPU time for serving the same production traffic seen in Figure 1, by using the MySQL pluggable storage engine
workload. These gains enabled us to reduce the number of database architecture, it was possible to replace the storage layer without
servers in UDB to less than half, saving significant resources. In changing the upper layers such as client protocols, SQL and
this paper, we describe our journey to build and run an OLTP LSM- Replication.
tree SQL database at scale. We also discuss the features we
implemented to keep pace with UDB workloads, what made
migrations easier, and what operational and software development
challenges we faced during the two years of running MyRocks in
production.
Among the new features we introduced in RocksDB were
transactional support, bulk loading, and prefix bloom filters, all are
available for the benefit of all RocksDB users.
PVLDB Reference Format:
Yoshinori Matsunobu, Siying Dong, Herman Lee. MyRocks: Figure 1: MySQL and MyRocks Storage Engine
LSM-Tree Database Storage Engine Serving Facebook's Social
Graph. PVLDB, 13(12): 3217 - 3230, 2020. We called this engine MyRocks. When we started the project, our
DOI: https://doi.org/10.14778/3415478.3415546 goal was to reduce the number of UDB servers by 50%. That
required the MyRocks space usage to be no more than 50% of the
compressed InnoDB format, while maintaining comparable CPU
This work is licensed under the Creative Commons Attribution- and I/O utilization. We expected that achieving similar CPU
NonCommercial-NoDerivatives 4.0 International License. To view a copy of utilization vs InnoDB was the hardest challenge, since flash I/O had
this license, visit http://creativecommons.org/licenses/by-nc-nd/4.0/. For any sufficient read IOPS capacity and the LSM-tree database had less
use beyond those covered by this license, obtain permission by emailing
[email protected]. Copyright is held by the owner/author(s). Publication rights
write amplification. Since InnoDB was a fast, reliable database
licensed to the VLDB Endowment. with many features on which our Engineering team relied, there
Proceedings of the VLDB Endowment, Vol. 13, No. 12 were many challenges ensuring there was no gap between InnoDB
ISSN 2150-8097. and MyRocks.
DOI: https://doi.org/10.14778/3415478.3415546
Among the significant challenges were: (1) Increased CPU,
memory, and I/O pressure. MyRocks compresses the database size
by half which requires more CPU, memory, and I/O to handle the storage devices that had lower write cycles. These enabled us to
2x number of instances on the host. (2) A larger gap between reduce the number of database servers in UDB to less than half with
forward and backward range scans. The LSM-tree allows data MyRocks. Since 2017, regressions have been continuously tracked
blocks to be encoded in a more compacted form. As a result, via MyShadow and data correctness. We improved compaction to
forward scans are faster than backward scans. (3) Key comparisons. guarantee the removal of stale data, meeting the increasing
LSM-tree key comparisons are invoked more frequently than B- demands of data privacy.
tree. (4) Query performance. MyRocks was slower than InnoDB in
range query performance. (5) LSM-tree performance needs This practice is valuable because: (1) Since SQL databases built on
memory-based caching bloom filters for optimal performance. LSM-tree are gaining popularity, the practical techniques of tuning
Caching bloom filters in memory is important to LSM-tree and improving LSM-tree are valuable. To the best of our
performance, but this consumes a non-trivial amount of DRAM and knowledge, this is the first time these techniques have been
increases memory pressure. (6) Tombstone Management. With implemented on a large-scale production system. (2) While some
LSM-trees, deletes are processed by adding markers, which can high-level B-tree vs LSM-tree comparisons are documented, our
work exposed implementation challenges for LSM-tree to match B-
sometimes cause performance problem with frequently
updated/deleted rows. (7) Compactions, especially when triggered tree performance, extra benefits from a LSM-tree, and
by burst writes, may cause stalls. optimizations that can narrow the gap. (3) Migrating data across
different databases or storage engines is common. This paper shares
Section 3 provides the details for how those challenges were the processes used to migrate the database to a different storage
addressed. In short, the highlighted innovations implemented are engine. The experience is more interesting because the storage
the (1) prefix bloom filter so that range scans with equal predicates engine moved to is relatively immature.
are faster (Section 3.2.2.1), the (2) mem comparable keys in
MyRocks allowing more efficient character comparisons (Section In this paper, we describe three contributions:
3.2.1.1), a (3) new tombstone/deletion type to more efficiently 1. UDB overview, challenges with B-Tree indexes and why
handle secondary index maintenance (Section 3.2.2.2), (4) bulk we thought LSM-tree database optimized for flash
loading to skip compactions on data loading, (Section 3.2.3.4), (5) storage was suitable (Section 2).
rate limited compaction file generations and deletions to prevent 2. How we optimized MyRocks for various read workloads
stalls (Section 3.2.3.2), and (6) hybrid compressions – using a faster and compactions (Section 3).
compression algorithm for upper RocksDB levels, and a stronger 3. How we migrated to MyRocks in production (Section 4).
algorithm for the bottommost level, so that MemTable flush and
compaction can keep up with write ingestion rates with minimal Then we show migration results in Section 5, followed by lessons
space overhead (Section 3.3.4). learned in Section 6. Finally, we show related work in Section 7,
and concluding remarks in Section 8.
MyRocks also has native performance benefits over InnoDB such
as not needing random reads for maintaining non-unique secondary
indexes. More writes can be consolidated, with fewer total bytes 2. BACKGROUND AND MOTIVATION
written to flash. The read performance improvements and write 2.1 UDB Architecture
performance benefits were evident when the UDB was migrated UDB is our massively sharded database service. We have
from InnoDB to MyRocks with no degradation of CPU utilization. customized MySQL with hundreds of features to operate the
database for our needs. All customized extensions to MySQL are
Comprehensive correctness, performance and reliability released as open source [4].
validations were needed prior to migration. We built two
infrastructure services to help the migration. One was MyShadow, Facebook has many geographically distributed data centers across
which captured production queries and replayed them to test the world [5] and the UDB instances are running in some of them.
instances. The other was a data correctness tool which compared Where other distributed database solutions place up to three copies
full index data and query results between InnoDB and MyRocks in the same region and synchronously replicate among them, the
instances. We ran these two tools to verify that MySQL instances Facebook ecosystem is so large that adopting this architecture for
running MyRocks did not return wrong results, did not return UDB is not practical as it would force us to maintain more than 10
unexpected errors, did not regress CPU utilizations, and did not database copies. We only maintain one database copy for each
cause outstanding stalls. After completing the validations, the region. However, there are many applications which relied on short
InnoDB to MyRocks migration itself was relatively easy. Since commit latency and did not function well with tens of millisecond
MySQL replication was independent of storage engine, adding for synchronous cross region transaction commits. These
MyRocks instances and removing InnoDB instances were simple. constraints led us to deploy a MySQL distributed systems
The bulk data loading feature in MyRocks greatly reduced data architecture as shown in Figure 2.
migration time as it could load indexes directly into the LSM-tree
and bypass all MemTable writes and compactions. We used traditional asynchronous MySQL replication for cross
region MySQL replication. However, for in-region fault tolerance,
The InnoDB to MyRocks UDB migrations were completed in we created a middleware called Binlog Server (Log Backup Unit)
August 2017. For the same data sets, MyRocks and modern LSM- which can retrieve and serve the MySQL replication logs known as
tree structures and compression techniques reduced the instance Binary Logs. Binlog Servers only retain a short period of recent
size by 62.3% compared to compressed InnoDB. Lower secondary transaction logs and do not maintain a full copy of the database.
index maintenance overhead and overall read performance Each MySQL instance replicates its log to two Binlog Servers using
improvements resulted in slightly reduced CPU time. Bytes written MySQL Semi-Synchronous protocol. All three servers are spread
to flush storage went down by 75%, which helped not to hit IOPS across different failure domains within the region. This architecture
bottlenecks, and opened possibilities to use more affordable flash
made it possible to achieve both short (in-region) commit latency defragmented. In order to keep the space usage down, we needed
and one database copy per region. to defragment constantly and aggressively, which in turn, reduced
server performance and wore out the flash faster. Flash durability
UDB is a persistent data store of our social graphs. On top of UDB, was already becoming a concern since higher durability drives were
there is a huge cache tier called TAO [3]. TAO is a distributed write more expensive.
through cache handling social graphs and mapping them to
individual rows in UDB. Aside from legacy applications, most read Compression was also limited in InnoDB. Default InnoDB data
and write requests to UDB originate from TAO. In general, block size was 16KB and table level compression required
applications do not directly issue queries to UDB, but instead issue predefining the after-compressed block size (key_block_size), to
requests to TAO. TAO provides limited number of APIs to one of 1KB, 2KB, 4KB or 8KB. This is to guarantee that pages can
applications to handle social graphs. Limiting access methods to be individually updated, a basic requirement for B-tree. For
applications helped to prevent them from issuing bad queries to example, if key_block_size was 8KB, then even if 16KB data was
UDB and to stabilize workloads. compressed to 5KB, actual space usage was still 8KB, so the
storage savings was capped at 50%. Too small a block size results
We use MySQL’s Binary Logs not only for MySQL Replication,
in high CPU overhead for increased page splits and compression
but also for notifying updates to external applications. We created attempts. We used 8KB for most tables, and 4KB for tables updated
a pub-sub service called Wormhole [6] for this. One of the use cases infrequently, so overall space saving impacts were limited.
of Wormhole is invalidating the TAO cache in remote regions by
reading the Binary Log of the region's MySQL instance. Another issue we faced with InnoDB on flash storage was higher
write amplification and occasional stalls caused by writes to flash.
In InnoDB, a dirty page is eventually written back to a data file.
Because TAO is responsible for most caching, to be efficient,
MySQL runs on hardware where the working set is not cached in
DRAM, so writing back dirty pages happens frequently. Even a
single row modification in an InnoDB data block results in the
entire 8KB page to be written. InnoDB also has a “double-write”
feature to protect from torn page corruptions during unexpected
shutdowns. These amplified writes significantly. In some cases, we
hit issues where burst write rates triggered I/O stalls on flash.
Based on issues we faced in InnoDB in UDB, it was obvious we
needed a better space optimized, lower write amplification database
implementation. We found that LSM-tree database fitted well for
those two bottlenecks. Another reason we got interested in LSM-
Figure 2: UDB Architecture tree was its friendliness to tiered storage, and new storage
technologies created more tiering opportunities. Although we have
2.2 UDB Storage not yet taken advantage of it, we anticipate benefits from it in the
UDB was one of the first database services built at Facebook. Both future.
software and hardware trends have changed significantly since that Despite potential benefits, there were several challenges to
time. Early versions of UDB ran on spinning hard drives that had a adopting a LSM-tree database for UDB. First, there was not a
small amount of data because of low IOPS. Workload was carefully production proven database that ran well on flash back in 2010. The
monitored to prevent the server from overwhelming the disk drives. majority of popular LSM databases ran on HDD and none had a
In 2010, we started adding solid state drives to our UDB servers to proven case study for running on flash at scale.
improve I/O throughput. The first iteration used a flash device as a Secondly, UDB still needed to serve a lot of read requests. While
cache for the HDD. While increasing server cost, Flashcache [7] TAO has a high hit rate, read efficiency was still important because
improved IOPS capacity from hundreds per second to thousands TAO often issued read requests from UDB for batch-style jobs that
per second, allowing us to support much more data on a single had low TAO cache hit rate. Also, TAO often went through “cold
server. In 2013, we eliminated the HDD and switched to a pure restart” to invalidate caches and refresh from UDB. Write requests
flash storage. This setup was no longer bounded by read I/O also triggered read requests. All UDB tables had primary keys, so
throughput, but overall cost per GB was significantly higher than inserts needed to perform unique key constraint checks and
HDD or Flashcache. Reducing the space used by the database updates/deletes needed to find previous rows. Delete or update via
became a priority. The most straight-forward solution was to non-primary keys needed to read to find primary keys.
compress the data. MySQL’s InnoDB storage engine supports
compression and we enabled it in 2011. Space reduction was For these reasons, it was important to serve reads efficiently as well.
approximately 50%, which was still insufficient. In studying the A B-Tree database like InnoDB is well suited for both read and
storage engine, we found the B-Tree structure wasted space because write workloads, while LSM-tree shifted more for write and space
of index fragmentations. Index fragmentation was a common issue optimizations. So, it was questionable if LSM-tree databases could
for B-Tree database and approximately 25% to 30% of each handle read workloads on flash.
InnoDB block space was wasted. We tried to mitigate the problem
with B-tree defragmentation, but it was less effective on our 2.3 Common UDB Tables
workload than expected. In UDB, a continuous stream of mostly UDB has mainly two types of tables to store social data – one for
random writes would quickly fragment pages that were just objects and the other for associations of the objects [3]. Each object
and association have types (fbtype and assoc_type) defining its Bloom filters are kept in each SST file and used to eliminate
characteristics. The fbtype or assoc_type determines the physical unnecessary search within an SST file.
table that stores the object or association. Common object table is
called fbobj_info, which stores common objects, keyed by object
type (fbtype) and identifiers (fbid). The object itself is stored in a
“data” column in a serialized format, and its format is dependent on
each fbtype. Association tables store associations of the objects. For
example, the assoc_comments table stores associations of
comments on Facebook activities (e.g. posts), keyed by pair of
identifiers (id1 and id2) and type of the association (assoc_type).
The association tables have secondary indexes called id1_type.
Secondary index of the association table (id1_type index) was
designed to optimize range scans. Getting list of ids (id2) associated
to id (id1) is a common logic on Facebook, such as getting a list of
people’s identifiers who liked a certain post.
Figure 3: RocksDB Architecture
From schema point of view, object tables are accessed like a key
value store than a relational model. On the other hand, association 2.4.2 Why RocksDB?
tables have meaningful schema such as pair of ids. We adopted an As mentioned in Section 2.2, space utilization and write
optimization called “covering index” [8] in id1_type secondary amplification are two bottlenecks of UDB. Write amplification is
index, so that range scans can be completed without randomly the initial optimization goal for RocksDB, so it is a perfect fit.
reading from primary keys, by including all relevant columns in the LSM-tree is more effective because it avoids in-place updates to
index. Typical social graph updates modify both object tables and pages, which eventually caused page writes with small updates in
association tables for the same id1s in one database transaction, so UDB. Updates to a LSM-tree are batched and when they are written
having both tables inside one database instance makes sense to get out, pages only contain updated entries, except for the last sorted
benefits of ACID capabilities of the transactions. run. When updates are finally applied to the last sorted run, lots of
updates are already accumulated, so that a good percentage of page
2.4 RocksDB: Optimized for Flash would be newly updated data.
Utilizing Flash Storage is not unique to MySQL and other Besides write amplification, we still need to address the other major
applications at Facebook already had years of experience. In order bottleneck: space utilization. We noticed that, LSM-tree does
to address similar challenges faced by other applications, in 2012, significantly better than B-tree for this metric too. For InnoDB,
a new key/value store library, RocksDB [9] was created for flash. space amplification mostly comes from fragmentation and less
By the time we started to look for an alternative storage engine for efficient compression. As mentioned in Section 2.2, InnoDB
MySQL, RocksDB was already used in a list of services, including wasted 25-30% space in fragmentation. LSM-tree does not suffer
ZippyDB [10] Laser [11] and Dragon [12]. from the problem and its equivalence is dead data not yet removed
RocksDB is a key/value store library optimized for characteristics in the tree. LSM-tree’s dead data is removed by compaction, and
of flash-based SSDs. When choosing the main data structure of the by tuning compaction, we are able to maintain the ratio of dead data
engine, we studied several known data structures and chose LSM- to as low as 10% [14]. RocksDB also optimizes for space because
tree for its good write amplification feature, with a good balance of it works well with compression. If 16KB data was compressed to
read performance [1]. The implementation is based on LevelDB 5KB, RocksDB uses just 5KB while InnoDB aligns to 8KB, so
[13]. RocksDB is much more space efficient. Also, InnoDB has
significant space overhead per row for handling transactions (6-
byte transaction id and 7-byte roll pointer). RocksDB has 7-byte
2.4.1 RocksDB Architecture sequence number for each row, for snapshot reads. However,
Whenever data is written to RocksDB, it is added to an in-memory RocksDB converts sequence numbers to zero during compactions,
write buffer called MemTable, as well as Write Ahead Log (WAL). if no other transaction is referencing them. Zero sequence number
Once the size of the MemTable reaches a predetermined size, the uses very little space after compression. In practice, most rows in
contents of the MemTable are flushed out to a “Sorted Strings Lmax have zero sequence numbers, so space saving is significant,
Table” (SST) data file. Each of the SSTs stores data in sorted order, especially if average row size is small.
divided into blocks. Each SST also has an index block for binary
search with one key per SST block. SSTs are organized into a Since RocksDB is a good fit to address the performance and
sequence of sorted runs of exponentially increasing size, called efficiency challenges of UDB workloads, we decided to build
level, where each level will have multiple SSTs, as depicted in MyRocks, a new MySQL storage engine on top of RocksDB. The
Figure 3. In order to maintain the size for each level, some SSTs in engine implemented in MySQL 5.6 performs well compared to
level-L are selected and merged with the overlapping SSTs in InnoDB in TPC-C benchmark results [14]. As Oracle releases
level(L+1). The process is called compaction. We call the last level newer versions of MySQL, we will continue to port MyRocks
as Lmax. forward. The development work can be substantial because of new
requirements for storage engines.
In the read path, a key lookup occurs at each successive level until
the key is found or it is determined that the key is not present in the
last level. It begins by searching all MemTables, followed by all
Level-0 SSTs and then the SST’s at the next following levels. At
each of these successive levels, a whole binary search is used.
3. MYROCKS/ROCKSDB DEVELOPMENT made it clear we did not target such use cases (RUM Conjecture
compromise [1]).
3.1 Design Goals
Re-architecting and migrating a large production database is a big
engineering project. Before starting, we created several goals. 3.1.4 Design Choices
While increasing efficiency was high priority, it was also important
that many other factors, such as reliability, privacy, security, and
3.1.4.1 Contributions to RocksDB
We added features to RocksDB where possible. RocksDB is a
simplicity, did not regress when transitioning to MyRocks.
widely used open source software and we thought it would benefit
other RocksDB applications. MyRocks used the RocksDB APIs.
3.1.1 Maintained Existing Behavior of Apps and Ops
Implementing a new database was only part of our project.
Successfully migrating a continuously operating in UDB was also
3.1.4.2 Clustered Index Format
UDB took advantage of the InnoDB clustered index structure.
important. Hence, we made the ease of migration and operation a
Primary key lookups could be completed by a single read since all
goal. The pluggable storage engine architecture in MySQL enabled
columns are present. We adopted the same clustered index structure
that goal. Using the same client and SQL interface meant UDB
for MyRocks. Secondary key entries include primary key columns
client applications did not have to change, and many of our
to reference corresponding primary key entries. There is no row ID.
automation tools, such as instance monitoring, backups, and
failover, continued to function with no usability loss.