MyRocks LSM Tree Database Storage Engine Serving Facebooks Social Graph

Uploaded by

mbj111

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

139 views14 pages

MyRocks LSM Tree Database Storage Engine Serving Facebooks Social Graph

Uploaded by

mbj111

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

MyRocks: LSM-Tree Database Storage Engine Serving

Facebook's Social Graph

Yoshinori Matsunobu Siying Dong Herman Lee
Facebook Facebook Facebook
[email protected] [email protected] [email protected]

ABSTRACT 1. INTRODUCTION
Facebook uses MySQL to manage tens of petabytes of data in its
main database named the User Database (UDB). UDB serves social The Facebook UDB serves the most important social graph
activities such as likes, comments, and shares. In the past, Facebook workloads [3]. The initial Facebook deployments used the InnoDB
storage engine using MySQL as the backend. InnoDB was a robust,
used InnoDB, a B+Tree based storage engine as the backend. The
challenge was to find an index structure using less space and write widely used database and it performed well. Meanwhile, hardware
amplification [1]. LSM-tree [2] has the potential to greatly improve trends shifted from slow but affordable magnetic drives to fast but
these two bottlenecks. RocksDB, an LSM tree-based key/value more expensive flash storage. Transitioning to flash storage in UDB
store was already widely used in variety of applications but had a shifted the bottleneck from Input/Output Operations Per Second
very low-level key-value interface. To overcome these limitations, (IOPS) to storage capacity. From a space perspective, InnoDB had
MyRocks, a new MySQL storage engine, was built on top of three big challenges that were hard to overcome, index
RocksDB by adding relational capabilities. With MyRocks, using fragmentation, compression inefficiencies, and space overhead per
row (13 bytes) for handling transactions. To further optimize space,
the RocksDB API, significant efficiency gains were achieved while
still benefiting from all the MySQL features and tools. The as well as serving reads and writes with appropriate low latency, we
transition was mostly transparent to client applications. believed an LSM-tree database optimized for flash storage was
better in UDB. However, there were many different types of client
Facebook completed the UDB migration from InnoDB to MyRocks applications accessing UDB. Rewriting client applications for a
in 2017. Since then, ongoing improvements in production new database was going to take a long time, possibly multiple
operations, and additional enhancements to MySQL, MyRocks, years, and we wanted to avoid that as well.
and RocksDB, provided even greater efficiency wins. MyRocks
also reduced the instance size by 62.3% for UDB data sets and We decided to integrate RocksDB, a modern open source LSM-tree
based key/value store library optimized for flash, into MySQL. As
performed fewer I/O operations than InnoDB. Finally, MyRocks
consumed less CPU time for serving the same production traffic seen in Figure 1, by using the MySQL pluggable storage engine
workload. These gains enabled us to reduce the number of database architecture, it was possible to replace the storage layer without
servers in UDB to less than half, saving significant resources. In changing the upper layers such as client protocols, SQL and
this paper, we describe our journey to build and run an OLTP LSM- Replication.
tree SQL database at scale. We also discuss the features we
implemented to keep pace with UDB workloads, what made
migrations easier, and what operational and software development
challenges we faced during the two years of running MyRocks in
production.
Among the new features we introduced in RocksDB were
transactional support, bulk loading, and prefix bloom filters, all are
available for the benefit of all RocksDB users.
PVLDB Reference Format:
Yoshinori Matsunobu, Siying Dong, Herman Lee. MyRocks: Figure 1: MySQL and MyRocks Storage Engine
LSM-Tree Database Storage Engine Serving Facebook's Social
Graph. PVLDB, 13(12): 3217 - 3230, 2020. We called this engine MyRocks. When we started the project, our
DOI: https://doi.org/10.14778/3415478.3415546 goal was to reduce the number of UDB servers by 50%. That
required the MyRocks space usage to be no more than 50% of the
compressed InnoDB format, while maintaining comparable CPU
This work is licensed under the Creative Commons Attribution- and I/O utilization. We expected that achieving similar CPU
NonCommercial-NoDerivatives 4.0 International License. To view a copy of utilization vs InnoDB was the hardest challenge, since flash I/O had
this license, visit http://creativecommons.org/licenses/by-nc-nd/4.0/. For any sufficient read IOPS capacity and the LSM-tree database had less
use beyond those covered by this license, obtain permission by emailing
[email protected]. Copyright is held by the owner/author(s). Publication rights
write amplification. Since InnoDB was a fast, reliable database
licensed to the VLDB Endowment. with many features on which our Engineering team relied, there
Proceedings of the VLDB Endowment, Vol. 13, No. 12 were many challenges ensuring there was no gap between InnoDB
ISSN 2150-8097. and MyRocks.
DOI: https://doi.org/10.14778/3415478.3415546
Among the significant challenges were: (1) Increased CPU,
memory, and I/O pressure. MyRocks compresses the database size
by half which requires more CPU, memory, and I/O to handle the storage devices that had lower write cycles. These enabled us to
2x number of instances on the host. (2) A larger gap between reduce the number of database servers in UDB to less than half with
forward and backward range scans. The LSM-tree allows data MyRocks. Since 2017, regressions have been continuously tracked
blocks to be encoded in a more compacted form. As a result, via MyShadow and data correctness. We improved compaction to
forward scans are faster than backward scans. (3) Key comparisons. guarantee the removal of stale data, meeting the increasing
LSM-tree key comparisons are invoked more frequently than B- demands of data privacy.
tree. (4) Query performance. MyRocks was slower than InnoDB in
range query performance. (5) LSM-tree performance needs This practice is valuable because: (1) Since SQL databases built on
memory-based caching bloom filters for optimal performance. LSM-tree are gaining popularity, the practical techniques of tuning
Caching bloom filters in memory is important to LSM-tree and improving LSM-tree are valuable. To the best of our
performance, but this consumes a non-trivial amount of DRAM and knowledge, this is the first time these techniques have been
increases memory pressure. (6) Tombstone Management. With implemented on a large-scale production system. (2) While some
LSM-trees, deletes are processed by adding markers, which can high-level B-tree vs LSM-tree comparisons are documented, our
work exposed implementation challenges for LSM-tree to match B-
sometimes cause performance problem with frequently
updated/deleted rows. (7) Compactions, especially when triggered tree performance, extra benefits from a LSM-tree, and
by burst writes, may cause stalls. optimizations that can narrow the gap. (3) Migrating data across
different databases or storage engines is common. This paper shares
Section 3 provides the details for how those challenges were the processes used to migrate the database to a different storage
addressed. In short, the highlighted innovations implemented are engine. The experience is more interesting because the storage
the (1) prefix bloom filter so that range scans with equal predicates engine moved to is relatively immature.
are faster (Section 3.2.2.1), the (2) mem comparable keys in
MyRocks allowing more efficient character comparisons (Section In this paper, we describe three contributions:
3.2.1.1), a (3) new tombstone/deletion type to more efficiently 1. UDB overview, challenges with B-Tree indexes and why
handle secondary index maintenance (Section 3.2.2.2), (4) bulk we thought LSM-tree database optimized for flash
loading to skip compactions on data loading, (Section 3.2.3.4), (5) storage was suitable (Section 2).
rate limited compaction file generations and deletions to prevent 2. How we optimized MyRocks for various read workloads
stalls (Section 3.2.3.2), and (6) hybrid compressions – using a faster and compactions (Section 3).
compression algorithm for upper RocksDB levels, and a stronger 3. How we migrated to MyRocks in production (Section 4).
algorithm for the bottommost level, so that MemTable flush and
compaction can keep up with write ingestion rates with minimal Then we show migration results in Section 5, followed by lessons
space overhead (Section 3.3.4). learned in Section 6. Finally, we show related work in Section 7,
and concluding remarks in Section 8.
MyRocks also has native performance benefits over InnoDB such
as not needing random reads for maintaining non-unique secondary
indexes. More writes can be consolidated, with fewer total bytes 2. BACKGROUND AND MOTIVATION
written to flash. The read performance improvements and write 2.1 UDB Architecture
performance benefits were evident when the UDB was migrated UDB is our massively sharded database service. We have
from InnoDB to MyRocks with no degradation of CPU utilization. customized MySQL with hundreds of features to operate the
database for our needs. All customized extensions to MySQL are
Comprehensive correctness, performance and reliability released as open source [4].
validations were needed prior to migration. We built two
infrastructure services to help the migration. One was MyShadow, Facebook has many geographically distributed data centers across
which captured production queries and replayed them to test the world [5] and the UDB instances are running in some of them.
instances. The other was a data correctness tool which compared Where other distributed database solutions place up to three copies
full index data and query results between InnoDB and MyRocks in the same region and synchronously replicate among them, the
instances. We ran these two tools to verify that MySQL instances Facebook ecosystem is so large that adopting this architecture for
running MyRocks did not return wrong results, did not return UDB is not practical as it would force us to maintain more than 10
unexpected errors, did not regress CPU utilizations, and did not database copies. We only maintain one database copy for each
cause outstanding stalls. After completing the validations, the region. However, there are many applications which relied on short
InnoDB to MyRocks migration itself was relatively easy. Since commit latency and did not function well with tens of millisecond
MySQL replication was independent of storage engine, adding for synchronous cross region transaction commits. These
MyRocks instances and removing InnoDB instances were simple. constraints led us to deploy a MySQL distributed systems
The bulk data loading feature in MyRocks greatly reduced data architecture as shown in Figure 2.
migration time as it could load indexes directly into the LSM-tree
and bypass all MemTable writes and compactions. We used traditional asynchronous MySQL replication for cross
region MySQL replication. However, for in-region fault tolerance,
The InnoDB to MyRocks UDB migrations were completed in we created a middleware called Binlog Server (Log Backup Unit)
August 2017. For the same data sets, MyRocks and modern LSM- which can retrieve and serve the MySQL replication logs known as
tree structures and compression techniques reduced the instance Binary Logs. Binlog Servers only retain a short period of recent
size by 62.3% compared to compressed InnoDB. Lower secondary transaction logs and do not maintain a full copy of the database.
index maintenance overhead and overall read performance Each MySQL instance replicates its log to two Binlog Servers using
improvements resulted in slightly reduced CPU time. Bytes written MySQL Semi-Synchronous protocol. All three servers are spread
to flush storage went down by 75%, which helped not to hit IOPS across different failure domains within the region. This architecture
bottlenecks, and opened possibilities to use more affordable flash
made it possible to achieve both short (in-region) commit latency defragmented. In order to keep the space usage down, we needed
and one database copy per region. to defragment constantly and aggressively, which in turn, reduced
server performance and wore out the flash faster. Flash durability
UDB is a persistent data store of our social graphs. On top of UDB, was already becoming a concern since higher durability drives were
there is a huge cache tier called TAO [3]. TAO is a distributed write more expensive.
through cache handling social graphs and mapping them to
individual rows in UDB. Aside from legacy applications, most read Compression was also limited in InnoDB. Default InnoDB data
and write requests to UDB originate from TAO. In general, block size was 16KB and table level compression required
applications do not directly issue queries to UDB, but instead issue predefining the after-compressed block size (key_block_size), to
requests to TAO. TAO provides limited number of APIs to one of 1KB, 2KB, 4KB or 8KB. This is to guarantee that pages can
applications to handle social graphs. Limiting access methods to be individually updated, a basic requirement for B-tree. For
applications helped to prevent them from issuing bad queries to example, if key_block_size was 8KB, then even if 16KB data was
UDB and to stabilize workloads. compressed to 5KB, actual space usage was still 8KB, so the
storage savings was capped at 50%. Too small a block size results
We use MySQL’s Binary Logs not only for MySQL Replication,
in high CPU overhead for increased page splits and compression
but also for notifying updates to external applications. We created attempts. We used 8KB for most tables, and 4KB for tables updated
a pub-sub service called Wormhole [6] for this. One of the use cases infrequently, so overall space saving impacts were limited.
of Wormhole is invalidating the TAO cache in remote regions by
reading the Binary Log of the region's MySQL instance. Another issue we faced with InnoDB on flash storage was higher
write amplification and occasional stalls caused by writes to flash.
In InnoDB, a dirty page is eventually written back to a data file.
Because TAO is responsible for most caching, to be efficient,
MySQL runs on hardware where the working set is not cached in
DRAM, so writing back dirty pages happens frequently. Even a
single row modification in an InnoDB data block results in the
entire 8KB page to be written. InnoDB also has a “double-write”
feature to protect from torn page corruptions during unexpected
shutdowns. These amplified writes significantly. In some cases, we
hit issues where burst write rates triggered I/O stalls on flash.
Based on issues we faced in InnoDB in UDB, it was obvious we
needed a better space optimized, lower write amplification database
implementation. We found that LSM-tree database fitted well for
those two bottlenecks. Another reason we got interested in LSM-
Figure 2: UDB Architecture tree was its friendliness to tiered storage, and new storage
technologies created more tiering opportunities. Although we have
2.2 UDB Storage not yet taken advantage of it, we anticipate benefits from it in the
UDB was one of the first database services built at Facebook. Both future.
software and hardware trends have changed significantly since that Despite potential benefits, there were several challenges to
time. Early versions of UDB ran on spinning hard drives that had a adopting a LSM-tree database for UDB. First, there was not a
small amount of data because of low IOPS. Workload was carefully production proven database that ran well on flash back in 2010. The
monitored to prevent the server from overwhelming the disk drives. majority of popular LSM databases ran on HDD and none had a
In 2010, we started adding solid state drives to our UDB servers to proven case study for running on flash at scale.
improve I/O throughput. The first iteration used a flash device as a Secondly, UDB still needed to serve a lot of read requests. While
cache for the HDD. While increasing server cost, Flashcache [7] TAO has a high hit rate, read efficiency was still important because
improved IOPS capacity from hundreds per second to thousands TAO often issued read requests from UDB for batch-style jobs that
per second, allowing us to support much more data on a single had low TAO cache hit rate. Also, TAO often went through “cold
server. In 2013, we eliminated the HDD and switched to a pure restart” to invalidate caches and refresh from UDB. Write requests
flash storage. This setup was no longer bounded by read I/O also triggered read requests. All UDB tables had primary keys, so
throughput, but overall cost per GB was significantly higher than inserts needed to perform unique key constraint checks and
HDD or Flashcache. Reducing the space used by the database updates/deletes needed to find previous rows. Delete or update via
became a priority. The most straight-forward solution was to non-primary keys needed to read to find primary keys.
compress the data. MySQL’s InnoDB storage engine supports
compression and we enabled it in 2011. Space reduction was For these reasons, it was important to serve reads efficiently as well.
approximately 50%, which was still insufficient. In studying the A B-Tree database like InnoDB is well suited for both read and
storage engine, we found the B-Tree structure wasted space because write workloads, while LSM-tree shifted more for write and space
of index fragmentations. Index fragmentation was a common issue optimizations. So, it was questionable if LSM-tree databases could
for B-Tree database and approximately 25% to 30% of each handle read workloads on flash.
InnoDB block space was wasted. We tried to mitigate the problem
with B-tree defragmentation, but it was less effective on our 2.3 Common UDB Tables
workload than expected. In UDB, a continuous stream of mostly UDB has mainly two types of tables to store social data – one for
random writes would quickly fragment pages that were just objects and the other for associations of the objects [3]. Each object
and association have types (fbtype and assoc_type) defining its Bloom filters are kept in each SST file and used to eliminate
characteristics. The fbtype or assoc_type determines the physical unnecessary search within an SST file.
table that stores the object or association. Common object table is
called fbobj_info, which stores common objects, keyed by object
type (fbtype) and identifiers (fbid). The object itself is stored in a
“data” column in a serialized format, and its format is dependent on
each fbtype. Association tables store associations of the objects. For
example, the assoc_comments table stores associations of
comments on Facebook activities (e.g. posts), keyed by pair of
identifiers (id1 and id2) and type of the association (assoc_type).
The association tables have secondary indexes called id1_type.
Secondary index of the association table (id1_type index) was
designed to optimize range scans. Getting list of ids (id2) associated
to id (id1) is a common logic on Facebook, such as getting a list of
people’s identifiers who liked a certain post.
Figure 3: RocksDB Architecture
From schema point of view, object tables are accessed like a key
value store than a relational model. On the other hand, association 2.4.2 Why RocksDB?
tables have meaningful schema such as pair of ids. We adopted an As mentioned in Section 2.2, space utilization and write
optimization called “covering index” [8] in id1_type secondary amplification are two bottlenecks of UDB. Write amplification is
index, so that range scans can be completed without randomly the initial optimization goal for RocksDB, so it is a perfect fit.
reading from primary keys, by including all relevant columns in the LSM-tree is more effective because it avoids in-place updates to
index. Typical social graph updates modify both object tables and pages, which eventually caused page writes with small updates in
association tables for the same id1s in one database transaction, so UDB. Updates to a LSM-tree are batched and when they are written
having both tables inside one database instance makes sense to get out, pages only contain updated entries, except for the last sorted
benefits of ACID capabilities of the transactions. run. When updates are finally applied to the last sorted run, lots of
updates are already accumulated, so that a good percentage of page
2.4 RocksDB: Optimized for Flash would be newly updated data.
Utilizing Flash Storage is not unique to MySQL and other Besides write amplification, we still need to address the other major
applications at Facebook already had years of experience. In order bottleneck: space utilization. We noticed that, LSM-tree does
to address similar challenges faced by other applications, in 2012, significantly better than B-tree for this metric too. For InnoDB,
a new key/value store library, RocksDB [9] was created for flash. space amplification mostly comes from fragmentation and less
By the time we started to look for an alternative storage engine for efficient compression. As mentioned in Section 2.2, InnoDB
MySQL, RocksDB was already used in a list of services, including wasted 25-30% space in fragmentation. LSM-tree does not suffer
ZippyDB [10] Laser [11] and Dragon [12]. from the problem and its equivalence is dead data not yet removed
RocksDB is a key/value store library optimized for characteristics in the tree. LSM-tree’s dead data is removed by compaction, and
of flash-based SSDs. When choosing the main data structure of the by tuning compaction, we are able to maintain the ratio of dead data
engine, we studied several known data structures and chose LSM- to as low as 10% [14]. RocksDB also optimizes for space because
tree for its good write amplification feature, with a good balance of it works well with compression. If 16KB data was compressed to
read performance [1]. The implementation is based on LevelDB 5KB, RocksDB uses just 5KB while InnoDB aligns to 8KB, so
[13]. RocksDB is much more space efficient. Also, InnoDB has
significant space overhead per row for handling transactions (6-
byte transaction id and 7-byte roll pointer). RocksDB has 7-byte
2.4.1 RocksDB Architecture sequence number for each row, for snapshot reads. However,
Whenever data is written to RocksDB, it is added to an in-memory RocksDB converts sequence numbers to zero during compactions,
write buffer called MemTable, as well as Write Ahead Log (WAL). if no other transaction is referencing them. Zero sequence number
Once the size of the MemTable reaches a predetermined size, the uses very little space after compression. In practice, most rows in
contents of the MemTable are flushed out to a “Sorted Strings Lmax have zero sequence numbers, so space saving is significant,
Table” (SST) data file. Each of the SSTs stores data in sorted order, especially if average row size is small.
divided into blocks. Each SST also has an index block for binary
search with one key per SST block. SSTs are organized into a Since RocksDB is a good fit to address the performance and
sequence of sorted runs of exponentially increasing size, called efficiency challenges of UDB workloads, we decided to build
level, where each level will have multiple SSTs, as depicted in MyRocks, a new MySQL storage engine on top of RocksDB. The
Figure 3. In order to maintain the size for each level, some SSTs in engine implemented in MySQL 5.6 performs well compared to
level-L are selected and merged with the overlapping SSTs in InnoDB in TPC-C benchmark results [14]. As Oracle releases
level(L+1). The process is called compaction. We call the last level newer versions of MySQL, we will continue to port MyRocks
as Lmax. forward. The development work can be substantial because of new
requirements for storage engines.
In the read path, a key lookup occurs at each successive level until
the key is found or it is determined that the key is not present in the
last level. It begins by searching all MemTables, followed by all
Level-0 SSTs and then the SST’s at the next following levels. At
each of these successive levels, a whole binary search is used.
3. MYROCKS/ROCKSDB DEVELOPMENT made it clear we did not target such use cases (RUM Conjecture
compromise [1]).
3.1 Design Goals
Re-architecting and migrating a large production database is a big
engineering project. Before starting, we created several goals. 3.1.4 Design Choices
While increasing efficiency was high priority, it was also important
that many other factors, such as reliability, privacy, security, and
3.1.4.1 Contributions to RocksDB
We added features to RocksDB where possible. RocksDB is a
simplicity, did not regress when transitioning to MyRocks.
widely used open source software and we thought it would benefit
other RocksDB applications. MyRocks used the RocksDB APIs.
3.1.1 Maintained Existing Behavior of Apps and Ops
Implementing a new database was only part of our project.
Successfully migrating a continuously operating in UDB was also
3.1.4.2 Clustered Index Format
UDB took advantage of the InnoDB clustered index structure.
important. Hence, we made the ease of migration and operation a
Primary key lookups could be completed by a single read since all
goal. The pluggable storage engine architecture in MySQL enabled
columns are present. We adopted the same clustered index structure
that goal. Using the same client and SQL interface meant UDB
for MyRocks. Secondary key entries include primary key columns
client applications did not have to change, and many of our
to reference corresponding primary key entries. There is no row ID.
automation tools, such as instance monitoring, backups, and
failover, continued to function with no usability loss.

3.1.2 Limited Initial Target Scope

We did not want to spend many years on a new database project. Figure 4: MyRocks Index Format
Spending five or more years to implement a great database, then
spending additional multiple years to migrate, was not a reasonable 3.2 Performance Challenges
direction for us. Our UDB databases kept growing, so we wanted We made several read performance optimizations so that overall
to save space earlier rather than later. resource utilization was comparable to InnoDB. This section
We decided to limit the initial MyRocks product scope to UDB. discusses these optimization improvements and features. Since
Since UDB had specific table structures and query patterns, we RocksDB was a LSM-tree database, worse read performance as
believed it was feasible to make MyRocks beat our efficiency goals compared to InnoDB was expected. As we measured read
on UDB. On the other hand, fundamental designs such as on-disk performance gaps, we noted optimization opportunities that could
index and row formats were discussed in the early stages. These fill the gaps. During early stage benchmarks, we also found that
were needed to support all workloads and were harder to change improving CPU efficiency was more important than I/O. Modern
once implemented. flash had sufficient read IOPS and since RocksDB wrote much less
to flash, I/O was the lesser concern.
During MyRocks development, we continuously benchmarked
against InnoDB based on UDB equivalent workloads. We used Another big challenge for LSM-tree databases was large number of
LinkBench [15], an open source benchmark tool that simulated tombstones can greatly slow down range scans. We implemented
UDB-like workloads. We also analyzed UDB production several features to mitigate the negative impact of delete markers.
workloads. Based on this data, we drew up development tasks and
prioritized accordingly. Once UDB on MyRocks reached 3.2.1 CPU Reduction
production quality, we started supporting additional use cases.
3.2.1.1 Mem-comparable Keys
3.1.3 Set Clear Performance and Efficiency Goals With LSM-tree, more key comparisons are made when executing
As described previously, MyRocks was an efficiency driven queries as compared to InnoDB. Although RocksDB did several
project, so the focus was significant efficiency gains without optimizations for it, the number is still significantly higher than
sacrificing consistency. There were two goals compared to InnoDB InnoDB, especially in range queries. To look for the start key of a
in UDB. The first was a reducing database space by at least 50%, range, we only need one binary search in a B-tree, while we need
and the second, to do so without regressing CPU and I/O usage. to do one binary search for each sorted run in a LSM-tree and merge
Saving 50% disk space means that MySQL instance density is them using a heap. This can lead to several times more key
doubled per host, so a single database host needs to serve twice the comparisons. Similarly, simple key advancement does not require
amount of traffic. As more CPU and I/O pressure was expected, the any key comparison in B-tree, while in LSM-tree, at least one key
specific goal was that they did not regress. Most UDB tables had comparison is needed to adjust the heap, while often another one is
secondary indexes, and LSM-tree database could manipulate needed to identify whether a record represents an older version. As
secondary indexes more efficiently than InnoDB. We anticipated a result, RocksDB is more sensitive to key comparison cost than
MyRocks could use less CPU and I/O for writes, while it was also InnoDB.
expected to use more for reads. For example, most MySQL storage engines, including InnoDB,
Not all production workloads could migrate from InnoDB to support case insensitive collations. This allows “ABC” to match
MyRocks. We could not make a database that was better than “abc” on character comparisons, but it comes with a performance
InnoDB in all aspects. We picked LSM-tree over B-Tree to save cost because each key comparison involves multiple steps,
space at the expense of read performance. For read intensive including key de-serialization. Even case sensitive collations data
databases where all data resides in memory, MyRocks was hardly types may be required to go through some of these steps. In
better than InnoDB and the space savings benefit was minimal. We MyRocks, we always encode MySQL data to RocksDB keys in a
bytewise-comparable way which is much more efficient for and a short range query only needs to read from one or two leaf
comparisons. pages. With LSM-tree, it has two parts – seek and next. The LSM-
tree has much higher seek overhead. We usually need to read one
3.2.1.2 Reverse Key Comparator data block from each sorted run, and it needs to be done even for
In RocksDB, iterating keys in forward order is much faster than sorted runs where there are no keys in the range. Reading more
reverse order. There are several reasons and most of them are blocks means potentially more I/O and CPU for decompression.
fundamental to LSM-tree. Firstly, LSM-tree allows RocksDB to To mitigate short range scan performance, we introduced the prefix
use key delta encoding inside each data block, but delta encoding bloom filter in RocksDB. Users specify the number of bytes as a
is unfriendly to reverse iteration. Secondly, with LSM-tree, stale “prefix”, so that users can skip all sorted runs that do not contain
records may be present when we iterate through data. The records any key starting with specific prefix.
are stored in the tree in the reverse order of the key versions. This
order guarantees fast forward iteration, but also slows reverse Our association range scan was done by equal predicates from
iteration because RocksDB needs to read one extra record for a key prefixes of indexed columns. Association range scan used the
to find the latest version. Finally, the MemTable is implemented id1_type secondary key. The secondary key started with equal
using a skip list with single direction pointers, so reverse iteration predicates (id1, assoc_type) and included several other columns,
requires another binary search. As a result, ORDER BY query including timestamp. The latter columns were used to determine the
direction with reverse iteration is much slower than a forward one. sort order of the scan. For most TAO range scans, equal predicates
were used for only the first prefix (id1, assoc_type) columns.
Fortunately, most UDB queries are uniform so that we can tune data
placement based on common queries. As described in Section 2.3, The prefix bloom filter is 20 bytes and is composed of the internal
we have two major data models in UDB – objects and associations. index id (4 bytes), the id1 (8 bytes), and the assoc_type (8 bytes).
Associations are more expensive because they are fetched by range These two columns are always set in the WHERE clause with equal
scans as opposed to objects which are fetched by point lookups. predicates for the id1_type secondary key. By supporting the prefix
Range scans might span thousands of edges, and thus it was bloom filter, common index scans with equal predicates can use the
important to optimize them for social graph workloads. bloom filter and improve read performance.

TAO issues association range scans in descending order sorted by

update time. To optimize descending scan performance, we
3.2.2.2 Reducing Tombstone on Deletes and Updates
Compaction was a major factor in RocksDB performance,
implemented a reverse key comparator in RocksDB. It stores keys
in inverse bytewise order. We adopted reverse key comparator for impacting both read and write performance. One of the biggest pain
association secondary keys, so the descending scan by time points was how to handle deletions (called “tombstones”) more
internally executes a forward iteration scan, which was efficient. efficiently. In the early stages, benchmarks like LinkBench were
Our reverse key comparator improved descending scan throughput used to simulate UDB workloads. We observed an issue where the
by approximately 15% in UDB. association range scan performance gradually degraded as the
number of delete tombstones gradually increased in RocksDB data
files. As mentioned in Section 2.3, the secondary index id1_type
3.2.1.3 Faster Approximate Size to Scan Calculation had many columns, including timestamp and version, and was a
As a MySQL storage engine, MyRocks needs to tell the MySQL covering index [8]. This made range scans faster since random
optimizer the estimated cost to scan for each query plan candidate. reads from primary keys were not needed for these columns.
For each query plan candidate, MySQL passes both minimum key However, every update query modified the timestamp and version
and maximum key values to the storage engine, and MyRocks fields, resulting in constant updates to the secondary index.
estimates the cost to scan the range and returns the cost to MySQL.
RocksDB implements the functionality by finding the block Updating indexed columns in MyRocks means changing “keys” in
location of both of minimum and maximum keys and calculating RocksDB. Changing the RocksDB keys requires a Delete for the
the size distance between the two blocks, including MemTables. old key and a Put for the new key. The delete tombstone removes
This can cause nontrivial CPU overhead. We implemented two the matching Put during the MemTable Flush or Compactions, but
features to improve the performance of query cost calculation. One the tombstone itself remains, since it is possible that a Put for the
was completely skipping cost calculation when a hint to force a same key may exist in lower level SST files. The RocksDB
specific index was given. This is effective because the social graph compaction process drops tombstones during compaction once they
queries are highly uniform so adding this hint in several SQL reached Lmax. Updating MyRocks index key fields many times
queries can reduce most of the approximate size overheads. We also means generating huge number of tombstones. This makes range
improved the RocksDB algorithm to get the estimated size of scan scans significantly expensive, since it needs to scan all tombstones.
ranges by estimating total size of full SST files within the range In UDB, changing primary key columns does not happen in usual
first and skipping the remaining of partial files as soon as RocksDB workloads, so changing the primary keys incurred no performance
determines that they would not significantly change the result. implications. However, changing secondary keys does occur in
RocksDB also tries to combine diving for minimum and maximum steady state workloads. The association secondary keys were
keys into one operation. heavily used for range scans, such as returning a list of IDs who
liked a specific photo.
3.2.2 Latency Reduction/Range Query Performance To resolve tombstone inefficiencies, we introduced a new RocksDB
Deletion type called SingleDelete. Compared to Delete,
3.2.2.1 Prefix Bloom Filter SingleDelete can immediately be dropped when removing a
UDB had lots of range scans and it was more challenging in LSM- matched Put. The expected shorter lifespan of SingleDelete
tree database. In B-tree, a range query starts from one leaf page, tombstones maintained range scan performance even with a steady
stream of secondary index changes. SingleDelete does not work 3.2.3.2 SSD Slowness Because of Compaction
when multiple Puts occur to the same key. For example, Put(key=1, MyRocks relies on SSD’s Trim command to reduce SSD’s internal
value=1), Put(key=1, value=2) then SingleDelete(key=1) ends up write amplification and improve performance [16]. However, we
as (key=1, value=1) still remaining, while Delete(key=1) hides noticed that performance for some SSDs may temporarily drop
both Puts. This is a data inconsistency scenario and SingleDelete after a spike of Trim commands. Compactions may create hundreds
should not be used in this case. The MyRocks secondary key of megabytes to gigabytes of SST files. Deleting all of those files
prevents multiple Puts to the same key. If a secondary key does not at once may cause a spike in trims resulting in potential
change, MyRocks does not issue a RocksDB update call for the performance issues or even stalls on flash storage. The solution was
secondary key. If the secondary key changes, MyRocks issues to add rate limiting to file deletion speeds to avoid such stalls.
SingleDelete(old_secondary_key) and Put(new_secondary_key).
Multiple Puts to the same secondary_key without SingleDelete Compaction I/O requests may also compete with user query I/O
never occurs. requests, causing query latency to increase. We added rate limits to
compaction I/O requests to mitigate their effect on user query I/O.
3.2.2.3 Triggering Compaction based on Tombstones
When deleting large numbers of rows, some SST files might 3.2.3.3 Physically Removing Stale Data
become filled with tombstones and impact range scan performance. In UDB, migration jobs are scheduled to remove unused data.
RocksDB extended compaction to track close-by tombstones. Deletions have several types. Normal deletions, which execute the
When a key range with a high density of tombstones is detected DELETE statement and use Delete/SingleDelete RocksDB APIs
during flush or compaction, it immediately triggers another are typically related to user driven data deletion requests. Logical
compaction. This helps reduce range scan performance skews deletions, which execute UPDATE statements and use Put
caused by scanning an excessive number of tombstones. We call it RocksDB APIs, are usually driven for space savings optimizations,
the feature Deletion Triggered Compaction (DTC). such as overwriting unused data column to NULL but not deleting
the record.
Figure 5 was a LinkBench results with and without DTC and
SingleDelete. X-axis was time spent and Y-axis was query per Our social graph workload typically allocates ever increasing fbid
second. With no optimization, QPS significantly degraded with for object tables. The rate of modifications for an object usually
time spent, and did not come back until most tombstone decreases over time. If the object is deleted, no further
disappeared by compactions. SingleDelete showed similar modifications will be made to it. Since object tables use fbid as a
behavior but QPS drop was lower and maximum throughput was primary key and since most insert/update/delete queries are for
higher. DTC made overall QPS drop much less significant. In newly allocated fbid, there is a high chance that SST files
production, providing stable performance was important so using containing both normal and logical deletions for old objects/fbids
both DTC and SingleDelete was more valuable to us. do not get picked up for compaction because all new changes come
from newer fbids. These Delete or Put RocksDB operations for old
fbids made their way into L1~L2, but compactions never continued
to push them into Lmax. As a result, the row images containing the
data remained in Lmax and they continued to take up space.
This was resolved by implementing Periodic Compaction in
RocksDB. The feature checked the age of the data in the SST files,
and if it was older than a settable threshold, it triggered
compactions until it reached Lmax. This ensured that both Delete
and Put eventually reached Lmax within reasonable time frames
and that the space can be reclaimed.

Figure 5: Linkbench with deletion optimizations 3.2.3.4 Bulk Loading

One of the most common causes of stalls in LSM-tree databases
3.2.3 Space and Compaction Challenges like RocksDB was burst writes. Several types of data migration
jobs, such as online schema changes and data loading, generated
3.2.3.1 DRAM Usage Regression massive data writes. For example, the InnoDB to MyRocks
The bloom filter is important for LSM-tree performance and needs migration needed to dump from InnoDB and load into MyRocks.
to be cached in DRAM to be effective. This caused a significant MemTable flush and compaction are unable to keep up with heavy
DRAM usage regression compared to InnoDB. Our design used the write ingestion rates. Writes, including those from user queries,
typical bloom filter size of 10 bits per key. In order to reduce would then be stalled until flush and compaction has made
DRAM usage by bloom filters, we extended RocksDB to optionally sufficient progress. Throttling was one way to mitigate, but it
allow it to skip creating the bloom filter in the last sorted run. By increased overall migration time.
tuning the level size multiplier to 10 and the last run contain 90%
To optimize burst writes and ensure they do not interfere with user
of the data, the total bloom filter size was reduced by 90%, while
queries, we implemented a bulk loading capability in RocksDB and
the bloom filter is still effective. Skipping the bloom filter when
MyRocks takes advantage of it. With bulk loading, MyRocks used
using compression in the last level had the side effect that empty
RocksDB File Ingestion API to create SST files. The new SST files
key lookups, such as unique key check by INSERTs, become more
are directly ingested into Lmax, automatically and atomically
expensive. We chose memory efficiency over extra CPU time there.
updating the RocksDB Manifest to reflect the data files. Using Bulk
Loading allows data migration jobs to bypass the MemTable and
compactions through each RocksDB sorted run level, thus on the rate of changes made on primary. To reduce replica
eliminating any write stalls from those applications. Bulk loading synchronization time, we periodically re-create checkpoints during
requires that ingested key ranges never overlap with existing data. cloning, and continuously send newly hard linked SST files to the
For us, massive data writes typically occurred when creating new destination. Replication synchronization is only needed between
tables, so bulk loading supports this scenario. Figure 6 was a the last checkpoint and the end of the copy, which could be limited
benchmark to load our associations tables, which had both primary regardless of database instance size.
and secondary keys, into InnoDB and MyRocks, with and without
bulk loading. X-axis was the number of tables to load in parallel, 3.3.2 Scales Better with Many Secondary Indexes
and Y-axis was rows inserted per second. With MyRocks bulk In MyRocks, manipulating secondary indexes can be done without
loading, insert throughput linearly scaled at least up to 20 tables and random reads. Internally MyRocks issues RocksDB Put for
throughput was higher than InnoDB by 2.2 times (with one table, inserting, SingleDelete and Put for updating, and SingleDelete for
one concurrency) to 5.7 times (with 20 tables, 20 concurrency). deleting secondary indexes, but they do not need to call
Bulk loading eliminated stalls caused by these migration jobs. Get/GetForUpdate. This was a significant advantage over InnoDB
for heavily modified tables with multiple secondary indexes. Figure
6 shows that MyRocks had better insert throughput than InnoDB
for a table that had one secondary index.

3.3.3 Replace and Insert Without Reads

MySQL has a REPLACE syntax, which blindly inserts/overwrites
a new row. Internally REPLACE reads from a primary key to
discover a matching row by the key. If the row exists, it is deleted
and a new row overwrites it with the new value. Otherwise it
behaves like INSERT. MyRocks optimized REPLACE to issue the
RocksDB Put and skip unique key checks. The LSM-tree database
made it possible to skip the random read and improved the write
throughput.
Figure 6: Bulk Loading throughput MyRocks also has an option for INSERT statements to skip
checking unique key constraints. If it is skipped, MyRocks does not
need to perform a random read, while InnoDB still needs to issue
3.3 Extra Benefits of Using MyRocks the read. The drawback is these blind insertions may easily
While we had to overcome many challenges by using LSM-tree, in introduce data consistency bugs when interacting with other
addition to space and write efficiency, the development effort also MySQL features like triggers and replication. We were not willing
yielded the following benefits. to take that risk in UDB so we took the safer direction to disable
these optimizations. This performance feature is still available to
3.3.1 Online Backup and Restore Performance users.
We take logical backups from the database for disaster recovery and
take binary log backups for point in time recovery. We implemented 3.3.4 More Compression Opportunities
a read only snapshot to take consistent reads in MyRocks and let LSM-tree databases like RocksDB have multiple compression
our logical backup tool use it. Compared to InnoDB, long running opportunities. We highlight one effective compression
consistent reads were more efficient in MyRocks. InnoDB optimization, the “per level compression” algorithm.
implemented UNDO logs [17] as a linked list and needed to keep
all changes in the list after creating a transaction snapshot. It also RocksDB has multiple compression levels from L0 to Lmax. While
needed to rewind the list to find the row based on the consistent RocksDB has 90% of the data in Lmax by default, most
snapshot. This caused significant slowdown if there were hot rows compactions typically happen at levels other than Lmax. This drove
that changed a lot and a long running select needed to read the row us to configure different compression algorithms between levels.
after creating a snapshot. In MyRocks, a long running snapshot can Using a strong compression algorithm for Lmax and faster
maintain a reference to the specific version of the row needed. algorithm such as LZ4, or even no compression in non-Lmax
levels, makes sense. RocksDB explicitly sets specific compression
Restore from logical backups is much faster than InnoDB since algorithms in Lmax. In UDB, we used Zstandard for Lmax, and
MyRocks can utilize bulk loading features (Figure 6). LZ4 for other levels. Using faster compression algorithm for levels
We use physical backups for cloning a replica instance. Cloning an like L0 and L1 was helpful for MemTable flush and compactions
instance is done by creating a checkpoint in RocksDB and then to keep up with write ingestions. In UDB, approximately 80% of
sending all SST files and WAL files to a destination location. A compaction bytes are completed in non-Lmax levels. By adopting
RocksDB checkpoint creates hard links of the SST files. Since SST LZ4 for non-Lmax levels, overall compaction time spent could
files are immutable, no modification to the original files can be drop to one third, compared to using Zstandard for all levels.
made, allowing the checkpoint to point to a snapshot in time.
Cloning to a host in a different region may take hours due to 4. PRODUCTION MIGRATION
network transfer rates, especially if the source instance is large. A We started migrating from InnoDB to MyRocks in UDB after a year
newly cloned instance will then replicate from the MySQL primary and a half of development and testing. Since new software will
instance. This process to catch up the clone on transactions that likely have bugs, we first added MyRocks as a “disabled replica”,
took place after checkpoint creation may also take hours, depending
which meant executing replication traffic from the InnoDB primary Pair mode ran full table scans to check row counts and checksum
instance, while not serving production read traffic. This enabled of the primary keys from two instances, based on a consistent
testing of MyRocks functionality without affecting production snapshot at the same transaction state (Figure 8). Pair mode could
services. We could verify if MyRocks could serve writes without compare InnoDB and MyRocks instances. It could find bugs that
replication being stopped or inducing inconsistent data. We could was not covered by Single mode, such as missing or extra rows in
validate data recovery correctness if the MyRocks instance crashed. one of the instances.
Various automation tools, such as instance upgrade and database
backup, can test interactions with the instance. Select mode was like pair mode, but instead of full scan statements,
it ran select statements captured from MyShadow, and compared
We created a disabled MyRocks replica instance by logically results between two instances. If select statement results were not
exporting InnoDB tables and then importing them into MyRocks consistent, it indicated inconsistencies so we could investigate
using the bulk loading feature. Since InnoDB had clustered index further. Select mode could find inconsistencies reliably, but it also
structures, exported data was already sorted by primary key, which required filtering out false positives from nondeterministic queries
could be bulk loaded into MyRocks without extra sorting. For each (e.g. use of functions like NOW(), which returns the current time).
replica set, we could migrate 200~300GB of InnoDB data per hour.

4.1 MyShadow – Shadow Query Testing

MyRocks needed to be tested before enabling it in a production
environment. Of specific concern was how to verify that MyRocks
could serve queries reliably, including verifying CPU and I/O
utilization as compared to InnoDB, and checking for unexpected
crashes or query plan regressions.
To test production queries, we created a shadow testing tool called
MyShadow. At Facebook, we had a custom MySQL Audit Plugin Figure 8: Data Correctness algorithm (Pair and Select mode)
that captured production queries and recorded them to our internal
logging service. MyShadow read these captured queries from the
logging service and replayed them to target MySQL instances. We 4.3 Fixed Incompatible Queries
used MyShadow first to capture read queries from the production MyShadow and Data Correctness Tests revealed several issues that
InnoDB replica, then replayed the queries to a test MyRocks we could not find during unit tests or using common benchmarks
replica, allowing us to fix any query regressions before serving like sysbench and LinkBench. Single mode Data Correctness found
production read requests in MyRocks. some RocksDB compaction bugs that did not handle
Delete/SingleDelete correctly. Select mode Data Correctness found
interesting bugs in the prefix bloom filter where some range scans
with equal predicates returned fewer rows than expected. These
were fixed prior to enabling MyRocks.
We will highlight one outstanding issue we found during
MyShadow write traffic tests.

4.3.1 Gap Lock and Isolation Behavior Differences

Default transaction isolation level in MySQL is Repeatable Read,
and is used in UDB. Repeatable Read implementation in InnoDB
was unique compared to other database products. InnoDB’s locking
reads behaved as Read Committed, by returning the current
Figure 7: MyShadow Architecture snapshot. It was not strictly Repeatable Read. Instead, InnoDB
locks ranges (Gap) on locking reads, and holds the locks even when
4.2 Data Correctness Checks the rows do not exist, to block other transactions from updating the
When starting to use a new database in production, it was a big same ranges. Statement Based Binary Logging also requires Gap
challenge to validate the new database stored and returned correct Locks for correctness.
data. We have InnoDB as the reference implementation to validate On the other hand, MyRocks adopted Snapshot Isolation model for
data correctness by comparing its data to MyRocks’s. We created a Repeatable Read, which was the same as found in PostgreSQL. We
data correctness tool to compare data returned from each storage considered the InnoDB style Gap Lock based implementation as
engine. Our tool had three modes: Single, Pair and Select. well, but we concluded that Snapshot Isolation was simpler to
Single mode checked consistency between primary key and implement. We can also switch our replicas to use Row Based
secondary keys of the same table in the instance by verifying if row Binary Logging and obviate the need for Gap Lock support.
counts and checksum of overlapping columns were identical. This behavior difference caused issues when testing MyShadow
Single mode could find some internal MyRocks or RocksDB bugs write traffic. Snapshot Isolation based Repeatable Read returned
that did not update either of the index correctly. For example, we errors if rows were conflicted, while InnoDB style locking reads
could find a few RocksDB compaction bugs that ended up not did not conflict since it was essentially Read Committed. As a
deleting keys correctly, which showed up as index inconsistencies. result, MySQL primary instances running MyRocks returned
visibly higher number of errors than InnoDB primary instances
when using Repeatable Read isolation level. To reduce error rates,
we investigated common conflicting queries, discussed the issue
with application developers, and switched to Read Committed
when we confirmed they were safe. We also added a logging feature
to log queries using Gap Locks. Some apps explicitly depended on
Gap Locks while others benefited from it by accident, and logging
helped uncover these different cases.

4.4 Actual Migration

After passing MyShadow and Data Correctness testing, we started Figure 9: MyRocks instance migration steps.
enabling MyRocks instances in production. We started with replica
instances that served production read traffic and replicated from the 5. RESULTS
primary instance. These days we operate tens of thousands of MySQL replica sets in
UDB, and the majority are only MyRocks. We keep a very small
Figure 9 shows migration steps we took in UDB. When we
number of replica sets containing both InnoDB and MyRocks.
migrated to MyRocks in 2016-2017, we had six MySQL instances
These replica sets help us continuously benchmark against InnoDB
for each MySQL Replica Set (one primary and five replica
based on our production workloads, as well as finding any
instances replicating from the primary). We configured four
InnoDB and two MyRocks instances in the same replica set, and unexpected bugs, such as query correctness or optimizer plan
regressions.
the primary instance was fixed to InnoDB. MyRocks instances
replicated from the InnoDB primary. MySQL separated the storage
engines (InnoDB/MyRocks) from the replication streams (Binary 5.1 MyRocks vs InnoDB Efficiency in UDB
Logs), so Binary Log formats were independent from storage Table 1 reflects a set of statistics from one of the UDB replica sets.
engines. This architecture made the MyRocks deployment much It includes database instance size and the average CPU utilizations
less complex. during peak time for two scenarios, with and without user read
requests. Both scenarios include replication write requests. The
This configuration ran for a few months to validate that MyRocks
data excludes DBA batch jobs such as backups and schema
could serve production read traffic reliably. We also continued
changes. We typically monitor space and CPU utilization for
intensive MyShadow write traffic tests to prepare for making
capacity planning. Common benchmarks like sysbench and TPC-C
MyRocks the primary. measure throughput rather than CPU utilization. In production
Promoting MyRocks to primary was the culmination of years of database operations, we are working to save CPU time for given
effort since there was only one primary in each MySQL replica set. workloads. If a replica hits 100% CPU time, it will not be able to
Despite all the planning and testing, there was still some reliably serve more requests. So it is more important to track how
trepidation. It required a leap of faith that we found all problems. much CPU time is spent on given workloads.
When we enabled MyRocks as the primary, we monitored all
The MyRocks instance size was 37.7% compared to the InnoDB
applications closely for any unexpected behavior, but it all went
instance size with the same data sets. Our InnoDB instance was
smoothly. We continued to maintain a number of InnoDB replica
using compressed format. This shows that LSM-based RocksDB
instances in case we needed to revert MyRocks, but they were never
can be much more space efficient compared to B+Tree based
needed. At this stage, we kept three MyRocks instances, including
compressed InnoDB. When we first deployed MyRocks in
one primary, and three InnoDB replica instances for each replica
production in 2016, approximate space reduction was slightly
set, and kept the topology up for another few months until we were
above 50%. Since then, we changed the compression algorithm
confident that we could remove the InnoDB.
from Zlib to Zstandard, and support PeriodicCompaction to
Having two or more instances with the same storage engine was periodically reclaim stale data, realizing even greater space savings.
essential during the transition phase. We had tens of thousands of
MyRocks was nearly 40% more CPU efficient than InnoDB when
replica sets, so losing some instances by hardware or software
not serving read traffic and only serving write requests through
failures was normal. Had we lost all InnoDB or MyRocks instances
replication. We have secondary indexes in most of the tables
in the same replica set, we would have had to perform a logical
including objects and associations. When modifying rows
copy (exporting from InnoDB/MyRocks then importing into
MyRocks could maintain secondary indexes without random reads,
MyRocks/InnoDB), which was much more expensive than physical
which made a big difference compared to InnoDB.
copy. Migrating from MyRocks to InnoDB was very painful
because InnoDB did not have bulk loading capabilities. We kept CPU time spent serving both read and write traffic was slightly
three InnoDB instances for each replica set so that losing all lower in MyRocks than in InnoDB. 1.65 means the MySQL server
instances at the same time was very unlikely. process (mysqld) used 1.65 CPU seconds for both user and system
per second. Commodity servers have tens of CPU cores. We are
We started enabling MyRocks replicas to serve production read
targeting aggregated CPU utilizations by mysqld process under
traffic in mid 2016. We added MyRocks primary instances in early 40% of the total CPU cores during peak time. On typical hardware,
2017. We gradually promoted more MyRocks instances to
we run many MyRocks instances per host. The MyRocks instances
primaries. We removed almost all InnoDB instances by August
were more than 60% smaller compared to InnoDB, which means
2017. Until migrations were complete, we kept two or more
instance density became 2.5 times larger, and resulted in much
InnoDB and MyRocks instances in the same replica set, to avoid
higher aggregated CPU utilization. Note that these numbers were
logical migrations when losing instances.
from replica instances. Primary instances used more CPU time for
sending binary logs to replicas, which were equally expensive Facebook, and we were very excited that our research projects,
between InnoDB and MyRocks. Some batch operations like logical which started in mid 2014, have come this far. We decided to move
backups also added CPU time although we did not usually run them away from HBase to MySQL/MyRocks primarily since the latter
during peak time. Right now our CPU utilization in UDB is still ran much better on flash storage. MyRocks used less CPU time and
under target, but optimizing further is one of our long-term goals. stalled less often. People might be surprised that we migrated from
Latency of read requests were comparable between InnoDB and NoSQL to SQL because of performance, since NoSQL was
MyRocks. supposed to be faster. But in general, fundamentals like database
architecture, data modeling, data access algorithm and ability to
Overall, we could operate MyRocks with much smaller footprint tune them easily mattered more than CPU time to parse SQL
and slightly lower CPU utilizations for our production UDB, statements.
compared to the compressed InnoDB we primarily used until 2017.
There were multiple factors where InnoDB could have been more 6. LESSONS LEARNED
efficient. For example, InnoDB supported only Zlib compression The MySQL team at Facebook had Software Engineers who
algorithm, while MyRocks had additional options like LZ4 and modified the MySQL server code base for Facebook workloads,
Zstandard, which were more CPU and/or space efficient and Production Engineers (PE) who made MySQL infrastructure
compression algorithms. Also, our MySQL server binary was built automated and reliable day to day. PE were also heavily involved
with feedback-directed optimization (FDO), which optimized the in MyRocks and RocksDB developments from very early stages.
binary based on MyRocks workloads. These reduced the CPU The PE spent significant efforts to stabilize InnoDB in UDB in the
usage of MyRocks instances by approximately 7~10%. Due to past, so they had better predications about what kinds of issues
maintenance considerations, we do not keep binaries optimized for might happen during migrations. This helped prioritize and
InnoDB. By supporting Zstandard for InnoDB compression and determine the features and directions for improving MyRocks and
building InnoDB optimized binary, CPU utilization of InnoDB RocksDB. Ultimately, this reduced the time it took to successfully
instance will be comparable to MyRocks in our UDB workloads. migrate the UDB backends. Driving migration projects without
Since we are already operating MyRocks everywhere in UDB, and understanding the current production database workloads is harder
due to density reasons, MyRocks is serving much more traffic than to succeed.
InnoDB per same space, we are focusing more optimizing CPU
efficiencies in MyRocks. When developing a new storage engine, it is very important to
understand how core components worked, including flash storage
Table 1: UDB statistics compared to InnoDB and MyRocks and the Linux Kernel, from development to debugging production
Engine Space CPU CPU Bytes issues. It is important not only for database servers, but also for
seconds/s seconds/s written per operational tools like backups. While it may be simpler to treat
for writes for reads + second underlying components as black boxes, we believe this may lead to
writes missed opportunities for improvement. Also, problems that occur
InnoDB 2187.4GB 0.89 1.83 13.34MB between an application and the underlying component are harder to
MyRocks 824.4GB 0.55 1.65 3.42MB debug without first building expertise in these areas.
Outliers should not be ignored. Many of our production issues
happened on only one or a few instances. Checking at the p90 or
5.2 Migrated Facebook Messenger Backend p99 data points would not have caught such issues. At Facebook,
from HBase to MySQL With MyRocks we have invested a lot in monitoring to quickly catch such outliers.
HBase [18] was the choice for the Facebook Messenger backend Running correctness checks, even after production deployment,
database since its launch in 2010. At that time, HBase was chosen was important for finding bugs.
as the Facebook Messenger database because Messenger was write From operational standpoint, SQL compatibility was extremely
intensive and HBase was a good fit for it. HBase is based on a LSM helpful to migrate within reasonable time. Many of our important
algorithm, is strongly consistent, and runs well on HDD [19]. tools, such as MyShadow and data correctness, worked for both
InnoDB was not chosen because it was not write and space InnoDB and MyRocks thanks to SQL compatibility.
optimized. Several years since then, flash storage has come to
dominate database storage, and we have encountered issues where We realized that a LSM-tree database like RocksDB has many more
HBase could not use flash storage capacity because it exhausted adjustable parameters. It was much more workload dependent and
CPU quickly, caused mainly by Garbage Collections. harder to tune correctly. Applications generating many tombstones
affected performance more severely than InnoDB and calibrating
The successful InnoDB to MyRocks migration in UDB suggested MyRocks properly was challenging. It is our long term goal to
a direction to migrate from HBase to MySQL with the MyRocks make MyRocks require less tuning to support a greater range of
engine, too. The migration was more complex than UDB since workload patterns. The LSM-tree database had features to speed up
HBase and MySQL were very different database products. The writes by sacrificing consistency, but we currently favor more
migration was done in conjunction with refactoring the Facebook conservative settings.
Messenger application. The data migration was done using
MapReduce jobs to extract the data from HBase and loading it into
MySQL. During the transition phase, consistency was checked by 6.1 Memory Stalls and Efficiency
double writing and reading. The full details are discussed in [20]. We have learned several lessons from memory allocation reliability
and efficiency. RocksDB is more heavily dependent on memory
UDB (InnoDB to MyRocks) and Facebook Messenger (HBase to allocator implementations than InnoDB. RocksDB allocates
MySQL/MyRocks) were very large OLTP database services at memory for each block and the actual block size differs from block
to block. We used jemalloc [21] and it was crucial for our [31] are a few examples. The MyRocks solution differs from these
workloads. solutions for (1) MyRocks uses LSM-tree; (2) an existing key/value
store library, RocksDB, is used rather than implementing a new
RocksDB used to rely on Linux’s page cache for Buffered I/O with one.
simply POSIX fadvise hints, while InnoDB supports Direct I/O
(Linux O_DIRECT). We started using MyRocks with Buffered I/O Some works, e.g. [32], introduce database migration system in the
in production, however we faced two challenges. One was context of multi-tenant databases in cloud. Others shared
transaction commit stalls triggered by the Linux page cache experience on large scale migration of their databases, e.g. [33][34].
allocations, and the other was higher Linux kernel memory usage While their works focused on data integrity for the migrated data
leading to swapping. itself and performance tuning, we focused on detecting
performance regression, data correctness bugs and query
MyRocks transaction commit paths perform many memory incompatibility caused by storage engine implementation as early
operations (e.g. MemTable writes, binlog/wal writes), which as possible.
caused Kernel Memory allocation stalls with the Linux Kernel 4.0.
We upgraded Linux to 4.6, which solved most VM allocation issues Regarding saving DRAM for bloom filter in LSM-trees, [35] and
and mitigated the problem. [36] proposed more adaptive and general approaches. While
RocksDB uses prefix bloom filter to filter out short range queries,
6.1.1 Transitioning to Direct I/O in RocksDB [37] proposed a general range filter for LSM-trees.
To achieve higher efficiency gains, we tried to use bigger flash
storage capacity without increasing DRAM size, but we started
seeing swaps triggered by memory pressure. We found that the 8. CONCLUSION AND FUTURE WORKS
Linux kernel allocated approximately 2~3GB of slab memory per This paper introduces UDB, our largest OLTP database for handling
1TB of RocksDB SST files. As the storage size increased the social activities at Facebook. We placed a high priority on
overhead was not negligible, especially with a smaller DRAM continually increasing efficiency, which led to the development of
configuration. Though our Linux Kernel team at Facebook a LSM-tree database that is more space and write optimized than
implemented a new radix tree structure to reduce memory overhead the B+Tree database, InnoDB. We created MyRocks, a MySQL
to manage large data files, we decided to support Direct I/O in storage engine, on top of RocksDB, a key/value store library.
RocksDB and make it less dependent on a Linux kernel distribution MyRocks made our production database migration from InnoDB
[22]. Making MyRocks less dependent on specific operation significantly easier, since both are MySQL storage engines.
system optimizations is important for the open source community. Leveraging MySQL features, MyRocks and InnoDB instances
Most database users do not have a dedicated Linux kernel team, and could replicate from each other. No significant client changes were
some use proven stable kernel versions, which are relatively old. needed. The LSM-tree database was known to be space and write
They could see better benchmark results even with older kernel. optimized, but the downside was more expensive reads. While
After using Direct I/O, our average slab size dropped by over 80%. MyRocks addresses two major bottlenecks of the systems, we faced
several challenges, including CPU efficiencies. Significant
One notable challenge transitioning from buffered I/O to direct I/O optimizations in RocksDB, such as hybrid compression algorithms
in production was that we had to make sure we did not mix buffered and flexible bloom filter, addressed these issues. Our MyRocks
and direct I/O to the same file. It was undefined behavior in Linux mixed read and write workloads in UDB were eventually more
and caused significant performance slowdown. We adjusted our CPU efficient than InnoDB’s. Our success with UDB led to the
tools reading the files to match the access pattern. For example, our HBase to MyRocks migration in Facebook Messenger.
online MyRocks binary backup tool copied RocksDB SST files
either using direct or buffered I/O, based on the instance’s setting. Simplifying MyRocks performance tuning so it can be used without
in depth knowledge is our next milestone. While configuring the
7. RELATED WORK prefix bloom filter, reverse key comparators and skipping last level
LSM-tree based databases have existed for a long time. Big Table bloom filters, we have managed to match the performance of
[23], LevelDB [13], Cassandra [24] and HBase [18] are a few InnoDB, it required significant effort. We plan to allow RocksDB
examples. To our knowledge, most of the optimization techniques to adaptively tune itself dynamically.
we introduced in this paper are novel and not present in previous
systems. MyRocks also differs from other systems as space 9. ACKNOWLEDGMENTS
capacity efficiency is the primary optimization goal. While we cannot list all our contributors to the MyRocks and
Spanner [25] is a SQL database based on LSM-tree, but it is a SQL RocksDB engineering projects, we would like to thank everyone
database built from ground up while MyRocks has a clear goal of who supported us and note a few special contributors. Sergey
matching performance of existing systems and keeping the Petrunya at MariaDB worked with us from very early stage of
database behavior compatible. MyRocks and LevelDB Storage Engines. Sergey developed much
of MyRocks storage engine implementation, including
Several database services built their SQL databases using RocksDB comparators, secondary index support, optimizer statistics, index
as a storage engine, such as CockroachDB [26], Yugabytes [27], condition pushdown, and batched key access. We would also like
and TiDB [28]. Those systems built SQL and distributed capability to thank Google for releasing LevelDB as an open source LSM-tree
from scratch, while one important goal of MyRocks is to keep those database, Dhruba Borthakur for creating RocksDB from LevelDB,
layers intact by continuing using MySQL. and Mark Callaghan for generous mentoring and great advice. And
finally, many thanks to our former and current colleagues on the
There are other projects that create or extend MySQL storage
MySQL Software Engineering, RocksDB Software Engineering,
engines, while keeping it transparent to database users and
and MySQL Production Engineering teams at Facebook.
administrators. Amazon Aurora [29], TokuDB [30] and PolarDB
10. REFERENCES [14] S. Dong, M. Callaghan, L. Galanis, D. Borthakur, T. Savor,
[1] M. Athanassoulis, M. S. Kester, L. M. Maas, R. I. Stoica, S. and M. Strumm. Optimizing space amplification in
Idreos, A. Ailamaki, and M. Callaghan. Designing Access RocksDB. In CIDR, volume 3, page 3, 2017.
Methods: The RUM Conjecture. In Proceedings of the [15] Timothy G. Armstrong, Vamsi Ponnekanti, Dhruba
International Conference on Extending Database Technology Borthakur, and Mark Callaghan. 2013. LinkBench: a
(EDBT) Conference, 2016 database benchmark based on the Facebook social graph. In
[2] Patrick O’Neil, Edward Cheng, Dieter Gawlick, and Proceedings of the 2013 ACM SIGMOD International
Elizabeth O’Neil. 1996. The log-structured merge-tree Conference on Management of Data (SIGMOD ’13).
(LSM-tree). Acta Inf. 33, 4 (June 1996), 351–385. Association for Computing Machinery, New York, NY, USA,
1185–1196.
[3] Venkateshwaran Venkataramani, Zach Amsden, Nathan
Bronson, George Cabrera III, Prasad Chakka, Peter Dimov, [16] Tasha Frankie, Gordon Hughes, and Ken Kreutz-Delgado.
Hui Ding, Jack Ferris, Anthony Giardullo, Jeremy Hoon, 2012. A mathematical model of the trim command in NAND-
Sachin Kulkarni, Nathan Lawrence, Mark Marchukov, flash SSDs. In Proceedings of the 50th Annual Southeast
Dmitri Petrov, and Lovro Puzar. 2012. TAO: how facebook Regional Conference (ACM-SE ’12). Association for
serves the social graph. In Proceedings of the 2012 ACM Computing Machinery, New York, NY, USA, 59–64.
SIGMOD International Conference on Management of Data [17] MySQL InnoDB Undo Logs
(SIGMOD ’12). Association for Computing Machinery, New https://dev.mysql.com/doc/refman/5.6/en/innodb-undo-
York, NY, USA, 791–792. logs.html
[4] Facebook’s MySQL extensions. [18] George, Lars. HBase: the definitive guide: random access to
https://github.com/facebook/mysql-5.6 your planet-size data. " O'Reilly Media, Inc.", 2011.
[5] Data centers year in review. Facebook Engineering. [19] Tyler Harter, Dhruba Borthakur, Siying Dong, Amitanand
https://engineering.fb.com/data-center-engineering/data- Aiyer, Liyin Tang, Andrea C. Arpaci-Dusseau, and Remzi H.
centers-2018/. Arpaci-Dusseau. 2014. Analysis of HDFS under HBase: a
[6] Sharma, Y., Ajoux, P., Ang, P., Callies, D., Choudhary, A., facebook messages case study. In Proceedings of the 12th
Demailly, L., Fersch, T., Guz, L.A., Kotulski, A., Kulkarni, S. USENIX conference on File and Storage Technologies
and Kumar, S., 2015. Wormhole: Reliable pub-sub to support (FAST’14). USENIX Association, USA, 199–212.
geo-replicated internet services. In 12th USENIX [20] Xiang Li, Thomas Georgiou. Migrating Messenger storage to
Symposium on Networked Systems Design and optimize performance https://engineering.fb.com/core-
Implementation (NSDI 15) (pp. 351-366). data/migrating-messenger-storage-to-optimize-performance/
[7] Flashcache https://www.facebook.com/notes/mysql-at- [21] Evans, J. 2006, A Scalable Concurrent malloc(3)
facebook/releasing-flashcache/388112370932/ Implementation for FreeBSD
[8] MySQL Glossary for Covering Index [22] Stonebraker, M. 1981. Operating System Support for
https://dev.mysql.com/doc/refman/5.6/en/glossary.html#glos Database Management. Communications of the ACM 24(7):
_covering_index 412-418
[9] RocksDB. https://github.com/facebook/rocksdb [23] Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C.
[10] Amy Tai, Andrew Kryczka, Shobhit O. Kanaujia, Kyle Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra,
Jamieson, Michael J. Freedman, and Asaf Cidon. 2019. Andrew Fikes, and Robert E. Gruber. 2008. Bigtable: A
Who’s afraid of uncorrectable bit errors? online recovery of Distributed Storage System for Structured Data. ACM Trans.
flash errors with distributed redundancy. In Proceedings of Comput. Syst. 26, 2, Article 4 (June 2008), 26 pages.
the 2019 USENIX Conference on Usenix Annual Technical [24] Lakshman, A. and Malik, P., 2010. Cassandra: a
Conference (USENIX ATC ’19). USENIX Association, decentralized structured storage system. ACM SIGOPS
USA, 977–991. Operating Systems Review, 44(2), pp.35-40.
[11] Guoqiang Jerry Chen, Janet L. Wiener, Shridhar Iyer, Anshul [25] Bacon, D.F., Bales, N., Bruno, N., Cooper, B.F., Dickinson,
Jaiswal, Ran Lei, Nikhil Simha, Wei Wang, Kevin Wilfong, A., Fikes, A., Fraser, C., Gubarev, A., Joshi, M., Kogan, E.
Tim Williamson, and Serhat Yilmaz. 2016. Realtime Data and Lloyd, A., 2017, May. Spanner: Becoming a SQL
Processing at Facebook. In Proceedings of the 2016 system. In Proceedings of the 2017 ACM International
International Conference on Management of Data (SIGMOD Conference on Management of Data (pp. 331-343).
’16). Association for Computing Machinery, New York, NY,
USA, 1087–1098. [26] Taft, R., Sharif, I., Matei, A., VanBenschoten, N., Lewis, J.,
Grieger, T., Niemi, K., Woods, A., Birzin, A., Poss, R. and
[12] Arun Sharma. Dragon: A distributed graph query engine. Bardea, P., 2020, June. CockroachDB: The Resilient Geo-
https://engineering.fb.com/data-infrastructure/dragon-a- Distributed SQL Database. In Proceedings of the 2020 ACM
distributed-graph-query-engine/ SIGMOD International Conference on Management of Data
[13] Ghemawat, S. and Dean, J., 2011. LevelDB. (pp. 1493-1509).
https://github.com/google/leveldb
[27] Yugabyte, Inc. The Leading High-Performance Distributed [33] Netflix Technology Blog. Netflix Billing Migration to AWS
SQL Database. https://www.yugabyte.com/. Accessed: 2020- — Part III. https://netflixtechblog.com/netflix-billing-
02-09. migration-to-aws-part-iii-7d94ab9d1f59
[28] PingCAP. Tackling MySQL Scalability with TiDB: the most [34] Migrating from AWS RDS MySQL to AWS Aurora
actively developed open source NewSQL database on Serverless MySQL Database.
GitHub. https://pingcap.com/. Accessed: 2020-02-09. https://www.adelatech.com/migrating-from-aws-rds-mysql-
to-aws-aurora-serverless-mysql-database/
[29] Verbitski, A., Gupta, A., Saha, D., Brahmadesam, M., Gupta,
K., Mittal, R., Krishnamurthy, S., Maurice, S., Kharatishvili, [35] Dayan, N., Athanassoulis, M. and Idreos, S., 2017, May.
T. and Bao, X., 2017, May. Amazon aurora: Design Monkey: Optimal navigable key-value store. In Proceedings
considerations for high throughput cloud-native relational of the 2017 ACM International Conference on Management
databases. In Proceedings of the 2017 ACM International of Data (pp. 79-94).
Conference on Management of Data (pp. 1041-1052).
[36] Zhang, Y., Li, Y., Guo, F., Li, C. and Xu, Y., 2018. ElasticBF:
[30] I. Tokutek, “TokuDB: MySQL performance, MariaDB Fine-grained and Elastic Bloom Filter Towards Efficient
performance,” http://www.tokutek.com/products/tokudb-for- Read for LSM-tree-based {KV} Stores. In 10th {USENIX}
mysql/, 2013. Workshop on Hot Topics in Storage and File Systems
(HotStorage 18).
[31] Feifei Li. Cloud-Native Database Systems at Alibaba:
Opportunities and Challenges. PVLDB, 12(12): 2263 - 2272, [37] Huanchen Zhang, Hyeontaek Lim, Viktor Leis, David G.
2019. Andersen, Michael Kaminsky, Kimberly Keeton, and
DOI: https://doi.org/10.14778/3352063.3352141 Andrew Pavlo. 2018. SuRF: Practical Range Query Filtering
with Fast Succinct Tries. In Proceedings of the 2018
[32] Elmore, A.J., Das, S., Agrawal, D. and El Abbadi, A., 2011, International Conference on Management of Data (SIGMOD
June. Zephyr: live migration in shared nothing databases for ’18). Association for Computing Machinery, New York, NY,
elastic cloud platforms. In Proceedings of the 2011 ACM USA, 323–336.
SIGMOD International Conference on Management of
data (pp. 301-312).

Dokumen - Pub - Understanding Etl Data Pipelines For Modern Data Architectures Early Release 9781098159252
No ratings yet
Dokumen - Pub - Understanding Etl Data Pipelines For Modern Data Architectures Early Release 9781098159252
39 pages
06 Buffer Cache
No ratings yet
06 Buffer Cache
85 pages
Extensible Storage Engine (ESE) Database File (EDB) Format
No ratings yet
Extensible Storage Engine (ESE) Database File (EDB) Format
53 pages
Mongodb Vs Couchbase Architecture WP PDF
No ratings yet
Mongodb Vs Couchbase Architecture WP PDF
45 pages
6 Documentdatabases
No ratings yet
6 Documentdatabases
27 pages
Basics of Partitioning
100% (1)
Basics of Partitioning
2 pages
Performance Tuning With InfoSphere CDC
100% (1)
Performance Tuning With InfoSphere CDC
37 pages
A Performance Comparison of SQL and NoSQL Databases
No ratings yet
A Performance Comparison of SQL and NoSQL Databases
5 pages
RESTful Day 1 PDF
No ratings yet
RESTful Day 1 PDF
47 pages
Tuning Linux For MongoDB
No ratings yet
Tuning Linux For MongoDB
26 pages
An Investigation of NoSQL Database Performance From A MYSQL Perspective
No ratings yet
An Investigation of NoSQL Database Performance From A MYSQL Perspective
3 pages
Big Data Processing Types
No ratings yet
Big Data Processing Types
22 pages
MySQL Interview Questions
No ratings yet
MySQL Interview Questions
47 pages
SQL Replication Basic
No ratings yet
SQL Replication Basic
22 pages
Sharding in MongoDB
No ratings yet
Sharding in MongoDB
3 pages
Cassandra: Types of Nosql Databases
No ratings yet
Cassandra: Types of Nosql Databases
6 pages
MIE1628 Big Data Analytics Lecture8
No ratings yet
MIE1628 Big Data Analytics Lecture8
82 pages
When Where and Why To Use NoSQL
No ratings yet
When Where and Why To Use NoSQL
13 pages
Node Patterns - Databases Volume I - LevelDB, Redis and CouchDB
No ratings yet
Node Patterns - Databases Volume I - LevelDB, Redis and CouchDB
98 pages
Tuning SQL Queries - Oracle
100% (1)
Tuning SQL Queries - Oracle
27 pages
Nosql: Non-Relational Next Generation Operational Datastores and Databases
No ratings yet
Nosql: Non-Relational Next Generation Operational Datastores and Databases
19 pages
Lecture 07 - Key-Value Databases
No ratings yet
Lecture 07 - Key-Value Databases
75 pages
SQL Server Indexes
No ratings yet
SQL Server Indexes
14 pages
Big Query
No ratings yet
Big Query
5 pages
Nosql - Journey Ahead!: Origin: Punch Cards To Dbms
No ratings yet
Nosql - Journey Ahead!: Origin: Punch Cards To Dbms
54 pages
Data Versioning For Graph Databases
No ratings yet
Data Versioning For Graph Databases
71 pages
What Is Nosql: Features of Nosql Databases
No ratings yet
What Is Nosql: Features of Nosql Databases
11 pages
What Is DW2.0
No ratings yet
What Is DW2.0
13 pages
Database Architecture Interview Questions and Answers Guide
No ratings yet
Database Architecture Interview Questions and Answers Guide
6 pages
TCO Model SQLServerVsMongo
No ratings yet
TCO Model SQLServerVsMongo
6 pages
S4 - RM Ra SQL
No ratings yet
S4 - RM Ra SQL
111 pages
RDBMS To MongoDB Migration
No ratings yet
RDBMS To MongoDB Migration
20 pages
BigQuery Cost Optimization + Best Practices
No ratings yet
BigQuery Cost Optimization + Best Practices
30 pages
Db2 Oracle
No ratings yet
Db2 Oracle
39 pages
Operational Data Stores
No ratings yet
Operational Data Stores
3 pages
How To Manage A Redis Database
No ratings yet
How To Manage A Redis Database
95 pages
ERModel PDF
100% (1)
ERModel PDF
82 pages
MongoDB Performance Best Practices
No ratings yet
MongoDB Performance Best Practices
15 pages
Data Integration Best Practices
No ratings yet
Data Integration Best Practices
17 pages
The Best of Bruce's Postgres Slides: Ruce Omjian
No ratings yet
The Best of Bruce's Postgres Slides: Ruce Omjian
26 pages
Big Data, Map Reduce & Hadoop: By: Surbhi Vyas (7) Varsha
No ratings yet
Big Data, Map Reduce & Hadoop: By: Surbhi Vyas (7) Varsha
40 pages
Flink Vs Spark by Slim Baltagi
No ratings yet
Flink Vs Spark by Slim Baltagi
67 pages
Oracle 12c - CDB - PDB - Performing Basic Tasks PDF
No ratings yet
Oracle 12c - CDB - PDB - Performing Basic Tasks PDF
18 pages
Oracle Cloud World 2024 Presentations and Their Corresponding Links
No ratings yet
Oracle Cloud World 2024 Presentations and Their Corresponding Links
52 pages
Mapreduce and Hadoop Distributed File System
No ratings yet
Mapreduce and Hadoop Distributed File System
36 pages
Introduction of DBMS
No ratings yet
Introduction of DBMS
83 pages
Nosqlmodule 1
100% (1)
Nosqlmodule 1
102 pages
Mongodb Interview Questions (V4.4)
No ratings yet
Mongodb Interview Questions (V4.4)
25 pages
Relational (OLTP) Data Modeling
No ratings yet
Relational (OLTP) Data Modeling
2 pages
Distributed PostgreSQL
No ratings yet
Distributed PostgreSQL
118 pages
Perofrmance and Indexes Discussion Questions Solutions PDF
No ratings yet
Perofrmance and Indexes Discussion Questions Solutions PDF
5 pages
SQL NoSQL NewSQL
No ratings yet
SQL NoSQL NewSQL
12 pages
EDB Postgres Advanced Server Guide v11
No ratings yet
EDB Postgres Advanced Server Guide v11
329 pages
Teradata Frequently Asking Questions
No ratings yet
Teradata Frequently Asking Questions
46 pages
Sonali DBMS Notes
100% (13)
Sonali DBMS Notes
61 pages
The Datadog Handbook: A Guide to Monitoring, Metrics, and Tracing
From Everand
The Datadog Handbook: A Guide to Monitoring, Metrics, and Tracing
Robert Johnson
No ratings yet
Data Engineering with Scala and Spark: Build streaming and batch pipelines that process massive amounts of data using Scala
From Everand
Data Engineering with Scala and Spark: Build streaming and batch pipelines that process massive amounts of data using Scala
Eric Tome
No ratings yet
Ultimate Salesforce Data Cloud for Customer Experience: Explore, Implement and Elevate B2C Experiences Through Customer Data Innovations Using Salesforce Data Cloud
From Everand
Ultimate Salesforce Data Cloud for Customer Experience: Explore, Implement and Elevate B2C Experiences Through Customer Data Innovations Using Salesforce Data Cloud
Gourab Mukherjee
No ratings yet
Mastering Data Engineering and Analytics with Databricks: A Hands-on Guide to Build Scalable Pipelines Using Databricks, Delta Lake, and MLflow (English Edition)
From Everand
Mastering Data Engineering and Analytics with Databricks: A Hands-on Guide to Build Scalable Pipelines Using Databricks, Delta Lake, and MLflow (English Edition)
Manoj Kumar
No ratings yet
HDInsight Essentials - Second Edition
From Everand
HDInsight Essentials - Second Edition
Rajesh Nadipalli
No ratings yet
Parer Point Presentation On SAS by GE
100% (1)
Parer Point Presentation On SAS by GE
17 pages
OracleStudy Material
100% (1)
OracleStudy Material
376 pages
Resume 1
No ratings yet
Resume 1
2 pages
NK-1000-0093 DeltaV Continuous Historian Backup and Recovery Procedure
0% (1)
NK-1000-0093 DeltaV Continuous Historian Backup and Recovery Procedure
4 pages
06 Commvault Defining Data Retention and Destruction Policies
No ratings yet
06 Commvault Defining Data Retention and Destruction Policies
36 pages
5b. Naming
No ratings yet
5b. Naming
46 pages
Complex Oracle Queries
No ratings yet
Complex Oracle Queries
5 pages
Bank Account Management System Project Report
No ratings yet
Bank Account Management System Project Report
31 pages
DBMS Korth ch1
No ratings yet
DBMS Korth ch1
32 pages
Objective: Katherine Campañano
No ratings yet
Objective: Katherine Campañano
4 pages
Grade 12 SQL Practical Record
No ratings yet
Grade 12 SQL Practical Record
6 pages
Answers - Revision Qns (All Chapters) - Class XI - Annual Exam 2023
No ratings yet
Answers - Revision Qns (All Chapters) - Class XI - Annual Exam 2023
11 pages
Interview Que Only For Cognos
No ratings yet
Interview Que Only For Cognos
6 pages
Business Intelligence and Analytics: Prepared by Dr. Hima Suresh Assistant Professor Division of CS, SOE
No ratings yet
Business Intelligence and Analytics: Prepared by Dr. Hima Suresh Assistant Professor Division of CS, SOE
36 pages
Data Warehouse Concepts
No ratings yet
Data Warehouse Concepts
11 pages
Simple Batch Reporting Using RSBizWare Historian and RSSQL 5.0 - Rev0.9
No ratings yet
Simple Batch Reporting Using RSBizWare Historian and RSSQL 5.0 - Rev0.9
80 pages
Backup Mechanism in Database
No ratings yet
Backup Mechanism in Database
17 pages
Intro
No ratings yet
Intro
19 pages
Database Management Systems: Course: Credits: 3
No ratings yet
Database Management Systems: Course: Credits: 3
22 pages
18 - PDFsam - AWS Certified Solutions Architect Study Guide 2023 - Cropped
No ratings yet
18 - PDFsam - AWS Certified Solutions Architect Study Guide 2023 - Cropped
1 page
DataStax Astra
No ratings yet
DataStax Astra
3 pages
DBMS Importantquestions
No ratings yet
DBMS Importantquestions
5 pages
System Design Interview Guide-1
No ratings yet
System Design Interview Guide-1
8 pages
PL-SQL Assignment-2 - 101803108-Coe6
100% (1)
PL-SQL Assignment-2 - 101803108-Coe6
4 pages
(Ebooks PDF) Download Database Systems: Design, Implementation, Management 11th Edition (Ebook PDF) Full Chapters
100% (2)
(Ebooks PDF) Download Database Systems: Design, Implementation, Management 11th Edition (Ebook PDF) Full Chapters
55 pages
Tut w3s
100% (1)
Tut w3s
6 pages
Chapter 5
No ratings yet
Chapter 5
47 pages
O Codificador Limpo
0% (1)
O Codificador Limpo
8 pages
A Decision Support System
No ratings yet
A Decision Support System
3 pages
Unit Wise Important Questions
83% (12)
Unit Wise Important Questions
11 pages

MyRocks LSM Tree Database Storage Engine Serving Facebooks Social Graph

Uploaded by

MyRocks LSM Tree Database Storage Engine Serving Facebooks Social Graph

Uploaded by

MyRocks: LSM-Tree Database Storage Engine Serving

Facebook's Social Graph

3.1.2 Limited Initial Target Scope

TAO issues association range scans in descending order sorted by

Figure 5: Linkbench with deletion optimizations 3.2.3.4 Bulk Loading

3.3.3 Replace and Insert Without Reads

4.1 MyShadow – Shadow Query Testing

4.3.1 Gap Lock and Isolation Behavior Differences

4.4 Actual Migration

You might also like