Rocksdb Readthedocs Io en Latest
Rocksdb Readthedocs Io en Latest
Release 0.6.7
sh
1 Overview 1
1.1 Installing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Basic Usage of python-rocksdb . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Python driver for RocksDB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.4 Changelog . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2 Contributing 41
3 RoadMap/TODO 43
Index 49
i
ii
CHAPTER
ONE
OVERVIEW
import rocksdb
db = rocksdb.DB("test.db", rocksdb.Options(create_if_missing=True))
db.put(b"a", b"b")
print db.get(b"a")
1.1 Installing
Building rocksdb
Briefly describes how to build rocksdb under an ordinary debian/ubuntu. For more details consider https://github.com/
facebook/rocksdb/blob/master/INSTALL.md
1
python-rocksdb Documentation, Release 0.6.7
Systemwide rocksdb
The following command installs the shared library in /usr/lib/ and the header files in /usr/include/
rocksdb/:
To uninstall use:
Local rocksdb
If you don’t like the system wide installation, or you don’t have the permissions, it is possible to set the following
environment variables. These varialbes are picked up by the compiler, linker and loader
export CPLUS_INCLUDE_PATH=${CPLUS_INCLUDE_PATH}:`pwd`/../include
export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:`pwd`
export LIBRARY_PATH=${LIBRARY_PATH}:`pwd`
Building python-rocksdb
1.2.1 Open
import rocksdb
db = rocksdb.DB("test.db", rocksdb.Options(create_if_missing=True))
import rocksdb
opts = rocksdb.Options()
opts.create_if_missing = True
opts.max_open_files = 300000
opts.write_buffer_size = 67108864
opts.max_write_buffer_number = 3
opts.target_file_size_base = 67108864
opts.table_factory = rocksdb.BlockBasedTableFactory(
filter_policy=rocksdb.BloomFilterPolicy(10),
(continues on next page)
2 Chapter 1. Overview
python-rocksdb Documentation, Release 0.6.7
db = rocksdb.DB("test.db", opts)
It assings a cache of 2.5G, uses a bloom filter for faster lookups and keeps more data (64 MB) in memory before
writting a .sst file.
RocksDB stores all data as uninterpreted byte strings. pyrocksdb behaves the same and uses nearly everywhere byte
strings too. In python2 this is the str type. In python3 the bytes type. Since the default string type for string literals
differs between python 2 and 3, it is strongly recommended to use an explicit b prefix for all byte string literals in both
python2 and python3 code. For example b'this is a byte string'. This avoids ambiguity and ensures that
your code keeps working as intended if you switch between python2 and python3.
The only place where you can pass unicode objects are filesytem paths like
• Directory name of the database itself rocksdb.DB.__init__()
• rocksdb.Options.wal_dir
• rocksdb.Options.db_log_dir
To encode this path name, sys.getfilesystemencoding() encoding is used.
1.2.3 Access
# Store
db.put(b"key", b"value")
# Get
db.get(b"key")
# Delete
db.delete(b"key")
batch = rocksdb.WriteBatch()
batch.put(b"key", b"v1")
batch.delete(b"key")
batch.put(b"key", b"v2")
batch.put(b"key", b"v3")
db.write(batch)
db.put(b"key1", b"v1")
db.put(b"key2", b"v2")
# prints b"v1"
print ret[b"key1"]
# prints None
print ret[b"key3"]
1.2.4 Iteration
Iterators behave slightly different than expected. Per default they are not valid. So you have to call one of its seek
methods first
db.put(b"key1", b"v1")
db.put(b"key2", b"v2")
db.put(b"key3", b"v3")
it = db.iterkeys()
it.seek_to_first()
it.seek_to_last()
# prints [b'key3']
print list(it)
it.seek(b'key2')
# prints [b'key2', b'key3']
print list(it)
it = db.itervalues()
it.seek_to_first()
it = db.iteritems()
it.seek_to_first()
Reversed iteration
it = db.iteritems()
it.seek_to_last()
4 Chapter 1. Overview
python-rocksdb Documentation, Release 0.6.7
db.put(b'a1', b'a1_value')
db.put(b'a3', b'a3_value')
db.put(b'b1', b'b1_value')
db.put(b'b2', b'b2_value')
db.put(b'c2', b'c2_value')
db.put(b'c4', b'c4_value')
it = db.iteritems()
it.seek(b'a1')
assertEqual(it.get(), (b'a1', b'a1_value'))
it.seek(b'a3')
assertEqual(it.get(), (b'a3', b'a3_value'))
it.seek_for_prev(b'c4')
assertEqual(it.get(), (b'c4', b'c4_value'))
it.seek_for_prev(b'c3')
assertEqual(it.get(), (b'c2', b'c2_value'))
1.2.5 Snapshots
snapshot = self.db.snapshot()
self.db.put(b"a", b"2")
self.db.delete(b"b")
it = self.db.iteritems()
it.seek_to_first()
it = self.db.iteritems(snapshot=snapshot)
it.seek_to_first()
1.2.6 MergeOperator
Merge operators are useful for efficient read-modify-write operations. For more details see Merge Operator
A python merge operator must either implement the rocksdb.interfaces.AssociativeMergeOperator
or rocksdb.interfaces.MergeOperator interface.
The following example python merge operator implements a counter
class AssocCounter(rocksdb.interfaces.AssociativeMergeOperator):
def merge(self, key, existing_value, value):
if existing_value:
s = int(existing_value) + int(value)
return (True, str(s).encode('ascii'))
return (True, value)
(continues on next page)
def name(self):
return b'AssocCounter'
opts = rocksdb.Options()
opts.create_if_missing = True
opts.merge_operator = AssocCounter()
db = rocksdb.DB('test.db', opts)
db.merge(b"a", b"1")
db.merge(b"a", b"1")
# prints b'2'
print db.get(b"a")
1.2.7 PrefixExtractor
According to Prefix API a prefix_extractor can reduce IO for scans within a prefix range. A python prefix extractor
must implement the rocksdb.interfaces.SliceTransform interface.
The following example presents a prefix extractor of a static size. So always the first 5 bytes are used as the prefix
class StaticPrefix(rocksdb.interfaces.SliceTransform):
def name(self):
return b'static'
opts = rocksdb.Options()
opts.create_if_missing=True
opts.prefix_extractor = StaticPrefix()
db = rocksdb.DB('test.db', opts)
(continues on next page)
6 Chapter 1. Overview
python-rocksdb Documentation, Release 0.6.7
db.put(b'00001.x', b'x')
db.put(b'00001.y', b'y')
db.put(b'00001.z', b'z')
db.put(b'00002.x', b'x')
db.put(b'00002.y', b'y')
db.put(b'00002.z', b'z')
db.put(b'00003.x', b'x')
db.put(b'00003.y', b'y')
db.put(b'00003.z', b'z')
prefix = b'00002'
it = db.iteritems()
it.seek(prefix)
import rocksdb
db = rocksdb.DB("test.db", rocksdb.Options(create_if_missing=True))
db.put(b'a', b'v1')
db.put(b'b', b'v2')
db.put(b'c', b'v3')
Backup is created like this. You can choose any path for the backup destination except the db path itself. If
flush_before_backup is True the current memtable is flushed to disk before backup.
backup = rocksdb.BackupEngine("test.db/backups")
backup.create_backup(db, flush_before_backup=True)
Restore is done like this. The two arguments are the db_dir and wal_dir, which are mostly the same.
backup = rocksdb.BackupEngine("test.db/backups")
backup.restore_latest_backup("test.db", "test.db")
As noted here MemtableFactories, RocksDB offers different implementations for the memtable representation. Per
default rocksdb.SkipListMemtableFactory is used, but changing it to a different one is veary easy.
Here is an example for HashSkipList-MemtableFactory. Keep in mind: To use the hashed based MemtableFactories
you must set rocksdb.Options.prefix_extractor. In this example all keys have a static prefix of len 5.
class StaticPrefix(rocksdb.interfaces.SliceTransform):
def name(self):
return b'static'
opts = rocksdb.Options()
opts.prefix_extractor = StaticPrefix()
opts.allow_concurrent_memtable_write = False
opts.memtable_factory = rocksdb.HashSkipListMemtableFactory()
opts.create_if_missing = True
db = rocksdb.DB("test.db", opts)
db.put(b'00001.x', b'x')
db.put(b'00001.y', b'y')
db.put(b'00002.x', b'x')
opts = rocksdb.Options()
opts.allow_concurrent_memtable_write = False
opts.memtable_factory = rocksdb.VectorMemtableFactory()
opts.create_if_missing = True
db = rocksdb.DB("test.db", opts)
As noted here TableFactories, it is also possible to change the representation of the final data files. Here is an example
how to use a ‘PlainTable’.
opts = rocksdb.Options()
opts.table_factory = rocksdb.PlainTableFactory()
opts.create_if_missing = True
db = rocksdb.DB("test.db", opts)
8 Chapter 1. Overview
python-rocksdb Documentation, Release 0.6.7
RocksDB has a compaction algorithm called universal. This style typically results in lower write amplification but
higher space amplification than Level Style Compaction. See here for more details, https://github.com/facebook/
rocksdb/wiki/Rocksdb-Architecture-Guide#multi-threaded-compactions
Here is an example to switch to universal style compaction.
opts = rocksdb.Options()
opts.compaction_style = "universal"
opts.compaction_options_universal = {"min_merge_width": 3}
In same cases you need to know, what operations happened on a WriteBatch. The pyrocksdb WriteBatch supports the
iterator protocol, see this example.
batch = rocksdb.WriteBatch()
batch.put(b"key1", b"v1")
batch.delete(b'a')
batch.merge(b'xxx', b'value')
Options object
class rocksdb.Options
Important: The default values mentioned here, describe the values of the C++ library only. This wrapper does
not set any default value itself. So as soon as the rocksdb developers change a default value this document could
be outdated. So if you really depend on a default value, double check it with the according version of the C++
library.
https://github.com/facebook/rocksdb/blob/master/include/rocksdb/options.h
https://github.com/facebook/rocksdb/blob/master/util/options.cc
__init__(**kwargs)
All options mentioned below can also be passed as keyword-arguments in the constructor. For example:
import rocksdb
opts = rocksdb.Options(create_if_missing=True)
# is the same as
opts = rocksdb.Options()
opts.create_if_missing = True
create_if_missing
If True, the database will be created if it is missing.
Type: bool
Default: False
error_if_exists
If True, an error is raised if the database already exists.
Type: bool
Default: False
paranoid_checks
If True, the implementation will do aggressive checking of the data it is processing and will stop early if
it detects any errors. This may have unforeseen ramifications: for example, a corruption of one DB entry
may cause a large number of entries to become unreadable or for the entire DB to become unopenable. If
any of the writes to the database fails (Put, Delete, Merge, Write), the database will switch to read-only
mode and fail all other Write operations.
Type: bool
Default: True
write_buffer_size
Amount of data to build up in memory (backed by an unsorted log on disk) before converting to a sorted
on-disk file.
Larger values increase performance, especially during bulk loads. Up to max_write_buffer_number write
buffers may be held in memory at the same time, so you may wish to adjust this parameter to control
memory usage. Also, a larger write buffer will result in a longer recovery time the next time the database
is opened.
Type: int
10 Chapter 1. Overview
python-rocksdb Documentation, Release 0.6.7
Default: 4194304
max_write_buffer_number
The maximum number of write buffers that are built up in memory. The default is 2, so that when 1 write
buffer is being flushed to storage, new writes can continue to the other write buffer.
Type: int
Default: 2
min_write_buffer_number_to_merge
The minimum number of write buffers that will be merged together before writing to storage. If set to 1,
then all write buffers are fushed to L0 as individual files and this increases read amplification because a
get request has to check in all of these files. Also, an in-memory merge may result in writing lesser data to
storage if there are duplicate records in each of these individual write buffers.
Type: int
Default: 1
max_open_files
Number of open files that can be used by the DB. You may need to increase this if your database has a large
working set. Value -1 means files opened are always kept open. You can estimate number of files based
on target_file_size_base and target_file_size_multiplier for level-based compaction. For universal-style
compaction, you can usually set it to -1.
Type: int
Default: 5000
compression
Compress blocks using the specified compression algorithm. This parameter can be changed dynamically.
num_levels
Number of levels for this database
Type: int
Default: 7
level0_file_num_compaction_trigger
Number of files to trigger level-0 compaction. A value <0 means that level-0 compaction will not be
triggered by number of files at all.
Type: int
Default: 4
level0_slowdown_writes_trigger
Soft limit on number of level-0 files. We start slowing down writes at this point. A value <0 means that no
writing slow down will be triggered by number of files in level-0.
Type: int
Default: 20
level0_stop_writes_trigger
Maximum number of level-0 files. We stop writes at this point.
Type: int
Default: 24
max_mem_compaction_level
Maximum level to which a new compacted memtable is pushed if it does not create overlap. We try to push
to level 2 to avoid the relatively expensive level 0=>1 compactions and to avoid some expensive manifest
file operations. We do not push all the way to the largest level since that can generate a lot of wasted disk
space if the same key space is being repeatedly overwritten.
Type: int
Default: 2
target_file_size_base
Type: int
Default: 2097152
target_file_size_multiplier
12 Chapter 1. Overview
python-rocksdb Documentation, Release 0.6.7
Type: int
Default: 1
max_bytes_for_level_base
Control maximum total data size for a level. max_bytes_for_level_base is the max total for level-
1. Maximum number of bytes for level L can be calculated as (max_bytes_for_level_base) *
(max_bytes_for_level_multiplier ^ (L-1)) For example, if max_bytes_for_level_base is 20MB, and if
max_bytes_for_level_multiplier is 10, total data size for level-1 will be 20MB, total file size for level-2
will be 200MB, and total file size for level-3 will be 2GB.
Type: int
Default: 10485760
max_bytes_for_level_multiplier
See max_bytes_for_level_base
Type: int
Default: 10
max_bytes_for_level_multiplier_additional
Different max-size multipliers for different levels. These are multiplied by
max_bytes_for_level_multiplier to arrive at the max-size of each level.
Type: [int]
Default: [1, 1, 1, 1, 1, 1, 1]
max_compaction_bytes
We try to limit number of bytes in one compaction to be lower than this threshold. But it’s not guaranteed.
Value 0 will be sanitized.
Type: int
Default: target_file_size_base * 25
use_fsync
If true, then every store to stable storage will issue a fsync. If false, then every store to stable storage will
issue a fdatasync. This parameter should be set to true while storing data to filesystem like ext3 that can
lose files after a reboot.
Type: bool
Default: False
db_log_dir
This specifies the info LOG dir. If it is empty, the log files will be in the same dir as data. If it is non
empty, the log files will be in the specified dir, and the db data dir’s absolute path will be used as the log
file name’s prefix.
Type: unicode
Default: ""
wal_dir
This specifies the absolute dir path for write-ahead logs (WAL). If it is empty, the log files will be in the
same dir as data, dbname is used as the data dir by default. If it is non empty, the log files will be in kept
the specified dir. When destroying the db, all log files in wal_dir and the dir itself is deleted
Type: unicode
Default: ""
delete_obsolete_files_period_micros
The periodicity when obsolete files get deleted. The default value is 6 hours. The files that get out of scope
by compaction process will still get automatically delete on every compaction, regardless of this setting
Type: int
Default: 21600000000
max_background_compactions
Maximum number of concurrent background jobs, submitted to the default LOW priority thread pool
Type: int
Default: 1
max_background_flushes
Maximum number of concurrent background memtable flush jobs, submitted to the HIGH priority thread
pool. By default, all background jobs (major compaction and memtable flush) go to the LOW priority
pool. If this option is set to a positive number, memtable flush jobs will be submitted to the HIGH priority
pool. It is important when the same Env is shared by multiple db instances. Without a separate pool, long
running major compaction jobs could potentially block memtable flush jobs of other db instances, leading
to unnecessary Put stalls.
Type: int
Default: 1
max_log_file_size
Specify the maximal size of the info log file. If the log file is larger than max_log_file_size, a new info log
file will be created. If max_log_file_size == 0, all logs will be written to one log file.
14 Chapter 1. Overview
python-rocksdb Documentation, Release 0.6.7
Type: int
Default: 0
log_file_time_to_roll
Time for the info log file to roll (in seconds). If specified with non-zero value, log file will be rolled if it
has been active longer than log_file_time_to_roll. A value of 0 means disabled.
Type: int
Default: 0
keep_log_file_num
Maximal info log files to be kept.
Type: int
Default: 1000
soft_rate_limit
Puts are delayed 0-1 ms when any level has a compaction score that exceeds soft_rate_limit. This is
ignored when == 0.0. CONSTRAINT: soft_rate_limit <= hard_rate_limit. If this constraint does not hold,
RocksDB will set soft_rate_limit = hard_rate_limit. A value of 0 means disabled.
Type: float
Default: 0
hard_rate_limit
Puts are delayed 1ms at a time when any level has a compaction score that exceeds hard_rate_limit. This
is ignored when <= 1.0. A value fo 0 means disabled.
Type: float
Default: 0
rate_limit_delay_max_milliseconds
Max time a put will be stalled when hard_rate_limit is enforced. If 0, then there is no limit.
Type: int
Default: 1000
max_manifest_file_size
manifest file is rolled over on reaching this limit. The older manifest file be deleted. The default value is
MAX_INT so that roll-over does not take place.
Type: int
Default: (2**64) - 1
table_cache_numshardbits
Number of shards used for table cache.
Type: int
Default: 4
arena_block_size
size of one block in arena memory allocation. If <= 0, a proper value is automatically calculated (usually
1/10 of writer_buffer_size).
Type: int
Default: 0
disable_auto_compactions
Disable automatic compactions. Manual compactions can still be issued on this database.
Type: bool
Default: False
wal_ttl_seconds, wal_size_limit_mb
The following two fields affect how archived logs will be deleted.
1. If both set to 0, logs will be deleted asap and will not get into the archive.
2. If wal_ttl_seconds is 0 and wal_size_limit_mb is not 0, WAL files will be checked every 10 min and
if total size is greater then wal_size_limit_mb, they will be deleted starting with the earliest until
size_limit is met. All empty files will be deleted.
3. If wal_ttl_seconds is not 0 and wal_size_limit_mb is 0, then WAL files will be checked every
wal_ttl_secondsi / 2 and those that are older than wal_ttl_seconds will be deleted.
4. If both are not 0, WAL files will be checked every 10 min and both checks will be performed with ttl
being first.
Type: int
Default: 0
manifest_preallocation_size
Number of bytes to preallocate (via fallocate) the manifest files. Default is 4mb, which is reasonable to
reduce random IO as well as prevent overallocation for mounts that preallocate large amounts of data (such
as xfs’s allocsize option).
Type: int
16 Chapter 1. Overview
python-rocksdb Documentation, Release 0.6.7
Default: 4194304
purge_redundant_kvs_while_flush
Purge duplicate/deleted keys when a memtable is flushed to storage.
Type: bool
Default: True
allow_mmap_reads
Allow the OS to mmap file for reading sst tables
Type: bool
Default: True
allow_mmap_writes
Allow the OS to mmap file for writing
Type: bool
Default: False
is_fd_close_on_exec
Disable child process inherit open files
Type: bool
Default: True
skip_log_error_on_recovery
Skip log corruption error on recovery (If client is ok with losing most recent changes)
Type: bool
Default: False
stats_dump_period_sec
If not zero, dump rocksdb.stats to LOG every stats_dump_period_sec
Type: int
Default: 3600
advise_random_on_open
If set true, will hint the underlying file system that the file access pattern is random, when a sst file is
opened.
Type: bool
Default: True
use_adaptive_mutex
Use adaptive mutex, which spins in the user space before resorting to kernel. This could reduce context
switch when the mutex is not heavily contended. However, if the mutex is hot, we could end up wasting
spin time.
Type: bool
Default: False
bytes_per_sync
Allows OS to incrementally sync files to disk while they are being written, asynchronously, in the back-
ground. Issue one request for every bytes_per_sync written. 0 turns it off.
Type: int
Default: 0
compaction_style
The compaction style. Could be set to "level" to use level-style compaction. For universal-style com-
paction use "universal". For FIFO compaction use "fifo". If no compaction style use "none".
Type: string
Default: level
compaction_pri
If level compaction_style = kCompactionStyleLevel, for each level, which files are prioritized to be picked
to compact.
compaction_options_universal
Options to use for universal-style compaction. They make only sense if rocksdb.Options.
compaction_style is set to "universal".
It is a dict with the following keys.
• size_ratio: Percentage flexibilty while comparing file size. If the candidate file(s) size is 1%
smaller than the next file’s size, then include next file into this candidate set. Default: 1
18 Chapter 1. Overview
python-rocksdb Documentation, Release 0.6.7
opts = rocksdb.Options()
opts.compaction_options_universal = {'stop_style': 'similar_size'}
max_sequential_skip_in_iterations
An iteration->Next() sequentially skips over keys with the same user-key unless this option is set. This
number specifies the number of keys (with the same userkey) that will be sequentially skipped before a
reseek is issued.
Type: int
Default: 8
memtable_factory
This is a factory that provides MemTableRep objects. Right now you can assing instances of the following
classes.
• rocksdb.VectorMemtableFactory
• rocksdb.SkipListMemtableFactory
• rocksdb.HashSkipListMemtableFactory
• rocksdb.HashLinkListMemtableFactory
Default: rocksdb.SkipListMemtableFactory
table_factory
Factory for the files forming the persisten data storage. Sometimes they are also named SST-Files. Right
now you can assign instances of the following classes.
• rocksdb.BlockBasedTableFactory
• rocksdb.PlainTableFactory
• rocksdb.TotalOrderPlainTableFactory
Default: rocksdb.BlockBasedTableFactory
inplace_update_support
Allows thread-safe inplace updates. Requires Updates if
• key exists in current memtable
• new sizeof(new_value) <= sizeof(old_value)
• old_value for that key is a put i.e. kTypeValue
Type: bool
Default: False
inplace_update_num_locks
Type: int
Default: 10000
comparator
Comparator used to define the order of keys in the table. A python comparator must implement the
rocksdb.interfaces.Comparator interface.
Requires: The client must ensure that the comparator supplied here has the same name and orders keys
exactly the same as the comparator provided to previous open calls on the same DB.
Default: rocksdb.BytewiseComparator
merge_operator
The client must provide a merge operator if Merge operation needs to be accessed. Calling Merge on a DB
without a merge operator would result in rocksdb.errors.NotSupported. The client must ensure
that the merge operator supplied here has the same name and exactly the same semantics as the merge
operator provided to previous open calls on the same DB. The only exception is reserved for upgrade,
where a DB previously without a merge operator is introduced to Merge operation for the first time. It’s
necessary to specify a merge operator when openning the DB in this case.
20 Chapter 1. Overview
python-rocksdb Documentation, Release 0.6.7
CompactionPri
class rocksdb.CompactionPri
Defines the support compression types
kByCompensatedSize
kOldestLargestSeqFirst
kOldestSmallestSeqFirst
kMinOverlappingRatio
CompressionTypes
class rocksdb.CompressionType
Defines the support compression types
no_compression
snappy_compression
zlib_compression
bzip2_compression
lz4_compression
lz4hc_compression
xpress_compression
zstd_compression
zstdnotfinal_compression
disable_compression
BytewiseComparator
class rocksdb.BytewiseComparator
Wraps the rocksdb Bytewise Comparator, it uses lexicographic byte-wise ordering
BloomFilterPolicy
class rocksdb.BloomFilterPolicy
Wraps the rocksdb BloomFilter Policy
__init__(bits_per_key)
Parameters bits_per_key (int) – Specifies the approximately number of bits per key. A good
value for bits_per_key is 10, which yields a filter with ~ 1% false positive rate.
LRUCache
class rocksdb.LRUCache
Wraps the rocksdb LRUCache
__init__(capacity, shard_bits=None)
Create a new cache with a fixed size capacity (in bytes). The cache is sharded to 2^numShardBits shards,
by hash of the key. The total capacity is divided and evenly assigned to each shard.
TableFactories
Currently RocksDB supports two types of tables: plain table and block-based table. Instances of this classes can
assigned to rocksdb.Options.table_factory
• Block-based table: This is the default table type that RocksDB inherited from LevelDB. It was designed for
storing data in hard disk or flash device.
• Plain table: It is one of RocksDB’s SST file format optimized for low query latency on pure-memory or really
low-latency media.
Tutorial of rocksdb table formats is available here: https://github.com/facebook/rocksdb/wiki/
A-Tutorial-of-RocksDB-SST-formats
class rocksdb.BlockBasedTableFactory
Wraps BlockBasedTableFactory of RocksDB.
__init__(index_type='binary_search', hash_index_allow_collision=True, checksum='crc32',
Parameters
• index_type (string) –
– binary_search a space efficient index block that is optimized for binary-search-based
index.
– hash_search the hash index. If enabled, will do hash lookup when Op-
tions.prefix_extractor is provided.
22 Chapter 1. Overview
python-rocksdb Documentation, Release 0.6.7
class rocksdb.PlainTableFactory
Plain Table with prefix-only seek. It wraps rocksdb PlainTableFactory.
For this factory, you need to set rocksdb.Options.prefix_extractor properly to make it work.
Look-up will start with prefix hash lookup for key prefix. Inside the hash bucket found, a binary search is
executed for hash conflicts. Finally, a linear search is used.
__init__(user_key_len=0, bloom_bits_per_key=10, hash_table_ratio=0.75, index_sparseness=10,
huge_page_tlb_size=0, encoding_type='plain', full_scan_mode=False,
store_index_in_file=False)
Parameters
• user_key_len (int) – Plain table has optimization for fix-sized keys, which can
be specified via user_key_len. Alternatively, you can pass 0 if your keys have variable
lengths.
• bloom_bits_per_key (int) – The number of bits used for bloom filer per prefix.
You may disable it by passing 0.
• hash_table_ratio (float) – The desired utilization of the hash table used for prefix
hashing. hash_table_ratio = number of prefixes / #buckets in the hash table.
• index_sparseness (int) – Inside each prefix, need to build one index record for
how many keys for binary search inside each hash bucket. For encoding type prefix,
the value will be used when writing to determine an interval to rewrite the full key. It will
also be used as a suggestion and satisfied when possible.
• huge_page_tlb_size (int) – If <=0, allocate hash indexes and blooms from mal-
loc. Otherwise from huge page TLB. The user needs to reserve huge pages for it to
be allocated, like: sysctl -w vm.nr_hugepages=20 See linux doc Documenta-
tion/vm/hugetlbpage.txt
• encoding_type (string) – How to encode the keys. The value will determine how
to encode keys when writing to a new SST file. This value will be stored inside the SST
file which will be used when reading from the file, which makes it possible for users to
choose different encoding type when reopening a DB. Files with different encoding types
can co-exist in the same DB and can be read.
– plain: Always write full keys without any special encoding.
– prefix: Find opportunity to write the same prefix once for multiple rows. In
some cases, when a key follows a previous key with the same prefix, instead of
writing out the full key, it just writes out the size of the shared prefix, as well as other
bytes, to save some bytes.
When using this option, the user is required to use the same prefix extractor to make
sure the same prefix will be extracted from the same key. The Name() value of the
prefix extractor will be stored in the file. When reopening the file, the name of the
options.prefix_extractor given will be bitwise compared to the prefix extractors stored
in the file. An error will be returned if the two don’t match.
• full_scan_mode (bool) – Mode for reading the whole file one record by one without
using the index.
• store_index_in_file (bool) – Compute plain table index and bloom filter during
file building and store it in file. When reading file, index will be mmaped instead of
recomputation.
MemtableFactories
RocksDB has different classes to represent the in-memory buffer for the current operations. You have to assing
instances of the following classes to rocksdb.Options.memtable_factory. This page has a comparison the
most popular ones. https://github.com/facebook/rocksdb/wiki/Hash-based-memtable-implementations
class rocksdb.VectorMemtableFactory
This creates MemTableReps that are backed by an std::vector. On iteration, the vector is sorted. This is useful
for workloads where iteration is very rare and writes are generally not issued after reads begin.
__init__(count=0)
Parameters count (int) – Passed to the constructor of the underlying std::vector of each
VectorRep. On initialization, the underlying array will be at least count bytes reserved for
usage.
class rocksdb.SkipListMemtableFactory
This uses a skip list to store keys.
24 Chapter 1. Overview
python-rocksdb Documentation, Release 0.6.7
__init__()
class rocksdb.HashSkipListMemtableFactory
This class contains a fixed array of buckets, each pointing to a skiplist (null if the bucket is empty).
__init__(bucket_count=50000)
Parameters bucket (int) – number of fixed array buckets
Database object
class rocksdb.DB
If this flag is False, and the machine crashes, some recent writes may be lost. Note
that if it is just the process that crashes (i.e., the machine does not reboot), no writes
will be lost even if sync == False.
In other words, a DB write with sync == False has similar crash semantics as the
“write()” system call. A DB write with sync == True has similar crash semantics
to a “write()” system call followed by “fdatasync()”.
• disable_wal (bool) – If True, writes will not first go to the write ahead log, and
the write may got lost after a crash.
delete(key, sync=False, disable_wal=False)
Remove the database entry for “key”.
Parameters
• key (bytes) – Name to delete
• sync – See rocksdb.DB.put()
• disable_wal – See rocksdb.DB.put()
Raises rocksdb.errors.NotFound – If the key did not exists
merge(key, value, sync=False, disable_wal=False)
Merge the database entry for “key” with “value”. The semantics of this operation is determined by the
user provided merge_operator when opening DB.
See rocksdb.DB.put() for the parameters
Raises rocksdb.errors.NotSupported if this is called and no rocksdb.
Options.merge_operator was set at creation
write(batch, sync=False, disable_wal=False)
Apply the specified updates to the database.
Parameters
• batch (rocksdb.WriteBatch) – Batch to apply
• sync – See rocksdb.DB.put()
• disable_wal – See rocksdb.DB.put()
get(key, verify_checksums=False, fill_cache=True, snapshot=None, read_tier='all')
Parameters
• key (bytes) – Name to get
• verify_checksums (bool) – If True, all data read from underlying storage will
be verified against corresponding checksums.
• fill_cache (bool) – Should the “data block”, “index block” or “filter block” read
for this iteration be cached in memory? Callers may wish to set this field to False
for bulk scans.
• snapshot (rocksdb.Snapshot) – If not None, read as of the supplied snap-
shot (which must belong to the DB that is being read and which must not have been
released). Is it None a implicit snapshot of the state at the beginning of this read
operation is used
• read_tier (string) – Specify if this read request should process data that AL-
READY resides on a particular cache. If the required data is not found at the specified
cache, then rocksdb.errors.Incomplete is raised.
26 Chapter 1. Overview
python-rocksdb Documentation, Release 0.6.7
Returns None if not found, else the value for this key
multi_get(keys, verify_checksums=False, fill_cache=True, snapshot=None, read_tier='all')
Parameters keys (list of bytes) – Keys to fetch
For the other params see rocksdb.DB.get()
Returns A dict where the value is either bytes or None if not found
Raises If the fetch for a single key fails
Note: keys will not be “de-duplicated”. Duplicate keys will return duplicate values in order.
28 Chapter 1. Overview
python-rocksdb Documentation, Release 0.6.7
begin == None is treated as a key before all keys in the database. end == None is treated as a
key after all keys in the database. Therefore the following call will compact the entire database: db.
compact_range().
Note that after the entire database is compacted, all data are pushed down to the last level containing any
data. If the total data size after compaction is reduced, that level might not be appropriate for hosting all
the files. In this case, client could set change_level to True, to move the files back to the minimum level
capable of holding the data set or a given level (specified by non-negative target_level).
Parameters
• begin (bytes) – Key where to start compaction. If None start at the beginning of
the database.
• end (bytes) – Key where to end compaction. If None end at the last key of the
database.
• change_level (bool) – If True, compacted files will be moved to the mini-
mum level capable of holding the data or given level (specified by non-negative tar-
get_level). If False you may end with a bigger level than configured. Default is
False.
• target_level (int) – If change_level is true and target_level have non-negative
value, compacted files will be moved to target_level. Default is -1.
• bottommost_level_compaction (string) – For level based compaction, we
can configure if we want to skip/force bottommost level compaction. By default level
based compaction will only compact the bottommost level if there is a compaction
filter. It can be set to the following values.
skip Skip bottommost level compaction
if_compaction_filter Only compact bottommost level if there is a com-
paction filter. This is the default.
force Always compact bottommost level
options
Returns the associated rocksdb.Options instance.
Note: Changes to this object have no effect anymore. Consider this as read-only
Iterator
class rocksdb.BaseIterator
Base class for all iterators in this module. After creation a iterator is invalid. Call one of the seek methods first
before starting iteration
seek_to_first()
Position at the first key in the source
seek_to_last()
Position at the last key in the source
seek(key)
Parameters key (bytes) – Position at the first key in the source that at or past
Methods to support the python iterator protocol
__iter__()
__next__()
__reversed__()
Snapshot
class rocksdb.Snapshot
Opaque handler for a single Snapshot. Snapshot is released if nobody holds a reference on it. Retrieved via
rocksdb.DB.snapshot()
WriteBatch
class rocksdb.WriteBatch
WriteBatch holds a collection of updates to apply atomically to a DB.
The updates are applied in the order in which they are added to the WriteBatch. For example, the
value of “key” will be “v3” after the following batch is written:
batch = rocksdb.WriteBatch()
batch.put(b"key", b"v1")
batch.delete(b"key")
batch.put(b"key", b"v2")
batch.put(b"key", b"v3")
__init__(data=None)
Creates a WriteBatch.
Parameters data (bytes) – A serialized version of a previous WriteBatch. As retrieved
from a previous .data() call. If None a empty WriteBatch is generated
put(key, value)
Store the mapping “key->value” in the database.
Parameters
• key (bytes) – Name of the entry to store
• value (bytes) – Data of this entry
merge(key, value)
Merge “value” with the existing value of “key” in the database.
Parameters
• key (bytes) – Name of the entry to merge
• value (bytes) – Data to merge
delete(key)
If the database contains a mapping for “key”, erase it. Else do nothing.
Parameters key (bytes) – Key to erase
clear()
Clear all updates buffered in this batch.
30 Chapter 1. Overview
python-rocksdb Documentation, Release 0.6.7
Note: Don’t call this method if there is an outstanding iterator. Calling rocksdb.WriteBatch.
clear() with outstanding iterator, leads to SEGFAULT.
data()
Retrieve the serialized version of this batch.
Return type bytes
count()
Returns the number of updates in the batch
Return type int
__iter__()
Returns an iterator over the current contents of the write batch.
If you add new items to the batch, they are not visible for this iterator. Create a new one if you need to see
them.
Note: Calling rocksdb.WriteBatch.clear() on the write batch invalidates the iterator. Using a
iterator where its corresponding write batch has been cleared, leads to SEGFAULT.
WriteBatchIterator
class rocksdb.WriteBatchIterator
__iter__()
Returns self.
__next__()
Returns the next item inside the corresponding write batch. The return value is a tuple of always size
three.
First item (Name of the operation):
• "Put"
• "Merge"
• "Delete"
Repair DB
repair_db(db_name, opts)
Parameters
• db_name (unicode) – Name of the database to open
• opts (rocksdb.Options) – Options for this specific database
If a DB cannot be opened, you may attempt to call this method to resurrect as much of the contents of the
database as possible. Some data may be lost, so be careful when calling this function on a database that contains
important information.
Errors
exception rocksdb.errors.NotFound
exception rocksdb.errors.Corruption
exception rocksdb.errors.NotSupported
exception rocksdb.errors.InvalidArgument
exception rocksdb.errors.RocksIOError
exception rocksdb.errors.MergeInProgress
exception rocksdb.errors.Incomplete
1.3.3 Interfaces
Comparator
class rocksdb.interfaces.Comparator
A Comparator object provides a total order across slices that are used as keys in an sstable or a database.
A Comparator implementation must be thread-safe since rocksdb may invoke its methods concurrently from
multiple threads.
compare(a, b)
Three-way comparison.
Parameters
• a (bytes) – First field to compare
• b (bytes) – Second field to compare
Returns
• -1 if a < b
• 0 if a == b
• 1 if a > b
Return type int
name()
The name of the comparator. Used to check for comparator mismatches (i.e., a DB created with one
comparator is accessed using a different comparator).
32 Chapter 1. Overview
python-rocksdb Documentation, Release 0.6.7
The client of this package should switch to a new name whenever the comparator implementation changes
in a way that will cause the relative ordering of any two keys to change.
Names starting with “rocksdb.” are reserved and should not be used by any clients of this package.
Return type bytes
Merge Operator
Essentially, a MergeOperator specifies the SEMANTICS of a merge, which only client knows. It could
be numeric addition, list append, string concatenation, edit data structure, whatever. The library, on
the other hand, is concerned with the exercise of this interface, at the right time (during get, iteration,
compaction. . . )
To use merge, the client needs to provide an object implementing one of the following interfaces:
• AssociativeMergeOperator - for most simple semantics (always take two values, and merge them
into one value, which is then put back into rocksdb). numeric addition and string concatenation are
examples.
• MergeOperator - the generic class for all the more complex operations. One method (FullMerge) to
merge a Put/Delete value with a merge operand. Another method (PartialMerge) that merges two
operands together. This is especially useful if your key values have a complex structure but you
would still like to support client-specific incremental updates.
AssociativeMergeOperator is simpler to implement. MergeOperator is simply more powerful.
See this page for more details https://github.com/facebook/rocksdb/wiki/Merge-Operator
AssociativeMergeOperator
class rocksdb.interfaces.AssociativeMergeOperator
MergeOperator
class rocksdb.interfaces.MergeOperator
Note: Presently there is no way to differentiate between error/corruption and simply “return false”. For
now, the client should simply return false in any case it cannot perform partial-merge, regardless of reason.
If there is corruption in the data, handle it in the FullMerge() function, and return false there.
name()
The name of the MergeOperator. Used to check for MergeOperator mismatches. For example a DB
created with one MergeOperator is accessed using a different MergeOperator.
Return type bytes
34 Chapter 1. Overview
python-rocksdb Documentation, Release 0.6.7
FilterPolicy
class rocksdb.interfaces.FilterPolicy
create_filter(keys)
Create a bytestring which can act as a filter for keys.
Parameters keys (list of bytes) – list of keys (potentially with duplicates) that are
ordered according to the user supplied comparator.
Returns A filter that summarizes keys
Return type bytes
key_may_match(key, filter)
Check if the key is maybe in the filter.
Parameters
• key (bytes) – Key for a single entry inside the database
• filter (bytes) – Contains the data returned by a preceding call to create_filter on
this class
Returns This method must return True if the key was in the list of keys passed to cre-
ate_filter(). This method may return True or False if the key was not on the list, but it
should aim to return False with a high probability.
Return type bool
name()
Return the name of this policy. Note that if the filter encoding changes in an incompatible way, the name
returned by this method must be changed. Otherwise, old incompatible filters may be passed to methods
of this type.
Return type bytes
SliceTransform
class rocksdb.interfaces.SliceTransform
SliceTransform is currently used to implement the ‘prefix-API’ of rocksdb. https://github.com/facebook/
rocksdb/wiki/Proposal-for-prefix-API
transform(src)
Parameters src (bytes) – Full key to extract the prefix from.
Returns A tuple of two interges (offset, size). Where the first integer is the offset
within the src and the second the size of the prefix after the offset. Which means the
prefix is generted by src[offset:offset+size]
Return type (int, int)
in_domain(src)
Decide if a prefix can be extraced from src. Only if this method returns True transform() will be
called.
Parameters src (bytes) – Full key to check.
Return type bool
in_range(prefix)
Checks if prefix is a valid prefix
Parameters prefix (bytes) – Prefix to check.
Returns True if prefix is a valid prefix.
Return type bool
name()
Return the name of this transformation.
Return type bytes
BackupEngine
class rocksdb.BackupEngine
__init__(backup_dir)
Creates a object to manage backup of a single database.
Parameters backup_dir (unicode) – Where to keep the backup files. Has to be different
than db.db_name. For example db.db_name + ‘/backups’.
create_backup(db, flush_before_backup=False)
Triggers the creation of a backup.
Parameters
• db (rocksdb.DB) – Database object to backup.
• flush_before_backup (bool) – If True the current memtable is flushed.
restore_backup(backup_id, db_dir, wal_dir)
Restores the backup from the given id.
Parameters
• backup_id (int) – id of the backup to restore.
• db_dir (unicode) – Target directory to restore backup.
• wal_dir (unicode) – Target directory to restore backuped WAL files.
restore_latest_backup(db_dir, wal_dir)
Restores the latest backup.
Parameters
• db_dir (unicode) – see restore_backup()
• wal_dir (unicode) – see restore_backup()
stop_backup()
Can be called from another thread to stop the current backup process.
purge_old_backups(num_backups_to_keep)
Deletes all backups (oldest first) until “num_backups_to_keep” are left.
Parameters num_backups_to_keep (int) – Number of backupfiles to keep.
delete_backup(backup_id)
36 Chapter 1. Overview
python-rocksdb Documentation, Release 0.6.7
Parameters backup_id (int) – Delete the backup with the given id.
get_backup_info()
Returns information about all backups.
It returns a list of dict’s where each dict as the following keys.
backup_id (int): id of this backup.
timestamp (int): Seconds since epoch, when the backup was created.
size (int): Size in bytes of the backup.
1.4 Changelog
Prefix Seeks:
According to this page https://github.com/facebook/rocksdb/wiki/Prefix-Seek-API-Changes, all the prefix related pa-
rameters on ReadOptions are removed. Rocksdb realizes now if Options.prefix_extractor is set and
uses then prefix-seeks automatically. This means the following changes on pyrocksdb.
• DB.iterkeys, DB.itervalues, DB.iteritems have no prefix parameter anymore.
• DB.get, DB.multi_get, DB.key_may_exist, DB.iterkeys, DB.itervalues, DB.iteritems have no prefix_seek
parameter anymore.
Which means all the iterators walk now always to the end of the database. So if you need to stay within a prefix, write
your own code to ensure that. For DB.iterkeys and DB.iteritems itertools.takewhile is a possible solution.
1.4. Changelog 37
python-rocksdb Documentation, Release 0.6.7
it = self.db.iterkeys()
it.seek(b'00002')
print list(takewhile(lambda key: key.startswith(b'00002'), it))
it = self.db.iteritems()
it.seek(b'00002')
print dict(takewhile(lambda item: item[0].startswith(b'00002'), it))
New:
38 Chapter 1. Overview
python-rocksdb Documentation, Release 0.6.7
This version works with RocksDB version 2.8.fb. Now you have access to the more advanced options of rocksdb. Like
changing the memtable or SST representation. It is also possible now to enable Universal Style Compaction.
• Fixed issue 3. Which fixed the change of prefix_extractor from raw-pointer to smart-pointer.
• Support the new rocksdb.Options.verify_checksums_in_compaction option.
• Add rocksdb.Options.table_factory option. So you could use the new ‘PlainTableFactories’ which
are optimized for in-memory-databases.
– https://github.com/facebook/rocksdb/wiki/PlainTable-Format
– https://github.com/facebook/rocksdb/wiki/How-to-persist-in-memory-RocksDB-database%3F
• Add rocksdb.Options.memtable_factory option.
• Add options rocksdb.Options.compaction_style and rocksdb.Options.
compaction_options_universal to change the compaction style.
• Update documentation to the new default values
– allow_mmap_reads=true
– allow_mmap_writes=false
– max_background_flushes=1
– max_open_files=5000
– paranoid_checks=true
– disable_seek_compaction=true
– level0_stop_writes_trigger=24
– level0_slowdown_writes_trigger=20
• Document new property names for rocksdb.DB.get_property().
1.4. Changelog 39
python-rocksdb Documentation, Release 0.6.7
40 Chapter 1. Overview
CHAPTER
TWO
CONTRIBUTING
Source can be found on github. Feel free to fork and send pull-requests or create issues on the github issue tracker
41
python-rocksdb Documentation, Release 0.6.7
42 Chapter 2. Contributing
CHAPTER
THREE
ROADMAP/TODO
43
python-rocksdb Documentation, Release 0.6.7
44 Chapter 3. RoadMap/TODO
CHAPTER
FOUR
• genindex
• modindex
• search
45
python-rocksdb Documentation, Release 0.6.7
r
rocksdb, 9
47
python-rocksdb Documentation, Release 0.6.7
B F
built-in function full_merge() (rocksdb.interfaces.MergeOperator
repair_db(), 32 method), 34
bytes_per_sync (rocksdb.Options attribute), 18
bzip2_compression (rocksdb.CompressionType at- G
tribute), 21 get() (rocksdb.DB method), 26
get_backup_info() (rocksdb.BackupEngine
C method), 37
clear() (rocksdb.WriteBatch method), 30 get_live_files_metadata() (rocksdb.DB
compact_range() (rocksdb.DB method), 28 method), 28
compaction_options_universal get_property() (rocksdb.DB method), 28
(rocksdb.Options attribute), 18
49
python-rocksdb Documentation, Release 0.6.7
50 Index
python-rocksdb Documentation, Release 0.6.7
Index 51