Lecture 4 LSBM Tree
Lecture 4 LSBM Tree
张晓东
俄亥俄州立大学
The Ohio State University, USA
1
2
• Relational tables
– Tables must partitioned/placed among many nodes, e.g. Apache Hive
– How to minimize data transfers among nodes and from local disks?
• Key-value store
– Key indexing becomes a bottleneck as # concurrent requests increase
– How to accelerate data accesses for in-memory key-value store?
Fast Accesses to Sequentially Archived Data
in both memory and disks
4
What is LSM-tree?
It is a Log-structured merge-tree (1996):
• Multiple levels of sorted data, (e.g. each by a B+ tree)
• Each level increases exponentially, forming a “pyramid”
• The smallest level is in memory and the rest on disk
C0 DRAM
HDD
C0
C1 C1
C2
C2
5
LSM-tree is widely used in production systems
RocksDB
6
Why Log-structured merge-tree (LSM-tree)?
• B+ tree
Insert 2.5
– In-place update seek
– Random I/Os
– Low write throughput
2.5
d2’
• LSM-tree
– Log-structure for sequential
• LSM-tree
I/O
– Log-structured update writes
–– Merge/compaction
Merge for sorting for sorting DRAM
–– Sequential I/Os
Batched I/O operations HDD
– High write throughput merges
– High write throughput
7
8
effective caching
warming up
disenabled caching
*https://www.datastax.com/dev/blog/compaction-improvements-in-cassandra-21
9
Basic Functions of Buffer Cache
Buffer cache
a
bc
disk
a c
b
10
Buffer Cache in LSM-tree
DRAM
HDD
C0
get a C1
index structure
abcdef C2 a-f root
multi-page
a-c d-f block
a
• Queries are conducted level by level
buffer cache
• In each level, an index is maintained
in memory to map keys to disk blocks
• The index is checked to locate the disk
disk block for the key/key range abc de f
single-page
block
• The found disk block(s) will be loaded
into buffer cache and serve there 11
Buffer Cache in LSM-tree
DRAM
HDD
C0
get a C1
index structure
abcdef C2 a-f root
multi-page
a-c d-f block
a
• Any future request on the same buffer cache
key/key range still need to go
through the index
disk
• The found disk block(s) can be single-page
abc de f
served from the buffer cache block
15
Key-Value store: No address mapping to Disk
hash
function
a 1
b 2 d_value
DRAM
writes d 3
4 b_value
HDD
merges c C0
5
6
a_value disk
a ce e C1
bdf C2 ae
bd f
ace
af C1
disk
merges
bacbdced e f C2 af
bc de
a ce e C1 append c a e B1
merges
b ad bf c d e f C2 b d f B2
underlying LSM-tree compaction buffer
merges C1 c a e B1
abcdef C2 ace b d f B2
underlying LSM-tree compaction buffer
Read & write Only frequently visited data are kept in the
compaction buffer (dynamically alive)
24
Why LSbM-tree is effective?
Underlying LSM-tree Compaction buffer
• Contains the entire dataset • Attempt to keep cached data
• Fully sorted at each level • Not fully sorted at each level
• Efficient for on-disk range • Not be used for on-disk range
query query
• Updated frequently for merge • Not updated frequently
• LSM-tree induced buffer cache • LSM-tree induced buffer cache
misses are high misses are minimized
no
Use the block Use the block
found in Ci found in Bi
no
30
Dataset and Workloads
98% reads(Hot Range) 2% reads
3GB 17GB
20GB
100% writes
• Dataset
– 20GB unique data
• Write workload
– 100% writes uniformly distributed on the entire dataset
• Read workload (RangeHot workload)
– 98% reads uniformly distributed on a 3GB hot range
31
– 2% reads uniformly distributed on the rest data range
LSM-tree induced cache invalidation
• Test on LSM-tree
• Writes
– Fixed write throughput 1000 writes per second
• Reads
– RangeHot workload 32
Ineffective of the Dedicated-server solution
• Test on LSbM-tree
• Writes
– Fixed write throughput 1000 writes per second
• Reads
– RangeHot workload 34
Random access performance
35
Range query performance
37
Conclusion
• LSM-tree is a widely used storage data structure in
production systems to maximize the write throughput
• LSM-tree induced cache misses happen for workloads of
mixed reads and writes
• Several solutions have been proposed to address this
problem, but raising other issues
• LSbM-tree is an effective solution to retain all the merits
of LSM-tree but also re-enable buffer caching
• LSbM-tree is being implemented and tested in
production systems, e.g. Cassandra and LevelDB
38
Thank you
39
dbsize
40
Zipfian workload
20GB
100% read/writes
41
Zipfian workload on bLSM
42
Zipfian workload on LSbM
43