0% found this document useful (0 votes)
34 views42 pages

Lecture 4 LSBM Tree

The document discusses the LSbM-tree, a data storage structure designed to optimize both read and write operations for big data applications. It highlights the challenges faced by traditional storage systems like LSM-trees and B+-trees, and introduces the LSbM-tree's innovative use of a compaction buffer to enhance cache performance and reduce invalidations. Experimental results demonstrate that LSbM-tree outperforms existing solutions in terms of random access and range query performance.

Uploaded by

Jin Wang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views42 pages

Lecture 4 LSBM Tree

The document discusses the LSbM-tree, a data storage structure designed to optimize both read and write operations for big data applications. It highlights the challenges faced by traditional storage systems like LSM-trees and B+-trees, and introduces the LSbM-tree's innovative use of a compaction buffer to enhance cache performance and reduce invalidations. Experimental results demonstrate that LSbM-tree outperforms existing solutions in terms of random access and range query performance.

Uploaded by

Jin Wang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 42

LSbM-tree: 一个读写兼优的大数据存储结构

张晓东
俄亥俄州立大学
The Ohio State University, USA

1
2

Major Data Formats in Storage Systems


• Sequentially archived data
– Indexed data, e.g. sorted data by a defined key, …
– Read/write largely by B+-tree and LSM-tree
• Relational tables
– Structured data formats for relational databases, e.g. MySQL
– Read/write operations by relational algebra/calculus
• Key-value store
– A pair of key/value for a data item, e.g. redis, memcached
– Read/write: request -> index –> fetching data
• Graph-databases
• Free-style text files
– A file may be retrieved by KV-store, indexed directory, …
3

New Challenges to Access Performance in Big Data

• Sequentially archived data


– Can we massively process both reads and writes concurrently?
– but LSM-tree favors writes and B+-tree favors reads

• Relational tables
– Tables must partitioned/placed among many nodes, e.g. Apache Hive
– How to minimize data transfers among nodes and from local disks?

• Key-value store
– Key indexing becomes a bottleneck as # concurrent requests increase
– How to accelerate data accesses for in-memory key-value store?
Fast Accesses to Sequentially Archived Data
in both memory and disks

4
What is LSM-tree?
It is a Log-structured merge-tree (1996):
• Multiple levels of sorted data, (e.g. each by a B+ tree)
• Each level increases exponentially, forming a “pyramid”
• The smallest level is in memory and the rest on disk
C0 DRAM
HDD
C0
C1 C1

C2

C2

5
LSM-tree is widely used in production systems

RocksDB

6
Why Log-structured merge-tree (LSM-tree)?
• B+ tree
Insert 2.5
– In-place update seek
– Random I/Os
– Low write throughput
2.5

d2’
• LSM-tree
– Log-structure for sequential
• LSM-tree
I/O
– Log-structured update writes

–– Merge/compaction
Merge for sorting for sorting DRAM
–– Sequential I/Os
Batched I/O operations HDD
– High write throughput merges
– High write throughput

7
8

SQLite Experiences of LSM-tree by Richard Hipp


A buffer cache problem reported by Cassandra in 2014

effective caching

warming up

disenabled caching

*https://www.datastax.com/dev/blog/compaction-improvements-in-cassandra-21
9
Basic Functions of Buffer Cache

• Buffer cache is in DRAM or other fast devices


• Data entries in buffer cache refer to disk blocks (page)
• Cache the frequently read disk blocks for reuse

Buffer cache
a
bc

disk

a c
b

10
Buffer Cache in LSM-tree
DRAM
HDD
C0

get a C1
index structure
abcdef C2 a-f root

multi-page
a-c d-f block
a
• Queries are conducted level by level
buffer cache
• In each level, an index is maintained
in memory to map keys to disk blocks
• The index is checked to locate the disk
disk block for the key/key range abc de f
single-page
block
• The found disk block(s) will be loaded
into buffer cache and serve there 11
Buffer Cache in LSM-tree
DRAM
HDD
C0

get a C1
index structure
abcdef C2 a-f root

multi-page
a-c d-f block
a
• Any future request on the same buffer cache
key/key range still need to go
through the index
disk
• The found disk block(s) can be single-page
abc de f
served from the buffer cache block

directly after then 12


LSM-tree induced Buffer Cache invalidations
Buffer cache
writes
DRAM aa c d
babcd
HDD
c C0
disk
merges a ce e C1 abcde f
b da fb c d e f C2 ae
bd f
ace

• Read buffer (buffer cache) and


LSM-tree write buffer (C0) are
separate
• Frequent compactions for sorting
– Referencing addresses changed
– Cache invalidations => misses 13
Existing representative solutions
• Building a Key-Value store cache
– E.g. raw cache in Cassandra, RocksDB, Mega-KV (VLDB 2015)

• Providing a Dedicated Compaction Server


– “Compaction management in distributed key-value data stores”
(VLDB’ 2015)

• Lazy Compaction: e.g., stepped merge


– “Incremental Organization for Data Recording and Warehousing”
(VLDB’ 1997)

15
Key-Value store: No address mapping to Disk
hash
function
a 1
b 2 d_value
DRAM
writes d 3
4 b_value
HDD
merges c C0
5
6
a_value disk
a ce e C1

bdf C2 ae
bd f
ace

• An independent In-Memory hash table is


used as a key-value cache

• +: KV-cache is not affected by disk


compaction

• -: The hash-table cannot support fast in-


memory range query 16
Dedicated server: prefetching for buffer cache
Buffer cache ace
writes Compaction server
DRAM a d
HDD b
c C0
a ce e C1
disk
bdf C2 ae
bd f

• A dedicated server is used for


compactions

• +: After each compaction, old


data blocks in buffer cache will be
replaced by newly compacted data
17
Dedicated server: prefetching for buffer cache
ab cd ef
Buffer cache
writes Compaction server
DRAM af ab cd ef
HDD
C0

af C1
disk
merges
bacbdced e f C2 af
bc de

• Buffer cache replacement is done


by comparing key ranges of blocks

• -: The prefetching based on


compaction data may load
unnecessary data
18
Stepped-merge: slowdown compaction
Buffer cache
writes
DRAM d
a b
HDD
c C0 disk
append
c a e C1
c
bdf C2 a
e bd f

• The merging is changed to


appending
• +: compaction-induced cache
invalidations are reduced
• -: Since each level is not fully sorted,
range queries are slow 19
LSbM-tree: aided by a compaction
buffer
• LSbM-tree
– Retains all the merits of an LSM-tree by
maintaining the structure
– Compaction buffer re-enable buffer caching
• Compaction buffer directly maps to buffer cache
• Slow data movement to reduce cache invalidations
• Keep cached data in compaction buffer for cache hits
writes
Buffer cache
C0 append
c a
C1 c B1
merges
C2 a B2
underlying LSM-tree Compaction buffer
20
Compaction Buffer: a cushion between LSM-tree and cache
DRAM
writes
HDD
c C0 append

a ce e C1 append c a e B1
merges
b ad bf c d e f C2 b d f B2
underlying LSM-tree compaction buffer

• During a compaction from Ci to


Ci+1, Ci is also appended to Bi+1
Buffer cache – E.g. while data in C0 is merged to C1 in LSM-
c abd tree, it is also appended to B1 in compaction
buffer
disk • No additional I/O, but only index
a e modification and additional space
c abcde f
ae
bd f
ace
21
Compaction Buffer: a cushion between LSM-tree and cache
DRAM
writes
HDD
C0

merges C1 c a e B1
abcdef C2 ace b d f B2
underlying LSM-tree compaction buffer

• During a compaction from Ci to


Ci+1, Ci is also appended to Bi+1
Buffer cache
– E.g. while data in C0 is merged to C1 in LSM-
ac
c abd tree, it is also appended to B1 in compaction
buffer
disk • No additional I/O, but only index
a e modification and additional space
c abcde f
bd f • As Ci is merged into Ci+1, the data
ace
blocks in Bi are removed gradually
22
Buffer Trimming: Who should stay or leave?
DRAM
writes
HDD
C0
newly
C1 appended B1
merges removed
abcdef C2 ace b d f B2
underlying LSM-tree compaction buffer

• To keep the cached data only, the


Buffer cache compaction buffer is periodically trimmed
ac • In each level, the most recently appended
bd
data blocks stay
disk • For other data blocks, make them stay
only if they are cached in the buffer cache
abcde f
• The removed data blocks are noted in the
bd f index for future queries
ace
23
Birth and Death of the compaction buffer

Workloads Dynamics in compaction buffer

Read only No data is appended into compaction buffer


(no birth)

Write only All data are deleted from compaction buffer


by the trim process (dying soon)

Read & write Only frequently visited data are kept in the
compaction buffer (dynamically alive)

24
Why LSbM-tree is effective?
Underlying LSM-tree Compaction buffer
• Contains the entire dataset • Attempt to keep cached data
• Fully sorted at each level • Not fully sorted at each level
• Efficient for on-disk range • Not be used for on-disk range
query query
• Updated frequently for merge • Not updated frequently
• LSM-tree induced buffer cache • LSM-tree induced buffer cache
misses are high misses are minimized

LSbM best utilizes both the underlying LSM-tree and


the compaction buffer for queries of different access patterns
25
Querying LSM-tree
i=0

Any block in yes


Ci holding k?

Use the block


found in Ci
no

no Is level i Is the block yes


i++ the last in cache?
level?
no
yes Load the block Parse out K-V pair
into cache from the block

Not found Found 26


Querying LSbM-tree
i=0

Any block in yes Any block in yes


Ci holding k? Bi holding k?

no
Use the block Use the block
found in Ci found in Bi
no

no Is level i Is the block yes


i++ the last in cache?
level?
no
yes Load the block Parse out K-V pair
into cache from the block

Not found Found 27


Querying LSbM-tree
DRAM
HDD
C0
get a C1 get a B1
removed
abcdef C2 ace b d f B2
underlying LSM-tree compaction buffer

• Step 1: the underlying LSM-tree is checked


Buffer cache level by level to locate the disk block(s)
ac
bd holding the requested key/key range
• Step 2: the same level(s) of the compaction
disk buffer is checked to locate the disk block(s)
holding the requested key/key range
abcde f
• Step 3: if any block is found in Step 2, the
bd request will be served by the block(s) in the
ace
compaction buffer
28
Querying LSbM-tree
DRAM
HDD
C0
get f C1 get f B1
removed
abcdef C2 ace b d f B2
underlying LSM-tree compaction buffer

• Step 1: the underlying LSM-tree is checked


Buffer cache
level by level to locate the disk block(s)
ac f
bd holding the requested key/key range
• Step 2: the same level(s) of the compaction
disk buffer is checked to locate the disk block(s)
holding the requested key/key range
abcde f
• Step 3: if no block(s) is found in Step 2, the
bd
ace request is served with the block(s) found in
the underlying LSM-tree
29
Experiments
• Setup
– Linux kernel 4.4.0-64
– Two quad-core Intel E5354 processors
– 8 GB main memory
– Two Seagate hard disk drives (Seagate
Cheetah 15K.7, 450GB) are configured as
RAID0

30
Dataset and Workloads
98% reads(Hot Range) 2% reads

3GB 17GB

20GB

100% writes
• Dataset
– 20GB unique data
• Write workload
– 100% writes uniformly distributed on the entire dataset
• Read workload (RangeHot workload)
– 98% reads uniformly distributed on a 3GB hot range
31
– 2% reads uniformly distributed on the rest data range
LSM-tree induced cache invalidation

• Test on LSM-tree
• Writes
– Fixed write throughput 1000 writes per second
• Reads
– RangeHot workload 32
Ineffective of the Dedicated-server solution

• Test on LSM-tree with dedicated compaction server


• Writes
– Fixed write throughput 1000 writes per second
• Reads
– RangeHot workload 33
Effectiveness of LSbM-tree

• Test on LSbM-tree
• Writes
– Fixed write throughput 1000 writes per second
• Reads
– RangeHot workload 34
Random access performance

• Buffer cache cannot be effectively used by LSM-tree


• The Dedicated server solution doesn’t work for RangeHot
workload
• LSbM-tree effectively re-enables the buffer caching and
achieves the best random access performance

35
Range query performance

• LSM-tree is efficient on range query


– Each level is fully sorted
– The invalidated disk blocks in cache can be loaded back quickly by
range query
• Key-Value store cache cannot support fast in-memory range query
• SM-tree is inefficient on on-disk range query
• LSbM-tree achieves the best range query performance by best
utilizing the underlying LSM-tree and the compaction buffer 36
Where are these methods positioned?
 high disk range
query efficiency
 high buffer caching
efficiency

37
Conclusion
• LSM-tree is a widely used storage data structure in
production systems to maximize the write throughput
• LSM-tree induced cache misses happen for workloads of
mixed reads and writes
• Several solutions have been proposed to address this
problem, but raising other issues
• LSbM-tree is an effective solution to retain all the merits
of LSM-tree but also re-enable buffer caching
• LSbM-tree is being implemented and tested in
production systems, e.g. Cassandra and LevelDB

38
Thank you

39
dbsize

40
Zipfian workload

20GB

100% read/writes

41
Zipfian workload on bLSM

42
Zipfian workload on LSbM

43

You might also like