0% found this document useful (0 votes)

21 views12 pages

10 Data Structures for Fast Databases

Uploaded by

kdevesh.2099

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views12 pages

10 Data Structures for Fast Databases

Uploaded by

kdevesh.2099

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Read in the Substack app Open app

10 Data Structures That Make Databases Fast

and Scalable
From B-Trees to Bloom Filters

ASHISH PRATAP SINGH

NOV 21, 2024

Have you ever wondered why modern databases are so fast and efficient, even when
managing terabytes of data?

The answer lies in their underlying data structures and indexing techniques that enable
efficient storage, retrieval, and management of data.

In this article, we'll look at 10 important data structures that make modern databases
fast, reliable, and scalable.

📣 Design, develop and manage distributed software

better (Sponsored)
Multiplayer auto-documents your system, from the high-level logical architecture down
to the individual components, APIs, dependencies, and environments. Perfect for teams
who want to speed up their workflows and consolidate their technical assets.

1. Hash Indexes
A hash index is a data structure that maps keys to values using a hash function.

The hash function converts a key into an integer, which is used as an index in a hash
table (buckets) to locate the corresponding value.

It is designed for fast insertion and lookup, such as:

Insert/Find a new record with id = 123.

Visualized using Multiplayer

This structure provides O(1) average-time complexity for insertions, deletions, and
lookups.

Hash indexes are widely used in key-value stores (eg., DynamoDB), and caching
systems (eg., Redis).

2. B-Trees
A B-tree is a self-balancing tree data structure designed to store sorted data in a way
that optimizes reads, writes, and queries on large datasets.

It minimizes disk I/O by storing multiple keys in a single node and automatically
balances itself during insertions and deletions.
Visualized using Multiplayer

Unlike binary search trees, where each node has at most two children, B-Trees allow
multiple children per node. The number of children is defined by the order of the B-
Tree.

Internal nodes contain keys and pointers to child nodes and leaf nodes contain keys and
pointers to the actual data.

Keys in each node are stored in sorted order, enabling fast binary searches.

B-Trees are widely used for indexing in relational databases (eg., MySQL).

While many NoSQL databases favor LSM Trees for write-heavy workloads, some use
B-Trees for read-heavy scenarios or as part of their indexing strategy.

3. Skip Lists
A skip list is a probabilistic data structure that extends the functionality of linked lists
by adding multiple levels of "shortcuts" to enable fast search, insertion, and deletion
operations.

It is designed to offer performance comparable to balanced binary trees or B-Trees

while being simpler to implement and manage.
Visualized using Multiplayer

A skip list consists of multiple levels, with each level being a subset of the level below.

The bottom-most layer contains all the elements in sorted order, and higher layers are
sparser subsets that provide shortcuts to quickly navigate the lower layers.

Nodes are promoted to higher levels probabilistically, ensuring an even distribution

without the need for rebalancing.

Skip lists are particularly well-suited for in-memory storage and dynamic datasets
where updates are frequent.

Redis uses skip lists to implement it’s sorted sets (ZSET), enabling fast insertions,
deletions, and range queries while maintaining sorted order.

4. Memtables
A memtable is an in-memory data structure used in modern databases to temporarily
store write operations before they are flushed to disk.

It plays a critical role in optimizing write performance and ensuring data durability,
especially in databases designed for high-throughput workloads, such as Cassandra,
RocksDB, and HBase.
Memtables are typically implemented as a sorted structure like a Red-Black Tree or Skip
List, enabling efficient lookups and ordered writes to disk.

When data is written to the database: it is logged in the Write-Ahead Log (to persist the
change) and added to the memtable.

For recent writes, the memtable is checked first. If the key is not found in the memtable,
the query searches on-disk SSTables (Sorted String Table) or other storage files.

When the memtable reaches its size limit, it is flushed to disk as an immutable SSTable
and a new memtable is initialized for subsequent writes.

5. SSTables
An SSTable (Sorted String Table) is an immutable, on-disk data structure used in
modern databases like Cassandra, RocksDB, and HBase to store sorted key-value pairs.

SSTables are primarily used in Log-Structured Merge Tree (LSM Tree) databases to
optimize read and write performance.

They enable efficient sequential writes, fast lookups, and range queries.
Visualized using Multiplayer

Key Characteristics of SSTables:

Once written, an SSTable is never modified. New writes are handled by creating
new SSTables.

SSTables are written as sequential blocks to disk, minimizing fragmentation and

improving I/O performance.

SSTables include an in-memory index or Bloom filter to quickly determine whether

a key might exist without scanning the entire table.

Older SSTables are periodically merged into larger tables, removing duplicates and
reclaiming storage space.

Subscribe to receive new articles every week.

Type your email... Subscribe

6. Inverted Index
An inverted index is a data structure that maps terms (words or tokens) to the
documents or locations where they appear.

It is called "inverted" because it reverses the conventional relationship of an index:

instead of mapping documents to the terms they contain, it maps terms to the
documents that contain them.

This structure allows for optimal handling of full-text searches like:

Find all document which contain the terms: “database powerful“

Visualized using Multiplayer

How Inverted Index is Created:

1. Tokenization: Text is split into individual tokens (words or terms).

Example: "Database systems are powerful" → ["database", "systems", "are",

"powerful"]

2. Normalization: Tokens are standardized (e.g., lowercased, stemmed, or lemmatized).

Example: "Databases" → "database"

3. Index Construction and Storage: For each term, a postings list is created or updated
with the document ID and metadata (e.g., term frequency, positions).
Inverted indexes are widely used in databases, search engines, and information
retrieval systems to enable efficient keyword lookups, Boolean queries, and relevance
ranking.

7. Bloom Filters
A Bloom Filter is a space-efficient, probabilistic data structure that answers the
question: "Does this element exist in a set?"

It starts as a bit array of size m, initialized with all bits set to 0. It also requires k
independent hash functions, each of which maps an element to one of the m positions in
the bit array.

Visualized using Multiplayer

To insert an element into the Bloom filter, you pass it through each of the k hash
functions to get k positions in the bit array. The bits at these positions are set to 1.

To check if an element is in the set, you again pass it through the k hash functions to get
k positions.
If all the bits at these positions are set to 1, the element is probably in the set
(though there's a chance it might be a false positive).

If any bit at these positions is 0, the element is definitely not in the set.

Unlike traditional data structures, it does not store the actual elements, making it
extremely memory-efficient.

Bloom filters allow databases to quickly check if a key might exist in a specific data
structure (e.g., an SSTable or a database partition). They avoid unnecessary disk
lookups in places where the key is guaranteed to be absent.

8. Bitmap Indexes
A bitmap index is a specialized indexing technique that encodes the values of a column
as a series of bitmaps, where each bitmap corresponds to a unique value in the column.

Each bit in the bitmap represents whether a row in the dataset contains that value.

Consider a dataset with a column Color:

In the bitmap index:

Each row corresponds to a bit in the bitmap.

A 1 indicates that the value is present in the row, while a 0 indicates its absence.
Bitmap indexes use bitwise operations (AND, OR, NOT, XOR) to efficiently filter data.

Example Query: Find rows with Color = 'Red' OR Color = 'Green'

Retrieve the bitmap for 'Red' → 1 0 0 1 0

Retrieve the bitmap for 'Green' → 0 0 1 0 1

Perform a bitwise OR → 1 0 1 1 1

Bitmap indexes are widely used in data warehouses, columnar databases, and OLAP
(Online Analytical Processing) systems for their ability to speed up complex queries
like filtering, aggregations, and joins when dealing with large datasets containing
low-cardinality columns (columns with few unique values).

9. R-trees
An R-Tree (short for Rectangle Tree) is a tree-based data structure designed for indexing
multidimensional spatial data, such as geographic locations, geometric shapes, or
bounding boxes.

It is particularly effective for queries involving spatial relationships, such as

intersections, containment, and nearest neighbors.

Each node is represented by an Minimum Bounding Rectangle (MBR) that encloses all
its child nodes or objects. Leaf nodes store the actual spatial objects.

PostGIS, an extension of PostgreSQL, uses R-Trees to index spatial data for queries like:
Find all locations within a rectangular region.

SELECT * FROM locations

WHERE ST_Intersects(
geometry,
ST_MakeEnvelope(-74.02, 40.70, -73.93, 40.80)
);

10. Write-Ahead Logs (WAL)

A Write-Ahead Log is an append-only persistent log file that records all changes made
to a database before they are applied to the actual database.

This ensures that even if the system crashes during a write operation, the database can
recover to a consistent state by replaying or rolling back these logged changes.

Structure of a WAL Entry:

Transaction ID (TXID): A unique identifier for the transaction.

Operation Type: The action performed (INSERT, UPDATE, DELETE).

Table Name: The table where the operation occurred.

Key: The primary key or unique identifier of the record.

Old Value: The previous value of the record (used in UPDATE and DELETE).

New Value: The new value of the record (used in INSERT and UPDATE).

Timestamp: The time the operation was logged.

Sample WAL Record:

TXID: 1002
Operation: UPDATE
Table: users
Key: id=1
Old Value: {"id": 1, "name": "Alice", "email":
"alice@[Link]", "age": 30}
New Value: {"id": 1, "name": "Alice", "email":
"[Link]@[Link]", "age": 31}
Timestamp: 2024-11-19T[Link]Z

By logging changes before they are applied to the main database, WAL enables
databases to recover from crashes and maintain ACID (Atomicity, Consistency,
Isolation, Durability) properties.

Periodically, the database truncates or archives old log entries after ensuring that the
changes are safely written to the main database file.

System Design
No ratings yet
System Design
6 pages
SQL Indexes
No ratings yet
SQL Indexes
20 pages
08 Indexes1
No ratings yet
08 Indexes1
7 pages
DBMS Data Structures Explained
No ratings yet
DBMS Data Structures Explained
11 pages
Information Retrievals Full Notes
No ratings yet
Information Retrievals Full Notes
8 pages
Indexing
No ratings yet
Indexing
77 pages
DBMS Structures
No ratings yet
DBMS Structures
11 pages
09 Indexes2
No ratings yet
09 Indexes2
5 pages
Learned Index Structures in Databases
No ratings yet
Learned Index Structures in Databases
27 pages
09 Indexes2
No ratings yet
09 Indexes2
5 pages
08 Indexes1
No ratings yet
08 Indexes1
7 pages
B-Tree, Hashing, Chaining
No ratings yet
B-Tree, Hashing, Chaining
6 pages
Designing Data-Intensive Apps - CH 3
No ratings yet
Designing Data-Intensive Apps - CH 3
7 pages
Designing Data Intensive Applications: Part 1: Storage and Retrieval
No ratings yet
Designing Data Intensive Applications: Part 1: Storage and Retrieval
85 pages
DBMS
No ratings yet
DBMS
19 pages
Designing Data Intensive Applications
25% (4)
Designing Data Intensive Applications
61 pages
Untitled Document
No ratings yet
Untitled Document
11 pages
Assignment (DS)
No ratings yet
Assignment (DS)
8 pages
DB Cheat Sheet Till Mid
No ratings yet
DB Cheat Sheet Till Mid
2 pages
Dbms Ani
No ratings yet
Dbms Ani
68 pages
Unit 6
No ratings yet
Unit 6
38 pages
System Design
No ratings yet
System Design
150 pages
System Design
No ratings yet
System Design
150 pages
Cheat Sheet v4
No ratings yet
Cheat Sheet v4
3 pages
DINLect 1
No ratings yet
DINLect 1
69 pages
Search Trees
No ratings yet
Search Trees
55 pages
Document From Mohamed Gehad
No ratings yet
Document From Mohamed Gehad
15 pages
B+ Trees in File System Indexing
No ratings yet
B+ Trees in File System Indexing
5 pages
Black Elegant and Modern Startup Pitch Deck Presentation
No ratings yet
Black Elegant and Modern Startup Pitch Deck Presentation
16 pages
Index Dbms
No ratings yet
Index Dbms
5 pages
10.1.1.219.7269 ModernBTreeTechniques
No ratings yet
10.1.1.219.7269 ModernBTreeTechniques
203 pages
DBMS File & Index Organization
No ratings yet
DBMS File & Index Organization
10 pages
DBMS Unit5
No ratings yet
DBMS Unit5
40 pages
B-Tree Overview in DBMS
No ratings yet
B-Tree Overview in DBMS
19 pages
Database Management System-203105251: Assistant Professor Computer Science & Engineering
No ratings yet
Database Management System-203105251: Assistant Professor Computer Science & Engineering
35 pages
A Machine Learning Approach To Databases Indexes
No ratings yet
A Machine Learning Approach To Databases Indexes
5 pages
Unit Iv
No ratings yet
Unit Iv
29 pages
24CSR1R01 DSF Assignment 6
No ratings yet
24CSR1R01 DSF Assignment 6
2 pages
DBMS 3
No ratings yet
DBMS 3
3 pages
210 Maps PDF
No ratings yet
210 Maps PDF
39 pages
Graph Databases: Key Points: 1. Definition & Basics
No ratings yet
Graph Databases: Key Points: 1. Definition & Basics
20 pages
Hashing Techniques in Data Structures
No ratings yet
Hashing Techniques in Data Structures
25 pages
Unit 3 Storage Strategies Indices B-Trees Hashing
No ratings yet
Unit 3 Storage Strategies Indices B-Trees Hashing
12 pages
HBase Architecture and Performance Insights
No ratings yet
HBase Architecture and Performance Insights
46 pages
Unit 5 Indexing 2024
No ratings yet
Unit 5 Indexing 2024
50 pages
Module Iippt
No ratings yet
Module Iippt
27 pages
Advanced Data Structures
No ratings yet
Advanced Data Structures
37 pages
cs186 Notes
No ratings yet
cs186 Notes
31 pages
Definition of Hashing
No ratings yet
Definition of Hashing
30 pages
0810 IT ITC801 BDA SampleQB
No ratings yet
0810 IT ITC801 BDA SampleQB
22 pages
Probabilistic Data Structures
No ratings yet
Probabilistic Data Structures
26 pages
Fuzzy Multi-Keyword Search For Multi-Owner Scenario in IoV
No ratings yet
Fuzzy Multi-Keyword Search For Multi-Owner Scenario in IoV
15 pages
Computer Networks: Fatma Hendaoui Hamdi Eltaief Habib Youssef
No ratings yet
Computer Networks: Fatma Hendaoui Hamdi Eltaief Habib Youssef
14 pages
FortiSIEM-7 1 4-Release - Notes
No ratings yet
FortiSIEM-7 1 4-Release - Notes
13 pages
CAP Theorem vs ACID in Databases
100% (1)
CAP Theorem vs ACID in Databases
22 pages
Understanding Bloom Filters and Differential Files
No ratings yet
Understanding Bloom Filters and Differential Files
22 pages
Big Data and Hadoop Quiz Guide
No ratings yet
Big Data and Hadoop Quiz Guide
21 pages
Apache Doris Docs (English) - Compressed
No ratings yet
Apache Doris Docs (English) - Compressed
1,714 pages
ISS 07 User Authentication 13-8-2024
No ratings yet
ISS 07 User Authentication 13-8-2024
41 pages
Data Analytics Assignment VII
No ratings yet
Data Analytics Assignment VII
2 pages
Wi-Fi Aware Specification v4.0
No ratings yet
Wi-Fi Aware Specification v4.0
257 pages
Viden Io Data Analytics Lecture7 Data Stream Filtering PDF
No ratings yet
Viden Io Data Analytics Lecture7 Data Stream Filtering PDF
20 pages
Module 3 Mining Data Streams
No ratings yet
Module 3 Mining Data Streams
97 pages
Big Data End Sem 2024
No ratings yet
Big Data End Sem 2024
4 pages
Stream Computing & Data Mining
No ratings yet
Stream Computing & Data Mining
35 pages
Big Data Analytics - Unit 3
No ratings yet
Big Data Analytics - Unit 3
64 pages
Ex A Data Technical Deep Dive Oow 2016
No ratings yet
Ex A Data Technical Deep Dive Oow 2016
35 pages
Streaming Algorithms Overview
No ratings yet
Streaming Algorithms Overview
90 pages
CH8 Hashing
No ratings yet
CH8 Hashing
110 pages
MapReduce and Data Processing Quiz
No ratings yet
MapReduce and Data Processing Quiz
19 pages
Introduction to Big Data & Hadoop
No ratings yet
Introduction to Big Data & Hadoop
45 pages
MapReduce Bloom Filter Guide
No ratings yet
MapReduce Bloom Filter Guide
4 pages
3-6 BLOOM FILTER - Bitcoin Network
No ratings yet
3-6 BLOOM FILTER - Bitcoin Network
7 pages
Homework 4
67% (3)
Homework 4
13 pages
AA Exam 2021 Answers
No ratings yet
AA Exam 2021 Answers
6 pages
A Private Smart Wallet With Probabilistic Compliance
No ratings yet
A Private Smart Wallet With Probabilistic Compliance
10 pages
Streaming Algorithm: Filtering & Counting Distinct Elements: Compsci 590.02 Instructor: Ashwinmachanavajjhala
No ratings yet
Streaming Algorithm: Filtering & Counting Distinct Elements: Compsci 590.02 Instructor: Ashwinmachanavajjhala
26 pages
Block Chain Important-Ans
No ratings yet
Block Chain Important-Ans
20 pages

10 Data Structures for Fast Databases

Uploaded by

10 Data Structures for Fast Databases

Uploaded by

Read in the Substack app Open app

10 Data Structures That Make Databases Fast

ASHISH PRATAP SINGH

📣 Design, develop and manage distributed software

It is designed for fast insertion and lookup, such as:

Visualized using Multiplayer

It is designed to offer performance comparable to balanced binary trees or B-Trees

Nodes are promoted to higher levels probabilistically, ensuring an even distribution

Key Characteristics of SSTables:

SSTables are written as sequential blocks to disk, minimizing fragmentation and

SSTables include an in-memory index or Bloom filter to quickly determine whether

Subscribe to receive new articles every week.

Type your email... Subscribe

It is called "inverted" because it reverses the conventional relationship of an index:

This structure allows for optimal handling of full-text searches like:

Find all document which contain the terms: “database powerful“

Visualized using Multiplayer

How Inverted Index is Created:

Example: "Database systems are powerful" → ["database", "systems", "are",

2. Normalization: Tokens are standardized (e.g., lowercased, stemmed, or lemmatized).

Example: "Databases" → "database"

Visualized using Multiplayer

Consider a dataset with a column Color:

In the bitmap index:

Each row corresponds to a bit in the bitmap.

Example Query: Find rows with Color = 'Red' OR Color = 'Green'

Retrieve the bitmap for 'Red' → 1 0 0 1 0

Retrieve the bitmap for 'Green' → 0 0 1 0 1

It is particularly effective for queries involving spatial relationships, such as

SELECT * FROM locations

10. Write-Ahead Logs (WAL)

Structure of a WAL Entry:

Operation Type: The action performed (INSERT, UPDATE, DELETE).

Table Name: The table where the operation occurred.

Key: The primary key or unique identifier of the record.

Timestamp: The time the operation was logged.

Sample WAL Record:

You might also like