0% found this document useful (0 votes)
42 views34 pages

Quick Tour of ClickHouse Internals

The document provides an overview of ClickHouse, a column-oriented database management system designed for fast interactive queries on real-time data. It discusses key features such as the MergeTree engine for data sorting and merging, sharding for distributed tables, and replication for fault tolerance. The document emphasizes the importance of efficient data processing, indexing, and the system's scalability and open-source nature.

Uploaded by

yida
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views34 pages

Quick Tour of ClickHouse Internals

The document provides an overview of ClickHouse, a column-oriented database management system designed for fast interactive queries on real-time data. It discusses key features such as the MergeTree engine for data sorting and merging, sharding for distributed tables, and replication for fault tolerance. The document emphasizes the importance of efficient data processing, indexing, and the system's scalability and open-source nature.

Uploaded by

yida
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

Quick Tour of ClickHouse Internals

Aleksey Zatelepin Yandex

v1.1
ClickHouse use cases
A stream of events
› Actions of website visitors
› Ad impressions
› Financial transactions
› DNS queries
› …
We want to save info about these events and then glean some
insights from it

2
ClickHouse philosophy
› Interactive queries on data updated in real time
› Cleaned structured data is needed
› Try hard not to pre-aggregate anything
› Query language: a dialect of SQL + lots of extensions

2
Sample query in a web analytics system
Top-10 referers for a counter for the last week.

SELECT Referer, count(*) AS count


FROM hits
WHERE CounterID = 1234 AND Date >= today() - 7
GROUP BY Referer
ORDER BY count DESC
LIMIT 10

2
How to execute a query fast?
Read data fast
› Only needed columns: CounterID, Date, Referer
› Locality of reads (an index is needed!)
› Data compression

2
How to execute a query fast?
Read data fast
› Only needed columns: CounterID, Date, Referer
› Locality of reads (an index is needed!)
› Data compression
Process data fast
› Vectorized execution (block-based processing)
› Parallelize to all available cores and machines
› Specialization and low-level optimizations
2
Index needed!

The principle is the same as with classic DBMSes


A majority of queries will contain conditions on
CounterID and (possibly) Date
(CounterID, Date) fits the bill
Check this by mentally sorting the table by primary key
Differences
› The table will be physically sorted on disk
› Is not a unique constraint

2
Index internals
(CounterID, Date) CounterID Date Referer

primary.idx .mrk .bin .mrk .bin .mrk .bin


… …
1234 2017-09-11
N 1234 2017-09-21
N+8192 1234 2017-09-27
N+16384 1235 2013-02-16
1235 2013-03-12
… …
(One entry each 8192 rows)

2
Things to remember about indexes
Index is sparse
› Must fit into memory
› Default value of granularity (8192) is good enough
› Does not create a unique constraint
› Poor performance for point queries
Table is sorted according to the index
› There can be only one
› Using the index is always beneficial
2
How to keep the table sorted
Inserted events are (almost) sorted by time
But we need to sort by primary key!
MergeTree: maintain a small set of sorted parts
Similar idea to an LSM tree

2
How to keep the table sorted
Primary key
Part To
on disk insert

M N N+1
Insertion number

2
How to keep the table sorted
Primary key
Part Part
on disk on disk

M N N+1
Insertion number

2
How to keep the table sorted
Primary key
Part Part
[M, N] [N+1]
Merge in the background

M N N+1
Insertion number

2
How to keep the table sorted
Primary key
Part
[M, N+1]

M N+1
Insertion number

2
Things to do while merging
Replace/update records
› ReplacingMergeTree
› CollapsingMergeTree
Pre-aggregate data
› AggregatingMergeTree
Metrics rollup
› GraphiteMergeTree

2
MergeTree partitioning
ENGINE = MergeTree(Date,… )
› Table is partitioned by month or (soon) by any expression
› Parts from different partitions are not merged
› Easy manipulation of partitions

ALTER TABLE DROP PARTITION


ALTER TABLE DETACH/ATTACH PARTITION
› MinMax index by partition columns

2
Things to remember about MergeTree
Merging runs in the background
› Even when there are no queries!
Control total number of parts
› Rate of INSERTs
› MaxPartsCountForPartition and DelayedInserts
metrics are your friends

2
When one server is not enough
› The data won’t fit on a single server…
› You want to increase performance by adding more servers…
› Multiple simultaneous queries are competing for resources…

2
When one server is not enough
› The data won’t fit on a single server…
› You want to increase performance by adding more servers…
› Multiple simultaneous queries are competing for resources…

ClickHouse: Sharding + Distributed tables!

2
Reading from a Distributed table

SELECT FROM distributed_table


GROUP BY column

SELECT FROM local_table


GROUP BY column

Shard 1 Shard 2 Shard 3

2
Reading from a Distributed table

Full result

Partially aggregated
result

Shard 1 Shard 2 Shard 3

2
NYC taxi benchmark
CSV 227 Gb, ~1.3 bln rows
SELECT passenger_count, avg(total_amount)
FROM trips GROUP BY passenger_count

Shards 1 3 140

Time, s. 1,224 0,438 0,043

Speedup x2.8 x28.5

2
Inserting into a Distributed table

INSERT INTO distributed_table

Shard 1 Shard 2 Shard 3

2
Inserting into a Distributed table

Async insert into shard #


sharding_key % 3

INSERT INTO local_table

Shard 1 Shard 2 Shard 3

2
Inserting into a Distributed table
INSERT INTO distributed_table
SETTINGS insert_distributed_sync=1

Split by sharding_key and insert

Shard 1 Shard 2 Shard 3

2
Things to remember about Distributed tables
It is just a view
› Doesn’t store any data by itself
Will always query all shards

Ensure that the data is divided into shards uniformly


› either by inserting directly into local tables
› or let the Distributed table do it
(but beware of async inserts by default)

2
When failure is not an option
› Protection against hardware failure
› Data must be always available for reading and writing

2
When failure is not an option
› Protection against hardware failure
› Data must be always available for reading and writing

ClickHouse: ReplicatedMergeTree engine!


› Async master-master replication
› Works on per-table basis

2
Replication internals
Inserted block number

INSERT

Replica 1
fetch
fetch Replication
Replica 2 queue
merge (ZooKeeper)
Replica 3
merge

2
Replication and the CAP–theorem
What happens in case of network failure (partition)?

› Not consistent
As is any system with async replication

But you can turn sequential consistency on

› Highly available (almost)
Tolerates the failure of one datacenter, if ClickHouse replicas
are in min 2 DCs and ZK replicas are in 3 DCs.

A server partitioned from ZK quorum is unavailable for writes

2
Putting it all together
SELECT FROM distributed_table

SELECT FROM replicated_table

Shard 1 Shard 2 Shard 3


Replica 1 Replica 1 Replica 1

Shard 1 Shard 2 Shard 3


Replica 2 Replica 2 Replica 2

2
Things to remember about replication
Use it!
› Replicas check each other
› Unsure if INSERT went through?
Simply retry - the blocks will be deduplicated
› ZooKeeper needed, but only for INSERTs
(No added latency for SELECTs)

Monitor replica lag


› system.replicas and system.replication_queue
tables are your friends
2
Brief recap
› Column–oriented
› Fast interactive queries on real time data
› SQL dialect + extensions
› Bad fit for OLTP, Key–Value, blob storage
› Scales linearly
› Fault tolerant
› Open source!

2
Thank you
Start using ClickHouse today!
Questions? Or reach us at:
[email protected]
› Telegram: https://t.me/clickhouse_en
› GitHub: https://github.com/yandex/ClickHouse/
› Google group: https://groups.google.com/group/clickhouse

You might also like