Quick Tour of ClickHouse Internals
Quick Tour of ClickHouse Internals
v1.1
ClickHouse use cases
A stream of events
› Actions of website visitors
› Ad impressions
› Financial transactions
› DNS queries
› …
We want to save info about these events and then glean some
insights from it
2
ClickHouse philosophy
› Interactive queries on data updated in real time
› Cleaned structured data is needed
› Try hard not to pre-aggregate anything
› Query language: a dialect of SQL + lots of extensions
2
Sample query in a web analytics system
Top-10 referers for a counter for the last week.
2
How to execute a query fast?
Read data fast
› Only needed columns: CounterID, Date, Referer
› Locality of reads (an index is needed!)
› Data compression
2
How to execute a query fast?
Read data fast
› Only needed columns: CounterID, Date, Referer
› Locality of reads (an index is needed!)
› Data compression
Process data fast
› Vectorized execution (block-based processing)
› Parallelize to all available cores and machines
› Specialization and low-level optimizations
2
Index needed!
2
Index internals
(CounterID, Date) CounterID Date Referer
2
Things to remember about indexes
Index is sparse
› Must fit into memory
› Default value of granularity (8192) is good enough
› Does not create a unique constraint
› Poor performance for point queries
Table is sorted according to the index
› There can be only one
› Using the index is always beneficial
2
How to keep the table sorted
Inserted events are (almost) sorted by time
But we need to sort by primary key!
MergeTree: maintain a small set of sorted parts
Similar idea to an LSM tree
2
How to keep the table sorted
Primary key
Part To
on disk insert
M N N+1
Insertion number
2
How to keep the table sorted
Primary key
Part Part
on disk on disk
M N N+1
Insertion number
2
How to keep the table sorted
Primary key
Part Part
[M, N] [N+1]
Merge in the background
M N N+1
Insertion number
2
How to keep the table sorted
Primary key
Part
[M, N+1]
M N+1
Insertion number
2
Things to do while merging
Replace/update records
› ReplacingMergeTree
› CollapsingMergeTree
Pre-aggregate data
› AggregatingMergeTree
Metrics rollup
› GraphiteMergeTree
2
MergeTree partitioning
ENGINE = MergeTree(Date,… )
› Table is partitioned by month or (soon) by any expression
› Parts from different partitions are not merged
› Easy manipulation of partitions
2
Things to remember about MergeTree
Merging runs in the background
› Even when there are no queries!
Control total number of parts
› Rate of INSERTs
› MaxPartsCountForPartition and DelayedInserts
metrics are your friends
2
When one server is not enough
› The data won’t fit on a single server…
› You want to increase performance by adding more servers…
› Multiple simultaneous queries are competing for resources…
2
When one server is not enough
› The data won’t fit on a single server…
› You want to increase performance by adding more servers…
› Multiple simultaneous queries are competing for resources…
2
Reading from a Distributed table
2
Reading from a Distributed table
Full result
Partially aggregated
result
2
NYC taxi benchmark
CSV 227 Gb, ~1.3 bln rows
SELECT passenger_count, avg(total_amount)
FROM trips GROUP BY passenger_count
Shards 1 3 140
2
Inserting into a Distributed table
2
Inserting into a Distributed table
2
Inserting into a Distributed table
INSERT INTO distributed_table
SETTINGS insert_distributed_sync=1
2
Things to remember about Distributed tables
It is just a view
› Doesn’t store any data by itself
Will always query all shards
2
When failure is not an option
› Protection against hardware failure
› Data must be always available for reading and writing
2
When failure is not an option
› Protection against hardware failure
› Data must be always available for reading and writing
2
Replication internals
Inserted block number
INSERT
Replica 1
fetch
fetch Replication
Replica 2 queue
merge (ZooKeeper)
Replica 3
merge
2
Replication and the CAP–theorem
What happens in case of network failure (partition)?
❋
› Not consistent
As is any system with async replication
❋
But you can turn sequential consistency on
❋
› Highly available (almost)
Tolerates the failure of one datacenter, if ClickHouse replicas
are in min 2 DCs and ZK replicas are in 3 DCs.
❋
A server partitioned from ZK quorum is unavailable for writes
2
Putting it all together
SELECT FROM distributed_table
2
Things to remember about replication
Use it!
› Replicas check each other
› Unsure if INSERT went through?
Simply retry - the blocks will be deduplicated
› ZooKeeper needed, but only for INSERTs
(No added latency for SELECTs)
2
Thank you
Start using ClickHouse today!
Questions? Or reach us at:
› [email protected]
› Telegram: https://t.me/clickhouse_en
› GitHub: https://github.com/yandex/ClickHouse/
› Google group: https://groups.google.com/group/clickhouse