Opensource Column Store Databases - MariaDB ColumnStore vs. ClickHouse - FileId - 188040
Opensource Column Store Databases - MariaDB ColumnStore vs. ClickHouse - FileId - 188040
Alexander Rubin
VirtualHealth
About me
2
MariaDB ColumnStore, ClickHouse and Storage Formats
Caution:
1. This talk is not about specifics of implementation
○ A number of presentations about Clickhouse and MariaDB @ Percona Live 2019
2. This is all about:
○ What? -- what is the problem
○ Why? -- why queries are slow
○ How? -- how to solve
3. Examples are real-world example, medical insurance records
○ (but no actual PII data shown)
3
Into: MySQL and Slow Queries
Simple query - top 10 - clients who visited doctors most often (data from 2017-2019)
mysql> SELECT
-> client_id,
-> min(date) as first_visit,
-> max(date) as last_visit,
-> count(distinct date) as days_visited,
-> count(cv.id) as visits,
-> count(distinct cv.service_location_name) as locations
-> FROM client_visit cv
-> GROUP BY client_id
-> ORDER by visits desc
-> LIMIT 10;
+-----------+-------------+------------+--------------+--------+-----------+
| client_id | first_visit | last_visit | days_visited | visits | locations |
+-----------+-------------+------------+--------------+--------+-----------+
| ......... | 2017-08-07 | 2019-05-24 | .. | ... | .. |
4
MySQL = { }
What exactly is slow?
Is 47 seconds slow?
… depends on expectations
● Data Science world it is blazing fast
● Realtime report/dashboard - extremely slow
5
What to do?
Some ideas:
1. Use index Luke!
2. Table per report
3. Pre-aggregate - table per group of reports
4. Something else
6
Use index
But, it is already using index:
id: 1
select_type: SIMPLE
table: cv
partitions: NULL
type: index
possible_keys: FK_client_visit
key: FK_client_visit
key_len: 5
ref: NULL
rows: 10483873
filtered: 100.00
Extra: Using temporary; Using filesort
1 row in set, 1 warning (0.00 sec)
7
Ok, better index: covered index
mysql> alter table client_visit add key comb(client_id, date, service_location_name);
Query OK, 0 rows affected (38.48 sec)
Records: 0 Duplicates: 0 Warnings: 0
table: cv
partitions: NULL
type: index Still
possible_keys: FK_client_id,comb
key: comb slow!
key_len: 776
ref: NULL
rows: 10483873
filtered: 100.00
Extra: Using index; Using temporary; Using filesort
8
Ok, how large is the table?
9
Ok, other options in MySQL?
10
Ok, other options in MySQL?
Pre-aggregate in a table:
● group by client_id + avg, sum, …
● group by date + avg,sum
Final report will do another aggregation if needed
Problems:
1. Some aggregates can’t be re-aggregated
2. Still too many tables
3. Hard to maintain
11
And it was only the beginning… now this:
SELECT
cv.client_id as client_id,
min(date) as first_visit,
max(date) as last_visit,
count(distinct date) as days_visited, Highly
count(distinct cv.id) as visits, normalized
count(distinct cp.cpt_code) as procedures, schema
count(distinct cv.service_location_name) as locations,
sum(billed_amount) as total_billed,
max(billed_amount) as max_price,
avg(billed_amount) as avg_price
FROM
client_visit cv
join client_procedure cp on cp.encounter_id = cv.encounter_id
join client_procedure_claim cpc on cp.id = cpc.client_procedure_id
join client_claim cc on cc.id = cpc.client_claim_id
GROUP BY client_id
ORDER BY total_billed desc
LIMIT 10 12
4 table JOIN, all large tables
+-----------+-------------+------------+--------------+--------+------------+-----------+--------------+-----------+-------------+
| client_id | first_visit | last_visit | days_visited | visits | procedures | locations | total_billed | max_price | avg_price |
+-----------+-------------+------------+--------------+--------+------------+-----------+--------------+-----------+-------------+
| ....... | 2018-02-14 | 2019-04-22 | 64 | 64 | .. | .. | 200K | 11K | 449.34 |
...
13
Why MySQL is slow for such queries?
14
Why MySQL is slow for such queries?
15
https://clickhouse.yandex/docs/en/
Column Store Databases
MariaDB Columnstore
https://mariadb.com/kb/en/library/mariadb-columnstore/
16
Column Store Databases
Yandex Clickhouse
https://clickhouse.yandex/
17
Column-store tests
Testing box 1:
● AWS ec2 instance, c5d.4xlarge
● RAM: 32.0 GiB
● vCPU: 16
● Disk: NVMe SSD + EBS
Testing box 2:
● AWS ec2 instance, c5d.18xlarge
● RAM: 144.0 GiB
● vCPU: 72
● Disk: NVMe SSD + EBS
18
Is it worth using column store: Q1
19
Is it worth using column store: Q2
Response time (sec) 5 min 18.16 sec 33.83 sec 1 min 2.16 sec
Speed increase 9x 5x
compared to MySQL 940% 511%
(times, %)
20
Table sizes on disk
MySQL Clickhouse ColumnStore
client_visit 5,876,219,904 793,976,832 3,606,462,464
client_procedure 13,841,203,200 2,253,180,928 9,562,865,664
client_procedure_claim 2,466,250,752 292,007,936 335,683,584
client_claim 11,710,496,768 2,400,182,272 6,720,749,568
Total 33,894,170,624 5,739,347,968 20,225,761,280
Smaller compared to
MySQL size (x) 5.91 1.68
Compression
21
Exporting from MySQL
Usually 3 options
1. ETL to Star Schema
2. ETL to flat de-normalized tables
3. Copy / replicate realtime (as is)
22
Yandex Clickhouse
23
Clickhouse: export from mysql (schema)
https://github.com/Altinity/clickhouse-mysql-data-reader
1. Schema import
$ clickhouse-mysql --create-table-sql \
--src-host=mysql-replica-host \
--src-user=export \
--src-password=xxxxxx \
--src-schemas=main \
--src-tables=client_condition,client_procedure,client_visit
0 rows in set. Elapsed: 17.821 sec. Processed 37.40 million rows, 299.18
MB (2.10 million rows/s., 16.79 MB/s.)
25
Clickhouse: connect using MySQL client
https://github.com/sysown/proxysql/wiki/ClickHouse-Support
$ wget
https://github.com/sysown/proxysql/releases/download/v2.0.4/p
roxysql_2.0.4-ubuntu18_amd64.deb
$ dpkg -i proxysql_2.0.4-clickhouse-ubuntu18_amd64.deb
$ proxysql --clickhouse-server
26
Clickhouse: connect using MySQL client
$ mysql -h 127.0.0.1 -P 6032 ...
Admin> SELECT * FROM clickhouse_users;
Empty set (0.00 sec)
https://github.com/sysown/proxysql/wiki/ClickHouse-Support 27
Clickhouse: connect using MySQL client
mysql -h 127.0.0.1 -P 6090 -uclicku -pclickp
...
Server version: 5.5.30 (ProxySQL ClickHouse Module)
29
Clickhouse - joining MySQL source
SELECT count(*)
FROM clients_mysql
┌─count()─┐
│ 1035284 │
└─────────┘
30
Clickhouse - joining MySQL source
SELECT cv.client_id as client_id ...
FROM
...
INNER JOIN CREATE VIEW clients_mysql as
( select * from
SELECT mysql('mysql_host', 'db',
user_id, 'client', 'export', 'xxxx');
is_active
FROM clients_mysql
) AS c ON cv.client_id = c.user_id
WHERE c.is_active = 1
...
32
MariaDB ColumnStore: export from mysql
Schema import - need to create custom schema
mysqldump --no-data
… change engine=InnoDB to engine=Columnstore
33
MariaDB ColumnStore: export from mysql
Schema import - need to create custom schema
mysqldump --no-data does not work out of the box
34
MariaDB ColumnStore: export from mysql (data)
Fastest way -
1. mysql > select into outfile …
2. $ cpimport ...
Easiest way:
1. Import into InnoDB locally (columnstore includes MySQL server)
2. Run “insert into columstore_table select * from innodb_table”
35
MariaDB ColumnStore: Joining MySQL source
1. Export using InnoDB storage engine
2. JOIN across engines
SELECT ...
FROM
client_visit cv
join client_procedure cp on cp.encounter_id = cv.encounter_id
join client_procedure_claim cpc on cp.id = cpc.client_procedure_id
join client_claim cc on cc.id = cpc.client_claim_id
join client_innodb c on cv.client_id = c.user_id
WHERE c.is_active = 1
GROUP BY client_id
ORDER BY total_billed desc
limit 10;
36
Clickhouse: replication from MySQL
37
ColumnStore: replication from MySQL
Coming soon
https://jira.mariadb.org/browse/MCOL-498
https://jira.mariadb.org/browse/MCOL-593
38
Update / delete - MariaDB ColumnStore
DMLs are usually the slowest in Columnar Stores
Single row:
39
Update / delete - Clickhouse
Implemented as “Alter table update” (mutations)
Asynchronous
https://clickhouse.yandex/docs/en/query_language/alter/#alter-mutations
Single row:
40
When NOT to use Column Store db (1)
Single row (full row) by id: select * from client_claim where id = <num>;
41
When NOT to use Column Store db (2)
When we will be using index in MySQL + limit
42
When NOT to use Column Store db (2)
mysql> explain SELECT * FROM client_claim WHERE place_of_service = 'OFFICE'
ORDER BY received_date DESC LIMIT 10\G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: client_claim
partitions: NULL
type: ref
possible_keys: place_of_service_received_date
key: place_of_service_received_date
key_len: 768
ref: const
rows: 13072594
filtered: 100.00
43
Extra: Using where
Other idea: daily online snapshots
● Size is columnar store DB is significantly smaller
● We can load daily snapshots
○ I.e. store 30 days of data that can be queried (without restore from backup)
44
Summary: Clickhouse
Advantages
● Fastest queries
● Very efficient storage
Disadvantages
● For JOINs needs RAM to store all JOINed tables
● No native MySQL protocol (only via ProxySQL)
● No standard sql support (data types, etc)
45
Summary: MariaDB ColumnStore
Advantages
● Native MySQL protocol - easier to integrate
● Native shared nothing cluster
Disadvantages
● Slower queries
46
Conclusion (1)
47
Conclusion (2)
48
Conclusion (3)
49
Thank you!
50