SlideShare a Scribd company logo
© 2022 Altinity, Inc.
All About JSON and
ClickHouse
Tips, Tricks, and New Features
Robert Hodges and Diego Nieto
26 July 2022
1
Copyright © Altinity Inc 2022
© 2022 Altinity, Inc.
Let’s make some introductions
ClickHouse support and services including Altinity.Cloud
Authors of Altinity Kubernetes Operator for ClickHouse
and other open source projects
Robert Hodges
Database geek with 30+ years
on DBMS systems. Day job:
Altinity CEO
Diego Nieto
Database engineer focused on
ClickHouse, PostgreSQL, and
DBMS applications
2
© 2022 Altinity, Inc.
Reading and writing JSON - the basics
3
© 2022 Altinity, Inc.
JSON is pervasive as raw data
head http_logs.json
{"@timestamp": 895873059, "clientip":"54.72.5.0", "request":
"GET /images/home_bg_stars.gif HTTP/1.1", "status": 200,
"size": 2557}
{"@timestamp": 895873059, "clientip":"53.72.5.0", "request":
"GET /images/home_tool.gif HTTP/1.0", "status": 200, "size":
327}
...
Web server log data
© 2022 Altinity, Inc.
Reading and writing JSON data to/from tables
SQL Table
Every key is a
column
{"@timestamp":"
1998-05-22
21:37:39","clienti
p":"54.72.5.0",...}
{"@timestamp":"
1998-05-22
21:37:39","clienti
p":"54.72.5.0",...}
© 2022 Altinity, Inc.
Loading raw JSON using JSONEachRow input format
CREATE TABLE http_logs_tabular (
`@timestamp` DateTime,
`clientip` IPv4,
`status` UInt16,
`request` String,
`size` UInt32
) ENGINE = MergeTree
PARTITION BY toStartOfDay(`@timestamp`)
ORDER BY `@timestamp`
clickhouse-client --query 
'INSERT INTO http_logs_tabular Format JSONEachRow' 
< http_logs_tabular
© 2022 Altinity, Inc.
Writing JSON using JSONEachRow output format
SELECT * FROM http_logs_tabular
LIMIT 2
FORMAT JSONEachRow
{"@timestamp":"1998-05-22
21:37:39","clientip":"54.72.5.0","status":200,"request":"GET
/images/home_bg_stars.gif HTTP/1.1","size":2557}
{"@timestamp":"1998-05-22
21:37:39","clientip":"53.72.5.0","status":200,"request":"GET
/images/home_tool.gif HTTP/1.0","size":327}
© 2022 Altinity, Inc.
Storing JSON data in Strings
8
© 2022 Altinity, Inc.
Mapping JSON to a blob with optional derived columns
{"@timestamp":"
1998-05-22
21:37:39","clienti
p":"54.72.5.0",...}
SQL Table
JSON
String
JSON String (“blob”) with
derived header values
© 2022 Altinity, Inc.
Start by storing the JSON as a String
CREATE TABLE http_logs
(
`file` String,
`message` String
)
ENGINE = MergeTree
PARTITION BY file
ORDER BY tuple()
SETTINGS index_granularity = 8192
“Blob”
© 2022 Altinity, Inc.
Load data whatever way is easiest...
head http_logs.csv
"file","message"
"documents-211998.json","{""@timestamp"": 895873059,
""clientip"":""54.72.5.0"", ""request"": ""GET
/images/home_bg_stars.gif HTTP/1.1"", ""status"": 200, ""size"":
2557}"
"documents-211998.json","{""@timestamp"": 895873059,
""clientip"":""53.72.5.0"", ""request"": ""GET /images/home_tool.gif
HTTP/1.0"", ""status"": 200, ""size"": 327}"
...
clickhouse-client --query 
'INSERT INTO http_logs Format CSVWithNames' 
< http_logs.csv
© 2022 Altinity, Inc.
You can query using JSON* functions
-- Get a JSON string value
SELECT JSONExtractString(message, 'request') AS request
FROM http_logs LIMIT 3
-- Get a JSON numeric value
SELECT JSONExtractInt(message, 'status') AS status
FROM http_logs LIMIT 3
-- Use values to answer useful questions.
SELECT JSONExtractInt(message, 'status') AS status, count() as count
FROM http_logs WHERE status >= 400
WHERE toDateTime(JSONExtractUInt32(message, '@timestamp') BETWEEN
'1998-05-20 00:00:00' AND '1998-05-20 23:59:59'
GROUP BY status ORDER BY status
© 2022 Altinity, Inc.
-- Get using JSON function
SELECT JSONExtractString(message, 'request')
FROM http_logs LIMIT 3
-- Get it with proper type.
SELECT visitParamExtractString(message, 'request')
FROM http_logs LIMIT 3
JSON* vs visitParam functions
SLOWER
Complete
JSON parser
FASTER
But cannot distinguish same
name in different structures
© 2022 Altinity, Inc.
We can improve usability by ordering data
CREATE TABLE http_logs_sorted (
`file` String,
`message` String,
timestamp DateTime DEFAULT
toDateTime(JSONExtractUInt(message, '@timestamp'))
)
ENGINE = MergeTree
PARTITION BY toStartOfMonth(timestamp)
ORDER BY timestamp
INSERT INTO http_logs_sorted
SELECT file, message FROM http_logs
14
© 2022 Altinity, Inc.
And still further by adding more columns
ALTER TABLE http_logs_sorted
ADD COLUMN `status` Int16 DEFAULT JSONExtractInt(message,
'status') CODEC(ZSTD(1))
ALTER TABLE http_logs_sorted
ADD COLUMN `request` String DEFAULT
JSONExtractString(message, 'request')
-- Force columns to be materialized
ALTER TABLE http_logs_sorted
UPDATE status=status, request=request
WHERE 1
15
© 2022 Altinity, Inc.
Our query is now simpler...
SELECT
status, count() as count
FROM http_logs_sorted WHERE status >= 400 AND
timestamp BETWEEN
'1998-05-20 00:00:00' AND '1998-05-20 23:59:59'
GROUP BY status ORDER BY status
16
© 2022 Altinity, Inc.
And MUCH faster!
SELECT
status, count() as count
FROM http_logs_sorted WHERE status >= 400 AND
timestamp BETWEEN
'1998-05-20 00:00:00' AND '1998-05-20 23:59:59'
GROUP BY status ORDER BY status
0.014 seconds vs 9.8 seconds!
Can use primary
key index to drop
blocks
100x less I/O to read
17
© 2022 Altinity, Inc.
Using paired arrays and maps for JSON
18
© 2022 Altinity, Inc.
Representing JSON as paired arrays and maps
{"@timestamp":"
1998-05-22
21:37:39","clienti
p":"54.72.5.0",...}
SQL Table
Array
of
Keys
Arrays: Header values
with key-value pairs
Array
of
Values
SQL Table
Map
with
Key/Values
Map: Header values with
mapped key value pairs
© 2022 Altinity, Inc.
Storing JSON in paired arrays
CREATE TABLE http_logs_arrays (
`file` String,
`keys` Array(String),
`values` Array(String),
timestamp DateTime CODEC(Delta, ZSTD(1))
)
ENGINE = MergeTree
PARTITION BY toStartOfMonth(timestamp)
ORDER BY timestamp
20
© 2022 Altinity, Inc.
Loading JSON to paired arrays
-- Load data. Might be better to format outside ClickHouse.
INSERT into http_logs_arrays(file, keys, values, timestamp)
SELECT file,
arrayMap(x -> x.1,
JSONExtractKeysAndValues(message, 'String')) keys,
arrayMap(x -> x.2,
JSONExtractKeysAndValues(message, 'String')) values,
toDateTime(JSONExtractUInt(message, '@timestamp'))
timestamp
FROM http_logs limit 30000000
21
© 2022 Altinity, Inc.
Querying values in arrays
-- Run a query.
SELECT values[indexOf(keys, 'status')] status, count()
FROM http_logs_arrays
GROUP BY status ORDER BY status
status|count() |
------|--------|
200 |24917090|
206 | 64935|
302 | 1941|
304 | 4899616|
400 | 888|
404 | 115005|
500 | 525|
4-5x faster than accessing
JSON string objects
22
© 2022 Altinity, Inc.
Another way to store JSON objects: Maps
CREATE TABLE http_logs_map (
`file` String, `message` Map(String, String),
timestamp DateTime
DEFAULT toDateTime(toUInt32(message['@timestamp']))
CODEC(Delta, ZSTD(1))
)
ENGINE = MergeTree
PARTITION BY toStartOfMonth(timestamp)
ORDER BY timestamp
23
© 2022 Altinity, Inc.
Loading and querying JSON in Maps
-- Load data
INSERT into http_logs_map(file, message)
SELECT file,
JSONExtractKeysAndValues(message, 'String') message
FROM http_logs
-- Run a query.
SELECT message['status'] status, count()
FROM http_logs_map
GROUP BY status ORDER BY status 4-5x faster than accessing
JSON string objects
24
© 2022 Altinity, Inc.
The JSON Data Type
25
New in
22.3
© 2022 Altinity, Inc.
Mapping complex data to a JSON data type column
{Complex
JSON}
SQL Table
JSON
Data
Type
JSON data type (“blob”)
with other column values
© 2022 Altinity, Inc.
How did JSON work until now?
● Storing JSON using String datatypes
● 2 Parsers:
○ Simple parser
○ Full-fledged parser
● 2-set functions for each parser:
○ Family of simpleJSON functions that only work for simple non-nested JSON files
■ visitParamExtractUInt = simpleJSONExtractUInt
○ Family of JSONExtract* functions that can parse any JSON object completely.
■ JSONExtractUInt, JSONExtractString, JSONExtractRawArray …
Query Time!
27
© 2022 Altinity, Inc.
How did JSON work until now?
WITH JSONExtract(json, 'Tuple(a UInt32, b UInt32, c Nested(d UInt32, e
String))') AS parsed_json
SELECT JSONExtractUInt(json, 'a') AS a, JSONExtractUInt(json, 'b') AS b,
JSONExtractArrayRaw(json, 'c') AS array_c, tupleElement(parsed_json, 'a')
AS a_tuple, tupleElement(parsed_json, 'b') AS b_tuple,
tupleElement(parsed_json, 'c') AS array_c_tuple,
tupleElement(tupleElement(parsed_json, 'c'), 'd') AS `c.d`,
tupleElement(tupleElement(parsed_json, 'c'), 'e') AS `c.e`
FROM ( SELECT '{"a":1,"b":2,"c":[{"d":3,"e":"str_1"},
{"d":4,"e":"str_2"}, {"d":3,"e":"str_1"}, {"d":4,"e":"str_1"},
{"d":7,"e":"str_9"}]}' AS json )
FORMAT Vertical
28
Let’s dive in!
© 2022 Altinity, Inc.
How did JSON work until now?
1. Approach A: Using tuples
1.1. Get the structure of the json parsing it using the JSONExtract function and generate a
tuple structure using a CTE (WITH clause)
1.2. Use tupleElement function to extract the tuples: tupleElement->tupleElement for
getting nested fields
2. Approach B: Direct
2.1. Use JSONExtractUInt/Array to extract the values directly
Both require multiple passes:
● Tuple approach= 2 pass (CTE + Query)
● Direct approach= 3 pass two ints (a and b) and an array (array_c).
29
© 2022 Altinity, Inc.
New JSON
● ClickHouse parses JSON data at INSERT time.
● Automatic inference and creation of the underlying table structure
● JSON object stored in a columnar ClickHouse native format
● Named tuple and array notation to query JSON objects: array[x] | tuple.element
30
Ingestor Parsing
Conver-
sion
Storage
Layer
Raw
JSON
Extracted
fields
Columns with
ClickHouse type
definitions
© 2022 Altinity, Inc.
New JSON storage format
31
© 2022 Altinity, Inc.
New JSON
SET allow_experimental_object_type = 1;
CREATE TABLE json_test.stack_overflow_js (`raw` JSON)
ENGINE = MergeTree ORDER BY tuple();
INSERT INTO stack_overflow_js
SELECT json
FROM file('stack_overflow_nested.json.gz', JSONAsObject);
SELECT count(*) FROM stack_overflow_js;
11203029 rows in set. Elapsed: 2.323 sec. Processed 11.20 million rows, 3.35 GB (4.82
million rows/s., 1.44 GB/s.)
32
© 2022 Altinity, Inc.
New JSON useful settings
SET describe_extend_object_types = 1;
DESCRIBE TABLE stack_overflow_js;
--Basic structure
SET describe_include_subcolumns = 1;
DESCRIBE TABLE stack_overflow_js FORMAT Vertical;
--Columns included
SET output_format_json_named_tuples_as_objects = 1;
SELECT raw FROM stack_overflow_js LIMIT 1 FORMAT JSONEachRow;
--JSON full structure
33
© 2022 Altinity, Inc.
New vs Old-school
stack_overflow_js vs stack_overflow_str:
CREATE TABLE nested_json.stack_overflow_js (`raw` JSON)
ENGINE = MergeTree ORDER BY tuple();
CREATE TABLE nested_json.stack_overflow_str (`raw` String)
ENGINE = MergeTree ORDER BY tuple();
● topK stack_overflow_str:
SELECT topK(100)(arrayJoin(JSONExtract(raw, 'tag','Array(String)')))
FROM stack_overflow_str;
1 rows in set. Elapsed: 2.101 sec. Processed 11.20 million rows, 3.73 GB (5.33 million rows/s., 1.77 GB/s.)
● topK stack_overflow_str:
SELECT topK(100)(arrayJoin(raw.tag)) FROM stack_overflow_js
1 rows in set. Elapsed: 0.331 sec. Processed 11.20 million rows, 642.07 MB (33.90 million rows/s., 1.94 GB/s.)
34
© 2022 Altinity, Inc.
Limitations:
● What happens if there are schema changes?:
○ column type changes, new keys, deleted keys ….
○ Insert a new json like this { “foo”: “10”, “bar”: 10 }:
■ CH will create a new part for this json
■ CH will create a tuple structure: raw.foo and raw.bar
■ OPTIMIZE TABLE FINAL
● New mixed tuple = stack_overflow tuple + foobar tuple
● Problems:
○ No errors or warnings during insertions
○ Malformed JSON will pollute our data
○ We cannot select slices like raw.answers.*
○ CH creates a dynamic column per json key (our JSON has 1K keys so 1K columns)
35
© 2022 Altinity, Inc.
Check tuple structure:
INSERT INTO stack_overflow_js VALUES ('{ "bar": "hello", "foo": 1 }');
SELECT table,
column,
name AS part_name,
type,
subcolumns.names,
subcolumns.type
FROM system.parts_columns
WHERE table = 'stack_overflow_js'
FORMAT Vertical
36
© 2022 Altinity, Inc.
Check tuple structure:
Row 1:
──────
table: stack_overflow_js
column: raw
part_name: all_12_22_5
type: Tuple(answers Nested(date String, user String), creationDate String, qid String, tag
Array(String), title String, user String)
subcolumns.names:
['answers','answers.size0','answers.date','answers.user','creationDate','qid','tag','tag.size0','title','user']
subcolumns.types: ['Nested(date String, user
String)','UInt64','Array(String)','Array(String)','String','String','Array(String)','UInt64','String','String']
subcolumns.serializations:
['Default','Default','Default','Default','Default','Default','Default','Default','Default','Default']
Row 2:
──────
table: stack_overflow_js
column: raw
part_name: all_23_23_0
type: Tuple(Bar String, foo Int8)
subcolumns.names: ['foo','foo']
subcolumns.types: ['String','String']
subcolumns.serializations: ['Default','Default']
37
© 2022 Altinity, Inc.
Improvements:
● CODEC Changes: LZ4 vs ZSTD
SELECT table, column,
formatReadableSize(sum(column_data_compressed_bytes)) AS compressed,
formatReadableSize(sum(column_data_uncompressed_bytes)) AS uncompressed
FROM system.parts_columns
WHERE table IN ('stack_overflow_js', 'stack_overflow_str') AND column IN ('raw'')
GROUP BY table, column
● ALTER TABLEs
ALTER TABLE stack_overflow_str MODIFY COLUMN raw CODEC(ZSTD(3));
ALTER TABLE stack_overflow_js MODIFY COLUMN raw CODEC(ZSTD(3));
38
table column LZ4 ZSTD uncompressed
stack_overflow_str raw 1.73 GiB 1.23 GiB 3.73 GiB
stack_overflow_json raw 1.30 GiB 886.77 GiB 2.29 GiB
© 2022 Altinity, Inc.
Improvements
● Query times: LZ4 vs ZSTD
○ LZ4
■ 0.3s New vs 2.1s Old
○ ZSTD
■ 0.4s New vs 2.8s Old
39
table column LZ4 ZSTD comp.ratio
stack_overflow_str raw 0.3s 0.4s 12%
stack_overflow_json raw 2.1s 2.8s 10%
© 2022 Altinity, Inc.
Wrap-up and References
40
© 2022 Altinity, Inc.
Secrets to JSON happiness in ClickHouse
● Use JSON formats to read and write JSON data
● Fetch JSON String data with
JSONExtract*/JSONVisitParam* functions
● Store JSON in paired arrays or maps
● (NEW) The new JSON data type stores data efficiently
and offers convenient query syntax
○ It’s still experimental
41
© 2022 Altinity, Inc.
More things to look at by yourself
● Using materialized views to populate JSON data
● Indexing JSON data
○ Indexes on JSON data type columns
○ Bloom filters on blobs
● More compression and codec tricks
42
© 2022 Altinity, Inc.
Where to get more information
ClickHouse Docs: https://clickhouse.com/docs/
Altinity Knowledge Base: https://kb.altinity.com/
Altinity Blog: https://altinity.com
ClickHouse Source Code and Tests: https://github.com/ClickHouse/ClickHouse
● Especially tests
43
© 2022 Altinity, Inc.
Thank you!
Questions?
https://altinity.com
44
Altinity.Cloud
Altinity Support
Altinity Stable
Builds
We’re hiring!
Copyright © Altinity Inc 2022

More Related Content

PDF
A Day in the Life of a ClickHouse Query Webinar Slides
Altinity Ltd
 
PDF
Better than you think: Handling JSON data in ClickHouse
Altinity Ltd
 
PDF
Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdf
Altinity Ltd
 
PDF
Altinity Quickstart for ClickHouse
Altinity Ltd
 
PDF
ClickHouse Materialized Views: The Magic Continues
Altinity Ltd
 
PDF
ClickHouse Keeper
Altinity Ltd
 
PDF
ClickHouse Deep Dive, by Aleksei Milovidov
Altinity Ltd
 
PDF
ClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEO
Altinity Ltd
 
A Day in the Life of a ClickHouse Query Webinar Slides
Altinity Ltd
 
Better than you think: Handling JSON data in ClickHouse
Altinity Ltd
 
Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdf
Altinity Ltd
 
Altinity Quickstart for ClickHouse
Altinity Ltd
 
ClickHouse Materialized Views: The Magic Continues
Altinity Ltd
 
ClickHouse Keeper
Altinity Ltd
 
ClickHouse Deep Dive, by Aleksei Milovidov
Altinity Ltd
 
ClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEO
Altinity Ltd
 

What's hot (20)

PDF
ClickHouse tips and tricks. Webinar slides. By Robert Hodges, Altinity CEO
Altinity Ltd
 
PDF
Your first ClickHouse data warehouse
Altinity Ltd
 
PDF
Webinar: Secrets of ClickHouse Query Performance, by Robert Hodges
Altinity Ltd
 
PDF
Tricks every ClickHouse designer should know, by Robert Hodges, Altinity CEO
Altinity Ltd
 
PPTX
High Performance, High Reliability Data Loading on ClickHouse
Altinity Ltd
 
PDF
ClickHouse Monitoring 101: What to monitor and how
Altinity Ltd
 
PDF
A Fast Intro to Fast Query with ClickHouse, by Robert Hodges
Altinity Ltd
 
PDF
ClickHouse in Real Life. Case Studies and Best Practices, by Alexander Zaitsev
Altinity Ltd
 
PDF
Introduction to the Mysteries of ClickHouse Replication, By Robert Hodges and...
Altinity Ltd
 
PDF
All about Zookeeper and ClickHouse Keeper.pdf
Altinity Ltd
 
PDF
Analytics at Speed: Introduction to ClickHouse and Common Use Cases. By Mikha...
Altinity Ltd
 
PDF
Webinar slides: MORE secrets of ClickHouse Query Performance. By Robert Hodge...
Altinity Ltd
 
PDF
cLoki: Like Loki but for ClickHouse
Altinity Ltd
 
PDF
10 Good Reasons to Use ClickHouse
rpolat
 
PDF
Data Warehouses in Kubernetes Visualized: the ClickHouse Kubernetes Operator UI
Altinity Ltd
 
PDF
Introduction to Redis
Dvir Volk
 
PDF
Webinar: Strength in Numbers: Introduction to ClickHouse Cluster Performance
Altinity Ltd
 
PDF
ClickHouse on Kubernetes, by Alexander Zaitsev, Altinity CTO
Altinity Ltd
 
PDF
ClickHouse Features for Advanced Users, by Aleksei Milovidov
Altinity Ltd
 
PDF
Cassandra Introduction & Features
DataStax Academy
 
ClickHouse tips and tricks. Webinar slides. By Robert Hodges, Altinity CEO
Altinity Ltd
 
Your first ClickHouse data warehouse
Altinity Ltd
 
Webinar: Secrets of ClickHouse Query Performance, by Robert Hodges
Altinity Ltd
 
Tricks every ClickHouse designer should know, by Robert Hodges, Altinity CEO
Altinity Ltd
 
High Performance, High Reliability Data Loading on ClickHouse
Altinity Ltd
 
ClickHouse Monitoring 101: What to monitor and how
Altinity Ltd
 
A Fast Intro to Fast Query with ClickHouse, by Robert Hodges
Altinity Ltd
 
ClickHouse in Real Life. Case Studies and Best Practices, by Alexander Zaitsev
Altinity Ltd
 
Introduction to the Mysteries of ClickHouse Replication, By Robert Hodges and...
Altinity Ltd
 
All about Zookeeper and ClickHouse Keeper.pdf
Altinity Ltd
 
Analytics at Speed: Introduction to ClickHouse and Common Use Cases. By Mikha...
Altinity Ltd
 
Webinar slides: MORE secrets of ClickHouse Query Performance. By Robert Hodge...
Altinity Ltd
 
cLoki: Like Loki but for ClickHouse
Altinity Ltd
 
10 Good Reasons to Use ClickHouse
rpolat
 
Data Warehouses in Kubernetes Visualized: the ClickHouse Kubernetes Operator UI
Altinity Ltd
 
Introduction to Redis
Dvir Volk
 
Webinar: Strength in Numbers: Introduction to ClickHouse Cluster Performance
Altinity Ltd
 
ClickHouse on Kubernetes, by Alexander Zaitsev, Altinity CTO
Altinity Ltd
 
ClickHouse Features for Advanced Users, by Aleksei Milovidov
Altinity Ltd
 
Cassandra Introduction & Features
DataStax Academy
 
Ad

Similar to All About JSON and ClickHouse - Tips, Tricks and New Features-2022-07-26-FINAL.pdf (20)

PDF
A Practical Introduction to Handling Log Data in ClickHouse, by Robert Hodges...
Altinity Ltd
 
PDF
Altinity Quickstart for ClickHouse-2202-09-15.pdf
Altinity Ltd
 
PPTX
BGOUG15: JSON support in MySQL 5.7
Georgi Kodinov
 
PPT
Using JSON/BSON types in your hybrid application environment
Ajay Gupte
 
PDF
ClickHouse 2018. How to stop waiting for your queries to complete and start ...
Altinity Ltd
 
PDF
Practical JSON in MySQL 5.7 and Beyond
Ike Walker
 
PPTX
The rise of json in rdbms land jab17
alikonweb
 
PDF
Optimizer percona live_ams2015
Manyi Lu
 
PPTX
MySQL Rises with JSON Support
Okcan Yasin Saygılı
 
PDF
Practical JSON in MySQL 5.7
Ike Walker
 
PDF
Big Data in Real-Time: How ClickHouse powers Admiral's visitor relationships ...
Altinity Ltd
 
PDF
A day in the life of a click house query
CristinaMunteanu43
 
PPTX
Webinar 2017. Supercharge your analytics with ClickHouse. Alexander Zaitsev
Altinity Ltd
 
PDF
Dangerous on ClickHouse in 30 minutes, by Robert Hodges, Altinity CEO
Altinity Ltd
 
PDF
How to Use JSON in MySQL Wrong
Karwin Software Solutions LLC
 
PDF
There is Javascript in my SQL
PGConf APAC
 
PDF
Postgrtesql as a NoSQL Document Store - The JSON/JSONB data type
Jumping Bean
 
PDF
MySQL 5.7 + JSON
Morgan Tocker
 
PDF
MySQL's JSON Data Type and Document Store
Dave Stokes
 
PPTX
Php forum2015 tomas_final
Bertrand Matthelie
 
A Practical Introduction to Handling Log Data in ClickHouse, by Robert Hodges...
Altinity Ltd
 
Altinity Quickstart for ClickHouse-2202-09-15.pdf
Altinity Ltd
 
BGOUG15: JSON support in MySQL 5.7
Georgi Kodinov
 
Using JSON/BSON types in your hybrid application environment
Ajay Gupte
 
ClickHouse 2018. How to stop waiting for your queries to complete and start ...
Altinity Ltd
 
Practical JSON in MySQL 5.7 and Beyond
Ike Walker
 
The rise of json in rdbms land jab17
alikonweb
 
Optimizer percona live_ams2015
Manyi Lu
 
MySQL Rises with JSON Support
Okcan Yasin Saygılı
 
Practical JSON in MySQL 5.7
Ike Walker
 
Big Data in Real-Time: How ClickHouse powers Admiral's visitor relationships ...
Altinity Ltd
 
A day in the life of a click house query
CristinaMunteanu43
 
Webinar 2017. Supercharge your analytics with ClickHouse. Alexander Zaitsev
Altinity Ltd
 
Dangerous on ClickHouse in 30 minutes, by Robert Hodges, Altinity CEO
Altinity Ltd
 
How to Use JSON in MySQL Wrong
Karwin Software Solutions LLC
 
There is Javascript in my SQL
PGConf APAC
 
Postgrtesql as a NoSQL Document Store - The JSON/JSONB data type
Jumping Bean
 
MySQL 5.7 + JSON
Morgan Tocker
 
MySQL's JSON Data Type and Document Store
Dave Stokes
 
Php forum2015 tomas_final
Bertrand Matthelie
 
Ad

More from Altinity Ltd (20)

PPTX
Building an Analytic Extension to MySQL with ClickHouse and Open Source.pptx
Altinity Ltd
 
PDF
Cloud Native ClickHouse at Scale--Using the Altinity Kubernetes Operator-2022...
Altinity Ltd
 
PPTX
Building an Analytic Extension to MySQL with ClickHouse and Open Source
Altinity Ltd
 
PDF
Fun with ClickHouse Window Functions-2021-08-19.pdf
Altinity Ltd
 
PDF
Cloud Native Data Warehouses - Intro to ClickHouse on Kubernetes-2021-07.pdf
Altinity Ltd
 
PDF
Building High Performance Apps with Altinity Stable Builds for ClickHouse | A...
Altinity Ltd
 
PDF
Application Monitoring using Open Source - VictoriaMetrics & Altinity ClickHo...
Altinity Ltd
 
PDF
Own your ClickHouse data with Altinity.Cloud Anywhere-2023-01-17.pdf
Altinity Ltd
 
PDF
ClickHouse ReplacingMergeTree in Telecom Apps
Altinity Ltd
 
PDF
Adventures with the ClickHouse ReplacingMergeTree Engine
Altinity Ltd
 
PDF
Building a Real-Time Analytics Application with Apache Pulsar and Apache Pinot
Altinity Ltd
 
PDF
Altinity Webinar: Introduction to Altinity.Cloud-Platform for Real-Time Data.pdf
Altinity Ltd
 
PDF
OSA Con 2022 - What Data Engineering Can Learn from Frontend Engineering - Pe...
Altinity Ltd
 
PDF
OSA Con 2022 - Welcome to OSA CON Version 2022 - Robert Hodges - Altinity.pdf
Altinity Ltd
 
PDF
OSA Con 2022 - Using ClickHouse Database to Power Analytics and Customer Enga...
Altinity Ltd
 
PDF
OSA Con 2022 - Tips and Tricks to Keep Your Queries under 100ms with ClickHou...
Altinity Ltd
 
PDF
OSA Con 2022 - The Open Source Analytic Universe, Version 2022 - Robert Hodge...
Altinity Ltd
 
PDF
OSA Con 2022 - Switching Jaeger Distributed Tracing to ClickHouse to Enable A...
Altinity Ltd
 
PDF
OSA Con 2022 - Streaming Data Made Easy - Tim Spann & David Kjerrumgaard - St...
Altinity Ltd
 
PDF
OSA Con 2022 - State of Open Source Databases - Peter Zaitsev - Percona.pdf
Altinity Ltd
 
Building an Analytic Extension to MySQL with ClickHouse and Open Source.pptx
Altinity Ltd
 
Cloud Native ClickHouse at Scale--Using the Altinity Kubernetes Operator-2022...
Altinity Ltd
 
Building an Analytic Extension to MySQL with ClickHouse and Open Source
Altinity Ltd
 
Fun with ClickHouse Window Functions-2021-08-19.pdf
Altinity Ltd
 
Cloud Native Data Warehouses - Intro to ClickHouse on Kubernetes-2021-07.pdf
Altinity Ltd
 
Building High Performance Apps with Altinity Stable Builds for ClickHouse | A...
Altinity Ltd
 
Application Monitoring using Open Source - VictoriaMetrics & Altinity ClickHo...
Altinity Ltd
 
Own your ClickHouse data with Altinity.Cloud Anywhere-2023-01-17.pdf
Altinity Ltd
 
ClickHouse ReplacingMergeTree in Telecom Apps
Altinity Ltd
 
Adventures with the ClickHouse ReplacingMergeTree Engine
Altinity Ltd
 
Building a Real-Time Analytics Application with Apache Pulsar and Apache Pinot
Altinity Ltd
 
Altinity Webinar: Introduction to Altinity.Cloud-Platform for Real-Time Data.pdf
Altinity Ltd
 
OSA Con 2022 - What Data Engineering Can Learn from Frontend Engineering - Pe...
Altinity Ltd
 
OSA Con 2022 - Welcome to OSA CON Version 2022 - Robert Hodges - Altinity.pdf
Altinity Ltd
 
OSA Con 2022 - Using ClickHouse Database to Power Analytics and Customer Enga...
Altinity Ltd
 
OSA Con 2022 - Tips and Tricks to Keep Your Queries under 100ms with ClickHou...
Altinity Ltd
 
OSA Con 2022 - The Open Source Analytic Universe, Version 2022 - Robert Hodge...
Altinity Ltd
 
OSA Con 2022 - Switching Jaeger Distributed Tracing to ClickHouse to Enable A...
Altinity Ltd
 
OSA Con 2022 - Streaming Data Made Easy - Tim Spann & David Kjerrumgaard - St...
Altinity Ltd
 
OSA Con 2022 - State of Open Source Databases - Peter Zaitsev - Percona.pdf
Altinity Ltd
 

Recently uploaded (20)

PPTX
Machine Learning Solution for Power Grid Cybersecurity with GraphWavelets
Sione Palu
 
PPTX
Azure Data management Engineer project.pptx
sumitmundhe77
 
PPTX
Purple and Violet Modern Marketing Presentation (1).pptx
SanthoshKumar229321
 
PDF
Linux OS guide to know, operate. Linux Filesystem, command, users and system
Kiran Maharjan
 
PPTX
Introduction-to-Python-Programming-Language (1).pptx
dhyeysapariya
 
PPTX
Data Security Breach: Immediate Action Plan
varmabhuvan266
 
PPTX
Extract Transformation Load (3) (1).pptx
revathi148366
 
PPTX
Trading Procedures (1).pptxcffcdddxxddsss
garv794
 
PDF
The_Future_of_Data_Analytics_by_CA_Suvidha_Chaplot_UPDATED.pdf
CA Suvidha Chaplot
 
PPTX
Introduction to Data Analytics and Data Science
KavithaCIT
 
PDF
Chad Readey - An Independent Thinker
Chad Readey
 
PDF
Data_Cleaning_Infographic_Series_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
PDF
Technical Writing Module-I Complete Notes.pdf
VedprakashArya13
 
PDF
CH2-MODEL-SETUP-v2017.1-JC-APR27-2017.pdf
jcc00023con
 
PPTX
Logistic Regression ml machine learning.pptx
abdullahcocindia
 
PPTX
GR3-PPTFINAL (1).pptx 0.91 MbHIHUHUGG,HJGH
DarylArellaga1
 
PPTX
Pipeline Automatic Leak Detection for Water Distribution Systems
Sione Palu
 
PDF
Data Analyst Certificate Programs for Beginners | IABAC
Seenivasan
 
PDF
A Systems Thinking Approach to Algorithmic Fairness.pdf
Epistamai
 
PPTX
Probability systematic sampling methods.pptx
PrakashRajput19
 
Machine Learning Solution for Power Grid Cybersecurity with GraphWavelets
Sione Palu
 
Azure Data management Engineer project.pptx
sumitmundhe77
 
Purple and Violet Modern Marketing Presentation (1).pptx
SanthoshKumar229321
 
Linux OS guide to know, operate. Linux Filesystem, command, users and system
Kiran Maharjan
 
Introduction-to-Python-Programming-Language (1).pptx
dhyeysapariya
 
Data Security Breach: Immediate Action Plan
varmabhuvan266
 
Extract Transformation Load (3) (1).pptx
revathi148366
 
Trading Procedures (1).pptxcffcdddxxddsss
garv794
 
The_Future_of_Data_Analytics_by_CA_Suvidha_Chaplot_UPDATED.pdf
CA Suvidha Chaplot
 
Introduction to Data Analytics and Data Science
KavithaCIT
 
Chad Readey - An Independent Thinker
Chad Readey
 
Data_Cleaning_Infographic_Series_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
Technical Writing Module-I Complete Notes.pdf
VedprakashArya13
 
CH2-MODEL-SETUP-v2017.1-JC-APR27-2017.pdf
jcc00023con
 
Logistic Regression ml machine learning.pptx
abdullahcocindia
 
GR3-PPTFINAL (1).pptx 0.91 MbHIHUHUGG,HJGH
DarylArellaga1
 
Pipeline Automatic Leak Detection for Water Distribution Systems
Sione Palu
 
Data Analyst Certificate Programs for Beginners | IABAC
Seenivasan
 
A Systems Thinking Approach to Algorithmic Fairness.pdf
Epistamai
 
Probability systematic sampling methods.pptx
PrakashRajput19
 

All About JSON and ClickHouse - Tips, Tricks and New Features-2022-07-26-FINAL.pdf

  • 1. © 2022 Altinity, Inc. All About JSON and ClickHouse Tips, Tricks, and New Features Robert Hodges and Diego Nieto 26 July 2022 1 Copyright © Altinity Inc 2022
  • 2. © 2022 Altinity, Inc. Let’s make some introductions ClickHouse support and services including Altinity.Cloud Authors of Altinity Kubernetes Operator for ClickHouse and other open source projects Robert Hodges Database geek with 30+ years on DBMS systems. Day job: Altinity CEO Diego Nieto Database engineer focused on ClickHouse, PostgreSQL, and DBMS applications 2
  • 3. © 2022 Altinity, Inc. Reading and writing JSON - the basics 3
  • 4. © 2022 Altinity, Inc. JSON is pervasive as raw data head http_logs.json {"@timestamp": 895873059, "clientip":"54.72.5.0", "request": "GET /images/home_bg_stars.gif HTTP/1.1", "status": 200, "size": 2557} {"@timestamp": 895873059, "clientip":"53.72.5.0", "request": "GET /images/home_tool.gif HTTP/1.0", "status": 200, "size": 327} ... Web server log data
  • 5. © 2022 Altinity, Inc. Reading and writing JSON data to/from tables SQL Table Every key is a column {"@timestamp":" 1998-05-22 21:37:39","clienti p":"54.72.5.0",...} {"@timestamp":" 1998-05-22 21:37:39","clienti p":"54.72.5.0",...}
  • 6. © 2022 Altinity, Inc. Loading raw JSON using JSONEachRow input format CREATE TABLE http_logs_tabular ( `@timestamp` DateTime, `clientip` IPv4, `status` UInt16, `request` String, `size` UInt32 ) ENGINE = MergeTree PARTITION BY toStartOfDay(`@timestamp`) ORDER BY `@timestamp` clickhouse-client --query 'INSERT INTO http_logs_tabular Format JSONEachRow' < http_logs_tabular
  • 7. © 2022 Altinity, Inc. Writing JSON using JSONEachRow output format SELECT * FROM http_logs_tabular LIMIT 2 FORMAT JSONEachRow {"@timestamp":"1998-05-22 21:37:39","clientip":"54.72.5.0","status":200,"request":"GET /images/home_bg_stars.gif HTTP/1.1","size":2557} {"@timestamp":"1998-05-22 21:37:39","clientip":"53.72.5.0","status":200,"request":"GET /images/home_tool.gif HTTP/1.0","size":327}
  • 8. © 2022 Altinity, Inc. Storing JSON data in Strings 8
  • 9. © 2022 Altinity, Inc. Mapping JSON to a blob with optional derived columns {"@timestamp":" 1998-05-22 21:37:39","clienti p":"54.72.5.0",...} SQL Table JSON String JSON String (“blob”) with derived header values
  • 10. © 2022 Altinity, Inc. Start by storing the JSON as a String CREATE TABLE http_logs ( `file` String, `message` String ) ENGINE = MergeTree PARTITION BY file ORDER BY tuple() SETTINGS index_granularity = 8192 “Blob”
  • 11. © 2022 Altinity, Inc. Load data whatever way is easiest... head http_logs.csv "file","message" "documents-211998.json","{""@timestamp"": 895873059, ""clientip"":""54.72.5.0"", ""request"": ""GET /images/home_bg_stars.gif HTTP/1.1"", ""status"": 200, ""size"": 2557}" "documents-211998.json","{""@timestamp"": 895873059, ""clientip"":""53.72.5.0"", ""request"": ""GET /images/home_tool.gif HTTP/1.0"", ""status"": 200, ""size"": 327}" ... clickhouse-client --query 'INSERT INTO http_logs Format CSVWithNames' < http_logs.csv
  • 12. © 2022 Altinity, Inc. You can query using JSON* functions -- Get a JSON string value SELECT JSONExtractString(message, 'request') AS request FROM http_logs LIMIT 3 -- Get a JSON numeric value SELECT JSONExtractInt(message, 'status') AS status FROM http_logs LIMIT 3 -- Use values to answer useful questions. SELECT JSONExtractInt(message, 'status') AS status, count() as count FROM http_logs WHERE status >= 400 WHERE toDateTime(JSONExtractUInt32(message, '@timestamp') BETWEEN '1998-05-20 00:00:00' AND '1998-05-20 23:59:59' GROUP BY status ORDER BY status
  • 13. © 2022 Altinity, Inc. -- Get using JSON function SELECT JSONExtractString(message, 'request') FROM http_logs LIMIT 3 -- Get it with proper type. SELECT visitParamExtractString(message, 'request') FROM http_logs LIMIT 3 JSON* vs visitParam functions SLOWER Complete JSON parser FASTER But cannot distinguish same name in different structures
  • 14. © 2022 Altinity, Inc. We can improve usability by ordering data CREATE TABLE http_logs_sorted ( `file` String, `message` String, timestamp DateTime DEFAULT toDateTime(JSONExtractUInt(message, '@timestamp')) ) ENGINE = MergeTree PARTITION BY toStartOfMonth(timestamp) ORDER BY timestamp INSERT INTO http_logs_sorted SELECT file, message FROM http_logs 14
  • 15. © 2022 Altinity, Inc. And still further by adding more columns ALTER TABLE http_logs_sorted ADD COLUMN `status` Int16 DEFAULT JSONExtractInt(message, 'status') CODEC(ZSTD(1)) ALTER TABLE http_logs_sorted ADD COLUMN `request` String DEFAULT JSONExtractString(message, 'request') -- Force columns to be materialized ALTER TABLE http_logs_sorted UPDATE status=status, request=request WHERE 1 15
  • 16. © 2022 Altinity, Inc. Our query is now simpler... SELECT status, count() as count FROM http_logs_sorted WHERE status >= 400 AND timestamp BETWEEN '1998-05-20 00:00:00' AND '1998-05-20 23:59:59' GROUP BY status ORDER BY status 16
  • 17. © 2022 Altinity, Inc. And MUCH faster! SELECT status, count() as count FROM http_logs_sorted WHERE status >= 400 AND timestamp BETWEEN '1998-05-20 00:00:00' AND '1998-05-20 23:59:59' GROUP BY status ORDER BY status 0.014 seconds vs 9.8 seconds! Can use primary key index to drop blocks 100x less I/O to read 17
  • 18. © 2022 Altinity, Inc. Using paired arrays and maps for JSON 18
  • 19. © 2022 Altinity, Inc. Representing JSON as paired arrays and maps {"@timestamp":" 1998-05-22 21:37:39","clienti p":"54.72.5.0",...} SQL Table Array of Keys Arrays: Header values with key-value pairs Array of Values SQL Table Map with Key/Values Map: Header values with mapped key value pairs
  • 20. © 2022 Altinity, Inc. Storing JSON in paired arrays CREATE TABLE http_logs_arrays ( `file` String, `keys` Array(String), `values` Array(String), timestamp DateTime CODEC(Delta, ZSTD(1)) ) ENGINE = MergeTree PARTITION BY toStartOfMonth(timestamp) ORDER BY timestamp 20
  • 21. © 2022 Altinity, Inc. Loading JSON to paired arrays -- Load data. Might be better to format outside ClickHouse. INSERT into http_logs_arrays(file, keys, values, timestamp) SELECT file, arrayMap(x -> x.1, JSONExtractKeysAndValues(message, 'String')) keys, arrayMap(x -> x.2, JSONExtractKeysAndValues(message, 'String')) values, toDateTime(JSONExtractUInt(message, '@timestamp')) timestamp FROM http_logs limit 30000000 21
  • 22. © 2022 Altinity, Inc. Querying values in arrays -- Run a query. SELECT values[indexOf(keys, 'status')] status, count() FROM http_logs_arrays GROUP BY status ORDER BY status status|count() | ------|--------| 200 |24917090| 206 | 64935| 302 | 1941| 304 | 4899616| 400 | 888| 404 | 115005| 500 | 525| 4-5x faster than accessing JSON string objects 22
  • 23. © 2022 Altinity, Inc. Another way to store JSON objects: Maps CREATE TABLE http_logs_map ( `file` String, `message` Map(String, String), timestamp DateTime DEFAULT toDateTime(toUInt32(message['@timestamp'])) CODEC(Delta, ZSTD(1)) ) ENGINE = MergeTree PARTITION BY toStartOfMonth(timestamp) ORDER BY timestamp 23
  • 24. © 2022 Altinity, Inc. Loading and querying JSON in Maps -- Load data INSERT into http_logs_map(file, message) SELECT file, JSONExtractKeysAndValues(message, 'String') message FROM http_logs -- Run a query. SELECT message['status'] status, count() FROM http_logs_map GROUP BY status ORDER BY status 4-5x faster than accessing JSON string objects 24
  • 25. © 2022 Altinity, Inc. The JSON Data Type 25 New in 22.3
  • 26. © 2022 Altinity, Inc. Mapping complex data to a JSON data type column {Complex JSON} SQL Table JSON Data Type JSON data type (“blob”) with other column values
  • 27. © 2022 Altinity, Inc. How did JSON work until now? ● Storing JSON using String datatypes ● 2 Parsers: ○ Simple parser ○ Full-fledged parser ● 2-set functions for each parser: ○ Family of simpleJSON functions that only work for simple non-nested JSON files ■ visitParamExtractUInt = simpleJSONExtractUInt ○ Family of JSONExtract* functions that can parse any JSON object completely. ■ JSONExtractUInt, JSONExtractString, JSONExtractRawArray … Query Time! 27
  • 28. © 2022 Altinity, Inc. How did JSON work until now? WITH JSONExtract(json, 'Tuple(a UInt32, b UInt32, c Nested(d UInt32, e String))') AS parsed_json SELECT JSONExtractUInt(json, 'a') AS a, JSONExtractUInt(json, 'b') AS b, JSONExtractArrayRaw(json, 'c') AS array_c, tupleElement(parsed_json, 'a') AS a_tuple, tupleElement(parsed_json, 'b') AS b_tuple, tupleElement(parsed_json, 'c') AS array_c_tuple, tupleElement(tupleElement(parsed_json, 'c'), 'd') AS `c.d`, tupleElement(tupleElement(parsed_json, 'c'), 'e') AS `c.e` FROM ( SELECT '{"a":1,"b":2,"c":[{"d":3,"e":"str_1"}, {"d":4,"e":"str_2"}, {"d":3,"e":"str_1"}, {"d":4,"e":"str_1"}, {"d":7,"e":"str_9"}]}' AS json ) FORMAT Vertical 28 Let’s dive in!
  • 29. © 2022 Altinity, Inc. How did JSON work until now? 1. Approach A: Using tuples 1.1. Get the structure of the json parsing it using the JSONExtract function and generate a tuple structure using a CTE (WITH clause) 1.2. Use tupleElement function to extract the tuples: tupleElement->tupleElement for getting nested fields 2. Approach B: Direct 2.1. Use JSONExtractUInt/Array to extract the values directly Both require multiple passes: ● Tuple approach= 2 pass (CTE + Query) ● Direct approach= 3 pass two ints (a and b) and an array (array_c). 29
  • 30. © 2022 Altinity, Inc. New JSON ● ClickHouse parses JSON data at INSERT time. ● Automatic inference and creation of the underlying table structure ● JSON object stored in a columnar ClickHouse native format ● Named tuple and array notation to query JSON objects: array[x] | tuple.element 30 Ingestor Parsing Conver- sion Storage Layer Raw JSON Extracted fields Columns with ClickHouse type definitions
  • 31. © 2022 Altinity, Inc. New JSON storage format 31
  • 32. © 2022 Altinity, Inc. New JSON SET allow_experimental_object_type = 1; CREATE TABLE json_test.stack_overflow_js (`raw` JSON) ENGINE = MergeTree ORDER BY tuple(); INSERT INTO stack_overflow_js SELECT json FROM file('stack_overflow_nested.json.gz', JSONAsObject); SELECT count(*) FROM stack_overflow_js; 11203029 rows in set. Elapsed: 2.323 sec. Processed 11.20 million rows, 3.35 GB (4.82 million rows/s., 1.44 GB/s.) 32
  • 33. © 2022 Altinity, Inc. New JSON useful settings SET describe_extend_object_types = 1; DESCRIBE TABLE stack_overflow_js; --Basic structure SET describe_include_subcolumns = 1; DESCRIBE TABLE stack_overflow_js FORMAT Vertical; --Columns included SET output_format_json_named_tuples_as_objects = 1; SELECT raw FROM stack_overflow_js LIMIT 1 FORMAT JSONEachRow; --JSON full structure 33
  • 34. © 2022 Altinity, Inc. New vs Old-school stack_overflow_js vs stack_overflow_str: CREATE TABLE nested_json.stack_overflow_js (`raw` JSON) ENGINE = MergeTree ORDER BY tuple(); CREATE TABLE nested_json.stack_overflow_str (`raw` String) ENGINE = MergeTree ORDER BY tuple(); ● topK stack_overflow_str: SELECT topK(100)(arrayJoin(JSONExtract(raw, 'tag','Array(String)'))) FROM stack_overflow_str; 1 rows in set. Elapsed: 2.101 sec. Processed 11.20 million rows, 3.73 GB (5.33 million rows/s., 1.77 GB/s.) ● topK stack_overflow_str: SELECT topK(100)(arrayJoin(raw.tag)) FROM stack_overflow_js 1 rows in set. Elapsed: 0.331 sec. Processed 11.20 million rows, 642.07 MB (33.90 million rows/s., 1.94 GB/s.) 34
  • 35. © 2022 Altinity, Inc. Limitations: ● What happens if there are schema changes?: ○ column type changes, new keys, deleted keys …. ○ Insert a new json like this { “foo”: “10”, “bar”: 10 }: ■ CH will create a new part for this json ■ CH will create a tuple structure: raw.foo and raw.bar ■ OPTIMIZE TABLE FINAL ● New mixed tuple = stack_overflow tuple + foobar tuple ● Problems: ○ No errors or warnings during insertions ○ Malformed JSON will pollute our data ○ We cannot select slices like raw.answers.* ○ CH creates a dynamic column per json key (our JSON has 1K keys so 1K columns) 35
  • 36. © 2022 Altinity, Inc. Check tuple structure: INSERT INTO stack_overflow_js VALUES ('{ "bar": "hello", "foo": 1 }'); SELECT table, column, name AS part_name, type, subcolumns.names, subcolumns.type FROM system.parts_columns WHERE table = 'stack_overflow_js' FORMAT Vertical 36
  • 37. © 2022 Altinity, Inc. Check tuple structure: Row 1: ────── table: stack_overflow_js column: raw part_name: all_12_22_5 type: Tuple(answers Nested(date String, user String), creationDate String, qid String, tag Array(String), title String, user String) subcolumns.names: ['answers','answers.size0','answers.date','answers.user','creationDate','qid','tag','tag.size0','title','user'] subcolumns.types: ['Nested(date String, user String)','UInt64','Array(String)','Array(String)','String','String','Array(String)','UInt64','String','String'] subcolumns.serializations: ['Default','Default','Default','Default','Default','Default','Default','Default','Default','Default'] Row 2: ────── table: stack_overflow_js column: raw part_name: all_23_23_0 type: Tuple(Bar String, foo Int8) subcolumns.names: ['foo','foo'] subcolumns.types: ['String','String'] subcolumns.serializations: ['Default','Default'] 37
  • 38. © 2022 Altinity, Inc. Improvements: ● CODEC Changes: LZ4 vs ZSTD SELECT table, column, formatReadableSize(sum(column_data_compressed_bytes)) AS compressed, formatReadableSize(sum(column_data_uncompressed_bytes)) AS uncompressed FROM system.parts_columns WHERE table IN ('stack_overflow_js', 'stack_overflow_str') AND column IN ('raw'') GROUP BY table, column ● ALTER TABLEs ALTER TABLE stack_overflow_str MODIFY COLUMN raw CODEC(ZSTD(3)); ALTER TABLE stack_overflow_js MODIFY COLUMN raw CODEC(ZSTD(3)); 38 table column LZ4 ZSTD uncompressed stack_overflow_str raw 1.73 GiB 1.23 GiB 3.73 GiB stack_overflow_json raw 1.30 GiB 886.77 GiB 2.29 GiB
  • 39. © 2022 Altinity, Inc. Improvements ● Query times: LZ4 vs ZSTD ○ LZ4 ■ 0.3s New vs 2.1s Old ○ ZSTD ■ 0.4s New vs 2.8s Old 39 table column LZ4 ZSTD comp.ratio stack_overflow_str raw 0.3s 0.4s 12% stack_overflow_json raw 2.1s 2.8s 10%
  • 40. © 2022 Altinity, Inc. Wrap-up and References 40
  • 41. © 2022 Altinity, Inc. Secrets to JSON happiness in ClickHouse ● Use JSON formats to read and write JSON data ● Fetch JSON String data with JSONExtract*/JSONVisitParam* functions ● Store JSON in paired arrays or maps ● (NEW) The new JSON data type stores data efficiently and offers convenient query syntax ○ It’s still experimental 41
  • 42. © 2022 Altinity, Inc. More things to look at by yourself ● Using materialized views to populate JSON data ● Indexing JSON data ○ Indexes on JSON data type columns ○ Bloom filters on blobs ● More compression and codec tricks 42
  • 43. © 2022 Altinity, Inc. Where to get more information ClickHouse Docs: https://clickhouse.com/docs/ Altinity Knowledge Base: https://kb.altinity.com/ Altinity Blog: https://altinity.com ClickHouse Source Code and Tests: https://github.com/ClickHouse/ClickHouse ● Especially tests 43
  • 44. © 2022 Altinity, Inc. Thank you! Questions? https://altinity.com 44 Altinity.Cloud Altinity Support Altinity Stable Builds We’re hiring! Copyright © Altinity Inc 2022