CrateDB PostgreSQL Query Benchmark
CrateDB PostgreSQL Query Benchmark
June 2017
Ov er v iew
CrateDB is an open core, distributed SQL database. It combines the ease of use of SQL with the
horizontal scaling and data model flexibility people associate with NoSQL databases like MongoDB
or Apache Cassandra.
CrateDB can perform very fast queries, over incredibly large datasets, in real-time as new data is
being ingested. This makes CrateDB particularly well suited for storing and processing machine data
(e.g. data collected by IoT devices).
To demonstrate this, we put together a benchmark that compares CrateDB vs PostgreSQL query
performance. The results of this benchmark show that CrateDB performs significantly better than
PostgreSQL with selects, aggregations, and grouping.
33x better price/performance ration for CrateDB - CrateDB queried 314 million rows of data up to
22x faster than PostgreSQL, running on hardware that cost 30% less than the hardware on which
PostgreSQL ran.
To help you achieve similar results, we have put together this white paper which explains how the
CrateDB and PostgreSQL databases were set up and benchmarked.
The data set contains 314,496,000 records—data that simulates sensor readings gathered over the
period of one year. Each record contains information about the sensor and the sensor readings:
CREATE TRIGGER sensor_insert BEFORE INSERT OR UPDATE ON t1 FOR EACH ROW EXECUTE
PROCEDURE insert_sensor_reading();
CREATE TRIGGER generate_taxonomy BEFORE INSERT OR UPDATE ON t1 FOR EACH ROW
EXECUTE PROCEDURE generate_taxonomy();
Dif f er ences
There are two differences to point out, having to do with table partitioning and with enabling text
searches.
We also divided the CrateDB data into three physical shards using the CLUSTERED BY clause in the
CREATE TABLE statement. Rows having the same value in the routing column are stored in the same
shard. Queries that have the routing column in the WHERE clause (tenant_id in this case) are routed
directly to the shard that contains the relevant data.
We considered partitioning the data in PostgreSQL, but the DDL required to create 52 individual
weekly partitions was impractical. Refer to the appendix of this document for more detail on
partitioning a PostgreSQL table.
Text Searches
The CrateDB DDL defines a FULLTEXT index on the sensor_type field. This allows CrateDB to run
a full-text search query on a hierarchical taxonomy of sensors. We used the ltree module to perform
the same text search queries in PostgreSQL.
With the exception of the full-text query, the SQL syntax was the same for both CrateDB and
PostgreSQL.
Query #1
Query #2
Query #3
CrateDB:
SELECT sensor_type, COUNT(*) as sensor_count
FROM t1
WHERE taxonomy = ? AND tenant_id = ?
GROUP BY sensor_type;
PostgreSQL:
SELECT sensor_type, COUNT(*) as sensor_count
FROM t1
WHERE taxonomy <@ ?::ltree AND tenant_id = ?
GROUP BY sensor_type;
The total cost of the PostgreSQL hardware, $9,516 USD, was 30% higher than than the CrateDB
hardware cost of $6,102. (Pricing source: Thinkmate.com).
These results show the average duration (in milliseconds) for each query to process 314 million
rows:
If you would like to discuss your use-case or need help with CrateDB, please contact us. We can
provide guidance on database setup, hardware recommendations, data modelling, query design
advice, performance tuning, and so on.
PostgreSQL does not have any built-in partitioning based on columns. Instead, you must partition your
data manually in a series of steps, as shown here.
First, you must create a master table from which all the partitions will inherit:
Then, you must create all the partition tables you need.
Since our scenario involves a year's worth of data partitioned over the corresponding weeks, we need
52 tables. These tables must inherit the schema from the previously defined master table.
CREATE TABLE t1_1 (CHECK (EXTRACT(WEEK FROM week_generated) = 1)) INHERITS (t1);
CREATE TABLE t1_2 (CHECK (EXTRACT(WEEK FROM week_generated) = 2)) INHERITS (t1);
...
CREATE TABLE t1_51 (CHECK (EXTRACT(WEEK FROM week_generated) = 51)) INHERITS
(t1);
CREATE TABLE t1_52 (CHECK (EXTRACT(WEEK FROM week_generated) = 52)) INHERITS
(t1);
You can then create an index on each of these tables for the column you are using to partition.
Unfortunately, this leads to a very large schema (300+ lines in this case), which might be manageable, if
the partitions can be defined up-front. Unfortunately, because dates are non-repeating, it is impossible
to pre-define all possible partitions, unless you have a definite end-date.