Why Postgresql For Analytics Infrastructure (DW) ?: Huy Nguyen Cto, Cofounder - Holistics - Io
Why Postgresql For Analytics Infrastructure (DW) ?: Huy Nguyen Cto, Cofounder - Holistics - Io
Infrastructure (DW)?
Huy Nguyen
CTO, Cofounder - Holistics.io
● Cofounder
○ Data Reporting (BI) and Infrastructure SaaS
● Previous
○ Growth Team at Facebook (US)
Production Analytics
Live
Live Daily Snapshot
DBs Database
Databases
Databases Reporting / BI
Import
CSVs / Excels /
Google Sheets
Modify / Transform
Reporting /
Operational Data Data Warehouse
Analysis
Event Logs
(behavioural
data) Pre-aggregate
Data Science / ML
Production Analytics
Live
Live Daily Snapshot
DBs Database
Databases
Databases Reporting / BI
Import
CSVs / Excels /
Google Sheets
Modify / Transform
Reporting /
Operational Data Data Warehouse
What database should we pick? Analysis
Transactional Applications vs Analytics Applications
Data: Data:
● Many single-row writes ● Few large batch imports
● Current, single data ● Years of data, many sources
Queries: Queries:
● Generated by user activities; 10 ● Generated by large reports; 1 to
to 1000 users 10 users
● < 1s response time ● Queries run for hours
● Short queries ● Long queries
Postgres:
● Free (open-source)
● Easy to setup
Event Logs
(behavioural
data) Pre-aggregate
Data Science / ML
Production Analytics
Live
Live Daily Snapshot
DBs Database
Databases
Databases Reporting / BI
Import
CSVs / Excels /
Google Sheets
Modify / Transform
Reporting /
Operational Data Data Warehouse
Analysis
Table Table
Table Table
Production Analytics
Live
Live Table Table
DBs Database Table Table
Databases
Databases
Table Table
Table Table
Data Warehouse
Solution: Split (partition) to multiple tables date_d | country | user_id | browser | page_name | views
pageviews_2015_06
pageviews_2015_07
Problem:
pageviews_2015_08 Difficult to query data across multiple months
pageviews_2015_09
pageviews_2015_08
ALTER TABLE pageviews_2015_09 INHERIT video_plays;
● Hot Data
● Warm Data
● Cold Data
pageviews_2015_08
pageviews_2015_09
Table Table
Table Table
Production Analytics
Live
Live Table Table
DBs Database Table Table
Databases
Databases
Table Table
Table Table
Data Warehouse
● UNLOGGED TABLE
○ Skip WAL log
○ Improved Write Performance
http://pgsnaga.blogspot.com/2011/10/data-loading-into-unlogged-tables-and.html
Data Science / ML
● Extract / transform
● Aggregate / summarize
Analytics
● Statistical analysis Database
Reporting / BI
Reporting /
Data Warehouse Analysis
a) hard to read
b) cannot be reused
SELECT ...
FROM (SELECT ...
FROM t1
JOIN (SELECT ... FROM ...) a
ON (...)
) b
JOIN (SELECT ... FROM ...) c ON (...)
SELECT
gender,
COUNT(1) AS signups
FROM users
GROUP BY 1
● Examples:
○ Calculate moving average
○ Cumulative sum
○ Ranking by partition
○ …
Event Logs
(behavioural
data) Pre-aggregate
Data Science / ML
Production Analytics
Live
Live Daily Snapshot
DBs Database
Databases
Databases Reporting / BI
Import
CSVs / Excels /
Google Sheets
Modify / Transform
Reporting /
Operational Data Data Warehouse
Analysis
● CitusDB Extension
○ Automated data sharding and parallelization
○ Columnar Storage Format (better storage and performance)
● Vertica (HP)
○ Columnar Storage, Parallel Execution
○ Started by Michael Stonebraker (Postgres original author)
● Amazon Redshift
○ Fork of PostgreSQL 8.2 -- ParAccel DB
○ Columnar Storage & Parallel Executions
Other Proprietary DW Databases (Relational)
● Paraccel (Postgres fork) ● Greenplum
Related to Postgres
License /
Free / Open-source Free / Open-source Expensive Expensive
Cost