0% found this document useful (0 votes)
40 views50 pages

Why Postgresql For Analytics Infrastructure (DW) ?: Huy Nguyen Cto, Cofounder - Holistics - Io

The document discusses using PostgreSQL for analytics infrastructure. It argues that PostgreSQL is a good choice because it is simple to get started with, has rich features for data pipelines and analysis, and can scale up as data grows. Some of the key features highlighted are table partitioning for managing large tables, tablespaces for controlling disk storage, and unlogged tables for write performance. The document presents PostgreSQL as a flexible database that can meet the needs of analytics applications from initial setup through large-scale growth.

Uploaded by

Chinar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views50 pages

Why Postgresql For Analytics Infrastructure (DW) ?: Huy Nguyen Cto, Cofounder - Holistics - Io

The document discusses using PostgreSQL for analytics infrastructure. It argues that PostgreSQL is a good choice because it is simple to get started with, has rich features for data pipelines and analysis, and can scale up as data grows. Some of the key features highlighted are table partitioning for managing large tables, tablespaces for controlling disk storage, and unlogged tables for write performance. The document presents PostgreSQL as a flexible database that can meet the needs of analytics applications from initial setup through large-scale growth.

Uploaded by

Chinar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

Why PostgreSQL for Analytics

Infrastructure (DW)?

Huy Nguyen
CTO, Cofounder - Holistics.io

Grokking TechTalk - Database Systems


Ho Chi Minh City - Aug 2016
About Me

● Cofounder
○ Data Reporting (BI) and Infrastructure SaaS

● Cofounder of Grokking Vietnam


○ Building community of world-class engineers in Vietnam

● Previous
○ Growth Team at Facebook (US)

○ Built Data Pipeline at Viki (Singapore)


Background: What is Analytics/DW?
- A Typical Web Application

Data-related Business Problems:

• Daily/weekly registered users by different platforms, countries?


• How many video uploads do we have everyday?
- A Typical Web Application

• Daily/weekly registered users by different platforms, countries?


• How many video uploads do we have everyday?
A Typical Data Pipeline
Event Logs
(behavioural
data) Pre-aggregate
Data Science / ML

Production Analytics
Live
Live Daily Snapshot
DBs Database
Databases
Databases Reporting / BI
Import

CSVs / Excels /
Google Sheets
Modify / Transform

Reporting /
Operational Data Data Warehouse
Analysis
Event Logs
(behavioural
data) Pre-aggregate
Data Science / ML

Production Analytics
Live
Live Daily Snapshot
DBs Database
Databases
Databases Reporting / BI
Import

CSVs / Excels /
Google Sheets
Modify / Transform

Reporting /
Operational Data Data Warehouse
What database should we pick? Analysis
Transactional Applications vs Analytics Applications

Data: Data:
● Many single-row writes ● Few large batch imports
● Current, single data ● Years of data, many sources
Queries: Queries:
● Generated by user activities; 10 ● Generated by large reports; 1 to
to 1000 users 10 users
● < 1s response time ● Queries run for hours
● Short queries ● Long queries

Ref: http://www.slideshare.net/PGExperts/really-big-elephants-postgresql-dw-15833438 (slide 5)


Complex Query...

Ref: http://www.slideshare.net/PGExperts/really-big-elephants-postgresql-dw-15833438 (slide 8)


Why start with Postgres?
Data Growth

1. Simple to Get Started


2. Rich Features for Analytics
– Data Pipeline (ETL)
– Data Analysis
3. Scale Up

(1) Start (2) Grow (3) Scale


Why start with Postgres?
Data Growth

1. Simple to Get Started


2. Rich Features for Analytics
– Data Pipeline (ETL)
– Data Analysis
3. Scale Up

(1) Start (2) Grow (3) Scale


1 Simple to Get Started
● Data requests grow gradually as your company grows
● Business users care about results (not backend)

→ Need something quick to start, easy to fine-tune along the way

Postgres:

● Free (open-source)
● Easy to setup

1. Simple start 2. Rich features 3. Scale up


Why start with Postgres?
Data Growth

1. Simple to Get Started


2. Rich Features for Analytics
– Data Pipeline (ETL)
– Data Analysis
3. Scale Up

(1) Start (2) Grow (3) Scale


Data Pipeline (ETL) Data Analysis

Event Logs
(behavioural
data) Pre-aggregate
Data Science / ML

Production Analytics
Live
Live Daily Snapshot
DBs Database
Databases
Databases Reporting / BI
Import

CSVs / Excels /
Google Sheets
Modify / Transform

Reporting /
Operational Data Data Warehouse
Analysis

1. Simple start 2. Rich features 3. Scale up


Event Logs
(behavioural Table Table
data)

Table Table

Table Table
Production Analytics
Live
Live Table Table
DBs Database Table Table
Databases
Databases
Table Table

CSVs / Excels / Table Table


Google Sheets

Table Table
Data Warehouse

1. Simple start 2. Rich features 3. Scale up


2 a- Data Pipeline (ETL) & Performance

● Managing Table Data: table partitioning

● Managing Disk Space: tablespace

● Write Performance: unlogged table

● Others: foreign data wrapper, point-in-time recovery

1. Simple start 2. Rich features 3. Scale up


2 a- Data Pipeline (ETL) & Performance

● Managing Table Data: table partitioning

● Managing Disk Space: tablespace

● Write Performance: unlogged table

● Others: foreign data wrapper, point-in-time recovery

1. Simple start 2. Rich features 3. Scale up


Managing Data Tables
Analytics tables hold lots of data pageviews

(+ 100k records a day)


⇒ Table grows big quickly, difficult to manage !

Solution: Split (partition) to multiple tables date_d | country | user_id | browser | page_name | views

pageviews_2015_06

pageviews_2015_07
Problem:
pageviews_2015_08 Difficult to query data across multiple months

pageviews_2015_09

1. Simple start 2. Rich features 3. Scale up


Managing Data Tables: parent table

pageviews_2015_06 pageviews_parent (parent table)


pageviews_2015_07

pageviews_2015_08
ALTER TABLE pageviews_2015_09 INHERIT video_plays;

pageviews_2015_09 ALTER TABLE pageviews_2015_09 ADD CONSTRAINT


CHECK date_d >= '2015-09-01'
AND date_d < '2015-10-01';

1. Simple start 2. Rich features 3. Scale up


2 a- Data Pipeline (ETL) & Performance

● Managing Table Data: table partitioning

● Managing Disk Space: tablespace

● Write Performance: unlogged table

● Others: foreign data wrapper, point-in-time recovery

1. Simple start 2. Rich features 3. Scale up


Managing Disk-spaces
Analytics DB holds lots of data; hardware spaces are limited

● SSD: fast, expensive


● SATA: cheap, slow

Data have different access


frequency

● Hot Data
● Warm Data
● Cold Data

1. Simple start 2. Rich features 3. Scale up


Managing Disk-spaces: tablespace
Tablespace: Define where your tables are stored on disks

CREATE TABLESPACE hot_data LOCATION /disk0/ssd/


CREATE TABLESPACE warm_data LOCATION /disk1/sata2/

# beginning of the month

CREATE TABLE pageviews_2016_08 TABLESPACE hot_data;


ALTER TABLE pageviews_2016_07 TABLESPACE warm_data;

1. Simple start 2. Rich features 3. Scale up


Combining TABLESPACE and PARENT TABLE

pageviews_2015_06 pageviews_parent (parent table)


pageviews_2015_07

pageviews_2015_08

pageviews_2015_09

1. Simple start 2. Rich features 3. Scale up


2 a- Data Pipeline (ETL) & Performance

● Managing Table Data: table partitioning

● Managing Disk Space: tablespace

● Write Performance: unlogged table

● Others: foreign data wrapper, point-in-time recovery

1. Simple start 2. Rich features 3. Scale up


Event Logs
(behavioural Table Table
data)

Table Table

Table Table
Production Analytics
Live
Live Table Table
DBs Database Table Table
Databases
Databases
Table Table

CSVs / Excels / Table Table


Google Sheets

Table Table
Data Warehouse

Analytics tables can be rebuilt from source

1. Simple start 2. Rich features 3. Scale up


Write Performance: unlogged table
● Transactional Safety: Every update is 2 writes:
○ Update data inside table
○ Write WAL (Write Ahead Log)

● UNLOGGED TABLE
○ Skip WAL log
○ Improved Write Performance

CREATE TABLE daily_summary(...) UNLOGGED;

INSERT INTO daily_summary …;

http://pgsnaga.blogspot.com/2011/10/data-loading-into-unlogged-tables-and.html

1. Simple start 2. Rich features 3. Scale up


2 a- Data Pipeline (ETL) & Performance

● Managing Table Data: table partitioning

● Managing Disk Space: tablespace

● Write Performance: unlogged table

● Others: foreign data wrapper, point-in-time recovery

1. Simple start 2. Rich features 3. Scale up


2- b- Data Analysis (writing SQLs)

Data Science / ML
● Extract / transform
● Aggregate / summarize
Analytics
● Statistical analysis Database
Reporting / BI

Reporting /
Data Warehouse Analysis

1. Simple start 2. Rich features 3. Scale up


2- b - Data Analysis with Postgres
● PL/SQL
● SQL features ● Full-text search (n-gram)
○ WITH clause ● Performance:
○ Window functions ○ Parallel queries (pg9.6)
○ Aggregation functions ○ Materialized views
○ Statistical functions ○ BRIN index
● Data structures ● Others:
○ JSON / JSONB ○ DISTINCT ON
○ Arrays ○ VALUES
○ PostGIS (geo data) ○ generate_series()
○ Geometry (point, line, etc) ○ Support FULL OUTER JOIN
○ HyperLogLog (extension) ○ Better EXPLAIN

1. Simple start 2. Rich features 3. Scale up


CTE - Problem with Nested Queries
Nested queries are

a) hard to read
b) cannot be reused
SELECT ...
FROM (SELECT ...
FROM t1
JOIN (SELECT ... FROM ...) a
ON (...)
) b
JOIN (SELECT ... FROM ...) c ON (...)

1. Simple start 2. Rich features 3. Scale up


CTE - Common Table Expressions (WITH clause)
WITH a AS (
● SQL’s “private methods” SELECT ... FROM ...
), b AS (
● WITH view can be referred SELECT ...
multiple times FROM t1 JOIN a ON (...)
), c AS (
● Allows chaining instead of SELECT ... FROM ...
nesting )
SELECT ... FROM b JOIN c ON ...

1. Simple start 2. Rich features 3. Scale up


CTE (cont.)
● Recursive CTE
● Writeable CTE

# move data from A to B


WITH deleted_rows AS (
DELETE FROM a WHERE ...
RETURNING *
)
INSERT INTO b
SELECT * FROM deleted_rows;

1. Simple start 2. Rich features 3. Scale up


Limitation of GROUP BY aggregate

● GROUP BY aggregate: reduce a


partition of data into 1 value

SELECT
gender,
COUNT(1) AS signups
FROM users
GROUP BY 1

What if we want to work through each row of each partition?

1. Simple start 2. Rich features 3. Scale up


Window functions

● Window functions: moving frame


of 1 partition data

● Examples:
○ Calculate moving average
○ Cumulative sum
○ Ranking by partition
○ …

1. Simple start 2. Rich features 3. Scale up


Example: Cumulative Sum
CREATE TABLE users ( SELECT
id INT, created_at::date AS date_d,
gender VARCHAR(10), COUNT(1) AS daily_signups,
created_at TIMESTAMP SUM(COUNT(1)) OVER
); (ORDER BY dated_d) AS cumulative_signups
FROM users U
GROUP BY 1
ORDER BY 1

| date_d | daily_signups | cumulative_signups |


| 2016-08-01 | 100 | 100 |
| 2016-08-02 | 50 | 150 |
| 2016-08-03 | 80 | 230 |

1. Simple start 2. Rich features 3. Scale up


Example: Group by Gender and rank by signup time

CREATE TABLE users ( SELECT


id INT, gender,
name VARCHAR, name,
gender VARCHAR(10), RANK() OVER (PARTITION BY gender
created_at TIMESTAMP ORDER BY created_at) AS signup_rnk
); FROM users U ORDER BY 1, 3;

| gender | name | signup_rnk |


| male | Hung | 1 |
| male | Son | 2 |
| ... |
| female | Lan | 1 |
| female | Tuyet | 2 |

1. Simple start 2. Rich features 3. Scale up


2 b- Data Analysis with Postgres

● SQL features ● PL/SQL


○ WITH clause ● Full-text search (n-gram)
○ Window functions ● Performance:
○ Aggregation functions ○ Parallel queries (pg9.6)
○ Statistical functions ○ Materialized views
● Data structures ○ BRIN index
○ JSON / JSONB ● Others:
○ Arrays ○ DISTINCT ON
○ PostGIS (geo data) ○ VALUES
○ Geometry (point, line, etc) ○ generate_series()
PostgreSQL is well suited for data analysis!
○ HyperLogLog (extension) ○ Support FULL OUTER JOIN
○ Better EXPLAIN
Data Pipeline (ETL) Data Analysis

Event Logs
(behavioural
data) Pre-aggregate
Data Science / ML

Production Analytics
Live
Live Daily Snapshot
DBs Database
Databases
Databases Reporting / BI
Import

CSVs / Excels /
Google Sheets
Modify / Transform

Reporting /
Operational Data Data Warehouse
Analysis

1. Simple start 2. Rich features 3. Scale up


Why start with Postgres?
Data Growth

1. Simple to Get Started


2. Rich Features for Analytics
– Data Pipeline (ETL)
– Data Analysis
3. Scale Up

(1) Start (2) Grow (3) Scale


3- Scaling Up
● PostgreSQL downsides:
○ Optimized for transactional applications
○ Single-core execution; row-based storage

● CitusDB Extension
○ Automated data sharding and parallelization
○ Columnar Storage Format (better storage and performance)

● Vertica (HP)
○ Columnar Storage, Parallel Execution
○ Started by Michael Stonebraker (Postgres original author)

● Amazon Redshift
○ Fork of PostgreSQL 8.2 -- ParAccel DB
○ Columnar Storage & Parallel Executions
Other Proprietary DW Databases (Relational)
● Paraccel (Postgres fork) ● Greenplum

● Vertica (from Postgres author) ● Teradata

● CitusDB (Postgres extension) ● Infobright

● Amazon Redshift (from Paraccel) ● Google BigQuery


● Aster Data

Related to Postgres

1. Simple start 2. Rich features 3. Scale up


Compare: Popular SQL Databases

PostgreSQL MySQL Oracle SQL Server

License /
Free / Open-source Free / Open-source Expensive Expensive
Cost

DW features Strong Weak Strong Strong


● SQL features ● PL/SQL
○ WITH clause ● Full-text search (n-gram)
○ Window functions ● Performance:
○ Aggregation functions ○ Parallel queries (pg9.6)
○ Statistical functions ○ Materialized views
● Data structures ○ BRIN index
○ JSON / JSONB ● Others:
○ Arrays ○ DISTINCT ON
○ PostGIS (geo data) ○ VALUES
○ Geometry (point, line, etc) ○ generate_series()
○ HyperLogLog (extension) ○ Support FULL OUTER JOIN
○ Better EXPLAIN
● SQL features ● PL/SQL
○ WITH clause ● Full-text search
○ Window functions ● Performance:
○ Aggregation functions ○ Parallel queries (pg9.6)
○ Statistical functions ○ Materialized views
● Data structures ○ BRIN index
○ JSON / JSONB ● Others:
○ Arrays ○ DISTINCT ON
○ PostGIS (geo data) ○ VALUES
○ Geometry (point, line, etc) ○ generate_series()
○ HyperLogLog (extension) ○ Support FULL OUTER JOIN
○ Better EXPLAIN
Summary
Data Growth

1. Simple to Get Started


2. Rich Features for Analytics
– Data Pipeline (ETL)
– Data Analysis
3. Easy to Scale Up

(1) Start (2) Grow (3) Scale


Summary (cont)

● Why starting with Postgres


● Scaling up to DW databases
● Comparing with other transactional DBs
● Not Cover:
○ How to setup PostgreSQL for DW
○ Performance Optimizations
○ Behavioural Data: Hadoop, Spark, HDFS
Huy Nguyen
[email protected]

You might also like