SlideShare a Scribd company logo
Scalable Uniques in
Postgres -
Craig Kerstiens
Heroku Postgres
Postgresql-HLL
Truviso
‱ Extended Postgres to do streaming
‱ Various markets
‱ Ad space
‱ Wanted unique impressions
‱ Sort of wanted unique impressions
SELECT count(*)
Approx Top K
Compressed Bitmap
HyperLogLog
HyperLogLog
‱ KMV - K minimum value
HyperLogLog
‱ KMV - K minimum value
‱ Bit observable patterns
HyperLogLog
‱ KMV - K minimum value
‱ Bit observable patterns
‱ Stochastic averaging
HyperLogLog
‱ KMV - K minimum value
‱ Bit observable patterns
‱ Stochastic averaging
‱ Harmonic averaging
HyperLogLog
‱ KMV - K minimum value
‱ Bit observable patterns
‱ Stochastic averaging
‱ Harmonic averaging
HyperLogLog
‱ KMV - K minimum value
‱ Bit observable patterns
‱ Stochastic averaging
‱ Harmonic averaging
‱ Implemented by Aggregate Knowledge
Craig Kerstiens - Scalable Uniques in Postgres @ Postgres Open
HyperLogLog
Probabilistic uniques with small footprint
HyperLogLog
Probabilistic uniques with small footprint
Close enough distinct with small footprint
Use cases
Use cases
‱ Semi distinct count
‱ Think pg_stat_statements
‱ Ad networks
‱ Web traïŹƒc
Use cases
‱ Semi distinct count
‱ Think pg_stat_statements
‱ Ad networks
‱ Web traïŹƒc
‱ With rollups/groupings
Digging in
CREATE	
  EXTENSION	
  hll;
	
  	
  CREATE	
  TABLE	
  helloworld	
  (
	
  	
  	
  	
  	
  	
  id	
  	
  	
  	
  integer,
	
  	
  	
  	
  	
  	
  set	
  	
  	
  hll
	
  	
  );
Digging in
CREATE	
  EXTENSION	
  hll;
	
  	
  CREATE	
  TABLE	
  helloworld	
  (
	
  	
  	
  	
  	
  	
  id	
  	
  	
  	
  integer,
	
  	
  	
  	
  	
  	
  set	
  	
  	
  hll
	
  	
  );
Inserting data
UPDATE	
  helloworld	
  
SET	
  set	
  =	
  hll_add(set,	
  hll_hash_integer(12345))	
  
WHERE	
  id	
  =	
  1;
UPDATE	
  helloworld	
  
SET	
  set	
  =	
  hll_add(set,	
  hll_hash_text('hello	
  world'))	
  
WHERE	
  id	
  =	
  1;
Real world
CREATE	
  TABLE	
  daily_uniques	
  (
	
  	
  	
  	
  date	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  date	
  UNIQUE,
	
  	
  	
  	
  users	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  hll
);
Real world
INSERT	
  INTO	
  daily_uniques(date,	
  users)
	
  	
  SELECT	
  date,	
  hll_add_agg(hll_hash_integer(user_id))
	
  	
  FROM	
  users
	
  	
  GROUP	
  BY	
  1;
Real world
SELECT	
  
	
  	
  	
  	
  	
  	
  	
  EXTRACT(MONTH	
  FROM	
  date)	
  AS	
  month,	
  
	
  	
  	
  	
  	
  	
  	
  hll_cardinality(hll_union_agg(users))
FROM	
  daily_uniques
WHERE	
  date	
  >=	
  '2012-­‐01-­‐01'	
  AND
	
  	
  	
  	
  	
  	
  date	
  <	
  	
  '2013-­‐01-­‐01'
GROUP	
  BY	
  1;
Real world
SELECT	
  
	
  	
  	
  	
  	
  	
  	
  EXTRACT(MONTH	
  FROM	
  date)	
  AS	
  month,	
  
	
  	
  	
  	
  	
  	
  	
  hll_cardinality(hll_union_agg(users))
FROM	
  daily_uniques
WHERE	
  date	
  >=	
  '2012-­‐01-­‐01'	
  AND
	
  	
  	
  	
  	
  	
  date	
  <	
  	
  '2013-­‐01-­‐01'
GROUP	
  BY	
  1;
Good practices
Good practices
Good practices
‱ It uses update
Good practices
‱ It uses update
‱ Do as a batch in most cases
Good practices
‱ It uses update
‱ Do as a batch in most cases
‱ Tweak the conïŹg
Tuning Parameters
Tuning Parameters
‱ log2m - log base 2 of registers
‱ Between 4 and 17
‱ Each 1 increase doubles storage
Tuning Parameters
‱ log2m - log base 2 of registers
‱ Between 4 and 17
‱ Each 1 increase doubles storage
‱ regwidth - bits per register
Tuning Parameters
‱ log2m - log base 2 of registers
‱ Between 4 and 17
‱ Each 1 increase doubles storage
‱ regwidth - bits per register
‱ expthresh - threshold for explicit vs sparse
Tuning Parameters
‱ log2m - log base 2 of registers
‱ Between 4 and 17
‱ Each 1 increase doubles storage
‱ regwidth - bits per register
‱ expthresh - threshold for explicit vs sparse
‱ spareson - on/oïŹ€ for sparse
Is it better?
1280 bytes
Estimate count of 10s of billions
Few percent error
Resources
‱ https://github.com/aggregateknowledge/
postgresql-hll
‱ http://blog.aggregateknowledge.com/
2013/02/04/open-source-release-
postgresql-hll/
‱ http://tapoueh.org/blog/2013/02/25-
postgresql-hyperloglog
Questions

More Related Content

PPT
Apache Cassandra at Videoplaza — Stockholm Cassandra Users — September 2013
DataStax Academy
 
PPTX
Hyperloglog Lightning Talk
Simon Prickett
 
PDF
Presto Summit 2018 - 10 - Qubole
kbajda
 
PDF
Time series database, InfluxDB & PHP
Corley S.r.l.
 
PDF
Clickhouse MeetUp@ContentSquare - ContentSquare's Experience Sharing
Vianney FOUCAULT
 
PDF
Real-time Analytics with Apache Flink and Druid
Jan Graßegger
 
PDF
Presto Summit 2018 - 07 - Lyft
kbajda
 
PPTX
Stream Processing Live Traffic Data with Kafka Streams
Tim Ysewyn
 
Apache Cassandra at Videoplaza — Stockholm Cassandra Users — September 2013
DataStax Academy
 
Hyperloglog Lightning Talk
Simon Prickett
 
Presto Summit 2018 - 10 - Qubole
kbajda
 
Time series database, InfluxDB & PHP
Corley S.r.l.
 
Clickhouse MeetUp@ContentSquare - ContentSquare's Experience Sharing
Vianney FOUCAULT
 
Real-time Analytics with Apache Flink and Druid
Jan Graßegger
 
Presto Summit 2018 - 07 - Lyft
kbajda
 
Stream Processing Live Traffic Data with Kafka Streams
Tim Ysewyn
 

What's hot (20)

DOCX
empirical analysis modeling of power dissipation control in internet data ce...
saadjamil31
 
PDF
Presto Summit 2018 - 04 - Netflix Containers
kbajda
 
PDF
Presto talk @ Global AI conference 2018 Boston
kbajda
 
PPTX
Meetup#2: Building responsive Symbology & Suggest WebService
Minsk MongoDB User Group
 
PDF
Acunu Analytics and Cassandra at Hailo All Your Base 2013
Acunu
 
PDF
ćˆ˜èŻšćż ïŒšRunning cloudera impala on postgre sql
hdhappy001
 
PPTX
InfluxDb and Grafana fighting with data
Ivan Vaskevych
 
PPTX
Presto@Netflix Presto Meetup 03-19-15
Zhenxiao Luo
 
PDF
J-Day KrakĂłw: Listen to the sounds of your application
Maciej Bilas
 
PDF
Acunu and Hailo: a realtime analytics case study on Cassandra
Acunu
 
PPTX
Webinar 2017. Supercharge your analytics with ClickHouse. Alexander Zaitsev
Altinity Ltd
 
PPTX
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
Tony Ng
 
PPT
ApexMeetup Geode - Talk2 2016-03-17
Apache Apex Organizer
 
PPTX
Webinar: MongoDB Use Cases within the Oil, Gas, and Energy Industries
MongoDB
 
PPTX
An Intro to Elasticsearch and Kibana
ObjectRocket
 
PDF
[Meetup] a successful migration from elastic search to clickhouse
Vianney FOUCAULT
 
PDF
Small intro to Big Data - Old version
SoftwareMill
 
PPTX
Open source big data landscape and possible ITS applications
SoftwareMill
 
PDF
Clickhouse at Cloudflare. By Marek Vavrusa
Valery Tkachenko
 
PPTX
New Thor & Roxie Hardware Architecture
HPCC Systems
 
empirical analysis modeling of power dissipation control in internet data ce...
saadjamil31
 
Presto Summit 2018 - 04 - Netflix Containers
kbajda
 
Presto talk @ Global AI conference 2018 Boston
kbajda
 
Meetup#2: Building responsive Symbology & Suggest WebService
Minsk MongoDB User Group
 
Acunu Analytics and Cassandra at Hailo All Your Base 2013
Acunu
 
ćˆ˜èŻšćż ïŒšRunning cloudera impala on postgre sql
hdhappy001
 
InfluxDb and Grafana fighting with data
Ivan Vaskevych
 
Presto@Netflix Presto Meetup 03-19-15
Zhenxiao Luo
 
J-Day KrakĂłw: Listen to the sounds of your application
Maciej Bilas
 
Acunu and Hailo: a realtime analytics case study on Cassandra
Acunu
 
Webinar 2017. Supercharge your analytics with ClickHouse. Alexander Zaitsev
Altinity Ltd
 
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
Tony Ng
 
ApexMeetup Geode - Talk2 2016-03-17
Apache Apex Organizer
 
Webinar: MongoDB Use Cases within the Oil, Gas, and Energy Industries
MongoDB
 
An Intro to Elasticsearch and Kibana
ObjectRocket
 
[Meetup] a successful migration from elastic search to clickhouse
Vianney FOUCAULT
 
Small intro to Big Data - Old version
SoftwareMill
 
Open source big data landscape and possible ITS applications
SoftwareMill
 
Clickhouse at Cloudflare. By Marek Vavrusa
Valery Tkachenko
 
New Thor & Roxie Hardware Architecture
HPCC Systems
 
Ad

Viewers also liked (14)

PDF
xPad - Building Simple Tablet OS with Gtk/WebKit
Ping-Hsun Chen
 
PDF
Ari xivo astricon_2016
Sylvain Boily
 
PDF
WEIGHT MANAGEMENT Do it yourself Motivation and Tips
Ryan Fernando
 
PDF
Useful PostgreSQL Extensions
EDB
 
PPTX
Architectures for High Availability - QConSF
Adrian Cockcroft
 
PDF
Fabric, Cuisine and Watchdog for server administration in Python
FFunction inc
 
PPTX
KazooCon 2014 - Kazoo Scalability
2600Hz
 
PDF
Introduction to Kafka Streams
Guozhang Wang
 
PPT
Astricon 2010: Scaling Asterisk installations
Olle E Johansson
 
PDF
Performance optimization 101 - Erlang Factory SF 2014
lpgauth
 
PDF
CoreOS, or How I Learned to Stop Worrying and Love Systemd
Richard Lister
 
PDF
Responsive design: techniques and tricks to prepare your websites for the mul...
Andreas Bovens
 
PDF
Scaling LoL Chat to 70M Players
MichaƂ Ptaszek
 
PPTX
Culture
Reed Hastings
 
xPad - Building Simple Tablet OS with Gtk/WebKit
Ping-Hsun Chen
 
Ari xivo astricon_2016
Sylvain Boily
 
WEIGHT MANAGEMENT Do it yourself Motivation and Tips
Ryan Fernando
 
Useful PostgreSQL Extensions
EDB
 
Architectures for High Availability - QConSF
Adrian Cockcroft
 
Fabric, Cuisine and Watchdog for server administration in Python
FFunction inc
 
KazooCon 2014 - Kazoo Scalability
2600Hz
 
Introduction to Kafka Streams
Guozhang Wang
 
Astricon 2010: Scaling Asterisk installations
Olle E Johansson
 
Performance optimization 101 - Erlang Factory SF 2014
lpgauth
 
CoreOS, or How I Learned to Stop Worrying and Love Systemd
Richard Lister
 
Responsive design: techniques and tricks to prepare your websites for the mul...
Andreas Bovens
 
Scaling LoL Chat to 70M Players
MichaƂ Ptaszek
 
Culture
Reed Hastings
 
Ad

Similar to Craig Kerstiens - Scalable Uniques in Postgres @ Postgres Open (20)

PDF
Distributed count(distinct) with hyper loglog on postgresql | PGConf EU 2017)...
Citus Data
 
PDF
Big Data Day LA 2015 - Large Scale Distinct Count -- The HyperLogLog algorith...
Data Con LA
 
PDF
Around the world with extensions | PostgreSQL Conference Europe 2018 | Craig ...
Citus Data
 
PDF
HyperLogLog in Hive - How to count sheep efficiently?
bzamecnik
 
PDF
What is HyperLogLog and Why You Will Love It | PostgreSQL Conference Europe 2...
Citus Data
 
PDF
JDD 2016 - Tomasz Borek - DB for next project? Why, Postgres, of course
PROIDEA
 
PDF
10 Reasons to Start Your Analytics Project with PostgreSQL
Satoshi Nagayasu
 
PPTX
Using the PostgreSQL Extension Ecosystem for Advanced Analytics
Chartio
 
PDF
Must Know Postgres Extension for DBA and Developer during Migration
Mydbops
 
PDF
Does PostgreSQL respond to the challenge of analytical queries?
Andrey Lepikhov
 
PDF
Advanced pg_stat_statements: Filtering, Regression Testing & more
Lukas Fittl
 
PDF
Overview of Postgres 9.5
EDB
 
PDF
PGConf APAC 2018 - A PostgreSQL DBAs Toolbelt for 2018
PGConf APAC
 
PDF
PGConf.ASIA 2019 Bali - Keynote Speech 2 - Ivan Pachenko
Equnix Business Solutions
 
PDF
Advanced Int->Bigint Conversions
Robert Treat
 
PDF
query_tuning.pdf
ssuserf99076
 
PDF
PostgreSQL 9.0 & The Future
Aaron Thul
 
PPTX
HyperLogLog and friends
Simon Lia-Jonassen
 
ODP
PostgreSQL 8.4 TriLUG 2009-11-12
Andrew Dunstan
 
PDF
Peeking into the Black Hole Called PL/PGSQL - the New PL Profiler / Jan Wieck...
Ontico
 
Distributed count(distinct) with hyper loglog on postgresql | PGConf EU 2017)...
Citus Data
 
Big Data Day LA 2015 - Large Scale Distinct Count -- The HyperLogLog algorith...
Data Con LA
 
Around the world with extensions | PostgreSQL Conference Europe 2018 | Craig ...
Citus Data
 
HyperLogLog in Hive - How to count sheep efficiently?
bzamecnik
 
What is HyperLogLog and Why You Will Love It | PostgreSQL Conference Europe 2...
Citus Data
 
JDD 2016 - Tomasz Borek - DB for next project? Why, Postgres, of course
PROIDEA
 
10 Reasons to Start Your Analytics Project with PostgreSQL
Satoshi Nagayasu
 
Using the PostgreSQL Extension Ecosystem for Advanced Analytics
Chartio
 
Must Know Postgres Extension for DBA and Developer during Migration
Mydbops
 
Does PostgreSQL respond to the challenge of analytical queries?
Andrey Lepikhov
 
Advanced pg_stat_statements: Filtering, Regression Testing & more
Lukas Fittl
 
Overview of Postgres 9.5
EDB
 
PGConf APAC 2018 - A PostgreSQL DBAs Toolbelt for 2018
PGConf APAC
 
PGConf.ASIA 2019 Bali - Keynote Speech 2 - Ivan Pachenko
Equnix Business Solutions
 
Advanced Int->Bigint Conversions
Robert Treat
 
query_tuning.pdf
ssuserf99076
 
PostgreSQL 9.0 & The Future
Aaron Thul
 
HyperLogLog and friends
Simon Lia-Jonassen
 
PostgreSQL 8.4 TriLUG 2009-11-12
Andrew Dunstan
 
Peeking into the Black Hole Called PL/PGSQL - the New PL Profiler / Jan Wieck...
Ontico
 

More from PostgresOpen (18)

PDF
Bruce Momjian - Inside PostgreSQL Shared Memory @ Postgres Open
PostgresOpen
 
PDF
Gurjeet Singh - How Postgres is Different From (Better Tha) Your RDBMS @ Post...
PostgresOpen
 
PDF
Keith Fiske - When PostgreSQL Can't, You Can @ Postgres Open
PostgresOpen
 
PPTX
David Keeney - SQL Database Server Requests from the Browser @ Postgres Open
PostgresOpen
 
PDF
Keith Paskett - Postgres on ZFS @ Postgres Open
PostgresOpen
 
PDF
Kevin Kempter - PostgreSQL Backup and Recovery Methods @ Postgres Open
PostgresOpen
 
PDF
Henrietta Dombrovskaya - A New Approach to Resolve Object-Relational Impedanc...
PostgresOpen
 
PDF
Steve Singer - Managing PostgreSQL with Puppet @ Postgres Open
PostgresOpen
 
PDF
John Melesky - Federating Queries Using Postgres FDW @ Postgres Open
PostgresOpen
 
PDF
Koichi Suzuki - Postgres-XC Dynamic Cluster Management @ Postgres Open
PostgresOpen
 
PDF
Selena Deckelmann - Sane Schema Management with Alembic and SQLAlchemy @ Pos...
PostgresOpen
 
PDF
Robert Bernier - Recovering From A Damaged PostgreSQL Cluster @ Postgres Open
PostgresOpen
 
PDF
Michael Paquier - Taking advantage of custom bgworkers @ Postgres Open
PostgresOpen
 
PDF
Kevin Kempter PostgreSQL Backup and Recovery Methods @ Postgres Open
PostgresOpen
 
PDF
Michael Bayer Introduction to SQLAlchemy @ Postgres Open
PostgresOpen
 
PDF
Robert Haas Query Planning Gone Wrong Presentation @ Postgres Open
PostgresOpen
 
PDF
Ryan Jarvinen Open Shift Talk @ Postgres Open 2013
PostgresOpen
 
PDF
Andrew Dunstan 9.3 JSON Presentation @ Postgres Open 2013
PostgresOpen
 
Bruce Momjian - Inside PostgreSQL Shared Memory @ Postgres Open
PostgresOpen
 
Gurjeet Singh - How Postgres is Different From (Better Tha) Your RDBMS @ Post...
PostgresOpen
 
Keith Fiske - When PostgreSQL Can't, You Can @ Postgres Open
PostgresOpen
 
David Keeney - SQL Database Server Requests from the Browser @ Postgres Open
PostgresOpen
 
Keith Paskett - Postgres on ZFS @ Postgres Open
PostgresOpen
 
Kevin Kempter - PostgreSQL Backup and Recovery Methods @ Postgres Open
PostgresOpen
 
Henrietta Dombrovskaya - A New Approach to Resolve Object-Relational Impedanc...
PostgresOpen
 
Steve Singer - Managing PostgreSQL with Puppet @ Postgres Open
PostgresOpen
 
John Melesky - Federating Queries Using Postgres FDW @ Postgres Open
PostgresOpen
 
Koichi Suzuki - Postgres-XC Dynamic Cluster Management @ Postgres Open
PostgresOpen
 
Selena Deckelmann - Sane Schema Management with Alembic and SQLAlchemy @ Pos...
PostgresOpen
 
Robert Bernier - Recovering From A Damaged PostgreSQL Cluster @ Postgres Open
PostgresOpen
 
Michael Paquier - Taking advantage of custom bgworkers @ Postgres Open
PostgresOpen
 
Kevin Kempter PostgreSQL Backup and Recovery Methods @ Postgres Open
PostgresOpen
 
Michael Bayer Introduction to SQLAlchemy @ Postgres Open
PostgresOpen
 
Robert Haas Query Planning Gone Wrong Presentation @ Postgres Open
PostgresOpen
 
Ryan Jarvinen Open Shift Talk @ Postgres Open 2013
PostgresOpen
 
Andrew Dunstan 9.3 JSON Presentation @ Postgres Open 2013
PostgresOpen
 

Recently uploaded (20)

PPTX
ChatGPT's Deck on The Enduring Legacy of Fax Machines
Greg Swan
 
PDF
Software Development Methodologies in 2025
KodekX
 
PDF
A Day in the Life of Location Data - Turning Where into How.pdf
Precisely
 
PPTX
Comunidade Salesforce SĂŁo Paulo - Desmistificando o Omnistudio (Vlocity)
Francisco Vieira JĂșnior
 
PPTX
Smart Infrastructure and Automation through IoT Sensors
Rejig Digital
 
PDF
Revolutionize Operations with Intelligent IoT Monitoring and Control
Rejig Digital
 
PPTX
The-Ethical-Hackers-Imperative-Safeguarding-the-Digital-Frontier.pptx
sujalchauhan1305
 
PDF
Oracle AI Vector Search- Getting Started and what's new in 2025- AIOUG Yatra ...
Sandesh Rao
 
PDF
DevOps & Developer Experience Summer BBQ
AUGNYC
 
PDF
How-Cloud-Computing-Impacts-Businesses-in-2025-and-Beyond.pdf
Artjoker Software Development Company
 
PPTX
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
PPTX
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
PDF
BLW VOCATIONAL TRAINING SUMMER INTERNSHIP REPORT
codernjn73
 
PDF
Advances in Ultra High Voltage (UHV) Transmission and Distribution Systems.pdf
Nabajyoti Banik
 
PDF
Chapter 2 Digital Image Fundamentals.pdf
Getnet Tigabie Askale -(GM)
 
PDF
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
 
PDF
Chapter 1 Introduction to CV and IP Lecture Note.pdf
Getnet Tigabie Askale -(GM)
 
PDF
This slide provides an overview Technology
mineshkharadi333
 
PDF
CIFDAQ'S Market Insight: BTC to ETH money in motion
CIFDAQ
 
PDF
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
ChatGPT's Deck on The Enduring Legacy of Fax Machines
Greg Swan
 
Software Development Methodologies in 2025
KodekX
 
A Day in the Life of Location Data - Turning Where into How.pdf
Precisely
 
Comunidade Salesforce SĂŁo Paulo - Desmistificando o Omnistudio (Vlocity)
Francisco Vieira JĂșnior
 
Smart Infrastructure and Automation through IoT Sensors
Rejig Digital
 
Revolutionize Operations with Intelligent IoT Monitoring and Control
Rejig Digital
 
The-Ethical-Hackers-Imperative-Safeguarding-the-Digital-Frontier.pptx
sujalchauhan1305
 
Oracle AI Vector Search- Getting Started and what's new in 2025- AIOUG Yatra ...
Sandesh Rao
 
DevOps & Developer Experience Summer BBQ
AUGNYC
 
How-Cloud-Computing-Impacts-Businesses-in-2025-and-Beyond.pdf
Artjoker Software Development Company
 
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
BLW VOCATIONAL TRAINING SUMMER INTERNSHIP REPORT
codernjn73
 
Advances in Ultra High Voltage (UHV) Transmission and Distribution Systems.pdf
Nabajyoti Banik
 
Chapter 2 Digital Image Fundamentals.pdf
Getnet Tigabie Askale -(GM)
 
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
 
Chapter 1 Introduction to CV and IP Lecture Note.pdf
Getnet Tigabie Askale -(GM)
 
This slide provides an overview Technology
mineshkharadi333
 
CIFDAQ'S Market Insight: BTC to ETH money in motion
CIFDAQ
 
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 

Craig Kerstiens - Scalable Uniques in Postgres @ Postgres Open