0% found this document useful (0 votes)

9 views7 pages

Getting List of Unique Elements

Uploaded by

Clailson

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views7 pages

Getting List of Unique Elements

Uploaded by

Clailson

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

Getting list of unique elements

Every so often you need to get list of unique elements in some column. The standard way to do it is:

select distinct column from table;

select column from table group by column;

The only problem is that it's slow – as it has to seq scan whole table. Can it be done faster?

Usually when somebody wants to speed up such query, (s)he is advised to add triggers which will keep list of values in some side table – dictionary.

This is nice, but a bit cumbersome, and relatively hard to do correctly. Let's see how it would work.

Let's assume we have this table:

Now, let's insert to it some data:

# insert into test (test_field)

select proname::text from pg_proc, generate_series(1,10) where random() < 0.33;
INSERT 0 7480

Now, let's see how the data looks:

# select count(*), count(distinct test_field) from test;

count | count
-------+-------
7480 | 1821
(1 row)

Now, to get list of all possible test_field values, I do:

# explain analyze select distinct test_field from test;

QUERY PLAN
--------------------------------------------------------------------------------------------------------------
HashAggregate (cost=135.50..153.71 rows=1821 width=12) (actual time=38.275..42.255 rows=1821 loops=1)
-> Seq Scan on test (cost=0.00..116.80 rows=7480 width=12) (actual time=0.060..17.386 rows=7480 loops=1)
Total runtime: 45.635 ms
(3 rows)

( I tested it several times, and it generally runs in 44-47 ms )

So, let's add our dictionary table, and triggers.

Now, we will need 3 triggers (on insert, update and delete), but internally they will be using 2 specialized code blocks:

 add new elements to dictionary, or increment element_count

 decrement element_count, or remove element from dictionary

Because of this we will start with 2 simple functions:

CREATE OR REPLACE FUNCTION add_to_dictionary(use_value TEXT) RETURNS void as $$

DECLARE
BEGIN
LOOP
UPDATE test_dictionary SET element_count = element_count + 1 WHERE test_field = use_value;
IF found THEN
RETURN;
END IF;
BEGIN
INSERT INTO test_dictionary(test_field) VALUES (use_value);
RETURN;
EXCEPTION WHEN unique_violation THEN
-- do nothing, and loop to try the UPDATE again
END;
END LOOP;
END;
$$ language plpgsql;

(the code is taken from the manual).

CREATE OR REPLACE FUNCTION remove_from_dictionary(use_value TEXT) RETURNS void as $$
DECLARE
tempint int4;
BEGIN
UPDATE test_dictionary SET element_count = element_count - 1 WHERE test_field = use_value RETURNING element_count INTO
tempint;
IF NOT FOUND THEN
raise exception 'remove_from_dictionary() called with element that doesn''t exist in test_dictionary ?! [%]', use_value;
END IF;
IF tempint = 0 THEN
DELETE FROM test_dictionary WHERE test_field = use_value;
END IF;
RETURN;
END;
$$ language plpgsql;

Now, let's test if the functions work:

First, the table is empty:

# select * from test_dictionary ;

test_field | element_count
------------+---------------
(0 rows)

Now, let's add some values:

# select add_to_dictionary('added');
add_to_dictionary
-------------------

(1 row)

# select add_to_dictionary('added');
add_to_dictionary
-------------------

(1 row)

# select add_to_dictionary('xxx');
add_to_dictionary
-------------------

(1 row)

And let's see how it looks:

# select * from test_dictionary ;

test_field | element_count
------------+---------------
added | 2
xxx | 1
(2 rows)

Looks ok.

Now, let's remove the elements:

# select remove_from_dictionary('xxx');
remove_from_dictionary
------------------------

(1 row)

# select * from test_dictionary ;

test_field | element_count
------------+---------------
added | 2
(1 row)

# select remove_from_dictionary('added');
remove_from_dictionary
------------------------

(1 row)

# select * from test_dictionary ;

test_field | element_count
------------+---------------
added | 1
(1 row)

# select remove_from_dictionary('added');
remove_from_dictionary
------------------------

(1 row)

# select * from test_dictionary ;

test_field | element_count
------------+---------------
(0 rows)

Last test – check if it will correctly fail when trying to remove element not in dictionary:
# select remove_from_dictionary('added');
ERROR: remove_from_dictionary() called with element that doesn't exist in test_dictionary ?! [added]

All works. Now for the triggers.

Insert and delete triggers are trivial:

CREATE OR REPLACE FUNCTION test_dictionary_i() RETURNS TRIGGER AS

$BODY$
BEGIN
IF NEW.test_field IS NOT NULL THEN
perform add_to_dictionary(NEW.test_field);
END IF;
RETURN NEW;
END;
$BODY$
LANGUAGE 'plpgsql';
CREATE TRIGGER test_dictionary_i AFTER INSERT ON test FOR EACH ROW EXECUTE PROCEDURE test_dictionary_i();
CREATE OR REPLACE FUNCTION test_dictionary_d() RETURNS TRIGGER AS
$BODY$
BEGIN
IF OLD.test_field IS NOT NULL THEN
perform remove_from_dictionary(OLD.test_field);
END IF;
RETURN OLD;
END;
$BODY$
LANGUAGE 'plpgsql';
CREATE TRIGGER test_dictionary_d AFTER DELETE ON test FOR EACH ROW EXECUTE PROCEDURE test_dictionary_d();

Update is a bit more complex:

CREATE OR REPLACE FUNCTION test_dictionary_u() RETURNS TRIGGER AS

$BODY$
BEGIN
IF NOT (OLD.test_field IS DISTINCT FROM NEW.test_field) THEN
RETURN NEW;
END IF;
IF OLD.test_field IS NULL THEN
-- it means that NEW.test_field IS NOT NULL
perform add_to_dictionary(NEW.test_field);
ELSIF NEW.test_field IS NULL THEN
-- it means that OLD.test_field IS NOT NULL
perform remove_from_dictionary(OLD.test_field);
ELSIF OLD.test_field < NEW.test_field THEN
perform remove_from_dictionary(OLD.test_field);
perform add_to_dictionary(NEW.test_field);
ELSE
perform add_to_dictionary(NEW.test_field);
perform remove_from_dictionary(OLD.test_field);
END IF;
RETURN NEW;
END;
$BODY$
LANGUAGE 'plpgsql';
CREATE TRIGGER test_dictionary_u AFTER UPDATE ON test FOR EACH ROW EXECUTE PROCEDURE test_dictionary_u();

You might wonder why there is the last elsif and else – with mixed order of function calls. If you do wonder about it – please check this blogpost, especially the
part where it explains deadlock problem.

And the final step – we have to prefill dictionary table. Like this:

insert into test_dictionary (test_field, element_count)

select test_field, count(*) from test where test_field is not null group by test_field;

And that's basically all.

First of all, let's check if it is faster:

# explain analyze select test_field from test_dictionary;

QUERY PLAN
-----------------------------------------------------------------------------------------------------------------
Seq Scan on test_dictionary (cost=0.00..24.53 rows=1353 width=32) (actual time=0.023..4.042 rows=1821 loops=1)
Total runtime: 7.412 ms
(2 rows)

No surprise here.

Getting the list of elements now depends on count of distinct elements, and not count of all rows in our main (test) table.

The larger source table – the faster (in comparison) access to dictionary will be.

Of course we should check if the triggers work ok. doing this on 1800 distinct values is complicated, so let's clear the table:

# delete from test;

DELETE 7480

Interesting – after this delete, test_dictionary should be empty. Is it?

# select count(*) from test_dictionary ;

count
-------
0
(1 row)

So, let's insert some rows, update some rows, and check if it's ok:

# insert into test (test_field) values ('a'),('b'),('c'),('a'),('e');

INSERT 0 5

# update test set test_field = 'x' where id = (select min(id) from test where test_field = 'a');
UPDATE 1

# update test set test_field = 'x' where test_field = 'e';

UPDATE 1

# select * from test;

id | test_field
------+------------
7482 | b
7483 | c
7484 | a
7481 | x
7485 | x
(5 rows)

# select * from test_dictionary ;

test_field | element_count
------------+---------------
b | 1
c | 1
a | 1
x | 2
(4 rows)

Looks good to me.

Now, above solution is great – it's relatively simple, keeps the dictionary updates, is (as far as i can tell) deadlock free.

But sometimes you just need something a bit different.

I was very recently in situation when I needed to get list of distinct values in some table.

Important facts:

 table was ~ 5 milion rows

 number of distinct values was ~ 20
 new values hardly ever occur
 there was index on the column that I wanted to get distinct values from

Given the fact that the list of fields hardly ever changes, and that I need the list of values relatively rarely, I thought about getting the list in a bit different way.

First, let's simulate my test table:

# create table test ( id serial primary key, test_field text );

(I cleaned previous test table, so the create actually worked

Now, let's insert some data:

# insert into test (test_field) select chr(i) from generate_series(65,84) i, generate_series(1,250000) x;

As you can see we have 20 different values, each repeated 250000 times:

# select test_field, count(*) from test group by test_field;

test_field | count
------------+--------
L | 250000
J | 250000
I | 250000
D | 250000
F | 250000
K | 250000
H | 250000
N | 250000
R | 250000
B | 250000
M | 250000
G | 250000
E | 250000
C | 250000
S | 250000
P | 250000
T | 250000
O | 250000
A | 250000
Q | 250000
(20 rows)

Now, getting list of test_field values is long, even when we have index:

# create index q on test (test_field);

CREATE INDEX
# analyze test;
ANALYZE

# explain analyze select distinct test_field from test;

QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
Unique (cost=0.00..153523.69 rows=20 width=2) (actual time=0.141..15163.846 rows=20 loops=1)
-> Index Scan using q on test (cost=0.00..141023.69 rows=5000000 width=2) (actual time=0.135..8337.897 rows=5000000 loops=1)
Total runtime: 15163.943 ms
(3 rows)

# explain analyze select test_field from test group by test_field;

QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=84624.00..84624.20 rows=20 width=2) (actual time=14552.031..14552.061 rows=20 loops=1)
-> Seq Scan on test (cost=0.00..72124.00 rows=5000000 width=2) (actual time=0.043..6734.036 rows=5000000 loops=1)
Total runtime: 14552.155 ms
(3 rows)

14-15 seconds. Not really cool. But, we can write a simple function:

CREATE OR REPLACE FUNCTION get_list_of_unique_elements() RETURNS setof TEXT as $$

DECLARE
previous TEXT;
temptext TEXT;
BEGIN
SELECT min(test_field) INTO temptext FROM test;
IF NOT FOUND THEN
RETURN;
END IF;
previous := temptext;
RETURN NEXT previous;
LOOP
SELECT min(test_field) INTO temptext FROM test WHERE test_field > previous;
IF NOT FOUND OR temptext IS NULL THEN
RETURN;
END IF;
previous := temptext;
RETURN NEXT previous;
END LOOP;
END;
$$ language plpgsql;

And how it works?

# select * from get_list_of_unique_elements();

get_list_of_unique_elements
-----------------------------
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
P
Q
R
S
T
(20 rows)

And time?

# explain analyze select * from get_list_of_unique_elements();

QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------
Function Scan on get_list_of_unique_elements (cost=0.00..260.00 rows=1000 width=32) (actual time=4.611..4.650 rows=20 loops=1)
Total runtime: 4.724 ms
(2 rows)

Interesting facts:

 This function will return elements in sorted order

 It should be faster than distinct/group by, by only as long as you have relatively low number of distinct values. My guess is that it's worth using for
situations where you have over 10 times as many rows as unique values
 It is possible to write the function so that it will accept table/column name as argument (using plpgsql EXECUTE), but it's not-trivial, if the column is not
of TEXT type.

Posted on 2009-07-10|Tags distinct, plpgsql, postgresql, unique|

11 thoughts on “Getting list of unique elements”

1. gregj says:
2009-07-10 at 12:42

about the first example, doing count like that on default transactional level is dangerous.
In reality you would require all transactions on that table to be serializable, which is true for any concurrent math operations.

2. Mac says:

2009-07-10 at 13:37

Thanks for this hint.

What I’m really looking for is a DynaEnum or AutoEnum datatype which supports the same features as an enum, plus:

– Behaves strictly like a varchar in all aspects…

– Automatically creates new values when needed – up to a limited number of new values (a few hundreds / thousands would probably be OK)
– Takes much less space
– Gives access to an quick function to count the distinct values

For instance in the case of Credit Card transactions processing I would have the following:

CREATE TABLE CreditCard (brand AUTOENUM, […], transactionOutcome AUTOENUM);

INSERT INTO CreditCardTransaction VALUES (‘VISA’, …, ‘OK’);

INSERT INTO CreditCardTransaction VALUES (‘VISA’, …, ‘OK’);
INSERT INTO CreditCardTransaction VALUES (‘AMEX’, …, ‘DENIED’);

-> this would automatically create 2 AutoEnums, one with (‘VISA’, ‘AMEX’) and the other one with (‘OK’, ‘DENIED’);

The storage need for those 2 enums would probably be really small.

I have many tables which would benefit from this. Especially tables which are used to log events, where the number of possible values is small but not
necessarily known beforehand – typically error messages -.

3. Scott Bailey says:

2009-07-10 at 17:48

Yeah I was thinking about this yesterday. I was reading that in 8.4 the optimizer can use bitmap indexes internally but you still can’t create a bitmap index
on a table. The bitmap index is ideal for these high volume, low cardinality tables.

But I think we could actually mimic the behavior with a table that stored the index name and an array of values. As each row is indexed, it would look up
the position of the value in the array and append it to the array if not found. (I’m guessing this is pretty close to how the enum type works internally.)

You could use that approach to make the above mentioned AutoEnum type (DynaEnum sounds like it might blow up on you) and for bitmap indexes.

4. Mac says:

2009-07-10 at 23:22

Yeah, the approach of a table would work… but It’s not backwards compatible with legacy code, it requires code rewrite… and it’s just cumbersome. And
it makes the whole DB schema more complex for not much benefit.

5. alvherre says:

2009-07-13 at 21:16

I think what you implemented in plpgsql in your last solution is called “skip scan” or something like that. I think this is something that should be
considered in the optimizer — TODO for 8.5?

6. depesz says:

2009-07-13 at 21:18

@alvherre:
I would *love* to see it in optimizer, but I’m definitely not the right person to ask about being in TODO – my C skills are next to none.

7. Jeff Davis says:

2009-07-15 at 02:34

The function remove_from_dictionary() appears unsafe. After “tmpint” is set, and before the DELETE is executed, the item may be added by some
concurrent process. You may be able to make it safe by adding a “WHERE element_count = 0” to the DELETE.
8. depesz says:

2009-07-16 at 07:50

@Jeff Davis:
I’m not sure. Wouldn’t UPDATE obtain lock on the row? So the concurrent addition would have to wait for transaction end.

9. Thomas says:

2009-07-25 at 22:00

If you have only a few distinct values, wouldn’t it be most efficient to query pg_stats for the histogram of the column?

That requires no additional coding and should be quite accurate assuming statistics_target is high enough.

Thomas

10. depesz says:

2009-07-25 at 22:04

@Thomas:
great idea, with 2 small problems:
1. statistics can (and usually are) not up to date
2. it would require the number of values to be *really* low. in terms of absolute numbers. the method I showed in the post works well for low number of
values but in relation to number of rows in table. i.e. .it will work quite well for 10000 values in 1 million row table.

11. Thomas says:

2009-07-25 at 22:59

re 1) yes, I am aware of that, but in your example you said the table was not updated very often. If “not very” is something like once a week, then the
histogram can probably be used without problems. Autovacuum should take care of that.

re 2) right. I was thinking about an absolute number (not more than 100)

Comments are closed.

Dbms Unit 3
No ratings yet
Dbms Unit 3
46 pages
DBMS Record
No ratings yet
DBMS Record
88 pages
DBMS - Sarthak Luthra
No ratings yet
DBMS - Sarthak Luthra
31 pages
PL SQL Cheatsheet
100% (3)
PL SQL Cheatsheet
3 pages
DDM Ii Year Print Out
No ratings yet
DDM Ii Year Print Out
41 pages
DBMS - All Pracs
No ratings yet
DBMS - All Pracs
42 pages
Non-Relational Postgres
No ratings yet
Non-Relational Postgres
71 pages
DBMS 5 To 8
No ratings yet
DBMS 5 To 8
16 pages
IE 171 Lecture 7 - SQL (Part 1)
No ratings yet
IE 171 Lecture 7 - SQL (Part 1)
42 pages
Adbms Front TR
No ratings yet
Adbms Front TR
40 pages
DBMS Lab Record
No ratings yet
DBMS Lab Record
34 pages
Database Systems
No ratings yet
Database Systems
46 pages
PLSQL Answers Set2 Full-1
No ratings yet
PLSQL Answers Set2 Full-1
9 pages
RDBMS Lab Record-IV Sem-1
No ratings yet
RDBMS Lab Record-IV Sem-1
39 pages
Dbms 2
No ratings yet
Dbms 2
9 pages
DBMS Lab Report
No ratings yet
DBMS Lab Report
20 pages
Fatima Code
No ratings yet
Fatima Code
24 pages
Algorithm and Programs
No ratings yet
Algorithm and Programs
21 pages
RDBMS Lab Record-IV Sem
No ratings yet
RDBMS Lab Record-IV Sem
39 pages
Cs 502 Lab Manual Final
No ratings yet
Cs 502 Lab Manual Final
27 pages
PostgreSQL CHEAT SHEET
No ratings yet
PostgreSQL CHEAT SHEET
8 pages
Sql Plsql Oracle
From Everand
Sql Plsql Oracle
Andrew Igla
No ratings yet
Very Short Notes
No ratings yet
Very Short Notes
13 pages
Data Visualization Python Tutorial
100% (1)
Data Visualization Python Tutorial
9 pages
Cassandra Query Language by Examples - Puzzles with Answers
From Everand
Cassandra Query Language by Examples - Puzzles with Answers
Cristian Scutaru
No ratings yet
Dbms
No ratings yet
Dbms
29 pages
React Interview Questions
No ratings yet
React Interview Questions
109 pages
Database1 Final Revision ٠٤٥٢٢٤
100% (1)
Database1 Final Revision ٠٤٥٢٢٤
14 pages
Test 2 - Teachers
No ratings yet
Test 2 - Teachers
8 pages
Plcopen Coding Guidelines Version 1.0
No ratings yet
Plcopen Coding Guidelines Version 1.0
127 pages
Redux Documentation 2016-12-06
No ratings yet
Redux Documentation 2016-12-06
322 pages
DBMS
No ratings yet
DBMS
15 pages
Profound Python Data Science
From Everand
Profound Python Data Science
Onder Teker
No ratings yet
CS3481 All 10 Experiments
No ratings yet
CS3481 All 10 Experiments
5 pages
DBMS Lab File-1
No ratings yet
DBMS Lab File-1
15 pages
DBMS Question Bank Answers
No ratings yet
DBMS Question Bank Answers
3 pages
Oracle SQL and PLSQL Topics
No ratings yet
Oracle SQL and PLSQL Topics
11 pages
Normalization
No ratings yet
Normalization
42 pages
Points To Remember - DB
No ratings yet
Points To Remember - DB
9 pages
Dbms Exp9
No ratings yet
Dbms Exp9
6 pages
Learn Programming Using C#
From Everand
Learn Programming Using C#
Taurius Litvinavicius
No ratings yet
Refresher
No ratings yet
Refresher
10 pages
SQL Experiment Ans
No ratings yet
SQL Experiment Ans
16 pages
Honeywell Placement Papers
No ratings yet
Honeywell Placement Papers
29 pages
Dbms Lab-1
No ratings yet
Dbms Lab-1
17 pages
Oracle SQL and PL/SQL
From Everand
Oracle SQL and PL/SQL
Niraj Gupta
4.5/5 (8)
SQL
No ratings yet
SQL
9 pages
SQL Server: Tips and Tricks - 2
From Everand
SQL Server: Tips and Tricks - 2
Priyanka Agarwal
4.5/5 (3)
Normalization-Simple Queries
No ratings yet
Normalization-Simple Queries
2 pages
SQL Tutorial PDF
No ratings yet
SQL Tutorial PDF
31 pages
Unit 1
No ratings yet
Unit 1
85 pages
Data Bases Cheatsheet
No ratings yet
Data Bases Cheatsheet
2 pages
SQL Note
No ratings yet
SQL Note
3 pages
Database Management System QB
No ratings yet
Database Management System QB
5 pages
Excel Techniques
From Everand
Excel Techniques
Online Trainees
2/5 (1)
SVG++ Documentation: Release 1.0
No ratings yet
SVG++ Documentation: Release 1.0
53 pages
Decode
100% (1)
Decode
9 pages
DBMS Lab Manual
From Everand
DBMS Lab Manual
Jitendra Patel
1.5/5 (3)
SQL Syntax
No ratings yet
SQL Syntax
2 pages
Structured Query Language
No ratings yet
Structured Query Language
13 pages
SQL Interview Questions and Answers: What Is SQL and Where Does It Come From?
No ratings yet
SQL Interview Questions and Answers: What Is SQL and Where Does It Come From?
9 pages
EXISTS Conditions: An Condition Tests For Existence of Rows in A Subquery
No ratings yet
EXISTS Conditions: An Condition Tests For Existence of Rows in A Subquery
5 pages
Introduction To Computation and Problem Solving
No ratings yet
Introduction To Computation and Problem Solving
20 pages
PL SQL
No ratings yet
PL SQL
4 pages
Amazing Java: Learn Java Quickly
From Everand
Amazing Java: Learn Java Quickly
Andrei Besedin
No ratings yet
Student Result Management System: BY Kazi Hasnayeen Emad ID: 191902025
No ratings yet
Student Result Management System: BY Kazi Hasnayeen Emad ID: 191902025
16 pages
File Operations in Java 11
No ratings yet
File Operations in Java 11
9 pages
Advanced SAS Interview Questions You'll Most Likely Be Asked
From Everand
Advanced SAS Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
Aditya Institute of Technology and Management: Tekkali
No ratings yet
Aditya Institute of Technology and Management: Tekkali
64 pages
Java Server Faces
No ratings yet
Java Server Faces
32 pages
UNIX Shell Programming Interview Questions You'll Most Likely Be Asked
From Everand
UNIX Shell Programming Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
Red Beans
No ratings yet
Red Beans
126 pages
Library Management System Final Report
No ratings yet
Library Management System Final Report
38 pages
Oracle PLSQL Quick Reference Card
No ratings yet
Oracle PLSQL Quick Reference Card
3 pages
Notes 1
No ratings yet
Notes 1
20 pages
Big M Method
0% (1)
Big M Method
28 pages
Python DATA VISUALIZATION
No ratings yet
Python DATA VISUALIZATION
31 pages
SQL Cheat
No ratings yet
SQL Cheat
1 page
Raghava New
No ratings yet
Raghava New
3 pages
ch-3 Flow of Control, Types of Statement, If, If Else, If Elifladder, Nested If, Algorithm, Flowchart, Pseudocode
No ratings yet
ch-3 Flow of Control, Types of Statement, If, If Else, If Elifladder, Nested If, Algorithm, Flowchart, Pseudocode
13 pages
Terraform Notes by Tahira (CEO of DevOps Mastery Hub)
No ratings yet
Terraform Notes by Tahira (CEO of DevOps Mastery Hub)
13 pages
2023 11 13 CBCS Bap 23 Sem.i, Iii, V
No ratings yet
2023 11 13 CBCS Bap 23 Sem.i, Iii, V
18 pages
Abstract - AOAU ZOSO F083 W
No ratings yet
Abstract - AOAU ZOSO F083 W
3 pages
Abstract Method in Class: Prof. Kumar Anand Singh
No ratings yet
Abstract Method in Class: Prof. Kumar Anand Singh
8 pages
CS3271 - Laboratory Exercise List
No ratings yet
CS3271 - Laboratory Exercise List
3 pages
Brute Force Approach
No ratings yet
Brute Force Approach
3 pages
XI IP Sample Paper
No ratings yet
XI IP Sample Paper
2 pages
116 BSCS 19
No ratings yet
116 BSCS 19
3 pages
(Ebook) Introduction To Programming With C++ by Y Daniel Liang ISBN 9780273793243, 0273793241 Instant Download
100% (6)
(Ebook) Introduction To Programming With C++ by Y Daniel Liang ISBN 9780273793243, 0273793241 Instant Download
47 pages
Computer Network Assignment Help
No ratings yet
Computer Network Assignment Help
8 pages

Getting List of Unique Elements

Uploaded by

Getting List of Unique Elements

Uploaded by

Getting list of unique elements

select distinct column from table;

select column from table group by column;

Let's assume we have this table:

Now, let's insert to it some data:

# insert into test (test_field)

Now, let's see how the data looks:

# select count(*), count(distinct test_field) from test;

Now, to get list of all possible test_field values, I do:

# explain analyze select distinct test_field from test;

( I tested it several times, and it generally runs in 44-47 ms )

So, let's add our dictionary table, and triggers.

 add new elements to dictionary, or increment element_count

Because of this we will start with 2 simple functions:

CREATE OR REPLACE FUNCTION add_to_dictionary(use_value TEXT) RETURNS void as $$

(the code is taken from the manual).

Now, let's test if the functions work:

First, the table is empty:

# select * from test_dictionary ;

Now, let's add some values:

And let's see how it looks:

# select * from test_dictionary ;

Now, let's remove the elements:

# select * from test_dictionary ;

# select * from test_dictionary ;

# select * from test_dictionary ;

All works. Now for the triggers.

Insert and delete triggers are trivial:

CREATE OR REPLACE FUNCTION test_dictionary_i() RETURNS TRIGGER AS

Update is a bit more complex:

CREATE OR REPLACE FUNCTION test_dictionary_u() RETURNS TRIGGER AS

insert into test_dictionary (test_field, element_count)

And that's basically all.

First of all, let's check if it is faster:

# explain analyze select test_field from test_dictionary;

# delete from test;

Interesting – after this delete, test_dictionary should be empty. Is it?

# select count(*) from test_dictionary ;

# insert into test (test_field) values ('a'),('b'),('c'),('a'),('e');

# update test set test_field = 'x' where test_field = 'e';

# select * from test;

# select * from test_dictionary ;

Looks good to me.

But sometimes you just need something a bit different.

 table was ~ 5 milion rows

First, let's simulate my test table:

# create table test ( id serial primary key, test_field text );

(I cleaned previous test table, so the create actually worked

Now, let's insert some data:

# insert into test (test_field) select chr(i) from generate_series(65,84) i, generate_series(1,250000) x;

# select test_field, count(*) from test group by test_field;

# create index q on test (test_field);

# explain analyze select distinct test_field from test;

# explain analyze select test_field from test group by test_field;

CREATE OR REPLACE FUNCTION get_list_of_unique_elements() RETURNS setof TEXT as $$

And how it works?

# select * from get_list_of_unique_elements();

# explain analyze select * from get_list_of_unique_elements();

 This function will return elements in sorted order

Posted on 2009-07-10|Tags distinct, plpgsql, postgresql, unique|

11 thoughts on “Getting list of unique elements”

Thanks for this hint.

– Behaves strictly like a varchar in all aspects…

CREATE TABLE CreditCard (brand AUTOENUM, […], transactionOutcome AUTOENUM);

INSERT INTO CreditCardTransaction VALUES (‘VISA’, …, ‘OK’);

3. Scott Bailey says:

7. Jeff Davis says:

10. depesz says:

11. Thomas says:

Comments are closed.

You might also like