Getting List of Unique Elements
Getting List of Unique Elements
Every so often you need to get list of unique elements in some column. The standard way to do it is:
or
The only problem is that it's slow – as it has to seq scan whole table. Can it be done faster?
Usually when somebody wants to speed up such query, (s)he is advised to add triggers which will keep list of values in some side table – dictionary.
This is nice, but a bit cumbersome, and relatively hard to do correctly. Let's see how it would work.
Table "public.test"
Column | Type | Modifiers
------------+---------+---------------------------------------------------
id | integer | not null default nextval('test_id_seq'::regclass)
test_field | text |
Indexes:
"test_pkey" PRIMARY KEY, btree (id)
# \d test_dictionary
Table "public.test_dictionary"
Column | Type | Modifiers
---------------+---------+--------------------
test_field | text | not null
element_count | integer | not null default 1
Indexes:
"test_dictionary_pkey" PRIMARY KEY, btree (test_field)
Now, we will need 3 triggers (on insert, update and delete), but internally they will be using 2 specialized code blocks:
# select add_to_dictionary('added');
add_to_dictionary
-------------------
(1 row)
# select add_to_dictionary('added');
add_to_dictionary
-------------------
(1 row)
# select add_to_dictionary('xxx');
add_to_dictionary
-------------------
(1 row)
Looks ok.
# select remove_from_dictionary('xxx');
remove_from_dictionary
------------------------
(1 row)
# select remove_from_dictionary('added');
remove_from_dictionary
------------------------
(1 row)
# select remove_from_dictionary('added');
remove_from_dictionary
------------------------
(1 row)
Last test – check if it will correctly fail when trying to remove element not in dictionary:
# select remove_from_dictionary('added');
ERROR: remove_from_dictionary() called with element that doesn't exist in test_dictionary ?! [added]
You might wonder why there is the last elsif and else – with mixed order of function calls. If you do wonder about it – please check this blogpost, especially the
part where it explains deadlock problem.
And the final step – we have to prefill dictionary table. Like this:
No surprise here.
Getting the list of elements now depends on count of distinct elements, and not count of all rows in our main (test) table.
The larger source table – the faster (in comparison) access to dictionary will be.
Of course we should check if the triggers work ok. doing this on 1800 distinct values is complicated, so let's clear the table:
So, let's insert some rows, update some rows, and check if it's ok:
# update test set test_field = 'x' where id = (select min(id) from test where test_field = 'a');
UPDATE 1
Now, above solution is great – it's relatively simple, keeps the dictionary updates, is (as far as i can tell) deadlock free.
I was very recently in situation when I needed to get list of distinct values in some table.
Important facts:
Given the fact that the list of fields hardly ever changes, and that I need the list of values relatively rarely, I thought about getting the list in a bit different way.
As you can see we have 20 different values, each repeated 250000 times:
Now, getting list of test_field values is long, even when we have index:
14-15 seconds. Not really cool. But, we can write a simple function:
And time?
Interesting facts:
1. gregj says:
2009-07-10 at 12:42
about the first example, doing count like that on default transactional level is dangerous.
In reality you would require all transactions on that table to be serializable, which is true for any concurrent math operations.
2. Mac says:
2009-07-10 at 13:37
What I’m really looking for is a DynaEnum or AutoEnum datatype which supports the same features as an enum, plus:
For instance in the case of Credit Card transactions processing I would have the following:
-> this would automatically create 2 AutoEnums, one with (‘VISA’, ‘AMEX’) and the other one with (‘OK’, ‘DENIED’);
The storage need for those 2 enums would probably be really small.
I have many tables which would benefit from this. Especially tables which are used to log events, where the number of possible values is small but not
necessarily known beforehand – typically error messages -.
2009-07-10 at 17:48
Yeah I was thinking about this yesterday. I was reading that in 8.4 the optimizer can use bitmap indexes internally but you still can’t create a bitmap index
on a table. The bitmap index is ideal for these high volume, low cardinality tables.
But I think we could actually mimic the behavior with a table that stored the index name and an array of values. As each row is indexed, it would look up
the position of the value in the array and append it to the array if not found. (I’m guessing this is pretty close to how the enum type works internally.)
You could use that approach to make the above mentioned AutoEnum type (DynaEnum sounds like it might blow up on you) and for bitmap indexes.
4. Mac says:
2009-07-10 at 23:22
Yeah, the approach of a table would work… but It’s not backwards compatible with legacy code, it requires code rewrite… and it’s just cumbersome. And
it makes the whole DB schema more complex for not much benefit.
5. alvherre says:
2009-07-13 at 21:16
I think what you implemented in plpgsql in your last solution is called “skip scan” or something like that. I think this is something that should be
considered in the optimizer — TODO for 8.5?
6. depesz says:
2009-07-13 at 21:18
@alvherre:
I would *love* to see it in optimizer, but I’m definitely not the right person to ask about being in TODO – my C skills are next to none.
2009-07-15 at 02:34
The function remove_from_dictionary() appears unsafe. After “tmpint” is set, and before the DELETE is executed, the item may be added by some
concurrent process. You may be able to make it safe by adding a “WHERE element_count = 0” to the DELETE.
8. depesz says:
2009-07-16 at 07:50
@Jeff Davis:
I’m not sure. Wouldn’t UPDATE obtain lock on the row? So the concurrent addition would have to wait for transaction end.
9. Thomas says:
2009-07-25 at 22:00
If you have only a few distinct values, wouldn’t it be most efficient to query pg_stats for the histogram of the column?
That requires no additional coding and should be quite accurate assuming statistics_target is high enough.
Thomas
2009-07-25 at 22:04
@Thomas:
great idea, with 2 small problems:
1. statistics can (and usually are) not up to date
2. it would require the number of values to be *really* low. in terms of absolute numbers. the method I showed in the post works well for low number of
values but in relation to number of rows in table. i.e. .it will work quite well for 10000 values in 1 million row table.
2009-07-25 at 22:59
re 1) yes, I am aware of that, but in your example you said the table was not updated very often. If “not very” is something like once a week, then the
histogram can probably be used without problems. Autovacuum should take care of that.
re 2) right. I was thinking about an absolute number (not more than 100)