Text Search and Pattern Matching in PostgreSQL
Text Search and Pattern Matching in PostgreSQL
Overview by Method
Use Cases
Questions
Joe Conway
[email protected]
[email protected]
Crunchy Data
Where’s Waldo?
Agenda
Summary of methods
Overview by method
Example use cases
Caveats
Note on Extensions
Sample Data
LIKE Syntax
Negating LIKE
Wildcards
ESCAPE Example
SIMILAR TO Syntax
Equivalent to LIKE
Interprets pattern using SQL definition of regex
string SIMILAR TO pattern [ESCAPE escape-character]
string NOT SIMILAR TO pattern [ESCAPE escape-character]
Wildcards
Same as LIKE
Also supports meta-characters borrowed from POSIX REs
pipe (”|”): either of two alternatives
asterisk (”*”): repetition >= 0 times
plus (”+”): repetition >= 1 time
question mark (”?”): repetition 0 or 1 time
”{m}”: repetition exactly m times
”{m,}”: repetition >= m times
”{m,n}”: repetition >= m and <= n times
parentheses (”()”): group items into a single logical item
SIMILAR TO Examples
Soundex
Levenshtein
Metaphone
Double Metaphone
Trigram Matching
\timing
SELECT set_limit(0.6); -- defaults to 0.3
SELECT DISTINCT _from, -- uses trgm_gist_idx
similarity(_from, ’Josef Konway <[email protected]>’) AS sml
FROM messages WHERE _from % ’Josef Konway <[email protected]>’
ORDER BY sml DESC, _from;
_from | sml
------------------------------------+----------
Joseph Conway <[email protected]> | 0.724138
Joe Conway <[email protected]> | 0.703704
jconway <[email protected]> | 0.678571
"Joe Conway" <[email protected]> | 0.62963
(4 rows)
Time: 502.002 ms
Overview
Preprocessing
Weighting
Creating tsvector
Writing tsquery
Writing tsquery
to tsquery() to create
Normalizes tokens lexemes
Discards stop words
Can also cast to tsquery
Tokens taken at face value
No weight labels
SELECT to_tsquery(’(hello:A & the:A) | writing:D’) AS to_tsq,
’(hello & the) | writing’::tsquery AS cast_tsq
UNION ALL
SELECT to_tsquery(’postgres:*’),
’postgres:*’::tsquery;
to_tsq | cast_tsq
-----------------------+-----------------------------
’hello’:A | ’write’:D | ’hello’ & ’the’ | ’writing’
’postgr’:* | ’postgres’:*
Writing tsquery
Match Operator
Relevance Ranking
ts rank(): based on frequency of matching lexemes
ts rank cd(): lexeme proximity taken into consideration
WITH ts(q) AS
(
SELECT ’multixact:A & (crash:D | (data:D & loss:D))’::tsquery
)
SELECT ts_rank(m.fti, ts.q) as tsrank
FROM messages m, ts
WHERE m.fti @@ ts.q
ORDER BY tsrank DESC LIMIT 4;
tsrank
----------
0.999997
0.999997
0.999997
0.999997
Highlighting
Equal
Anchored
Anchored case-insensitive
Reverse Anchored case-insensitive
Unanchored case-insensitive
Fuzzy
Complex Search with Relevancy Ranking
Equal
Equal
With an index
CREATE INDEX from_idx ON messages(_from);
EXPLAIN ANALYZE SELECT date FROM messages
WHERE _from = ’Joseph Conway <[email protected]>’;
QUERY PLAN
-------------------------------------------------------------------
Bitmap Heap Scan on messages
[...]
-> Bitmap Index Scan on from_idx
(cost=0.00..4.88 rows=61 width=0)
(actual time=0.051..0.051 rows=14 loops=1)
Index Cond: (_from = ’Joseph Conway <[email protected]>’::text)
Planning time: 0.267 ms
Execution time: 0.161 ms
Anchored
Anchored
Anchored Case-Insensitive
Find all the rows where column matches ’<pattern>%’
⇒ but in Case-Insensitive way
LIKE operator with suitable expression index is good
CREATE INDEX lower_pattern_idx
ON messages(lower(_from) using text_pattern_ops);
EXPLAIN ANALYZE SELECT date FROM messages
WHERE lower(_from) LIKE ’joseph conway%’;
QUERY PLAN
-------------------------------------------------------------------
Bitmap Heap Scan on messages
[...]
-> Bitmap Index Scan on lower_pattern_idx
(cost=0.00..214.76 rows=5433 width=0)
(actual time=0.074..0.074 rows=14 loops=1)
Index Cond: ((lower(_from) ~>=~ ’joseph conway’::text)
AND (lower(_from) ~<~ ’joseph conwaz’::text))
Planning time: 0.505 ms
Execution time: 0.258 ms
Anchored Case-Insensitive
Anchored Case-Insensitive
Unanchored Case-Insensitive
Unanchored Case-Insensitive
Unanchored Case-Insensitive
Fuzzy
Find all the rows where column matches ’<pattern>’
⇒ but in an inexact way
Use dmetaphone function with an expression index
Might also use Soundex, Levenshtein, Metaphone, or pg trgm
CREATE INDEX dmet_expr_idx ON messages(dmetaphone(_from));
EXPLAIN ANALYZE SELECT _from FROM messages
WHERE dmetaphone(_from) = dmetaphone(’josef konwei’);
QUERY PLAN
-------------------------------------------------------------------
Bitmap Heap Scan on messages
[...]
-> Bitmap Index Scan on dmet_expr_idx
(cost=0.00..101.17 rows=5433 width=0)
(actual time=0.085..0.085 rows=108 loops=1)
Index Cond: (dmetaphone(_from) = ’JSFK’::text)
Planning time: 0.272 ms
Execution time: 0.445 ms
Complex Requirements
Full Text Search
Complex multi-word searching
Relevancy Ranking
Questions?
Thank You!
[email protected]