Полнотекстовый Поиск В Postgresql За Миллисекунды
Полнотекстовый Поиск В Postgresql За Миллисекунды
за миллисекунды
Олег Бартунов
Александр Коротков
Full-text search in DB
● Full-text search
– Find documents, which satisfy query
– return results in some order (opt.)
● Requirements to FTS
– Full integration with DB core
● transaction support
● concurrency and recovery
● online index
– Linguistic support
– Flexibility, Scalability
What is a document ?
3.Sort documents
Can we improve native FTS ?
156676 Wikipedia articles:
We'll be FINE !
We'll be FINE !
● Additional benefit
– T(rare_word & frequent_word) ~ T(rare_word)
Inverted Index
Inverted Index
E
N
T Posting list
R Posting tree
T
R
E
E
Summary of changes
• GIN
– storage
– search
– ORDER BY
– interface
• planner
GIN structure changes
Add additional information
(word positions)
ItemPointer
typedef struct ItemPointerData
{
BlockIdData ip_blkid;
OffsetNumber ip_posid;
}
6 bytes
typedef struct BlockIdData
{
uint16 bi_hi;
uint16 bi_lo;
} BlockIdData;
WordEntryPos
/*
* Equivalent to
* typedef struct {
* uint16
* weight:2,
* pos:14;
2 bytes
* }
*/
Datum *extractValue
(
Datum itemValue,
int32 *nkeys,
bool **nullFlags,
Datum *addInfo,
bool *addInfoIsNull
)
extractQuery
Datum *extractValue
(
Datum query,
int32 *nkeys,
StrategyNumber n,
bool **pmatch,
Pointer **extra_data,
bool **nullFlags,
int32 *searchMode,
???bool **required???
)
consistent
bool consistent
(
bool check[],
StrategyNumber n,
Datum query,
int32 nkeys,
Pointer extra_data[],
bool *recheck,
Datum queryKeys[],
bool nullFlags[],
Datum addInfo[],
bool addInfoIsNull[]
)
calcRank
float8 calcRank
(
bool check[],
StrategyNumber n,
Datum query,
int32 nkeys,
Pointer extra_data[],
bool *recheck,
Datum queryKeys[],
bool nullFlags[],
Datum addInfo[],
bool addInfoIsNull[]
)
???joinAddInfo???
Datum joinAddInfo
(
Datum addInfos[]
)
Planner optimization
Before
test=# EXPLAIN (ANALYZE, VERBOSE) SELECT * FROM test ORDER BY slow_func(x,y)
LIMIT 10;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------
Limit (cost=0.00..3.09 rows=10 width=16) (actual time=11.344..103.443 rows=10
loops=1)
Output: x, y, (slow_func(x, y))
-> Index Scan using test_idx on public.test (cost=0.00..309.25 rows=1000 width=16)
(actual time=11.341..103.422 rows=10 loops=1)
Output: x, y, slow_func(x, y)
Total runtime: 103.524 ms
(5 rows)
After
test=# EXPLAIN (ANALYZE, VERBOSE) SELECT * FROM test ORDER BY slow_func(x,y)
LIMIT 10;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------
Limit (cost=0.00..3.09 rows=10 width=16) (actual time=0.062..0.093 rows=10 loops=1)
Output: x, y
-> Index Scan using test_idx on public.test (cost=0.00..309.25 rows=1000 width=16)
(actual time=0.058..0.085 rows=10 loops=1)
Output: x, y
Total runtime: 0.164 ms
(5 rows)
Testing results
avito.ru: 6.7 mln. docs
With tsvector column
SELECT
itemid, title
FROM
items
WHERE
fts @@ plainto_tsquery('russian', 'угловой
шкаф')
ORDER BY
ts_rank(fts, plainto_tsquery('russian',
'угловой шкаф')) DESC
LIMIT
10;
With tsvector column without patch,
Limit (cost=2341.92..2341.94 rows=10 width=398) (actual time=38.532..38.5
Buffers: shared hit=12830
-> Sort (cost=2341.92..2343.37 rows=581 width=398) (actual time=38.53
Sort Key: (ts_rank(fts, '''углов'' & ''шкаф'''::tsquery))
Sort Method: top-N heapsort Memory: 26kB
Buffers: shared hit=12830
-> Bitmap Heap Scan on items (cost=48.50..2329.36 rows=581 widt
Recheck Cond: (fts @@ '''углов'' & ''шкаф'''::tsquery)
Buffers: shared hit=12830
-> Bitmap Index Scan on fts_idx (cost=0.00..48.36 rows=58
Index Cond: (fts @@ '''углов'' & ''шкаф'''::tsquery)
Buffers: shared hit=116
Total runtime: 38.569 ms
With tsvector column with patch,