0% found this document useful (0 votes)
302 views56 pages

Fts Internals

This document discusses the internals of SQLite FTS4 and compares it to FTS5. FTS4 and FTS5 are virtual tables that maintain a full text index on their contents to support text searches. FTS4 is the released version that is widely used, while FTS5 is unreleased but incorporates lessons learned from FTS4. The document then covers the structure of the FTS index, the underlying database tables, auxiliary functions, administration and tuning parameters, and how common tokens are handled differently between FTS4 and FTS5.

Uploaded by

mahesh_rampalli
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
302 views56 pages

Fts Internals

This document discusses the internals of SQLite FTS4 and compares it to FTS5. FTS4 and FTS5 are virtual tables that maintain a full text index on their contents to support text searches. FTS4 is the released version that is widely used, while FTS5 is unreleased but incorporates lessons learned from FTS4. The document then covers the structure of the FTS index, the underlying database tables, auxiliary functions, administration and tuning parameters, and how common tokens are handled differently between FTS4 and FTS5.

Uploaded by

mahesh_rampalli
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 56

SQLite FTS4 Internals

And comparison with FTS5


FTS4 and FTS5
● FTS4 and FTS5 are both virtual tables that
maintain a “full text index” on their contents.
● They provide similar functionality, but
– FTS4 is released and is widely used.
– FTS5 is unreleased but incorporates a few lessons
learned during FTS4's lifetime.
● Most of these slides are about FTS4, with a few
comments regarding FTS5.
Presentation Structure
1. The FTS Index
• what it contains, the types of queries it supports, tokenizers

2. The Underlying Database Tables


• what is stored on disk, how this can be configured/optimized

3. Auxiliary Functions
• what are they, what they can do and how they might be extended

4. Administration and Tuning Parameters


• 'optimize', 'automerge', 'rebuild' and other commands – what and why

5. Common Tokens and FTS5


• how common tokens cause problems for FTS4, and why they are less
of a problem with FTS5
Part 1
1. The FTS Index
• what it contains, the types of queries it supports, tokenizers

2. The Underlying Database Tables


• what is stored on disk, how this can be configured/optimized

3. Auxiliary Functions
• what are they, what they can do and how they might be extended

4. Administration and Tuning Parameters


• 'optimize', 'automerge', 'rebuild' and other commands – what and why

5. Common Tokens and FTS5


• how common tokens cause problems for FTS4, and why they are less
of a problem with FTS5
Table Creation
● An FTS index is automatically created and
populated along with each FTS table.
● Create an FTS table using:
CREATE VIRTUAL TABLE ft USING fts4(a, b);

● Populate it using regular INSERT, UPDATE and


DELETE statements:
INSERT INTO ft(rowid, a, b) VALUES(?, ?, ?);
INSERT INTO ft(docid, a, b) VALUES(?, ?, ?);

DELETE FROM ft WHERE rowid = ?;


UPDATE ft SET a=? WHERE docid=?;
Example FTS Index
INSERT INTO ft(rowid, a, b)
VALUES(1, 'Purple Cyan', 'orange blue purple cyan.'),
VALUES(2, 'Yellow', 'Orange BLUE yellow purple yellow.'),
VALUES(3, 'Purple Cyan', 'Gold purple green.'),
VALUES(4, 'Yellow', '[red purple, grey]');

“blue” -> (1: b1) (2: b1) Doclists


“cyan” -> (1: a1 b3) (3: a1)
“gold” -> (3: b0)
“green” -> (3: b2)
“grey” -> (4: b2)
“orange” -> (1: b0) (2: b0)
“purple” -> (1: a0 b2) (2: b3) (3: a0 b1) (4: b1)
“red” -> (4: b0)
“yellow” -> (2: a0 b2 b4) (4: a0)

B-Tree structure Terms


Example Query
● So it's easy to see how FTS answers queries
for “the set of rowid values for rows that contain
'cyan'”:
SELECT rowid FROM ft WHERE ft MATCH 'cyan'

● It searches the b-tree for “cyan”, and finds:


(1: a1 b3) (3: a1)

● So returns rowid values 1 and 3.


Tokenizers
● A “tokenizer” extracts tokens or terms from
blocks of text. e.g. transforms:
”orange BLUE Yellow purple yellow.”
Case folding
● To:
”orange”, “blue”, “yellow”, “purple”, “yellow”

● FTS4 and FTS5 both have a couple of built-in


tokenizers (simple, unicode61, porter).
● And an API allowing users to implement more.
Tokenizers
● A single FTS table has a single tokenizer*.
● Used to extract tokens from both table content
and query text.
● It's important to use the same tokenizer on
content and queries. So that:
SELECT rowid FROM ft WHERE ft MATCH 'Cyan';

works. Upper case C


● Tokenizers may also transform terms to a more
normal form – this is “stemming”.
* Not entirely true for tables that use “languageid”
Stemmer Tokenizers
● Stemmers are language specific – the built-in
“porter” tokenizer is a stemmer for English.

With porter, “require”,


“requirement”,
“requirements”, and
“required” are all
considered the same term.
Custom Stemmer Tokenizers
● A tokenizer could also map common sets of
synonyms or abbreviations to a single token.
● i.e. tokenize these strings as follows:
”1st road, Somerset” -> “first”, “road”, “somerset”
“first Rd., Lancashire” -> “first”, “road”, “lancashire”

● Then, if the user runs:


SELECT … WHERE MATCH '1st Rd.'

● The tokenizer tokenizes the query as:


“first AND road”

● Which matches both rows.


More Queries: AND, OR
● As well as querying for all documents
containing a specified token, the FTS index
supports logical AND and OR operations:
SELECT rowid FROM ft WHERE ft MATCH 'yellow AND grey'
SELECT rowid FROM ft WHERE ft MATCH 'yellow OR grey'

● Retrieve doclists for each token:


“grey” -> (4: b2)
“yellow” -> (2: a0 b2 b4) (4: a0)

● For “AND”, return the intersection of the two


sets of rowids (just 4). For “OR”, the union (2
and 4).
Implicit AND operators
● If there is no operator between two tokens, an
implicit AND is inserted. Equivalent:
SELECT rowid FROM ft WHERE ft MATCH 'wal performance'
SELECT rowid FROM ft WHERE ft MATCH 'wal AND performance'

● This leads to intuitive results in UI's:


Why Implicit AND is important
● Say a document contains the text:
“The sqlite3_prepare API...”

● And the query:


SELECT rowid FROM ft WHERE ft MATCH 'sqlite3_prepare'

● Depending on the tokenizer, “sqlite3_prepare”


might be one or two tokens
● If it is two tokens, the query is equivalent to:
... MATCH 'sqlite3 AND prepare'

● Which will match


FTS4 Has Two Query Syntaxes
● FTS4 actually supports two slightly different
query syntaxes
● The switch:
-DSQLITE_ENABLE_FTS3_PARENTHESIS=1

● Enables the new syntax. Which supports


parenthesis. And the “NOT” operator.
● Always build with this switch!
More Queries: NOT operator
● The “NOT” operator works like an SQL
EXCEPT. This:
SELECT rowid FROM ft WHERE ft MATCH 'yellow NOT grey'

● Is “all rowids for documents that contain 'yellow'


but do not contain 'grey'”.
● Same again: Retrieve doclists for each token:
“grey” -> (4: b2)
“yellow” -> (2: a0 b2 b4) (4: a0)

● And so on..
Precedence & Parenthesis
● Precedence, from tightest to loosest grouping:
– NOT
– AND
– OR
● You can use parenthesis. So these are the
same:
SELECT * FROM ft WHERE ft MATCH 'yellow AND grey OR red'
SELECT * FROM ft WHERE ft MATCH 'red OR yellow AND grey'
SELECT * FROM ft WHERE ft MATCH 'red OR (yellow AND grey)'

● But this is different:


SELECT * FROM ft WHERE ft MATCH '(red OR yellow) AND grey'
More Queries: Phrases
● Can also use the index for “phrase” queries:
SELECT rowid FROM ft WHERE ft MATCH '”blue yellow”'

● FTS retrieves the doclists for each separate


token:
“blue” -> (1: b1) (2: b1)
“yellow” -> (2: a0 b2 b4) (4: a0)

● Filters as for “AND”, then filters for the phrase


match.
“blue” -> (1: b1) (2: b1)
“yellow” -> (2: a0 b2 b4) (4: a0)
More Queries: NEAR
● NEAR queries are similar:
SELECT rowid FROM ft WHERE ft MATCH 'orange NEAR cyan'

● As are queries that restrict matches to a


specified column:
SELECT rowid FROM ft WHERE ft MATCH 'b:cyan'

● All implemented by extra filtering after index


entries have been loaded from disk
Prefix Queries
● We can also do prefix queries:
SELECT rowid FROM ft WHERE ft MATCH 'g*'

● “all rows that contain at least one term that


begins with 'g'”
Scan and merge

“blue” -> (1: b1) (2: b1) this range


“cyan” -> (1: a1 b3) (3: a1)
“gold” -> (3: b0)
“green” -> (3: b2)
“grey” -> (4: b2)
“orange” -> (1: b0) (2: b0)
“purple” -> (1: a0 b2) (2: b3) (3: a0 b1) (4: b1)
“red” -> (4: b0)
“yellow” -> (2: a0 b2 b4) (4: a0)
Prefix Indexes
● Scanning and merging doclists can be slow.
● The “prefix=” option can be used to create
prefix indexes. e.g.:
CREATE VIRTUAL TABLE ft USING fts4(a, b, prefix=”1”);

● Then, as well as the main term index:


“b” -> (1: b1) (2: b1)
“c” -> (1: a1 b3) (3: a1)
“g” -> (3: b0 b2) (4: b2)
“o” -> (1: b0) (2: b0)
“p” -> (1: a0 b2) (2: b3) (3: a0 b1) (4: b1)
“r” -> (4: b0)
“y” -> (2: a0 b2 b4) (4: a0)
Prefix Indexes
● Multiple prefix indexes can be added:
CREATE VIRTUAL TABLE ft USING fts4(a, b, prefix=”1,2,3”);

● Each additional prefix index is between half and


the same size on disk as the main term index.
● Adding a prefix index reduces the CPU used by
prefix queries significantly. And IO by a little.
Part 2
1. The FTS Index
• what it contains, the types of queries it supports, tokenizers

2. The Underlying Database Tables


• what is stored on disk, how this can be configured/optimized

3. Auxiliary Functions
• what are they, what they can do and how they might be extended

4. Administration and Tuning Parameters


• 'optimize', 'automerge', 'rebuild' and other commands – what and why

5. Common Tokens and FTS5


• how common tokens cause problems for FTS4, and why they are less
of a problem with FTS5
Data stored on disk
● For each virtual table, FTS4 creates between 2
and 5 native tables on disk:
sqlite> CREATE VIRTUAL TABLE ft USING fts4(a, b);
sqlite> .schema
CREATE VIRTUAL TABLE ft USING fts4(a, b);

CREATE TABLE 'ft_content' (docid IPK, 'c0a', 'c1b');


CREATE TABLE 'ft_segments'(blockid IPK, block BLOB);
CREATE TABLE 'ft_segdir' (level INTEGER, idx INTEGER, ....
CREATE TABLE 'ft_docsize'(docid IPK, size BLOB);
CREATE TABLE 'ft_stat'(id IPK, value BLOB);
Data stored on disk
● Big tables:
– The “%_content” table stores the actual content inserted
into the table, verbatim.
– The “%_segment” table stores (most of) the FTS index
data.
– The “%_docsize” table stores the size, in tokens, of each
column value in the table. This is used by matchinfo().
● Small tables:
– %_segdir stores a small amount of FTS index data.
– %_stat contains a single record – the sum of the
%_docsize values.
Example 1: Enron Database
● Consists of 517424 separate emails (1.4 GiB).
● sqlite3_analyzer says:
Table 1024 byte pages % of DB
%_content 1524691 65.5%
%_segments 797885 34.3%
%_docsize 6105 0.25%
%_segdir 7 0.0%

● After adding a prefix index (prefix=1):


Table 1024 byte pages % of DB
%_content 1524691 57.9% FTS index now
%_segments 1103621 41.9% 1.38 times as
%_docsize 6105 0.23% large
%_segdir 16 0.0%
Example 2: POI Database
● 1.3 million rows, 28 columns, but just a few
tokens per row (most columns contain NULL):
Table 1024 byte pages % of DB
%_content 101246 55.6%
%_segments 30803 16.9% Unusually
%_docsize 50035 27.5% large
%_segdir 4 0.0%

● The %_docsize table is only used by the


matchinfo 'l' option. It can be omitted with:
CREATE VIRTUAL TABLE ft USING fts4(a, b, matchinfo=fts3);
Compressing the %_content table
● Each column value stored in an FTS4 table
may be individually compressed.
● Application provides SQL scalar functions to
compress and uncompress values.
● Compress function takes one argument –
returns compressed version.
● Uncompress function also takes one argument
– returns uncompressed version.
Compressing the %_content table
● Configuring an FTS4 table to use
compress/uncompress scalar functions:
CREATE VIRTUAL TABLE ft USING fts4(
a, b, compress=cmp, uncompress=uncmp
);

● Then, instead of reading and writing with:


SELECT c1 AS a, c2 AS b ...
INSERT INTO %_content VALUES($rowid, ?, ?);

● It uses:
SELECT cmp(c1) AS a, cmp(c2) AS b ...
INSERT INTO %_content VALUES($rowid, uncmp(?), uncmp(?));

● May not help if using ZipVFS already.


Contentless Tables
● The %_content table can be left out altogether,
as follows:
CREATE VIRTUAL TABLE ft USING fts4(a, b, content='');

● Works like any FTS table, except:


– UPDATE and DELETE are not supported
(because %_content is required to determine
which entries need to be removed from FTS
index).
– Reading from any column other than “rowid”
returns NULL.
External Content Tables
● FTS4 can also index content stored in regular
tables – but the index is not kept up to date
automatically.
CREATE TABLE tbl(a, b);
CREATE VIRTUAL TABLE ft USING fts4(a, b, content='tbl');

● Whenever content values are required, FTS tries


to obtain them with:
SELECT a, b FROM tbl WHERE rowid=?

● The same thing it would do if the %_content table


did exist.
External Content Tables
● To insert a row:
INSERT INTO tbl(rowid, a, b) VALUES(?,?,?); Order doesn't
INSERT INTO ft (rowid, a, b) VALUES(?,?,?);
matter
● To delete a row:
DELETE FROM ft WHERE rowid=?; Order matters!
DELETE FROM tbl WHERE rowid=?

● To update a row:
UPDATE ft SET a=?, b=? WHERE rowid=?;
UPDATE tbl SET a=?, b=? WHERE rowid=?; Order matters!
External Content Tables
● The external content table doesn't actually have
to be a table. Just something (a table, a view, a
virtual table) that supports the following:

– SELECT * FROM obj WHERE rowid=?;


– SELECT * FROM obj ORDER BY rowid ASC;
– SELECT * FROM obj ORDER BY rowid DESC;
The notindexed= option
● Entire columns can be omitted from the FTS
index using the “notindexed option”:
CREATE VIRTUAL TABLE ft USING fts4(a, b, notindexed='a');

● Multiple “notindexed” options are permitted.


● Works with external content tables.
● And contentless tables too (not really useful)
Part 3
1. The FTS Index
• what it contains, the types of queries it supports, tokenizers

2. The Underlying Database Tables


• what is stored on disk, how this can be configured/optimized

3. Auxiliary Functions
• what are they, what they can do and how they might be extended

4. Administration and Tuning Parameters


• 'optimize', 'automerge', 'rebuild' and other commands – what and why

5. Common Tokens and FTS5


• how common tokens cause problems for FTS4, and why they are less
of a problem with FTS5
Auxiliary Functions
● Functions run as part of FTS queries that
operate on:
– the position-lists for search terms
– the original document text,
– document sizes,
– and other things.
● FTS4 has “offsets”, “snippet” and “matchinfo”
● FTS5 has an API that allows applications to
implement custom auxiliary functions.
Auxiliary Function Example
● Say the query is:
SELECT snippet(ft) FROM ft WHERE ft MATCH 'purple AND yellow'

snippet() returns
this text

position lists
● Doclists:
“purple” -> (1: a0 b2) (2: b3) (3: a0 b1) (4: b1)
“yellow” -> (2: a0 b2 b4) (4: a0)

● Snippet() also accesses the original document


text (from %_content table) and the tokenizer.
The matchinfo() function
● Matchinfo exposes some of the data available to
aux. functions as an array of integers. e.g.
SELECT matchinfo(ft, 'ly') FROM ft WHERE ft MATCH 'red blue'

● Return value is an SQL blob – an array of 32-bit


integers.
● Each character in the second argument adds
one or more integers to the output blob.
The matchinfo() function
● The 'l' flag appends the size of each column in
tokens to the output.
● For each phrase/column combination, the 'y' flag
appends the number of phrase hits in the column
to the output. So:
SELECT matchinfo(ft, 'ly') FROM ft WHERE ft MATCH 'red blue'

● Returns a blob of 6 integers (2 from 'l', 4 from 'y').


● And there are many other flags too...
The matchinfo() function
● Matchinfo allows FTS to be extended in similar,
but more limited, ways to adding new aux.
functions – for ranking and so on.
● Tip: If you're using the 'x' option to matchinfo,
take a look at recently added option 'y'. 'y'
provides similar information, but is quicker.
Auxiliary Functions
● In general, it is easier and safer to add auxiliary
functions or matchinfo() modes than it is to add
other features to FTS4.
Part 4
1. The FTS Index
• what it contains, the types of queries it supports, tokenizers

2. The Underlying Database Tables


• what is stored on disk, how this can be configured/optimized

3. Auxiliary Functions
• what are they, what they can do and how they might be extended

4. Administration and Tuning Parameters


• 'optimize', 'automerge', 'rebuild' and other commands – what and why

5. Common Tokens and FTS5


• how common tokens cause problems for FTS4, and why they are less
of a problem with FTS5
Multiple Tree Structures
● Instead of a single tree structure, FTS uses an
array of trees
● This is to work around the “write amplification”
problem (see also – OTA).
● A new tree is written either:
– At the end of each transaction, or
– For large transactions, roughly once for each 1MB
of FTS index data
● When querying, FTS has to query all trees in
the array and merge the result.
Multiple Tree Structures

New trees are added to level 0

Level 0:
Once there are 16 trees in level 0,
Level 1: their contents are merged into a
single big level 1 tree (and the
original level 0 trees discarded)

Level 2:
And once there are 16 trees in
level 1, a level 2 tree... And so on
FTS Index Details: 'optimize'
● Querying multiple trees is slower than querying
a single tree.
● To merge all trees in an FTS index to a single
tree:
INSERT INTO ft(ft) VALUES('optimize');

● 'optimize' tends to help queries that retrieve


smaller doclists more than others.
The 'automerge' setting 1
● When a level reaches 16 trees, FTS
immediately merges them together into a single
tree.
● If the input trees are large, this might take a
long time.
● From the user's point of view, this means that
an unlucky FTS write might inexplicably take a
very long time.
The 'automerge' setting 2
● With automerge, after creating a new Level 0
tree, FTS (sometimes) does some work
towards merging existing trees too.
New trees are still added to level 0

Level 0:
After adding a level 0 tree, also
Level 1: do some work merging (say) level 1
Trees to level 0.

Level 2: FTS can query the partially


merged trees.
The 'automerge' setting 3
● Automerge prevents a level from ever having
as many as 16 trees, avoiding the problems
associated with large merge operations.
● Set automerge as follows:
INSERT INTO ft(ft) VALUES('automerge=4');

● The parameter (4) is the minimum number of


trees to merge at a time.
● A value of 0 turns automerge off. As does 16 or
greater.
The 'rebuild' command
● The 'rebuild' command rebuilds the FTS index
based on the current contents of the FTS table.
INSERT INTO ft(ft) VALUES('rebuild');

● For “external content” tables, the current


contents are read from the external table.
● Contentless FTS tables may not be rebuilt.
● This is useful when:
– The index may be corrupt, or
– The tokenizer has changed somehow.
Part 5
1. The FTS Index
• what it contains, the types of queries it supports, tokenizers

2. The Underlying Database Tables


• what is stored on disk, how this can be configured/optimized

3. Auxiliary Functions
• what are they, what they can do and how they might be extended

4. Administration and Tuning Parameters


• 'optimize', 'automerge', 'rebuild' and other commands – what and why

5. Common Tokens and FTS5


• how common tokens cause problems for FTS4, and why they are less
of a problem with FTS5
Large Doclists in FTS4
● Consider:
... poiFtsTable MATCH 'am faltenbach'

● The two doclists are loaded and merged to


determine the query result.
● But:
– Doclist for “am” contains 35,000 entries.
– Doclist for “faltenbach” contains 2 or 3.
● Making the query much, much slower than just:
... poiFtsTable MATCH 'faltenbach'
Large Doclists in FTS4
● Each Doclist in FTS4 is stored as a single blob.
● May only be read sequentially.
● Can be read incrementally, so:
... poiFtsTable MATCH 'am' LIMIT 10

can run without loading much data.


● But not much else can be done without loading
the entire doclist into memory.
● Large doclists cause many performance
problems.
Large Doclists in FTS5
Doclist is a single large blob
● FTS4:
“am” ->

● FTS5:
And there is a b-tree
Doclist is divided into a sequence of blobs to index it by docid

“am” ->
Large Doclists in FTS5
● So, when querying for:
SELECT count(*) FROM poiFtsTable
WHERE poiFtsTable MATCH 'am faltenbach'

● FTS5 effectively loads the small doclist for


'faltenbach' and then queries the b-tree to
check which of them also match 'am'.
FTS4 FTS5
Memory Used 301808 (max 446392) 120704 (max 127264)
Largest Allocation 136829 64000
Cache Misses 151 25
Pager Heap Usage: 195192 33912
Another large doclist problem
● Say a table contains:
poiName Country
Kath. Kindergarten Deutschland
Deutsch Bank Deutschland
Jim Knopf Deutschland
Velo Shop Well Deutschland

And many more rows...

● And the query is for 'poiName: de*'


● FTS4 (and FTS5) both have to do a linear scan
of the huge doclist for 'de*'.
● No solution yet for this one.
Finally...
● An FTS table maintains an FTS index mapping from
each term to a list of term occurrences.
● This can be queried for terms, prefixes and
phrases. AND, OR, NOT and NEAR are supported.
● Auxiliary functions do stuff with the position list data
for each row (and sometimes all rows).
● There are actually multiple trees on disk.
● Large doclists are something to watch out for.

You might also like