11 FM-Index
11 FM-Index
1. P. Ferragina, G. Manzini (2000) Opportunistic data structures with applications, Proceedings of the 41st IEEE
Symposium on Foundations of Computer Science
2. P. Ferragina, G. Manzini (2001) An experimental study of an opportunistic index, Proceedings of the 12th
ACM-SIAM Symposium on Discrete Algorithms, pp. 269-278
3. Johannes Fischer (2010), Skriptum VL Text-Indexierung, SoSe 2010, KIT
4. A. Andersson (1996) Sorting and searching revisited, Proceedings of the 5th Scandinavian Workshop on
Algorithm Theory, pp. 185-197
We will present approaches to compress L, Occ and pos, but omit to compress C assuming that σ and log n are
tolerably small.
11.4 Compressing L
Burrows and Wheeler proposed a move-to-front coding in combination with Huffman or arithmetic coding. In
the context of the move-to-front encoding each character is encoded by its index in a list, which changes over
the course of the algorithm. It works as follows:
Observation 1. The BWT tends to group characters together so that the probability of finding a character close
to another instance of the same character is increased substantially:
final
char sorted rotations
(L)
a n to decompress. It achieves compression
o n to perform only comparisons to a depth
o n transformation} This section describes
o n transformation} We use the example and
o n treats the right-hand side as the most
a n tree for each 16 kbyte input block, enc
a n tree in the output stream, then encodes
i n turn, set $L[i]$ to be the
i n turn, set $R[i]$ to the
o n unusual data. Like the algorithm of Man
a n use a single set of probabilities table
e n using the positions of the suffixes in
i n value at a given point in the vector $R
e n we present modifications that improve t
e n when the block size is quite large. Ho
i n which codes that have not been seen in
i n with $ch$ appear in the {\em same order
i n with $ch$. In our exam
o n with Huffman or arithmetic coding. Bri
o n with figures given by Bell˜\cite{bell}.
The Huffman encoding builds a binary tree where leaves are alphabet characters. The tree is balanced such
that for every node the leaves in the left and right subtree have a similar sum of occurrences.
11002 Compressing the FM Index, by David Weese, May 31, 2013, 13:20
character 0 1 2 3
occurrences in R 10 3 2 5
0 1 x bit code of x
0 0 0
0 1 1 110
3 2 111
0 1
3 10
1 2
Left and right childs are labeled with 0 and 1. The labels on the paths to each leaf define its bit code. The more
frequent a character the shorter its bit code. The final sequence H is the bitwise concatenation of bit codes of
characters from left to right in R.
The final sequence of bits H is:
L = ao oooa ai i...
R = 03 0001 03 0...
H = 0100001100100...
One property of the MTF coding is that the whole prefix R[1..i − 1] is required to decode character R[i], the
same holds for H. For encoding and decoding this is fine (practical assignment).
However, we want to search in the compressed FM index and hence need random accesses to L in algorithm
locate which would take O(n) time. Manzini and Ferragina achieve this directly on the Huffman encoded R,
however their algorithm is not practical, albeit optimal in theory.
We will proceed differently by using a simple trick we can determine L[i] using the Occ function. Clearly, the
values Occ(c, i) and Occ(c, i − 1) differ only for c = L[i].
Thus we can determine both L[i] and Occ(L[i], i) using σ Occ-queries.
Lets now discuss the possible space-time tradeoffs. The two simplest ideas are:
1. Avoid storing an Occ-table and scan L every time an Occ-query has to be answered. This occupies no
space, but needs O(n) time for answering a single Occ-query, leading to a total query time of O(mn) for
backwards search.
2. Store all answers to Occ(c, i) in a two-dimensional table. This table occupies O(nσ log n) bits of space, but
allows constant-time Occ-queries and makes the storage of L obsolete. Total time for backwards search
is optimal O(m).
1 if L[i] = c
(
Bc [i] = .
0 else
Definition 3. For a bitvector B we define rank1 (B, i) to be the number of 1’s in the prefix B[1..i]. rank0 (B, i) is
defined analogously.
We will see that it is possible to answer a rank query of a bitvector of length n in constant time using additional
tables of o(n) bits. Hence the σ bitvectors are an implementation of Occ that allows to answer Occ queries in
constant time with an overall memory consumption of O(σn + o(σn)) bits. Given a bitvector B = B[1..n].
Compressing the FM Index, by David Weese, May 31, 2013, 13:20 11003
j log n k
We compute the length ` = 2 and divide B into blocks of length ` and superblocks of corresponding to `2
blocks.
B ...
blocks ...
superblocks ...
`2 `
bits.
2. For the i-th block we count the number of 1’s from the beginning
j of k the overlapping superblock to the
end of the block in M[i] = rank1 B[1 + k`..n], (i − k)` where k = i−1
` ` is the number of blocks left of the
j k n log log n
overlapping superblock. M has n` entries and can be stored in O n` · log `2 = O log n = o(n) bits.
3. Let P be a precomputed lookup table such that for each possible bitvector V of length ` and i ∈ [1..`] holds
P[V][i] = rank1 (V, i). V has 2` × ` entries of values at most ` and thus can be stored in
log n √
O 2` · ` · log ` = O 2 2 · log n · log log n = O n log n log log n = o(n)
bits.
We now decompose
j k a rank-query into 3 subqueries using thej p−1
precomputed
k tables. For a position i we determine
the index p = ` of next block left of i and the index q = ` of the next superblock left of block p. Then it
i−1
holds: h ih i
rank1 (B, i) = M0 [q] + M[p] + P B[1 + p`..(p + 1)`] i − p` .
Note that B[1 + p`..(p + 1)`] fits into a single CPU register and can therefore be determined in O(1) time. Thus
a rank-query can be answered in O(1) time.
1. We create a root node v, where we divide Σ into two halves Σl and Σr of roughly equal size, where the
left half contains the lexicographically smaller characters.
2. At v we store a bit-vector Bv of length n (together with data structures for O(1) rank-queries), where a 0
of position i indicates that character L[i] belongs to Σl , and a 1 indicates the it belongs to Σr .
3. This defines two (virtual) sequences Lv and Rv , where Lv is obtained from L by concatenating all characters
L[i] where Bv [i] = 0, in the order as they appear in L. Sequence Rv is obtained in a similar manner for
positions i with Bv [i] = 1.
4. The left child lv is recursively defined to be the root of the wavelet tree for Lv , and the right child rv to be
the root of the wavelet tree for Rv . This process continues until a sequence consists of only one symbol,
in which case we create a leaf.
11004 Compressing the FM Index, by David Weese, May 31, 2013, 13:20
Note that the sequences themselves are not stored explicitly; node v only stores a bit-vector Bv and structures
for O(1) rank-queries.
Theorem 4. The wavelet tree for a sequence of length n over an alphabet of size σ can be stored in n log σ × (1 + o(1))
bits.
Proof: We concatenate all bit-vectors at the same depth d into a single bit-vector Bd of length n, and prepare
it for O(1)-rank-queries. Hence, at any level, the space needed is n + o(n) bits. Because the depth of the tree
is dlog σe the claim on the space follows. In order to determine the sub-interval of a particular node v in the
concatenated bit-vector Bd at level d, we can store two indices αv and βv such that Bd [αv , βv ] is the bit-vector Bv
associated to node v. This accounts for additional O(σ log n) bits. Then a rank-query is answered as follows
(b ∈ {0, 1}):
rankb (Bv , i) = rankb (Bd , αv + i − 1) − rankb (Bd , αv − 1),
where it is assumed that i ≤ βv − αv + 1, for otherwise the result is not defined.
How does the wavelet tree help for implementing the Occ-function? Suppose we want to compute Occ(c, i),
i.e., the number of occurrences of c ∈ Σ in L[1, i]. We start at the root r of the wavelet tree, and check if c belongs
to the first or to the second half of the alphabet.
In the first case, we know that the cs are in the left child of the root, namely Lr . Hence, the number of cs in
L[1, i] corresponds to the number of cs in Lr [1, rank0 (Br , i)]. If, on the hand, c belongs to the second half of the
alphabet, we know that the cs are in the subsequence Rr that corresponds to the right child of r, and hence
compute the number of occurrences of c in Rr [1, rank1 (Br , i)] as the number of cs in L[1, i].
This leads to the following recursive procedure for computing Occ(c, i), to be invoked with WT − occ(c, i, 1, σ, r),
where r is the root of the wavelet tree. (Recall that we assume that the characters in Σ can be accessed as
Σ[1], . . . , Σ[σ].)
(1) WT − occ(c, i, σl , σr , v)
(2) if σl = σr then return i; fi
(3) σm = b σl +σ r
2 c;
(4) if c ≤ Σ[σm ] then
(5) return WT − occ(c, rank0 (Bv , i), σl , σm , lv );
(6) else
(7) return WT − occ(c, rank1 (Bv , i), σm + 1, σr , rv );
(8) fi
Due to the depth of the wavelet tree, the time for WT − occ(·) is O(log σ). This leads to the following theorem.
Theorem 5. With backward-search and a wavelet-tree on Tb wt, we can answer counting queries in O(m log σ) time. The
space (in bits) is
O(σ log n) + n log σ + o(n log σ),
where the first term account for | C | + space for the αv , the second term accounts for the wavelet tree, and the third term
accounts for the rank data structure.
If we would mark every η-th row in the matrix (η > 1) we could easily decide whether row i is marked, η−1 e.
g. iff
i ≡ 1 (mod η). Unfortunately this approach still has worst-cases where a single pos-query takes O η n time
(excercise).
h
Instead we mark the matrix row for every η-th text position, i. e. for all j ∈ 0..d nη e row i with Mi = T(1+ jη) is
marked with the text position pos(i) = 1 + jη. To determine whether a row is marked we could store all marked
pairs (i, 1 + jη) in a hash map or a binary search tree with key i.
Instead we can again use our O(1) rank-query supported bitvector.