0% found this document useful (0 votes)
10 views6 pages

11 FM-Index

This document discusses compressing the FM index by compressing its component tables. It describes using move-to-front encoding to compress the BWT string L, which tends to cluster equal characters together. It also explains that the encoding results in runs of zeros in the encoded data R that can then be compressed with Huffman coding.

Uploaded by

dethleff901
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views6 pages

11 FM-Index

This document discusses compressing the FM index by compressing its component tables. It describes using move-to-front encoding to compress the BWT string L, which tends to cluster equal characters together. It also explains that the encoding results in runs of zeros in the encoded data R that can then be compressed with Huffman coding.

Uploaded by

dethleff901
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

11.

1 Compressing the FM Index


This exposition has been developed by David Weese. It is based on the following sources, which are all
recommended reading:

1. P. Ferragina, G. Manzini (2000) Opportunistic data structures with applications, Proceedings of the 41st IEEE
Symposium on Foundations of Computer Science
2. P. Ferragina, G. Manzini (2001) An experimental study of an opportunistic index, Proceedings of the 12th
ACM-SIAM Symposium on Discrete Algorithms, pp. 269-278
3. Johannes Fischer (2010), Skriptum VL Text-Indexierung, SoSe 2010, KIT
4. A. Andersson (1996) Sorting and searching revisited, Proceedings of the 5th Scandinavian Workshop on
Algorithm Theory, pp. 185-197

11.2 RAM Model


From now on we assume the RAM model in which we model a computer with a CPU that has registers of w
bits which can be modified with logical and arithmetical operations in O(1) time. The CPU can directly access
a memory of at most 2w words.
In the following we assume n ≤ 2w so that it is possible to address the whole input. To have a more precise
measure, we count memory consumptions in bits. The uncompressed suffix array then does not require O(n)
memory but O(n log n) bits, as dlog2 ne bits are required to represent any number in [1..n].

11.3 Tables of the FM Index


Let T be a text of length n over the alphabet Σ and σ = |Σ| be the alphabet size. We have seen, that for
the algorithms count and locate we need L and the tables C and Occ. Without compression their memory
consumption is as follows:

• L = Tbwt is a string of length n over Σ and requires O(n log σ) bits


• C is an array of length σ over [0..n] and requires O(σ log n) bits
• Occ is an array of length σ × n over [0..n] and requires O(σ · n log n) bits
• pos (if every row is marked) is a suffix array of length n over [1..n] and requires O(n log n) bits

We will present approaches to compress L, Occ and pos, but omit to compress C assuming that σ and log n are
tolerably small.

11.4 Compressing L
Burrows and Wheeler proposed a move-to-front coding in combination with Huffman or arithmetic coding. In
the context of the move-to-front encoding each character is encoded by its index in a list, which changes over
the course of the algorithm. It works as follows:

1. Initialize a list Y of characters to contain each character in Σ exactly once


2. Scan L with i = 1, . . . , n
(a) Set R[i] to the number of characters preceding character L[i] in the list Y
(b) Move character L[i] to the front of Y

R is the MTF encoding of L. R can again be decoded to L in a similar way (Exercise).


Algorithm move to front(L) shows the pseudo-code of the move-to-front encoding. The array M maintains
for every alphabet character the number preceding characters in Y instead of using Y directly.
Compressing the FM Index, by David Weese, May 31, 2013, 13:20 11001

(1) // move to front(L)


(2) for j = 1 to σ do
(3) M[ j] = j − 1
(4) od
(5) for i = 1 to n do
(6) // ord maps a character to its rank in the alphabet
(7) x = ord(L[i])
(8) R[i] = M[x];
(9) for j = 1 to σ do
(10) if M[j] < M[x] then M[ j] = M[j] + 1; fi
(11) od
(12) M[x] = 0;
(13) od
(14) return R;

Observation 1. The BWT tends to group characters together so that the probability of finding a character close
to another instance of the same character is increased substantially:

final
char sorted rotations
(L)
a n to decompress. It achieves compression
o n to perform only comparisons to a depth
o n transformation} This section describes
o n transformation} We use the example and
o n treats the right-hand side as the most
a n tree for each 16 kbyte input block, enc
a n tree in the output stream, then encodes
i n turn, set $L[i]$ to be the
i n turn, set $R[i]$ to the
o n unusual data. Like the algorithm of Man
a n use a single set of probabilities table
e n using the positions of the suffixes in
i n value at a given point in the vector $R
e n we present modifications that improve t
e n when the block size is quite large. Ho
i n which codes that have not been seen in
i n with $ch$ appear in the {\em same order
i n with $ch$. In our exam
o n with Huffman or arithmetic coding. Bri
o n with figures given by Bell˜\cite{bell}.

Observation 2. The move-to-front


Figure encoding replaces
1: Example of sorted equal
rotations. Twentycharacters thatfrom
consecutive rotations intheL are ”close together” by ”small
sorted list of rotations of a version of this paper are shown, together with the final
values” in R. In practice, the most important
character of each rotation. effect is that zeroes tend to occur in runs in R. These can be
compressed using an order-0 compressor, e.g. the Huffman encoding.

i L[i] R[i] Ynext


aeio
1 a 0 aeio
6
2 o 3 oaei
3 o 0 oaei
4 o 0 oaei
5 o 0 oaei
6 a 1 aoei
7 a 0 aoei
8 i 3 iaoe
9 i 0 iaoe
10 o 2 oiae
11 a 2 aoie
12 e 3 eaoi
13 i 3 ieao
14 e 1 eiao
15 e 0 eiao
16 i 1 ieao
17 i 0 ieao
...

The Huffman encoding builds a binary tree where leaves are alphabet characters. The tree is balanced such
that for every node the leaves in the left and right subtree have a similar sum of occurrences.
11002 Compressing the FM Index, by David Weese, May 31, 2013, 13:20

character 0 1 2 3
occurrences in R 10 3 2 5

0 1 x bit code of x
0 0 0
0 1 1 110
3 2 111
0 1
3 10
1 2

Left and right childs are labeled with 0 and 1. The labels on the paths to each leaf define its bit code. The more
frequent a character the shorter its bit code. The final sequence H is the bitwise concatenation of bit codes of
characters from left to right in R.
The final sequence of bits H is:

L = ao oooa ai i...
R = 03 0001 03 0...
H = 0100001100100...

One property of the MTF coding is that the whole prefix R[1..i − 1] is required to decode character R[i], the
same holds for H. For encoding and decoding this is fine (practical assignment).
However, we want to search in the compressed FM index and hence need random accesses to L in algorithm
locate which would take O(n) time. Manzini and Ferragina achieve this directly on the Huffman encoded R,
however their algorithm is not practical, albeit optimal in theory.
We will proceed differently by using a simple trick we can determine L[i] using the Occ function. Clearly, the
values Occ(c, i) and Occ(c, i − 1) differ only for c = L[i].
Thus we can determine both L[i] and Occ(L[i], i) using σ Occ-queries.
Lets now discuss the possible space-time tradeoffs. The two simplest ideas are:

1. Avoid storing an Occ-table and scan L every time an Occ-query has to be answered. This occupies no
space, but needs O(n) time for answering a single Occ-query, leading to a total query time of O(mn) for
backwards search.
2. Store all answers to Occ(c, i) in a two-dimensional table. This table occupies O(nσ log n) bits of space, but
allows constant-time Occ-queries and makes the storage of L obsolete. Total time for backwards search
is optimal O(m).

For a more practical implementation we can proceed as follows:

11.5 Compressing Occ


We reduce the problem of counting the occurrences of a character in a prefix of L to counting 1’s in a prefix of
a bitvector. Therefore we construct a bitvector Bc of length n for each c ∈ Σ such that:

1 if L[i] = c
(
Bc [i] = .
0 else

Definition 3. For a bitvector B we define rank1 (B, i) to be the number of 1’s in the prefix B[1..i]. rank0 (B, i) is
defined analogously.

As each 1 in the bitvector Bc indicates an occurrence of c in L, it holds:

Occ(c, i) = rank1 (Bc , i) .

We will see that it is possible to answer a rank query of a bitvector of length n in constant time using additional
tables of o(n) bits. Hence the σ bitvectors are an implementation of Occ that allows to answer Occ queries in
constant time with an overall memory consumption of O(σn + o(σn)) bits. Given a bitvector B = B[1..n].
Compressing the FM Index, by David Weese, May 31, 2013, 13:20 11003
j log n k
We compute the length ` = 2 and divide B into blocks of length ` and superblocks of corresponding to `2
blocks.
B ...
blocks ...
superblocks ...

`2 `

1. For the i-th superblock we count the number


j k of 1’s from the beginning of B to the endof the superblock

in M [i] = rank1 (B, i · ` ). As there are `2 superblocks, M0 can be stored in O `n2 · log n = O logn n = o(n)
0 2 n

bits.
2. For the i-th block we count the number of 1’s from the beginning
j of k the overlapping superblock to the
end of the block in M[i] = rank1 B[1 + k`..n], (i − k)` where k = i−1
` ` is the number of blocks left of the
j k  n log log n 
overlapping superblock. M has n` entries and can be stored in O n` · log `2 = O log n = o(n) bits.

3. Let P be a precomputed lookup table such that for each possible bitvector V of length ` and i ∈ [1..`] holds
P[V][i] = rank1 (V, i). V has 2` × ` entries of values at most ` and thus can be stored in
 log n  √
O 2` · ` · log ` = O 2 2 · log n · log log n = O n log n log log n = o(n)
  

bits.

We now decompose
j k a rank-query into 3 subqueries using thej p−1
precomputed
k tables. For a position i we determine
the index p = ` of next block left of i and the index q = ` of the next superblock left of block p. Then it
i−1

holds: h ih i
rank1 (B, i) = M0 [q] + M[p] + P B[1 + p`..(p + 1)`] i − p` .
Note that B[1 + p`..(p + 1)`] fits into a single CPU register and can therefore be determined in O(1) time. Thus
a rank-query can be answered in O(1) time.

11.6 Compressing Occ with Wavelet trees


Armed with constant-time rank-queries, we now develop a more space-efficient implementation of the Occ-
function, sacrificing the optimal query time. The idea is to use a wavelet tree on the BW-transformed text.
The wavelet tree of a sequence L[1, n] over an alphabet Σ is a balanced binary search tree of height O(log σ). It
is obtained as follows.

1. We create a root node v, where we divide Σ into two halves Σl and Σr of roughly equal size, where the
left half contains the lexicographically smaller characters.
2. At v we store a bit-vector Bv of length n (together with data structures for O(1) rank-queries), where a 0
of position i indicates that character L[i] belongs to Σl , and a 1 indicates the it belongs to Σr .

3. This defines two (virtual) sequences Lv and Rv , where Lv is obtained from L by concatenating all characters
L[i] where Bv [i] = 0, in the order as they appear in L. Sequence Rv is obtained in a similar manner for
positions i with Bv [i] = 1.
4. The left child lv is recursively defined to be the root of the wavelet tree for Lv , and the right child rv to be
the root of the wavelet tree for Rv . This process continues until a sequence consists of only one symbol,
in which case we create a leaf.
11004 Compressing the FM Index, by David Weese, May 31, 2013, 13:20

Note that the sequences themselves are not stored explicitly; node v only stores a bit-vector Bv and structures
for O(1) rank-queries.
Theorem 4. The wavelet tree for a sequence of length n over an alphabet of size σ can be stored in n log σ × (1 + o(1))
bits.

Proof: We concatenate all bit-vectors at the same depth d into a single bit-vector Bd of length n, and prepare
it for O(1)-rank-queries. Hence, at any level, the space needed is n + o(n) bits. Because the depth of the tree
is dlog σe the claim on the space follows. In order to determine the sub-interval of a particular node v in the
concatenated bit-vector Bd at level d, we can store two indices αv and βv such that Bd [αv , βv ] is the bit-vector Bv
associated to node v. This accounts for additional O(σ log n) bits. Then a rank-query is answered as follows
(b ∈ {0, 1}):
rankb (Bv , i) = rankb (Bd , αv + i − 1) − rankb (Bd , αv − 1),
where it is assumed that i ≤ βv − αv + 1, for otherwise the result is not defined.
How does the wavelet tree help for implementing the Occ-function? Suppose we want to compute Occ(c, i),
i.e., the number of occurrences of c ∈ Σ in L[1, i]. We start at the root r of the wavelet tree, and check if c belongs
to the first or to the second half of the alphabet.
In the first case, we know that the cs are in the left child of the root, namely Lr . Hence, the number of cs in
L[1, i] corresponds to the number of cs in Lr [1, rank0 (Br , i)]. If, on the hand, c belongs to the second half of the
alphabet, we know that the cs are in the subsequence Rr that corresponds to the right child of r, and hence
compute the number of occurrences of c in Rr [1, rank1 (Br , i)] as the number of cs in L[1, i].
This leads to the following recursive procedure for computing Occ(c, i), to be invoked with WT − occ(c, i, 1, σ, r),
where r is the root of the wavelet tree. (Recall that we assume that the characters in Σ can be accessed as
Σ[1], . . . , Σ[σ].)

(1) WT − occ(c, i, σl , σr , v)
(2) if σl = σr then return i; fi
(3) σm = b σl +σ r
2 c;
(4) if c ≤ Σ[σm ] then
(5) return WT − occ(c, rank0 (Bv , i), σl , σm , lv );
(6) else
(7) return WT − occ(c, rank1 (Bv , i), σm + 1, σr , rv );
(8) fi

Due to the depth of the wavelet tree, the time for WT − occ(·) is O(log σ). This leads to the following theorem.
Theorem 5. With backward-search and a wavelet-tree on Tb wt, we can answer counting queries in O(m log σ) time. The
space (in bits) is
O(σ log n) + n log σ + o(n log σ),
where the first term account for | C | + space for the αv , the second term accounts for the wavelet tree, and the third term
accounts for the rank data structure.

11.7 Compressing pos


To compress pos we mark only a subset of rows in the matrix M and store their text positions. Therefore we
need a data structure that efficiently decides whether a row Mi = T[ j] is marked and that retrieves j for a
marked row i.
Compressing the FM Index, by David Weese, May 31, 2013, 13:20 11005

If we would mark every η-th row in the matrix (η > 1) we could easily decide whether row i is marked,  η−1 e.
 g. iff
i ≡ 1 (mod η). Unfortunately this approach still has worst-cases where a single pos-query takes O η n time
(excercise).
h 
Instead we mark the matrix row for every η-th text position, i. e. for all j ∈ 0..d nη e row i with Mi = T(1+ jη) is
marked with the text position pos(i) = 1 + jη. To determine whether a row is marked we could store all marked
pairs (i, 1 + jη) in a hash map or a binary search tree with key i.
Instead we can again use our O(1) rank-query supported bitvector.

11.8 Compressing pos


We can use a rank-query supported bitvector Bpos in conjunction with an array Pos of size n/η.
If we still have the suffix array during the construction of the BWT, we can simply scan through the array
maintaining an index k which initialize to 0. Whenever A[i]/η ≡ 0 mod n, we mark the i − th Bit in the
Bitvector, store Pos[k] = A[i] and increment k.
If the suffix array is not given, we use the BWT and L-to-F mapping traverse the BWT as in the reconstruction
algorithm. While doing this, we keep counting the number of steps. After η backwards steps, we are at
textposition n − η and hence we mark the bitvector.
After setting all bits we again traverse the BWT maintaining a counter m which we initialize with 0. Whenever
the bitvector is set we increment m, obtain the rank k of the bit and set Pos[k] = n − m · η

You might also like