See discussions, stats, and author profiles for this publication at: https://www.researchgate.
net/publication/262938266
Enhanced Variable-Length Codes: Improved Compression with Efficient
Random Access
Conference Paper in Proceedings of the Data Compression Conference · March 2014
DOI: 10.1109/DCC.2014.74
CITATIONS READS
18 514
1 author:
M. Oguzhan Külekci
Istanbul Technical University
63 PUBLICATIONS 465 CITATIONS
SEE PROFILE
All content following this page was uploaded by M. Oguzhan Külekci on 10 June 2014.
The user has requested enhancement of the downloaded file.
Enhanced Variable-Length Codes:
Improved Compression with Efficient Random Access
M. Oğuzhan Külekci
TÜBİTAK - BİLGEM - UEKAE
Kocaeli, 41470, Turkey
[email protected]
Abstract
We investigate the usage of the wavelet tree and the rank/select–dictionary data struc-
tures on hybrid-structured variable–length codes, which represent an integer in the form of
a unary code section followed by a binary section. We propose to handle unary and binary
partitions as separate streams and create wavelet trees or R/S dictionaries over the unary
streams, which grants us the opportunity to directly access any codeword.Particularly con-
centrating on Elias and Rice schemes, we introduce several solutions that i) improve the
compression significantly, and ii) provide random access in constant or logarithmic time.
Experiments are conducted to compare the performances of the proposed codes against
Elias/Rice schemes and more recent state-of-the-art codings such as Simple9, PForDelta,
DACs, and improved-AC techniques. We observed that the newly introduced methods
outperform the original Elias/Rice codecs by ≈30% and the others by ≈10% in terms of
compression ratios. The methods described in this study are generic and may further be
extended to some other hybrid structure (unary/binary) variable–length codes as well.
1 Introduction
Hybrid usage of unary and binary codes for compact representation of integers has
been the backbone of many variable–length coding schemes [1, 2] such as Elias [3],
Golomb [4], and Rice [5]. A major difficulty of those codes is the lack of efficient
methods to support random access to the ith element, whose codeword’s beginning
bit address is not known on the encoded stream, and thus, one needs to decode all
previous (i − 1) values to reach it.
One alternative to tackle this problem is to use sampling. By using an additional
array that contains the beginning bit positions of every h–integers block on the en-
coded stream, the time complexity of random access becomes O(h) as we can begin
from the sampled bit addresses of the hi th block to retrieve the ith integer value, which
requires at most h integer decodings. This strategy defines a time/memory trade–off
between the extra space spent to store the sampled positions, and the access time.
In this study, we investigate how compressed data structures may help in variable-
size coding, and particularly focusing on Elias [3] and Rice [5] codes, we present
methods that i) significantly improve the compression ratios, and ii) provide efficient
methods to directly access the ith codeword.
We begin first by investigating the usage of rank/select (R/S) dictionary data
structures on variable-length coding. Previously, Brisaboa et al. [6] introduced DACs
(directly accessible codes) by integrating R/S dictionaries into Vbyte [7] byte-aligned
coding scheme. Alternatively, we introduce several schemes that consider building
R/S data structures over the concatenated unary partitions of Elias and Rice codings,
such that we can reach the ith codeword in O(1) time by two select queries.
The second dimension we study is to use wavelet trees. We propose to first label
each integer xi of the input sequence by the length of its unary code section in the
corresponding Elias or Rice coding, which are blog xi c and b 2xki c, respectively. By
creating a wavelet tree over these labels and storing the corresponding binary-code
sections at the leaf nodes, it becomes possible to extract the ith codeword in log r
steps, where r is the number of distinct labels occurred in the sequence. This idea
is akin to the alphabet partitioning [8] that aims to improve compressed rank/select
over text sequences by reducing the alphabet size by mapping several symbols into
a single one according to their frequencies. We use different labeling strategies other
than [8] with the aim to enhance compression on integer sequences while providing
direct access at the same time.
According to the experiments conducted, both the R/S dictionary and wavelet
tree solutions introduced in this study reduce the space consumptions of original
Elias/Rice schemes by ≈ 30%. On the test data, we also observed ≈ 10% improve-
ment in compression against the more recent compact integer representation tech-
niques [6, 9–11]. Notice that efficient random access is another gain beside improved
compression. Throughout the study, although we mainly work on Elias and Rice
schemes, as being amongst the most widely preferred fixed-to-variable codes in com-
pressed integer representations, the newly introduced methods define a generic idea
that is applicable to other hybrid (unary+binary) variable–length coding schemes.
2 Elias and Rice Codes with R/S Dictionaries
Let X = hx0 , x1 , x2 , . . . , xn−1 i be a given sequence of non–negative integers, where
xi ∈ {0, 1, 2, . . . , u − 1} for all 0 ≤ i < n. Each xi > 1 can be represented by blog xi c
bits by omitting the leftmost bit that is always set to 1, and each xi = 0|1 can be
shown by reserving one bit in raw format. Notice that the bit stream formed by
concatenating the minimal binary representations
P of xi sP
is not uniquely decodable.
Definition: We denote with Υ = ∀i:xi >1 blog xi c + ∀i:xi =0|xi =1 1 the total min-
imal binary representation length of sequence X in bits.
Throughout the paper, we assume a word RAM model with a word length of at
least log u bits, so that we can extract that much of bits with a single read operation.
EliasH Coding: The Elias–γ [3] coding of xi consists of 2 · blog(xi + 1)c + 1
bits such that the first blog(xi + 1)c bits set to 0 encode the actual length in unary
followed by a 1 bit to indicate the end of the unary part and the last blog(x + 1)c bits
contain minimal binary representation of (xi + 1) value. In EliasH coding, we store
the unary and binary sections of Elias–γ encoded integers in separate bit–arrays, and
represent the unary bit stream with a entropy compressed R/S dictionary.
Lemma 2.1 EliasH coding of n non-negative integers approximately requires n log Υn +
o(Υ + n) + Υ bits space and supports O(1)–time random access.
i 0 1 2 3 4 ...
xi 127 32 3 56 201 ...
1 2
j 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 ...
(Unary) U 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 1 0 0 0 0 0 1 ...
(Binary) B 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 0 0 1 ...
Access(x3 ): select1 (U ,3)=16 ; select1 (U ,4)=22 ; Binary–code length = 22-16-1=5;
Binary–code address = 16-3+1 = 14; B[14 . . . 18] = 11001; x3 + 1 =(1)11001 = 57; x3 = 56
a) Coding and access to elements on X via EliasH .
1
j 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7
(Unary code stream) U 0 0 0 0 0 0 0 1 0 0 1 1 0 0 0 1 ...
(Binary code stream) B 1 1 1 1 0 0 0 0 0 0 1 1 1 0 0 0 ...
Access(x3 ): select1 (U ,3)=11 ; select1 (U ,4)=15 ; Quotient = 15 − 11 − 1 = 3
Remainder = B[3 · 4 . . . 4 · 4 − 1] = B[12 . . . 15] = 1000 = 8; x3 = 24 · 3 + 8 = 56
b) Coding and access to elements on X via RiceH (k=4).
0 0 1 0 ... U0 = 0010. . . Access(x3 ):
0 0 0 ... U1 = 000. . . getbit(3,U0 )=0; rank1 (3,U0 )=1
0 1 0 ... U2 = 010. . . getbit(3-1,U1 )=0; rank1 (2,U1 )=0
U 0 1 ... U3 = 01. . . getbit(2-0,U2 )=0; rank1 (2,U2 )=1
0 ... U4 = 0. . . getbit(2-1,U3 )= 1 → Quotient = 3
0 ... U5 = 0. . . Remainder = B[12 . . . 15] = 1000
1 ... U6 = 0. . .
......... ......... x3 = 24 · 3 + 8= 56
B 1111000000111000
c) Coding and access to elements on X via RiceV (k=4).
Figure 1: Coding and access to elements on X via EliasH, RiceH, and RiceV coding.
Proof The unary section U includes Υ 0s and n 1s. The zeroth order entropy of
that sequence is log Υ+n n
≈ n log Υn bits. Hence, compressed representation of U
supporting O(1)–time R/S queries would occupy n log Υn +o(Υ+n) bits [12]. Summing
this value with the binary section B of length Υ gives the total space consumption.
While accessing a random xi in EliasH, we run 2 select queries to obtain the positions
of the ith and (i + 1)th 1 bits on U . The difference in between these gives us the code
length of (xi + 1) on B. The bit address of the minimal binary encoded (xi + 1) on
B is then calculated by subtracting i from the result of the first select query, since
the number of bits reserved on B to encode the previous (i − 1) integers is equal to
the number of 0s observed till the ith set bit on U . We move to that location on
B and extract the value of (xi + 1). In total, random access takes exactly 2 select
queries that can be performed in constant-time operations via the appropriate R/S
dictionary data structures. Figure 1–a gives a sketch of EliasH coding.
RiceH coding: In Rice [5] coding xi is represented by the unary encoded b 2xki c
value (quotient) and the binary encoded xi − 2k · b 2xki c value (remainder) according
to a predetermined k. Notice that the remainder is of fixed-length k-bits long. Fol-
lowing the same idea in EliasH, we propose in RiceH coding to represent the unary
encoded quotient values via an entropy compressed R/S dictionary, and keep the
binary encoded remainder values in a separate array.
ω ω
Lemma 2.2 The RiceH coding of X occupies n logPn·2 k + o( 2k + n) + nk bits space
and supports O(1) time random access, where ω = ∀i xi .
Proof The length of the unary section is |U | = n−1 xi ω
P
i=0 b 2k c + 1 ≈ 2k + n, where n
bits are set to 1 and rest to 0. The zero-order entropy of U may be approximated
ω
as log 2kn+n ≈ n log n·2 ω
k , and thus, the bits needed for entropy compressed R/S
ω ω
dictionary is approximately n log n·2 k + o( 2k + n). The remainder values of all xi
values are represented with the bit array B of size n · k bits. In random access to xi ,
first the indices of the ith and (i + 1)th 1 bits are detected by running two select
queries on the unary stream, whose difference gives us the quotient value b 2xki c. By
directly extracting the remainder from the B[k · i . . . k · (i + 1) − 1] bits on the B array,
the xi = (quotient · 2k + remainder) is retrieved. See Figure 1–b for a detailed sketch.
RiceV Coding: In RiceH coding, instead of keeping a single unary bit-stream,
it is possible to represent the unary sections of each codeword in a vertical fashion
by concatenating the first (leftmost) bits of every individual unary codeword in one
bitmap, the second bits in another bitmap, and so on. Obviously, the bit arrays will
be of varying lengths. This is something like creating the bit-arrays in a vertical
fashion as depicted in Figure 1–c, hence we refer it as RiceV coding.
Retrieving the xi value in a RiceV encoded sequence begins with extracting the
ith bit on the first bit-stream that includes the first bits of every unary code of all
elements in X. A 0 bit here means that the regarding unary section of xi has not been
finished and continues on the next level bitmap. In such a case, we achieve a rank
query to count how many 1s appear up to position i, which specifies how many of the
integers are not represented in the next bitmap. We decrement i by the result of that
rank query and perform the same operation on the next bitmap, which includes the
second bits of the unary sections of xi s such that 2xki > 0. The number of times we
need to perform this iteration until detecting the 1 bit marking the end of the unary
section for the target xi is dlog 2xki e. Extracting the remainder value on the binary
part of xi is trivial by fetching the ith k-bits block on B. The random access time
complexity of RiceV becomes O( 2uk ) (remember that (u − 1) is the largest integer in
X) as oppose to O(1) in RiceH. However, as can be observed from the experimental
results, RiceV generally achieves a better compression than RiceH.
3 Elias and Rice Codes with Wavelet Trees
This section describes two methods that integrates wavelet trees to Elias–γ and Rice
coding schemes. The main idea in both (which is akin to [8]) is to label each integer in
the sequence with its unary-code value, which is blog xi c in Elias–γ and b 2xki c in Rice.
The wavelet trees are constructed over the label sequences and include the regarding
binary-code values at their leaves.
EliasW(avelet) Coding: Assume set L = {`0 , `1 , `2 , . . . , `r−1 } includes the dis-
tinct blog xi c values observed on X. We propose to create a wavelet tree by recursively
splitting the X into two according to the corresponding `i values. Each leaf node in
the corresponding wavelet tree would be devoted to a unique bit length ` value, and
all xi values such that ` = blog xi c are stored in the leaf as a bit string composed of
concatenated minimal binary representations of those xi s. Such a wavelet tree will be
of height dlog re, and provide access to any item in at most dlog re steps. Figure 2–a
shows the proposed scheme on a sample sequence. Notice that we assume log 0 = 0,
X={3,6,0,11,5,1,3,15,9,13}
x
Labels EliasW={1,2,0,3,2,0,1,3,3,3} blog xi c Labels RiceW={0,1,0,2,1,0,0,3,2,3} b ki c
2
0101100111 0001000111
1001 010111
010100 0101
0 1 1 1
10 01 011 111 001 101 11 01 11 01
11 00 01 11 10 01
i) EliasW coding ii) RiceW coding
Figure 2: The i) EliasW and ii) RiceW (with k = 2) coding of the sample sequence X.
and keep xi s equal to 0 and 1 values in a separate node dedicated to ` = 0. While
retrieving integers from that special node, we do not concatenate 1 to the beginnings.
This wavelet tree can be stored as a single bit stream by concatenating the bitmaps
of the nodes in a predetermined (depth-first or breadth-first) tree traversal. No de-
limiters (or pointers) in between the individual bitmaps are necessary since we can
restore the tree topology along with the lengths of the bitmaps at each node once set
L and n are given beforehand.
Lemma 3.1 Given a sequence X = hx0 , x1 , . . . , xn−1 i of non–negative integers such
that 0 ≤ xi < u and L = {`0 , `1 , . . . , `r−1 } is the set of distinct blog xi c values observed
in X, the EliasW coding of X occupies at most n · dlog re + Υ bits of space plus a few
bits to encode the n value and set L for correct decoding.
Proof The wavelet tree generated for coding sequence X includes at most dlog re
intermediate levels including the root. Since the internal nodes at each level contains
exactly n bits in total, we need at most n · dlog re = o(n · dlog re) bits. The actual
xi values are encoded at the leaves with their minimal binary representations, which
require Υ bits in total. We need to know the n value and set L for proper decoding.
Following the encodings of n and the set size r = |L| in a universal coding scheme,
items in set L can be presented simply by r · log u bits, which in total does not cause
an increase in space complexity, and can be handled via a few bytes in practice.
Remark When L = {0, 1, 2, . . . , blog(u − 1)c}, which means every possible code
length has occurred in X, the space requirement is at most n · dlogdlog uee + Υ bits.
RiceW(avelet) Coding: Following the same idea in EliasW coding, we may
consider to construct set L as the unique b 2xki c values observed in X computed with
a predetermined k, and build the wavelet tree accordingly. Each leaf is dedicated to
a unique quotient value, and the remainder values of xi s having the same quotient
values are concatenated to form the bit stream at each leaf. Figure 2–b shows sample
encoding of X with this scheme.
Lemma 3.2 The RiceW coding of X occupies at most n · dlog re + nk bits of space
plus a few bits to encode the n value and the set of r distinct blog 2xki c values observed
in X for some parameter k < log u.
Proof The n · dlog re bits are spent in the wavelet tree as described in Lemma 3.1.
The leaves require n · k bits in total as the remainder values are all k bits long.
Remark The space usage becomes ndlog(dlog ue−k)e+nk bits assuming all possible
b 2xki c values appear in L = {0, 1, . . . , blog(u − 1) − kc}, and hence, r = dlog ue − k.
Lemma 3.3 In EliasW and RiceW codings, any element of sequence X can be ac-
cessed in O(dlog re) time by reserving additional o(n · dlog re) bits.
Proof The maximum number of internal nodes we need to traverse to reach leaves
is dlog re. At each visited node we access the qth bit bq in constant time and run
rank(bq , q, B) query on the corresponding bitmap to count number of bq s up to the
position q on B. We know that rank queries on a n-bits binary string can be exe-
cuted in O(1) time using n + o(n) space via succinct data structures [13, 14]. Thus,
augmenting the bitmaps of every node with the succinct data structures to support
constant time rank queries, the ith item of the input sequence X can be reached in
O(dlog re) time. This brings an overhead of o(n · dlog re) bits as we have at most
dlog re levels each of which containing n bits in total.
The rank queries can be performed in O(1) time using compressed space n · H0 + o(n)
[15] on binary strings, where H0 represents 0th order empirical entropy of the source
bitmap. Thus, the space requirement in accessing EliasW and RiceW coding may
further be improved in practice. Another way to reach the compressed space might
be to use Huffman shaped wavelet trees [16] instead of balanced trees. Notice that
in that case the height of the tree will be longer but access time will be shorter for
frequent elements.
4 Implementation and Experimental Results
We have implemented our proposals in C and run benchmarks against the original
Elias and Rice schemes as well as the more recent Simple9 [9], PForDelta [10], DACs
[6], and improved-AC [11] codes, which has reported improved compression ratios and
random–access times. All experiments conducted on a machine running 64–bit Linux
Mint (nadia) with Intel Core i7-3770 @ 3.40GHz processor, 16GB of main memory,
and 8192KB cache size. All reported results are the means of hundred runs.
In EliasW and RiceW implementations, we split the set L into two such that the
input set of integers is roughly clustered into two instead of splitting L into two equal
cardinality parts. Although the number of levels now becomes more than dlog re,
we provide faster random access to frequent items. Since at each level, number of
0s are roughly equal to 1s, using compressed bitmap for rank queries does not make
much sense, and thus, we keep the bitmaps in raw format and apply the basic rank
structure by using 5% extra space over that raw representation. However, for EliasH,
RiceH and RiceV schemes we used the compressed format RSDic library1 .
1
http://code.google.com/p/rsdic/, implementation of R/S in compressed space based on [17].
range Elias–γ Elias–δ EliasW EliasH Rice RiceW RiceH RiceV DAC Simple9 PforDelta iAC
[20 .28 ) 13.319 12.225 8.120 11.366 8.7406 8.0018 8.2038 8.2038 8.001 10.700 8.033 10.63
(0.448) (0.420) (0.052) (0.192) (0.372) (0.008) (0.128) (0.052) (0.008) (0.048) (0.312) (0.068)
[28 .216 ) 29.314 22.286 16.119 21.964 16.74215 16.00116 16.20316 16.20316 16.001 30.681 16.034 18.5947
(0.612) (0.444) (0.056) (0.188) (0.260) (0.008) (0.128) (0.052) (0.008) (0.076) (0.368) (0.068)
[216 .224 ) 45.323 31.282 24.119 32.069 24.75223 24.00124 24.20324 24.20324 24.001 32.250 25.449 26.4697
(0.788) (0.344) (0.060) (0.190) (0.260) (0.012) (0.132) (0.052) (0.016) (0.080) (7.360) (0.072)
a) Results obtained on uniformly distributed integers.
µ Elias–γ Elias–δ EliasW EliasH Rice RiceW RiceH RiceV DAC Simple9 PforDelta iAC
28 16.213 14.683 9.127 13.404 9.7438 8.3305 9.8088 8.7267 10.001 10.907 9.034 11.652
(0.476) (0.320) (0.040) (0.188) (0.256) (0.096) (0.116) (0.180) (0.008) (0.048) (0.328) (0.064)
210 20.207 16.718 11.128 16.079 11.74110 10.3327 11.80510 10.7259 12.001 16.240 11.035 13.652
(0.528) (0.312) (0.044) (0.188) (0.256) (0.092) (0.116) (0.188) (0.012) (0.056) (0.348) (0.068)
212 24.203 18.726 13.128 18.596 13.74011 12.3319 13.80412 12.72511 14.001 16.240 13.035 15.652
(0.544) (0.304) (0.044) (0.192) (0.288) (0.092) (0.120) (0.180) (0.016) (0.064) (0.364) (0.068)
214 28.202 20.726 15.128 21.226 15.74013 14.33211 15.80414 14.72513 16.001 26.917 16.033 17.621
(0.584) (0.308) (0.040) (0.188) (0.288) (0.100) (0.120) (0.180) (0.012) (0.076) (0.376) (0.064)
216 32.201 24.680 17.128 23.852 17.75015 16.33213 17.80416 16.72515 18.001 32.250 18.100 19.574
(0.624) (0.356) (0.044) (0.188) (0.292) (0.096) (0.116) (0.180) (0.012) (0.080) (0.374) (0.072)
218 36.211 26.725 19.128 26.383 19.75018 18.33215 19.80418 17.72517 20.005 32.250 20.035 21.481
(0.656) (0.336) (0.044) (0.189) (0.288) (0.096) (0.120) (0.176) (0.016) (0.080) (0.380) (0.068)
220 40.211 28.725 21.128 28.914 21.75019 20.33217 21.80420 20.72519 22.005 32.250 22.101 23.480
(0.704) (0.340) (0.048) (0.188) (0.284) (0.092) (0.120) (0.180) (0.016) (0.084) 3.772) (0.072)
222 44.211 30.726 23.128 31.392 23.75021 22.33219 23.80422 22.72521 24.005 32.250 24.413 25.480
(0.768) (0.344) (0.044) (0.184) (0.288) (0.096) (0.116) (0.180) (0.016) (0.080) 5.668) (0.072)
µ
b) Results obtained on normally distributed integers with mean µ and deviation 4.
Table 1: Space consumption as bits/integer computed on one million non-negative integers
with a) uniform and b) Gaussian distributions. Random access times in microseconds are
given between parentheses as secondary lines and the subscripts in Rice variants represent
the parameters giving best compression ratios. Sampling frequency is set to 100 for Elias-γ,
Elias-δ, Rice, Simple9, and PforDelta schemes.
We first tested newly introduced schemes on uniformly distributed integers to
observe their performance on small, medium and large integers. The results are shown
in table 1 a), where we see RiceW showed the best performance with the DACs both
in compactness and random access time. Note that on a uniformly distributed integer
set in range [2i , 2j ), the distribution of the blog xi c values are highly skewed towards
j, which might create a biased scheme especially for our proposals.
Thus, we performed a second test on normally distributed sets of integers having
mean µ and deviation µ4 , whose results are given in table 1 b). RiceW scheme seems to
be the unique leader in this case, reflecting approximately 10% enhanced space usage
against the best competitor among the previously known results (PForDelta on small
µ values, and DACs on larger ones). RiceV is consistently the second best during the
experiments, which outperforms others when µ = 218 . In general RiceV and RiceW
coded integers roughly occupy 1.3 to 1.6 bits less than the best previously known
scheme. However, in terms of random access time, DACs is the absolute leader.
We run a third experiment to measure the performance on a real world data,
where we preferred to use the same data set used in [6] that are the longest common
dblp dna protein
space time space time space time
Elias–γ 10.083 – 7.751 – 8.289 –
Elias50 –γ 10.683 0.196 8.351 0.200 8.889 0.236
Elias–δ 9.542 – 8.390 – 7.863 –
Elias50 –δ 10.142 0.192 8.990 0.196 8.463 0.216
EliasW 7.311 0.044 4.728 0.044 5.932 0.060
EliasH 9.765 0.184 8.273 0.156 8.035 0.144
Rice (5) 6.919 – (4) 6.049 – (7) 9.555 –
Rice50 (5) 7.519 0.116 (4) 6.649 0.112 (7) 10.155 0.100
RiceW (5) 7.015 0.052 (3) 4.600 0.048 (3) 6.597 0.102
RiceH (6) 6.836 0.164 (4) 5.413 0.156 (6) 8.458 0.160
RiceV (4) 5.679 0.304 (3) 4.614 0.376 (6) 8.202 0.372
DACs 7.522 0.024 5.543 0.028 6.579 0.044
Simple950 7.956 0.040 6.254 0.040 8.213 0.044
PForDelta50 6.283 0.800 5.141 0.964 6.732 2.864
improved–AC 9.112 0.068 8.040 0.064 9.110 0.068
Table 2: Experimental results obtained on LCP arrays computed from the first 100M of
dblp, dna, and protein files of Pizza–Chili corpus. The auspice 50 represent the sampling
frequencies for random access wherever applicable, and the parameters of Rice codings are
given in brackets.
prefix (LCP) arrays computed for the first 100M bytes of dblp, dna, and protein files
obtained from Pizza-Chili corpus2 . The maximum integer values in each of these
LCP sequences are 1084, 17772, and 35246 with medians 32, 13, and 6, respectively3 .
Results are shown in Table 2. On dblp, dna, and protein LCP arrays RiceV, RiceW,
and EliasW outperform their competitors in compression by approximately having
10% gain in space when compared to the next best (PforDelta in the former two, and
DACs in protein LCP), where more than 30% gain is observed against the original
Elias–γ, Elias–δ, and Rice codings. In terms of random–access time DACs again show
the best performance in all cases. However, the newly introduced methods also stay
competitive.
The improvements in terms of compression with the schemes EliasH, RiceH, and
RiceV, which are based on representing the unary sections of the codes by R/S dic-
tionaries, are due to the sparseness of these unary code segments that include in
practice exactly n 1 bits and much more 0 bits than that. Thus, entropy compressed
R/S dictionary data structures become very beneficial. Besides this space reduction,
since rank and select can be achieved in constant time, it becomes possible to access
any arbitrary item in the encoded list with the described methods.
When we compare EliasH with RiceH, we observe that the RiceH coding is more
advantageous both in terms of compactness and access time. Although both schemes
require two select queries in their random access procedures, the bit address calcu-
lation on the binary code stream depends on the results of these select operations in
EliasH coding, where it is independent (and more simple) than that in RiceH coding.
This dependency in EliasH causes an inefficiency in the optimized execution of the
access procedure, which makes RiceH run faster.
2
http://pizzachili.dcc.uchile.cl/
3
See table 1 in [6] for more details.
The difference in the compactness of EliasH and RiceH stems from the fact that
in general Rice compresses better than Elias. The length of the binary and unary
segments in Rice encoded sequences are shorter than their counterparts in Elias en-
coding, and hence, we observe this inherited property also on their modified versions.
If we compare RiceV with RiceH, we see that RiceV represents sequences with
less bits (except uniformly distributed sequences, where no difference was observed),
but access is a lot slower. In RiceV the access time is not constant and depends on
the value of the to-be-extracted integer as O(log 2xki ), which causes the decrease in
speed. However, splitting the unary code partition into smaller partitions in RiceV,
increases the sparseness ratio in these sub-bitmaps, which can be better represented
by the compressed R/S dictionaries, and results in better compression ratios. Notice
that there are no differences in the binary sections of RiceV and RiceH.
On the other hand, when the analyze wavelet tree based codings EliasW versus
RiceW, the experiments showed that in almost all cases RiceW performs better in
compression, but EliasW is slightly faster. Notice that instead of a balanced wavelet
tree implementation, we preferred to use a Huffman–like topology, to provide faster
access to the items having more frequent labels, which are the minimal code length
values in EliasW, and quotient values in RiceW. On uniformly distributed data the
height of the tree in RiceW is much shorter than the EliasW tree according to the
best performing k value selection. Thus, we see that on uniform data RiceW is a lot
more speedy than EliasW. When we look at the experiments on normally distributed
integer sequences, we see that ElisW runs faster, since there exists more levels (due to
more distinct labels) in RiceW tree. On LCP array sequence data, the access times
are more or less compatible. We observe that the original tendency of Rice coding to
achieve better compression than Elias is also preserved in their wavelet tree enhanced
implementations.
If we want to compare R/S dictionary based solutions versus wavelet tree based
proposals, the improvement in compression with respect to original Elias or Rice
sketches is almost guaranteed in R/S dictionary based solutions based on the sparse-
ness of the unary code segments, which is the case in general. On the other hand,
the compression performance of wavelet tree solutions, which varies according to the
number of distinct labels, highly depends on the target data.
5 Conclusion
We have investigated the usage of R/S dictionaries and wavelet trees on hybrid
(unary+binary) structure variable-length codes, particularly Elias–γ and Rice. The
proposed variants not only provide efficient random access capability, which is highly
desired in practical applications, but also improves the compression ratios signifi-
cantly. Experiments conducted on uniformly and normally distributed integer se-
quences as well as on sample LCP arrays showed that the newly introduced variants
outperforms their competitors including the most recent state-of-the-art codecs in
terms of compression, and also supports efficient random access within constant or
logarithmic time–complexities.
The methods introduced in this work, especially the wavelet tree construction
over relabeled integers, can further be extended to other variable–length codes. For
example, the length of the codewords in Fibonacci codes may also be used as the
labels, on which the wavelet tree may be constructed supporting efficient logarithmic–
time access. Seeking alternative labeling strategies for serving different purposes
other than the compact integer representations that we had exploited in this study
is another line of research that we believe worths to spend time on.
Acknowledgement: The author expresses his gratitude to Susanna Ladra and
Jukka Teuhola for sharing their codes.
6 References
[1] A. Moffat, Compressing integer sequences and sets, chapter Encyclopedia of algorithms,
pp. 178–182, Springer, 2008.
[2] P. Fenwick, Lossless Compression Handbook, chapter 3, pp. 55–78, 2003.
[3] P. Elias, “Efficient storage and retrieval by content and address of static files,” Journal
of ACM, vol. 21, no. 2, pp. 246–260, 1974.
[4] S. W. Golomb, “Run-length encodings,” IEEE Transactions on Information Theory,
vol. 12, pp. 399–401, 1966.
[5] R. F. Rice, “Some Practical Universal Noiseless Coding Techniques Part III,” JPL
Publication, vol. 83-17, 1983.
[6] N. R. Brisaboa, S. Ladra, and G. Navarro, “DACs: Bringing direct access to variable
length codes,” Information Processing and Management, vol. 49, no. 1, 2013.
[7] H. E. Williams and J. Zobel, “Compressing integers for fast file access,” The Computer
Journal, vol. 42, pp. 193–201, 1999.
[8] J. Barbay, T. Gagie, G. Navarro, and Y. Nekrich, “Alphabet partitioning for com-
pressed rank/select and applications,” in Algorithms and Computation, vol. 6507 of
Lecture Notes in Computer Science, pp. 315–326. 2010.
[9] V. N. Anh and A. Moffat, “Inverted index compression using word-aligned binary
codes,” Information Retrieval, vol. 8, no. 1, pp. 151–166, 2005.
[10] M. Zukowski, S. Heman, N. Nes, and P. Boncz, “Super–scalar RAM-CPU cache com-
pression,” in Proceedings of the 22nd International Conference on Data Engineering
(ICDE), 2006, pp. 59–59.
[11] A. Elmasry, J. Katajainen, and J. Teuhola, “Improved address–calculation coding of
integer arrays,” in SPIRE 2012: String Processing and Information Retrieval, vol.
7608 of Lecture Notes in Computer Science, pp. 205–216. 2012.
[12] D. Okanohara and K. Sadakane, “Practical entropy-compressed rank/select dictio-
nary.,” in Proceedings of ALENEX. 2007, SIAM.
[13] G. Jacobson, “Space-efficient static trees and graphs,” in Proceedings of the 30th
Annual Symposium on Foundations of Computer Science, 1989, pp. 549–554.
[14] D. R. Clark and J. I. Munro, “Efficient suffix trees on secondary storage,” in Proceedings
of the ACM-SIAM Symposium on Discrete algorithms, 1996, SODA, pp. 383–391.
[15] R. Raman, V. Raman, and S. S. Rao, “Succinct indexable dictionaries with applications
to encoding k–ary trees and multisets,” in Proceedings of the ACM-SIAM Symposium
on Discrete algorithms (SODA), 2002, pp. 233–242.
[16] R. González, S. Grabowski, V. Mäkinen, and G. Navarro, “Practical implementation
of rank and select queries,” in Poster Proceedings Volume of 4th Workshop on Efficient
and Experimental Algorithms (WEA05)(Greece, 2005), 2005, pp. 27–38.
[17] Gonzalo Navarro and Eliana Providel, “Fast, small, simple rank/select on bitmaps,”
in Experimental Algorithms, vol. 7276 of LNCS, pp. 295–306. 2012.
View publication stats