0% found this document useful (0 votes)

29 views11 pages

dcc14 Kulekci

The document proposes methods to improve compression ratios and enable efficient random access for variable-length integer coding schemes like Elias gamma and Rice coding. It introduces using rank/select dictionaries and wavelet trees built on the unary code partitions to achieve these goals. Experiments show the new methods outperform the original codes by 30% in compression and other recent techniques by 10%, while also enabling constant or logarithmic time random access. The techniques are generally applicable to other hybrid unary/binary variable length codes.

Uploaded by

piyushgchauhan01

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

29 views11 pages

dcc14 Kulekci

Uploaded by

piyushgchauhan01

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/262938266

Enhanced Variable-Length Codes: Improved Compression with Efﬁcient

Random Access

Conference Paper in Proceedings of the Data Compression Conference · March 2014

DOI: 10.1109/DCC.2014.74

CITATIONS READS
18 514

1 author:

M. Oguzhan Külekci
Istanbul Technical University
63 PUBLICATIONS 465 CITATIONS

SEE PROFILE

All content following this page was uploaded by M. Oguzhan Külekci on 10 June 2014.

The user has requested enhancement of the downloaded file.

Enhanced Variable-Length Codes:
Improved Compression with Efficient Random Access

M. Oğuzhan Külekci
TÜBİTAK - BİLGEM - UEKAE
Kocaeli, 41470, Turkey
[email protected]
Abstract
We investigate the usage of the wavelet tree and the rank/select–dictionary data struc-
tures on hybrid-structured variable–length codes, which represent an integer in the form of
a unary code section followed by a binary section. We propose to handle unary and binary
partitions as separate streams and create wavelet trees or R/S dictionaries over the unary
streams, which grants us the opportunity to directly access any codeword.Particularly con-
centrating on Elias and Rice schemes, we introduce several solutions that i) improve the
compression significantly, and ii) provide random access in constant or logarithmic time.
Experiments are conducted to compare the performances of the proposed codes against
Elias/Rice schemes and more recent state-of-the-art codings such as Simple9, PForDelta,
DACs, and improved-AC techniques. We observed that the newly introduced methods
outperform the original Elias/Rice codecs by ≈30% and the others by ≈10% in terms of
compression ratios. The methods described in this study are generic and may further be
extended to some other hybrid structure (unary/binary) variable–length codes as well.

1 Introduction

Hybrid usage of unary and binary codes for compact representation of integers has
been the backbone of many variable–length coding schemes [1, 2] such as Elias [3],
Golomb [4], and Rice [5]. A major difficulty of those codes is the lack of efficient
methods to support random access to the ith element, whose codeword’s beginning
bit address is not known on the encoded stream, and thus, one needs to decode all
previous (i − 1) values to reach it.
One alternative to tackle this problem is to use sampling. By using an additional
array that contains the beginning bit positions of every h–integers block on the en-
coded stream, the time complexity of random access becomes O(h) as we can begin
from the sampled bit addresses of the hi th block to retrieve the ith integer value, which
requires at most h integer decodings. This strategy defines a time/memory trade–off
between the extra space spent to store the sampled positions, and the access time.
In this study, we investigate how compressed data structures may help in variable-
size coding, and particularly focusing on Elias [3] and Rice [5] codes, we present
methods that i) significantly improve the compression ratios, and ii) provide efficient
methods to directly access the ith codeword.
We begin first by investigating the usage of rank/select (R/S) dictionary data
structures on variable-length coding. Previously, Brisaboa et al. [6] introduced DACs
(directly accessible codes) by integrating R/S dictionaries into Vbyte [7] byte-aligned
coding scheme. Alternatively, we introduce several schemes that consider building
R/S data structures over the concatenated unary partitions of Elias and Rice codings,
such that we can reach the ith codeword in O(1) time by two select queries.
The second dimension we study is to use wavelet trees. We propose to first label
each integer xi of the input sequence by the length of its unary code section in the
corresponding Elias or Rice coding, which are blog xi c and b 2xki c, respectively. By
creating a wavelet tree over these labels and storing the corresponding binary-code
sections at the leaf nodes, it becomes possible to extract the ith codeword in log r
steps, where r is the number of distinct labels occurred in the sequence. This idea
is akin to the alphabet partitioning [8] that aims to improve compressed rank/select
over text sequences by reducing the alphabet size by mapping several symbols into
a single one according to their frequencies. We use different labeling strategies other
than [8] with the aim to enhance compression on integer sequences while providing
direct access at the same time.
According to the experiments conducted, both the R/S dictionary and wavelet
tree solutions introduced in this study reduce the space consumptions of original
Elias/Rice schemes by ≈ 30%. On the test data, we also observed ≈ 10% improve-
ment in compression against the more recent compact integer representation tech-
niques [6, 9–11]. Notice that efficient random access is another gain beside improved
compression. Throughout the study, although we mainly work on Elias and Rice
schemes, as being amongst the most widely preferred fixed-to-variable codes in com-
pressed integer representations, the newly introduced methods define a generic idea
that is applicable to other hybrid (unary+binary) variable–length coding schemes.

2 Elias and Rice Codes with R/S Dictionaries

Let X = hx0 , x1 , x2 , . . . , xn−1 i be a given sequence of non–negative integers, where

xi ∈ {0, 1, 2, . . . , u − 1} for all 0 ≤ i < n. Each xi > 1 can be represented by blog xi c
bits by omitting the leftmost bit that is always set to 1, and each xi = 0|1 can be
shown by reserving one bit in raw format. Notice that the bit stream formed by
concatenating the minimal binary representations
P of xi sP
is not uniquely decodable.
Definition: We denote with Υ = ∀i:xi >1 blog xi c + ∀i:xi =0|xi =1 1 the total min-
imal binary representation length of sequence X in bits.
Throughout the paper, we assume a word RAM model with a word length of at
least log u bits, so that we can extract that much of bits with a single read operation.
EliasH Coding: The Elias–γ [3] coding of xi consists of 2 · blog(xi + 1)c + 1
bits such that the first blog(xi + 1)c bits set to 0 encode the actual length in unary
followed by a 1 bit to indicate the end of the unary part and the last blog(x + 1)c bits
contain minimal binary representation of (xi + 1) value. In EliasH coding, we store
the unary and binary sections of Elias–γ encoded integers in separate bit–arrays, and
represent the unary bit stream with a entropy compressed R/S dictionary.

Lemma 2.1 EliasH coding of n non-negative integers approximately requires n log Υn +

o(Υ + n) + Υ bits space and supports O(1)–time random access.
i 0 1 2 3 4 ...
xi 127 32 3 56 201 ...
1 2
j 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 ...
(Unary) U 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 1 0 0 0 0 0 1 ...
(Binary) B 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 0 0 1 ...
Access(x3 ): select1 (U ,3)=16 ; select1 (U ,4)=22 ; Binary–code length = 22-16-1=5;
Binary–code address = 16-3+1 = 14; B[14 . . . 18] = 11001; x3 + 1 =(1)11001 = 57; x3 = 56
a) Coding and access to elements on X via EliasH .
1
j 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7
(Unary code stream) U 0 0 0 0 0 0 0 1 0 0 1 1 0 0 0 1 ...
(Binary code stream) B 1 1 1 1 0 0 0 0 0 0 1 1 1 0 0 0 ...
Access(x3 ): select1 (U ,3)=11 ; select1 (U ,4)=15 ; Quotient = 15 − 11 − 1 = 3
Remainder = B[3 · 4 . . . 4 · 4 − 1] = B[12 . . . 15] = 1000 = 8; x3 = 24 · 3 + 8 = 56
b) Coding and access to elements on X via RiceH (k=4).

0 0 1 0 ... U0 = 0010. . . Access(x3 ):

0 0 0 ... U1 = 000. . . getbit(3,U0 )=0; rank1 (3,U0 )=1
0 1 0 ... U2 = 010. . . getbit(3-1,U1 )=0; rank1 (2,U1 )=0
U 0 1 ... U3 = 01. . . getbit(2-0,U2 )=0; rank1 (2,U2 )=1
0 ... U4 = 0. . . getbit(2-1,U3 )= 1 → Quotient = 3
0 ... U5 = 0. . . Remainder = B[12 . . . 15] = 1000
1 ... U6 = 0. . .
......... ......... x3 = 24 · 3 + 8= 56
B 1111000000111000
c) Coding and access to elements on X via RiceV (k=4).

Figure 1: Coding and access to elements on X via EliasH, RiceH, and RiceV coding.

Proof The unary section U includes Υ 0s and n 1s. The zeroth order entropy of
that sequence is log Υ+n n
≈ n log Υn bits. Hence, compressed representation of U
supporting O(1)–time R/S queries would occupy n log Υn +o(Υ+n) bits [12]. Summing
this value with the binary section B of length Υ gives the total space consumption.
While accessing a random xi in EliasH, we run 2 select queries to obtain the positions
of the ith and (i + 1)th 1 bits on U . The difference in between these gives us the code
length of (xi + 1) on B. The bit address of the minimal binary encoded (xi + 1) on
B is then calculated by subtracting i from the result of the first select query, since
the number of bits reserved on B to encode the previous (i − 1) integers is equal to
the number of 0s observed till the ith set bit on U . We move to that location on
B and extract the value of (xi + 1). In total, random access takes exactly 2 select
queries that can be performed in constant-time operations via the appropriate R/S
dictionary data structures. Figure 1–a gives a sketch of EliasH coding.
RiceH coding: In Rice [5] coding xi is represented by the unary encoded b 2xki c
value (quotient) and the binary encoded xi − 2k · b 2xki c value (remainder) according
to a predetermined k. Notice that the remainder is of fixed-length k-bits long. Fol-
lowing the same idea in EliasH, we propose in RiceH coding to represent the unary
encoded quotient values via an entropy compressed R/S dictionary, and keep the
binary encoded remainder values in a separate array.
ω ω
Lemma 2.2 The RiceH coding of X occupies n logPn·2 k + o( 2k + n) + nk bits space

and supports O(1) time random access, where ω = ∀i xi .

Proof The length of the unary section is |U | = n−1 xi ω
P
i=0 b 2k c + 1 ≈ 2k + n, where n
bits are set to 1 and rest to 0. The zero-order entropy of U may be approximated
ω
as log 2kn+n ≈ n log n·2 ω

k , and thus, the bits needed for entropy compressed R/S
ω ω
dictionary is approximately n log n·2 k + o( 2k + n). The remainder values of all xi
values are represented with the bit array B of size n · k bits. In random access to xi ,
first the indices of the ith and (i + 1)th 1 bits are detected by running two select
queries on the unary stream, whose difference gives us the quotient value b 2xki c. By
directly extracting the remainder from the B[k · i . . . k · (i + 1) − 1] bits on the B array,
the xi = (quotient · 2k + remainder) is retrieved. See Figure 1–b for a detailed sketch.

RiceV Coding: In RiceH coding, instead of keeping a single unary bit-stream,

it is possible to represent the unary sections of each codeword in a vertical fashion
by concatenating the first (leftmost) bits of every individual unary codeword in one
bitmap, the second bits in another bitmap, and so on. Obviously, the bit arrays will
be of varying lengths. This is something like creating the bit-arrays in a vertical
fashion as depicted in Figure 1–c, hence we refer it as RiceV coding.
Retrieving the xi value in a RiceV encoded sequence begins with extracting the
ith bit on the first bit-stream that includes the first bits of every unary code of all
elements in X. A 0 bit here means that the regarding unary section of xi has not been
finished and continues on the next level bitmap. In such a case, we achieve a rank
query to count how many 1s appear up to position i, which specifies how many of the
integers are not represented in the next bitmap. We decrement i by the result of that
rank query and perform the same operation on the next bitmap, which includes the
second bits of the unary sections of xi s such that 2xki > 0. The number of times we
need to perform this iteration until detecting the 1 bit marking the end of the unary
section for the target xi is dlog 2xki e. Extracting the remainder value on the binary
part of xi is trivial by fetching the ith k-bits block on B. The random access time
complexity of RiceV becomes O( 2uk ) (remember that (u − 1) is the largest integer in
X) as oppose to O(1) in RiceH. However, as can be observed from the experimental
results, RiceV generally achieves a better compression than RiceH.

3 Elias and Rice Codes with Wavelet Trees

This section describes two methods that integrates wavelet trees to Elias–γ and Rice
coding schemes. The main idea in both (which is akin to [8]) is to label each integer in
the sequence with its unary-code value, which is blog xi c in Elias–γ and b 2xki c in Rice.
The wavelet trees are constructed over the label sequences and include the regarding
binary-code values at their leaves.
EliasW(avelet) Coding: Assume set L = {`0 , `1 , `2 , . . . , `r−1 } includes the dis-
tinct blog xi c values observed on X. We propose to create a wavelet tree by recursively
splitting the X into two according to the corresponding `i values. Each leaf node in
the corresponding wavelet tree would be devoted to a unique bit length ` value, and
all xi values such that ` = blog xi c are stored in the leaf as a bit string composed of
concatenated minimal binary representations of those xi s. Such a wavelet tree will be
of height dlog re, and provide access to any item in at most dlog re steps. Figure 2–a
shows the proposed scheme on a sample sequence. Notice that we assume log 0 = 0,
X={3,6,0,11,5,1,3,15,9,13}
x
Labels EliasW={1,2,0,3,2,0,1,3,3,3} blog xi c Labels RiceW={0,1,0,2,1,0,0,3,2,3} b ki c

2
0101100111 0001000111

1001 010111
010100 0101

0 1 1 1
10 01 011 111 001 101 11 01 11 01
11 00 01 11 10 01

i) EliasW coding ii) RiceW coding

Figure 2: The i) EliasW and ii) RiceW (with k = 2) coding of the sample sequence X.

and keep xi s equal to 0 and 1 values in a separate node dedicated to ` = 0. While

retrieving integers from that special node, we do not concatenate 1 to the beginnings.
This wavelet tree can be stored as a single bit stream by concatenating the bitmaps
of the nodes in a predetermined (depth-first or breadth-first) tree traversal. No de-
limiters (or pointers) in between the individual bitmaps are necessary since we can
restore the tree topology along with the lengths of the bitmaps at each node once set
L and n are given beforehand.
Lemma 3.1 Given a sequence X = hx0 , x1 , . . . , xn−1 i of non–negative integers such
that 0 ≤ xi < u and L = {`0 , `1 , . . . , `r−1 } is the set of distinct blog xi c values observed
in X, the EliasW coding of X occupies at most n · dlog re + Υ bits of space plus a few
bits to encode the n value and set L for correct decoding.
Proof The wavelet tree generated for coding sequence X includes at most dlog re
intermediate levels including the root. Since the internal nodes at each level contains
exactly n bits in total, we need at most n · dlog re = o(n · dlog re) bits. The actual
xi values are encoded at the leaves with their minimal binary representations, which
require Υ bits in total. We need to know the n value and set L for proper decoding.
Following the encodings of n and the set size r = |L| in a universal coding scheme,
items in set L can be presented simply by r · log u bits, which in total does not cause
an increase in space complexity, and can be handled via a few bytes in practice.
Remark When L = {0, 1, 2, . . . , blog(u − 1)c}, which means every possible code
length has occurred in X, the space requirement is at most n · dlogdlog uee + Υ bits.
RiceW(avelet) Coding: Following the same idea in EliasW coding, we may
consider to construct set L as the unique b 2xki c values observed in X computed with
a predetermined k, and build the wavelet tree accordingly. Each leaf is dedicated to
a unique quotient value, and the remainder values of xi s having the same quotient
values are concatenated to form the bit stream at each leaf. Figure 2–b shows sample
encoding of X with this scheme.
Lemma 3.2 The RiceW coding of X occupies at most n · dlog re + nk bits of space
plus a few bits to encode the n value and the set of r distinct blog 2xki c values observed
in X for some parameter k < log u.
Proof The n · dlog re bits are spent in the wavelet tree as described in Lemma 3.1.
The leaves require n · k bits in total as the remainder values are all k bits long.

Remark The space usage becomes ndlog(dlog ue−k)e+nk bits assuming all possible
b 2xki c values appear in L = {0, 1, . . . , blog(u − 1) − kc}, and hence, r = dlog ue − k.

Lemma 3.3 In EliasW and RiceW codings, any element of sequence X can be ac-
cessed in O(dlog re) time by reserving additional o(n · dlog re) bits.

Proof The maximum number of internal nodes we need to traverse to reach leaves
is dlog re. At each visited node we access the qth bit bq in constant time and run
rank(bq , q, B) query on the corresponding bitmap to count number of bq s up to the
position q on B. We know that rank queries on a n-bits binary string can be exe-
cuted in O(1) time using n + o(n) space via succinct data structures [13, 14]. Thus,
augmenting the bitmaps of every node with the succinct data structures to support
constant time rank queries, the ith item of the input sequence X can be reached in
O(dlog re) time. This brings an overhead of o(n · dlog re) bits as we have at most
dlog re levels each of which containing n bits in total.

The rank queries can be performed in O(1) time using compressed space n · H0 + o(n)
[15] on binary strings, where H0 represents 0th order empirical entropy of the source
bitmap. Thus, the space requirement in accessing EliasW and RiceW coding may
further be improved in practice. Another way to reach the compressed space might
be to use Huffman shaped wavelet trees [16] instead of balanced trees. Notice that
in that case the height of the tree will be longer but access time will be shorter for
frequent elements.

4 Implementation and Experimental Results

We have implemented our proposals in C and run benchmarks against the original
Elias and Rice schemes as well as the more recent Simple9 [9], PForDelta [10], DACs
[6], and improved-AC [11] codes, which has reported improved compression ratios and
random–access times. All experiments conducted on a machine running 64–bit Linux
Mint (nadia) with Intel Core i7-3770 @ 3.40GHz processor, 16GB of main memory,
and 8192KB cache size. All reported results are the means of hundred runs.
In EliasW and RiceW implementations, we split the set L into two such that the
input set of integers is roughly clustered into two instead of splitting L into two equal
cardinality parts. Although the number of levels now becomes more than dlog re,
we provide faster random access to frequent items. Since at each level, number of
0s are roughly equal to 1s, using compressed bitmap for rank queries does not make
much sense, and thus, we keep the bitmaps in raw format and apply the basic rank
structure by using 5% extra space over that raw representation. However, for EliasH,
RiceH and RiceV schemes we used the compressed format RSDic library1 .
1
http://code.google.com/p/rsdic/, implementation of R/S in compressed space based on [17].
range Elias–γ Elias–δ EliasW EliasH Rice RiceW RiceH RiceV DAC Simple9 PforDelta iAC
[20 .28 ) 13.319 12.225 8.120 11.366 8.7406 8.0018 8.2038 8.2038 8.001 10.700 8.033 10.63
(0.448) (0.420) (0.052) (0.192) (0.372) (0.008) (0.128) (0.052) (0.008) (0.048) (0.312) (0.068)
[28 .216 ) 29.314 22.286 16.119 21.964 16.74215 16.00116 16.20316 16.20316 16.001 30.681 16.034 18.5947
(0.612) (0.444) (0.056) (0.188) (0.260) (0.008) (0.128) (0.052) (0.008) (0.076) (0.368) (0.068)
[216 .224 ) 45.323 31.282 24.119 32.069 24.75223 24.00124 24.20324 24.20324 24.001 32.250 25.449 26.4697
(0.788) (0.344) (0.060) (0.190) (0.260) (0.012) (0.132) (0.052) (0.016) (0.080) (7.360) (0.072)

a) Results obtained on uniformly distributed integers.

µ Elias–γ Elias–δ EliasW EliasH Rice RiceW RiceH RiceV DAC Simple9 PforDelta iAC
28 16.213 14.683 9.127 13.404 9.7438 8.3305 9.8088 8.7267 10.001 10.907 9.034 11.652
(0.476) (0.320) (0.040) (0.188) (0.256) (0.096) (0.116) (0.180) (0.008) (0.048) (0.328) (0.064)
210 20.207 16.718 11.128 16.079 11.74110 10.3327 11.80510 10.7259 12.001 16.240 11.035 13.652
(0.528) (0.312) (0.044) (0.188) (0.256) (0.092) (0.116) (0.188) (0.012) (0.056) (0.348) (0.068)
212 24.203 18.726 13.128 18.596 13.74011 12.3319 13.80412 12.72511 14.001 16.240 13.035 15.652
(0.544) (0.304) (0.044) (0.192) (0.288) (0.092) (0.120) (0.180) (0.016) (0.064) (0.364) (0.068)
214 28.202 20.726 15.128 21.226 15.74013 14.33211 15.80414 14.72513 16.001 26.917 16.033 17.621
(0.584) (0.308) (0.040) (0.188) (0.288) (0.100) (0.120) (0.180) (0.012) (0.076) (0.376) (0.064)
216 32.201 24.680 17.128 23.852 17.75015 16.33213 17.80416 16.72515 18.001 32.250 18.100 19.574
(0.624) (0.356) (0.044) (0.188) (0.292) (0.096) (0.116) (0.180) (0.012) (0.080) (0.374) (0.072)
218 36.211 26.725 19.128 26.383 19.75018 18.33215 19.80418 17.72517 20.005 32.250 20.035 21.481
(0.656) (0.336) (0.044) (0.189) (0.288) (0.096) (0.120) (0.176) (0.016) (0.080) (0.380) (0.068)
220 40.211 28.725 21.128 28.914 21.75019 20.33217 21.80420 20.72519 22.005 32.250 22.101 23.480
(0.704) (0.340) (0.048) (0.188) (0.284) (0.092) (0.120) (0.180) (0.016) (0.084) 3.772) (0.072)
222 44.211 30.726 23.128 31.392 23.75021 22.33219 23.80422 22.72521 24.005 32.250 24.413 25.480
(0.768) (0.344) (0.044) (0.184) (0.288) (0.096) (0.116) (0.180) (0.016) (0.080) 5.668) (0.072)

µ
b) Results obtained on normally distributed integers with mean µ and deviation 4.

Table 1: Space consumption as bits/integer computed on one million non-negative integers

with a) uniform and b) Gaussian distributions. Random access times in microseconds are
given between parentheses as secondary lines and the subscripts in Rice variants represent
the parameters giving best compression ratios. Sampling frequency is set to 100 for Elias-γ,
Elias-δ, Rice, Simple9, and PforDelta schemes.

We first tested newly introduced schemes on uniformly distributed integers to

observe their performance on small, medium and large integers. The results are shown
in table 1 a), where we see RiceW showed the best performance with the DACs both
in compactness and random access time. Note that on a uniformly distributed integer
set in range [2i , 2j ), the distribution of the blog xi c values are highly skewed towards
j, which might create a biased scheme especially for our proposals.
Thus, we performed a second test on normally distributed sets of integers having
mean µ and deviation µ4 , whose results are given in table 1 b). RiceW scheme seems to
be the unique leader in this case, reflecting approximately 10% enhanced space usage
against the best competitor among the previously known results (PForDelta on small
µ values, and DACs on larger ones). RiceV is consistently the second best during the
experiments, which outperforms others when µ = 218 . In general RiceV and RiceW
coded integers roughly occupy 1.3 to 1.6 bits less than the best previously known
scheme. However, in terms of random access time, DACs is the absolute leader.
We run a third experiment to measure the performance on a real world data,
where we preferred to use the same data set used in [6] that are the longest common
dblp dna protein
space time space time space time
Elias–γ 10.083 – 7.751 – 8.289 –
Elias50 –γ 10.683 0.196 8.351 0.200 8.889 0.236
Elias–δ 9.542 – 8.390 – 7.863 –
Elias50 –δ 10.142 0.192 8.990 0.196 8.463 0.216
EliasW 7.311 0.044 4.728 0.044 5.932 0.060
EliasH 9.765 0.184 8.273 0.156 8.035 0.144
Rice (5) 6.919 – (4) 6.049 – (7) 9.555 –
Rice50 (5) 7.519 0.116 (4) 6.649 0.112 (7) 10.155 0.100
RiceW (5) 7.015 0.052 (3) 4.600 0.048 (3) 6.597 0.102
RiceH (6) 6.836 0.164 (4) 5.413 0.156 (6) 8.458 0.160
RiceV (4) 5.679 0.304 (3) 4.614 0.376 (6) 8.202 0.372
DACs 7.522 0.024 5.543 0.028 6.579 0.044
Simple950 7.956 0.040 6.254 0.040 8.213 0.044
PForDelta50 6.283 0.800 5.141 0.964 6.732 2.864
improved–AC 9.112 0.068 8.040 0.064 9.110 0.068

Table 2: Experimental results obtained on LCP arrays computed from the first 100M of
dblp, dna, and protein files of Pizza–Chili corpus. The auspice 50 represent the sampling
frequencies for random access wherever applicable, and the parameters of Rice codings are
given in brackets.

prefix (LCP) arrays computed for the first 100M bytes of dblp, dna, and protein files
obtained from Pizza-Chili corpus2 . The maximum integer values in each of these
LCP sequences are 1084, 17772, and 35246 with medians 32, 13, and 6, respectively3 .
Results are shown in Table 2. On dblp, dna, and protein LCP arrays RiceV, RiceW,
and EliasW outperform their competitors in compression by approximately having
10% gain in space when compared to the next best (PforDelta in the former two, and
DACs in protein LCP), where more than 30% gain is observed against the original
Elias–γ, Elias–δ, and Rice codings. In terms of random–access time DACs again show
the best performance in all cases. However, the newly introduced methods also stay
competitive.
The improvements in terms of compression with the schemes EliasH, RiceH, and
RiceV, which are based on representing the unary sections of the codes by R/S dic-
tionaries, are due to the sparseness of these unary code segments that include in
practice exactly n 1 bits and much more 0 bits than that. Thus, entropy compressed
R/S dictionary data structures become very beneficial. Besides this space reduction,
since rank and select can be achieved in constant time, it becomes possible to access
any arbitrary item in the encoded list with the described methods.
When we compare EliasH with RiceH, we observe that the RiceH coding is more
advantageous both in terms of compactness and access time. Although both schemes
require two select queries in their random access procedures, the bit address calcu-
lation on the binary code stream depends on the results of these select operations in
EliasH coding, where it is independent (and more simple) than that in RiceH coding.
This dependency in EliasH causes an inefficiency in the optimized execution of the
access procedure, which makes RiceH run faster.
2
http://pizzachili.dcc.uchile.cl/
3
See table 1 in [6] for more details.
The difference in the compactness of EliasH and RiceH stems from the fact that
in general Rice compresses better than Elias. The length of the binary and unary
segments in Rice encoded sequences are shorter than their counterparts in Elias en-
coding, and hence, we observe this inherited property also on their modified versions.
If we compare RiceV with RiceH, we see that RiceV represents sequences with
less bits (except uniformly distributed sequences, where no difference was observed),
but access is a lot slower. In RiceV the access time is not constant and depends on
the value of the to-be-extracted integer as O(log 2xki ), which causes the decrease in
speed. However, splitting the unary code partition into smaller partitions in RiceV,
increases the sparseness ratio in these sub-bitmaps, which can be better represented
by the compressed R/S dictionaries, and results in better compression ratios. Notice
that there are no differences in the binary sections of RiceV and RiceH.
On the other hand, when the analyze wavelet tree based codings EliasW versus
RiceW, the experiments showed that in almost all cases RiceW performs better in
compression, but EliasW is slightly faster. Notice that instead of a balanced wavelet
tree implementation, we preferred to use a Huffman–like topology, to provide faster
access to the items having more frequent labels, which are the minimal code length
values in EliasW, and quotient values in RiceW. On uniformly distributed data the
height of the tree in RiceW is much shorter than the EliasW tree according to the
best performing k value selection. Thus, we see that on uniform data RiceW is a lot
more speedy than EliasW. When we look at the experiments on normally distributed
integer sequences, we see that ElisW runs faster, since there exists more levels (due to
more distinct labels) in RiceW tree. On LCP array sequence data, the access times
are more or less compatible. We observe that the original tendency of Rice coding to
achieve better compression than Elias is also preserved in their wavelet tree enhanced
implementations.
If we want to compare R/S dictionary based solutions versus wavelet tree based
proposals, the improvement in compression with respect to original Elias or Rice
sketches is almost guaranteed in R/S dictionary based solutions based on the sparse-
ness of the unary code segments, which is the case in general. On the other hand,
the compression performance of wavelet tree solutions, which varies according to the
number of distinct labels, highly depends on the target data.

5 Conclusion

We have investigated the usage of R/S dictionaries and wavelet trees on hybrid
(unary+binary) structure variable-length codes, particularly Elias–γ and Rice. The
proposed variants not only provide efficient random access capability, which is highly
desired in practical applications, but also improves the compression ratios signifi-
cantly. Experiments conducted on uniformly and normally distributed integer se-
quences as well as on sample LCP arrays showed that the newly introduced variants
outperforms their competitors including the most recent state-of-the-art codecs in
terms of compression, and also supports efficient random access within constant or
logarithmic time–complexities.
The methods introduced in this work, especially the wavelet tree construction
over relabeled integers, can further be extended to other variable–length codes. For
example, the length of the codewords in Fibonacci codes may also be used as the
labels, on which the wavelet tree may be constructed supporting efficient logarithmic–
time access. Seeking alternative labeling strategies for serving different purposes
other than the compact integer representations that we had exploited in this study
is another line of research that we believe worths to spend time on.
Acknowledgement: The author expresses his gratitude to Susanna Ladra and
Jukka Teuhola for sharing their codes.

6 References

[1] A. Moffat, Compressing integer sequences and sets, chapter Encyclopedia of algorithms,
pp. 178–182, Springer, 2008.
[2] P. Fenwick, Lossless Compression Handbook, chapter 3, pp. 55–78, 2003.
[3] P. Elias, “Efficient storage and retrieval by content and address of static files,” Journal
of ACM, vol. 21, no. 2, pp. 246–260, 1974.
[4] S. W. Golomb, “Run-length encodings,” IEEE Transactions on Information Theory,
vol. 12, pp. 399–401, 1966.
[5] R. F. Rice, “Some Practical Universal Noiseless Coding Techniques Part III,” JPL
Publication, vol. 83-17, 1983.
[6] N. R. Brisaboa, S. Ladra, and G. Navarro, “DACs: Bringing direct access to variable
length codes,” Information Processing and Management, vol. 49, no. 1, 2013.
[7] H. E. Williams and J. Zobel, “Compressing integers for fast file access,” The Computer
Journal, vol. 42, pp. 193–201, 1999.
[8] J. Barbay, T. Gagie, G. Navarro, and Y. Nekrich, “Alphabet partitioning for com-
pressed rank/select and applications,” in Algorithms and Computation, vol. 6507 of
Lecture Notes in Computer Science, pp. 315–326. 2010.
[9] V. N. Anh and A. Moffat, “Inverted index compression using word-aligned binary
codes,” Information Retrieval, vol. 8, no. 1, pp. 151–166, 2005.
[10] M. Zukowski, S. Heman, N. Nes, and P. Boncz, “Super–scalar RAM-CPU cache com-
pression,” in Proceedings of the 22nd International Conference on Data Engineering
(ICDE), 2006, pp. 59–59.
[11] A. Elmasry, J. Katajainen, and J. Teuhola, “Improved address–calculation coding of
integer arrays,” in SPIRE 2012: String Processing and Information Retrieval, vol.
7608 of Lecture Notes in Computer Science, pp. 205–216. 2012.
[12] D. Okanohara and K. Sadakane, “Practical entropy-compressed rank/select dictio-
nary.,” in Proceedings of ALENEX. 2007, SIAM.
[13] G. Jacobson, “Space-efficient static trees and graphs,” in Proceedings of the 30th
Annual Symposium on Foundations of Computer Science, 1989, pp. 549–554.
[14] D. R. Clark and J. I. Munro, “Efficient suffix trees on secondary storage,” in Proceedings
of the ACM-SIAM Symposium on Discrete algorithms, 1996, SODA, pp. 383–391.
[15] R. Raman, V. Raman, and S. S. Rao, “Succinct indexable dictionaries with applications
to encoding k–ary trees and multisets,” in Proceedings of the ACM-SIAM Symposium
on Discrete algorithms (SODA), 2002, pp. 233–242.
[16] R. González, S. Grabowski, V. Mäkinen, and G. Navarro, “Practical implementation
of rank and select queries,” in Poster Proceedings Volume of 4th Workshop on Efficient
and Experimental Algorithms (WEA05)(Greece, 2005), 2005, pp. 27–38.
[17] Gonzalo Navarro and Eliana Providel, “Fast, small, simple rank/select on bitmaps,”
in Experimental Algorithms, vol. 7276 of LNCS, pp. 295–306. 2012.

View publication stats

The Leader's Guide To Corporate Culture: Boris Groysberg Yo-Jud Cheng Jesse Price Jeremiah Lee
No ratings yet
The Leader's Guide To Corporate Culture: Boris Groysberg Yo-Jud Cheng Jesse Price Jeremiah Lee
13 pages
Robert Venturi - Idea of A Duck and A Decorated Shed
No ratings yet
Robert Venturi - Idea of A Duck and A Decorated Shed
7 pages
Elias Codes
No ratings yet
Elias Codes
6 pages
A Unique Perspective On Data Coding and Decoding
No ratings yet
A Unique Perspective On Data Coding and Decoding
11 pages
Recursive Data Compression Method
No ratings yet
Recursive Data Compression Method
26 pages
Binary Interpolative Coding Revisited
No ratings yet
Binary Interpolative Coding Revisited
10 pages
Testing_document_sourjyendra_Data Compression Techniques - Lecture 4 - Integer Codes 2 - University of Helsinky - Slides (DCT2015-Lecture4)
No ratings yet
Testing_document_sourjyendra_Data Compression Techniques - Lecture 4 - Integer Codes 2 - University of Helsinky - Slides (DCT2015-Lecture4)
56 pages
Ijart079 PDF
No ratings yet
Ijart079 PDF
5 pages
Audio and Video Coding PDF
No ratings yet
Audio and Video Coding PDF
72 pages
On The Usefulness of Fibonacci Compression Codes: Shmuel T. Klein, Miri Kopel Ben-Nissan
No ratings yet
On The Usefulness of Fibonacci Compression Codes: Shmuel T. Klein, Miri Kopel Ben-Nissan
15 pages
MIT6 004s09 Tutor01 Sol
No ratings yet
MIT6 004s09 Tutor01 Sol
13 pages
On Improving Tunstall Codes: Shmuel T. Klein and Dana Shapira
No ratings yet
On Improving Tunstall Codes: Shmuel T. Klein and Dana Shapira
16 pages
9 Run Length Codes
No ratings yet
9 Run Length Codes
9 pages
TOPIC
No ratings yet
TOPIC
18 pages
Entropy & Run Length Coding
No ratings yet
Entropy & Run Length Coding
45 pages
2201.01741v2 - Understanding Entropy Coding With Asymmetric Numeral Systems (ANS) - Statistician Perspective
No ratings yet
2201.01741v2 - Understanding Entropy Coding With Asymmetric Numeral Systems (ANS) - Statistician Perspective
26 pages
A Universal Algorithm For Sequential Data Compression: Ieee Transactions ON Information Theory, Vol. MAY 1977
No ratings yet
A Universal Algorithm For Sequential Data Compression: Ieee Transactions ON Information Theory, Vol. MAY 1977
7 pages
PSC2020 Article03
No ratings yet
PSC2020 Article03
10 pages
05 Arith 1
No ratings yet
05 Arith 1
54 pages
Robust Universal Complete Codes For Transmission and Compression
No ratings yet
Robust Universal Complete Codes For Transmission and Compression
25 pages
Chapter 2
No ratings yet
Chapter 2
13 pages
9781584885146-Sanet ST
No ratings yet
9781584885146-Sanet ST
106 pages
UNIT-5 Entropy Encoding
No ratings yet
UNIT-5 Entropy Encoding
8 pages
A Hardware Architecture of A Counter-Based Entropy Coder: Armein Z. R. Langi
No ratings yet
A Hardware Architecture of A Counter-Based Entropy Coder: Armein Z. R. Langi
15 pages
2013 O Donnel ECcodes Hadamard
No ratings yet
2013 O Donnel ECcodes Hadamard
10 pages
Golomb Code
No ratings yet
Golomb Code
11 pages
A Compression Method For Clustered Bit Vectors
No ratings yet
A Compression Method For Clustered Bit Vectors
4 pages
Chapter 4 Multi
No ratings yet
Chapter 4 Multi
45 pages
1 s2.0 089054018990028X Main
No ratings yet
1 s2.0 089054018990028X Main
16 pages
Ziv Lempel 1977 Universal Algorithm
No ratings yet
Ziv Lempel 1977 Universal Algorithm
7 pages
6.1 Lossless Compression Algorithms: Introduction: Unit 6: Multimedia Data Compression
No ratings yet
6.1 Lossless Compression Algorithms: Introduction: Unit 6: Multimedia Data Compression
25 pages
ICT - Module 1 Lecture 3
No ratings yet
ICT - Module 1 Lecture 3
43 pages
Coding Theory Lecture Notes
100% (1)
Coding Theory Lecture Notes
73 pages
Elias Report
No ratings yet
Elias Report
16 pages
New Binary Codes
No ratings yet
New Binary Codes
9 pages
5.3 Kraft Inequality and Optimal Codeword Length: Theorem 22 Let X
No ratings yet
5.3 Kraft Inequality and Optimal Codeword Length: Theorem 22 Let X
11 pages
ENSC 424 - Multimedia Communications Engineering: Topic 6: Arithmetic Coding 1
No ratings yet
ENSC 424 - Multimedia Communications Engineering: Topic 6: Arithmetic Coding 1
23 pages
Optimal Data Compression Algorithm: Pergamon
No ratings yet
Optimal Data Compression Algorithm: Pergamon
16 pages
Journal of Discrete Algorithms: Sergio de Agostino
No ratings yet
Journal of Discrete Algorithms: Sergio de Agostino
8 pages
Dce Easy Solution
0% (1)
Dce Easy Solution
87 pages
Lecture 4 - Arithmetic Coding and Lempel-Ziv
No ratings yet
Lecture 4 - Arithmetic Coding and Lempel-Ziv
26 pages
Ut 1 PPT
No ratings yet
Ut 1 PPT
77 pages
Quantization and Compression PDF
No ratings yet
Quantization and Compression PDF
220 pages
Lecture 4 Index Compression
No ratings yet
Lecture 4 Index Compression
32 pages
Group Assignment Multimedia System
No ratings yet
Group Assignment Multimedia System
26 pages
Golomb Coding
No ratings yet
Golomb Coding
7 pages
Reed-Solomon Encoding and Decoding
No ratings yet
Reed-Solomon Encoding and Decoding
46 pages
Introduction To Coding Theory, Second Edition Solutions Manual
No ratings yet
Introduction To Coding Theory, Second Edition Solutions Manual
134 pages
Decodable PDF
No ratings yet
Decodable PDF
4 pages
Efficient Random Network Coding For Distributed Storage Systems
No ratings yet
Efficient Random Network Coding For Distributed Storage Systems
10 pages
Chapter 7
No ratings yet
Chapter 7
70 pages
Ibook - Pub Basic Arithmetic Coding Based Approach To Compress A Character String
No ratings yet
Ibook - Pub Basic Arithmetic Coding Based Approach To Compress A Character String
8 pages
Rudiger Urbanke Lecture PDF
No ratings yet
Rudiger Urbanke Lecture PDF
70 pages
A First Course in Coding Theory. by Raymond Hill Review By: H. F. Mattson, Jr. SIAM Review, Vol. 30, No. 1 (Mar., 1988), Pp. 148-150 Published By: Stable URL: Accessed: 12/06/2014 21:13
No ratings yet
A First Course in Coding Theory. by Raymond Hill Review By: H. F. Mattson, Jr. SIAM Review, Vol. 30, No. 1 (Mar., 1988), Pp. 148-150 Published By: Stable URL: Accessed: 12/06/2014 21:13
4 pages
Kraft'S and Mcmillan'S Inequalities: Theorem 1.11
No ratings yet
Kraft'S and Mcmillan'S Inequalities: Theorem 1.11
11 pages
Uniquely Decodable Codes
No ratings yet
Uniquely Decodable Codes
10 pages
cp467 12 Lecture14 Compression1
No ratings yet
cp467 12 Lecture14 Compression1
146 pages
Digital Signal Processing (DSP) with Python Programming
From Everand
Digital Signal Processing (DSP) with Python Programming
Maurice Charbit
No ratings yet
IGNOU BCA Computer Oriented Numerical Technique Previous Year Unsolved Papers BCS 054
From Everand
IGNOU BCA Computer Oriented Numerical Technique Previous Year Unsolved Papers BCS 054
Manish Soni
No ratings yet
Blowfish Cipher Tutorials - Herong's Tutorial Examples
From Everand
Blowfish Cipher Tutorials - Herong's Tutorial Examples
Herong Yang
No ratings yet
The Tech Interview Playbook: From DSA to System Design
From Everand
The Tech Interview Playbook: From DSA to System Design
Chinmoy Mukherjee
No ratings yet
A Short Course in Discrete Mathematics
From Everand
A Short Course in Discrete Mathematics
Edward A. Bender
3/5 (1)
4 Hofstede Summary
100% (2)
4 Hofstede Summary
6 pages
Gateway 1 Term 2 Test 2 B
No ratings yet
Gateway 1 Term 2 Test 2 B
1 page
STYL'One Nano: The Most Advanced Benchtop Tablet Press
No ratings yet
STYL'One Nano: The Most Advanced Benchtop Tablet Press
4 pages
Super73 Electric Motorbike - K1D
No ratings yet
Super73 Electric Motorbike - K1D
1 page
Construction Economics B-Assignment 01
No ratings yet
Construction Economics B-Assignment 01
9 pages
Internet Components: by Prof. Brijesh Joshi
No ratings yet
Internet Components: by Prof. Brijesh Joshi
71 pages
2015 A Guide To Statistical Analysis in Microbial Ecology - A Community-Focused, Living Review of Multivariate Data Analyses
No ratings yet
2015 A Guide To Statistical Analysis in Microbial Ecology - A Community-Focused, Living Review of Multivariate Data Analyses
8 pages
Class Documentation Chemserv DLL enUS
No ratings yet
Class Documentation Chemserv DLL enUS
16 pages
PEH 4 - DLP - 4 - School Camping and Team Building 2
100% (1)
PEH 4 - DLP - 4 - School Camping and Team Building 2
8 pages
Eepc-301 L4
No ratings yet
Eepc-301 L4
10 pages
Irr of Republic Act No. 12009
No ratings yet
Irr of Republic Act No. 12009
57 pages
Prime Articles & Editorials - January 2025
No ratings yet
Prime Articles & Editorials - January 2025
36 pages
Hyt tc-700 - Service Manual
No ratings yet
Hyt tc-700 - Service Manual
76 pages
SITXCOM005 - Mapping Document
No ratings yet
SITXCOM005 - Mapping Document
10 pages
Determinants, Indias Role in Nam and Its Imapct
No ratings yet
Determinants, Indias Role in Nam and Its Imapct
7 pages
Intrathecal Hyaluronidase in TB Meningitis: The Unexplored Adjuvant
No ratings yet
Intrathecal Hyaluronidase in TB Meningitis: The Unexplored Adjuvant
5 pages
ZingHR The-New Age Employee Centric Digital HR2016 1
No ratings yet
ZingHR The-New Age Employee Centric Digital HR2016 1
31 pages
Hackintosh Lion 10.7
No ratings yet
Hackintosh Lion 10.7
4 pages
Alteration of Design IPP07
No ratings yet
Alteration of Design IPP07
3 pages
Regular Grammar
No ratings yet
Regular Grammar
2 pages
Combustion Characteristics of Different Biomass Fuels
No ratings yet
Combustion Characteristics of Different Biomass Fuels
12 pages
SRM Bidder'S Manual
No ratings yet
SRM Bidder'S Manual
36 pages
Visio-Corrosion Loop Plant Inlet Rev.0
No ratings yet
Visio-Corrosion Loop Plant Inlet Rev.0
2 pages
Paper 4
No ratings yet
Paper 4
12 pages
Gann Part 01
No ratings yet
Gann Part 01
14 pages
Icebreaker Research Material
No ratings yet
Icebreaker Research Material
5 pages
Unit 8: Extra Practice 1
No ratings yet
Unit 8: Extra Practice 1
2 pages
Evaluation of Peter Lynch's Stock Selection Criteria Based Equity Investment Strategy
No ratings yet
Evaluation of Peter Lynch's Stock Selection Criteria Based Equity Investment Strategy
10 pages

dcc14 Kulekci

Uploaded by

dcc14 Kulekci

Uploaded by

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

Enhanced Variable-Length Codes: Improved Compression with Efﬁcient

Conference Paper in Proceedings of the Data Compression Conference · March 2014

The user has requested enhancement of the downloaded file.

2 Elias and Rice Codes with R/S Dictionaries

Let X = hx0 , x1 , x2 , . . . , xn−1 i be a given sequence of non–negative integers, where

Lemma 2.1 EliasH coding of n non-negative integers approximately requires n log Υn +

0 0 1 0 ... U0 = 0010. . . Access(x3 ):

and supports O(1) time random access, where ω = ∀i xi .

RiceV Coding: In RiceH coding, instead of keeping a single unary bit-stream,

3 Elias and Rice Codes with Wavelet Trees

i) EliasW coding ii) RiceW coding

and keep xi s equal to 0 and 1 values in a separate node dedicated to ` = 0. While

4 Implementation and Experimental Results

a) Results obtained on uniformly distributed integers.

Table 1: Space consumption as bits/integer computed on one million non-negative integers

We first tested newly introduced schemes on uniformly distributed integers to

View publication stats

You might also like