Gamma

The document discusses using Elias gamma coding to compress posting lists in order to enhance Wikipedia search performance. It describes how gamma coding works, its implementation to compress posting lists, comparisons showing it significantly reduced file size and retrieval times, and conclusions on its effectiveness in allowing common words to be considered in queries without slowing response times.

Uploaded by

piyushgchauhan01

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views5 pages

Gamma

Uploaded by

piyushgchauhan01

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

Enhancing Wikipedia Search Performance

Using Elias Gamma Code

Ali Bagherzandi and Kerim Yassin Oktay

Department of Computer Science, University of California, Irvine,

{zandi,koktay}@ics.uci.edu

1. Motivation
Project 4 and 5 were aimed to build a real search engine, however the inefficiencies both in
retrieving the relevant documents for the queried phrase and the time required to answer the
query kept it far from being practical. We noticed that these two problems originate from similar
sources: In fact in order to make the query response time reasonable, we ignored words with
corpus frequency higher than 50’000 and this effected retrieving the relevant documents by
ignoring many words that normally should not be ignored because they are key words to perform
a finer search. For instance, in project 5, the phrase “search engine will retrieve” was equivalent
to “retrieve” whereas both “search” and “engine” is basically characterizing the above phrase
and should not be thrown away. Unfortunately when we tried to increase the threshold we faced
with unacceptably long response time. For instance for the phrase “search engine will retrieve”,
our server took (ms) to respond even if we threw away the word “will”.

We realized that almost all of the query response time is being spent on reading the posting lists
for query terms from the disk and as the corpus frequency of a term increases this time also
increases leading to a very bad performance. Therefore we thought maybe by using an elegant
encoding we can reduce the length of the posting lists and hence their loading time. The method
we come up with is called gamma coding.

2. Elias Gamma Code

Elias gamma code or for short gamma coding is a coding system developed by Peter Elias to
encode positive integers. It is most commonly used to encode integers whose upper-bound
cannot be determined beforehand or to compress data in which small values are much more
frequent than large values[1].

Gamma coding minimizes the space needed to store numeric data on a file by minimizing the
“wasted” space and adapt the length of the code on the finer grained bit level. When numeric
data is stored in predefined format (e.g. byte, integer, etc.) each number takes a predefined
amount of space (e.g. for byte 1 byte, for integer 4 bytes, etc.) Therefore a lot of space is
“wasted” simply because most of the numbers being stored require much less space. Gamma
encoding enables us to store an integer n using exactly 1 2log bits. This saves a huge
amount of space when numbers being saved are small.

Note that since everything is stored as a sequence of 0’s and 1’s, therefore one cannot simply
shift the numbers in the sequence to fill up the “wasted” space; Because it would lead to loosing
the track of starting and ending points of the numbers.

In order to encode using gamma coding, two values are stored consecutively for each number:
length and offset. Length is the bit-size of the number and offset is the difference between the
number and the largest power of two which is smaller than the number. More precisely, consider
integer k, the length value in gamma encoding of k is and the offset value is
2 . However to keep track of the starting and ending point of the numbers, length is stored
in unary followed by a 0 and offset is stored in binary. Therefore for example, length value of 13
is 3 and stored in unary followed by a 0, i.e. “1110” and offset value for 13 is 5 (13-8 = 5) and it
is represented in binary as "101". Table 1 shows some examples. Decoding is much simple: after
length and offset are read, the number n is reconstructed by computing 2 [2].

number unary code length offset Gamma code

0 0
1 10 0 0
2 110 10 0 10,0
3 1110 10 1 10,1
4 11110 110 00 110,00
9 1111111110 1110 001 1110,001
13 1110 101 1110,101
24 11110 1000 11110,1000
511 111111110 11111111 111111110,11111111
1025 11111111110 0000000001 11111111110,0000000001

3. Implementation
In order to compress our posting lists, we first eliminated unnecessary characters (e.g. “[”, “]”,
“,”, etc) from our posting file. Then for each posting list, we stored the first docId in the list as
the base and for the rest docId’s we stored only the increments in the docId’s rather than the
docId themselves. Since the docId’s in the posting lists are stored in sorted order, this way the
numbers we stored are much less than the docId’s. And finally to store all the numbers in posting
lists including base docId and increments, we used gamma coding technique.

To compress the posting lists using gamma encoding, we compute gamma encoding of each
number, and store it in a buffer (e.g. a int[] or byte[]), when the encoding for the next number is
added to buffer we use bitwise shift and bitwise or to utilize the wasted space in the byte array.
Finally we wrote the buffer to the file when the sum of the length of the encoded numbers
reaches to multiple of 32.

With this technique, we are able to compress our 3.31 GB posting file to 399 MB which gives a
compression rate equal to 8.2.

We note that we did the above process after we have created our posting lists. However it can
also be done in MapReduce while the posting lists are being created.

As we mentioned, the decompression part is much easier and requires only computing 2
after length and offset has been read. Besides this we need also compute the actual
docId’s using base docId and the increments. But over all the time complexity of this process is
O(n) and since the computations are done in main memory it is much faster.

4. Comparison
With this enhancement, we were able to reduce the response time drastically, and be able to
respond to almost all queries instantly. Figure 1 shows how much this method could speed up
the program. The x-axis is the average corpus frequency for certain groups of terms, and y-axis is
the retrieval time in milliseconds. The red graph shows retrieval time when there is no
compression is used and the blue graph shows retrieval time when gamma compression is used.

To produce this figure, we first computed the respond time for all the 6 million terms in our
index file using both with and without gamma compression technique. However since we are
only interested in the relation between retrieval time and corpus frequency we, we picked 13000
different corpus frequency that we had, and bucketized them into 50 different group. i.e. all we
put all the terms out of 6 million terms that had the least 260 low CF into the first bucket, all the
terms that had the second least 260 low CF into the second bucket, and so forth. Finally the
average retrieval time (w/ and w/o gamma compression) is computed over all the terms in a
given bucket. For completeness we also show what happens in the worst case, Figure 2 shows
the change in the retrieval time for the terms which fall into high end of corpus frequency
spectrum in our data set. Both of theses graphs are also plotted by our program.
Figure 1: Retrieval Time vs. Corpus Frequency

Figure 2: Retrieval time for some sample terms that have high corpus frequency
5. Conclusion
Using gamma encoding technique and storing only the increments for the docId’s in a sorted
posting list, we were able decrease the size of our posting lists which lead to a posting file 8
times smaller. Also loading the compressed posting lists form disk to main memory requires
lower time, and this decreased the response time of our program drastically. In fact the time
complexity becomes so low that we are able to remove any threshold for ignoring the common
words and take into account more words in the query phrase, which leads to retrieval of more
relevant documents.

6. References
1- http://en.wikipedia.org/wiki/Elias_gamma_coding
2- http://nlp.stanford.edu/IR-book/html/htmledition/gamma-codes-1.html

IMAGE COMPRESSION AND DECOMPRESSION SYSTEM USING RUN LENGTH ENCODING ALGORITHM (Chapter 1-3)
No ratings yet
IMAGE COMPRESSION AND DECOMPRESSION SYSTEM USING RUN LENGTH ENCODING ALGORITHM (Chapter 1-3)
18 pages
Lecture 4 Index Compression
No ratings yet
Lecture 4 Index Compression
32 pages
Elias Codes
No ratings yet
Elias Codes
6 pages
11 FM-Index
No ratings yet
11 FM-Index
6 pages
Information Retrieval - 3
No ratings yet
Information Retrieval - 3
36 pages
Chap 5
No ratings yet
Chap 5
64 pages
Fibonacci Coding Within The Burrows-Wheeler Compression Scheme
No ratings yet
Fibonacci Coding Within The Burrows-Wheeler Compression Scheme
5 pages
Lecture 5p2 - Index Construction & Compressing
No ratings yet
Lecture 5p2 - Index Construction & Compressing
40 pages
Huffman Coding: A Case Study of A Comparison Between Three Different Type Documents
No ratings yet
Huffman Coding: A Case Study of A Comparison Between Three Different Type Documents
5 pages
Elias Report
No ratings yet
Elias Report
16 pages
Huffman
No ratings yet
Huffman
13 pages
Huffman Coding A Case Study of A Comparison
No ratings yet
Huffman Coding A Case Study of A Comparison
2 pages
Lecture4 Compression
No ratings yet
Lecture4 Compression
61 pages
Chapter 5 - Index Compression
No ratings yet
Chapter 5 - Index Compression
28 pages
Unit 1 Data Compression
No ratings yet
Unit 1 Data Compression
30 pages
Unit 2
No ratings yet
Unit 2
157 pages
Ee121 Lec 01 22
No ratings yet
Ee121 Lec 01 22
3 pages
Chapter Four Indexing Structure
100% (2)
Chapter Four Indexing Structure
60 pages
Data Compression
No ratings yet
Data Compression
20 pages
10 1016@j Aei 2008 05 001
No ratings yet
10 1016@j Aei 2008 05 001
8 pages
Search Algorithms
No ratings yet
Search Algorithms
14 pages
Chapter Five Lossless Compression
No ratings yet
Chapter Five Lossless Compression
49 pages
Main Techniques and Performance of Each Compression
No ratings yet
Main Techniques and Performance of Each Compression
23 pages
Digital Data Compression
No ratings yet
Digital Data Compression
10 pages
05comp Flat
No ratings yet
05comp Flat
59 pages
ADDA
No ratings yet
ADDA
50 pages
Data Compression
No ratings yet
Data Compression
7 pages
COA Lab Notes
No ratings yet
COA Lab Notes
31 pages
Signal Processing Information Bits Lossy Lossless Statistical Redundancy
No ratings yet
Signal Processing Information Bits Lossy Lossless Statistical Redundancy
2 pages
3 Chapter Text and Image Compression
No ratings yet
3 Chapter Text and Image Compression
132 pages
Chapter 3 Part 1
No ratings yet
Chapter 3 Part 1
43 pages
Why Needed?: Without Compression, These Applications Would Not Be Feasible
No ratings yet
Why Needed?: Without Compression, These Applications Would Not Be Feasible
11 pages
A Novel Approach of Lossless Image Compression Using Two Techniques
No ratings yet
A Novel Approach of Lossless Image Compression Using Two Techniques
5 pages
Compusoft, 3 (7), 1012-1015 PDF
No ratings yet
Compusoft, 3 (7), 1012-1015 PDF
4 pages
A Unique Perspective On Data Coding and Decoding
No ratings yet
A Unique Perspective On Data Coding and Decoding
11 pages
MMC Chap3
100% (1)
MMC Chap3
22 pages
ISR Chap... 4
No ratings yet
ISR Chap... 4
43 pages
Module-3: Text and Image Compression
No ratings yet
Module-3: Text and Image Compression
66 pages
Comparison of Lossless Data Compression Algorithms
No ratings yet
Comparison of Lossless Data Compression Algorithms
12 pages
Data Compression Techniques: Pushpender Rana, Student
No ratings yet
Data Compression Techniques: Pushpender Rana, Student
4 pages
Greedy Algo 2
No ratings yet
Greedy Algo 2
54 pages
Acm Icpc Phuket 2009 Contest PB Final
No ratings yet
Acm Icpc Phuket 2009 Contest PB Final
23 pages
Compression For Sending and Storing Information: Text, Audio, Images, Videos
No ratings yet
Compression For Sending and Storing Information: Text, Audio, Images, Videos
28 pages
A Software Implementation of The Shannon-Fano Coding Algorithm
No ratings yet
A Software Implementation of The Shannon-Fano Coding Algorithm
4 pages
Data Compression With Huffman Coding: An Efficient Dynamic Implementation Using File Partitioning
No ratings yet
Data Compression With Huffman Coding: An Efficient Dynamic Implementation Using File Partitioning
7 pages
Introduction To Searching Operations
No ratings yet
Introduction To Searching Operations
8 pages
Algorithms in The Real World: Data Compression: Lectures 1 and 2
No ratings yet
Algorithms in The Real World: Data Compression: Lectures 1 and 2
55 pages
CHAPTER FOURmultimedia
No ratings yet
CHAPTER FOURmultimedia
23 pages
Binary Codes: 3.1 The Need For Binary Representation
No ratings yet
Binary Codes: 3.1 The Need For Binary Representation
5 pages
PDF
No ratings yet
PDF
5 pages
Compression
100% (1)
Compression
38 pages
Chapter 5 Math
No ratings yet
Chapter 5 Math
67 pages
Unit 2: Number System
No ratings yet
Unit 2: Number System
15 pages
Compression: Author: Paul Penfield, Jr. 2004 Massachusetts Institute of Technology Url: Start: Back: Next
No ratings yet
Compression: Author: Paul Penfield, Jr. 2004 Massachusetts Institute of Technology Url: Start: Back: Next
8 pages
Quantization and Compression PDF
No ratings yet
Quantization and Compression PDF
220 pages
Chapter 7
No ratings yet
Chapter 7
70 pages
Chapter 4 - Introduction To Source Coding
No ratings yet
Chapter 4 - Introduction To Source Coding
72 pages
Why Computers Use Binary
No ratings yet
Why Computers Use Binary
10 pages

Gamma

Uploaded by

Gamma

Uploaded by

Enhancing Wikipedia Search Performance

Using Elias Gamma Code

Ali Bagherzandi and Kerim Yassin Oktay

Department of Computer Science, University of California, Irvine,

2. Elias Gamma Code

number unary code length offset Gamma code

You might also like