0% found this document useful (0 votes)

44 views60 pages

Introduction To: Information Retrieval

This document provides an introduction to index compression techniques in information retrieval. It discusses the motivation for compression to reduce disk space usage and increase retrieval speed. Different approaches are described for compressing the dictionary and postings components of an inverted index. The document also examines empirical laws like Heaps' law and Zipf's law that describe the distribution of terms in a corpus and can help optimize compression.

Uploaded by

Deepa Raj

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

44 views60 pages

Introduction To: Information Retrieval

Uploaded by

Deepa Raj

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 60

Introduction to Information Retrieval

Introduction to
Information Retrieval

Hinrich Schütze and Christina Lioma

Lecture 5: Index Compression

1
Introduction to Information Retrieval

Overview
❶ Recap
❷ Compression
❸ Term statistics
❹ Dictionary compression
❺ Postings compression

2
Introduction to Information Retrieval

Outline
❶ Recap
❷ Compression
❸ Term statistics
❹ Dictionary compression
❺ Postings compression

3
Introduction to Information Retrieval

Blocked Sort-Based Indexing

4
Introduction to Information Retrieval

Single-pass in-memory indexing

Abbreviation: SPIMI
Key idea 1: Generate separate dictionaries for each block – no
need to maintain term-termID mapping across blocks.
Key idea 2: Don’t sort. Accumulate postings in postings lists as
they occur.
With these two ideas we can generate a complete inverted
index for each block.
These separate indexes can then be merged into one big index.

5
Introduction to Information Retrieval

SPIMI-Invert

6
Introduction to Information Retrieval

MapReduce for index construction

7
Introduction to Information Retrieval

Dynamic indexing: Simplest approach

Maintain big main index on disk

New docs go into small auxiliary index in memory.
Search across both, merge results
Periodically, merge auxiliary index into big index

8
Introduction to Information Retrieval

Roadmap

Today: index compression

Next 2 weeks: perspective of the user: how can we give the
user relevant results, how can we measure relevance, what
types of user interactions are effective?
After Pentecost: statistical classification and clustering in
information retrieval
Last 3 weeks: web information retrieval

9
Introduction to Information Retrieval

Take-away today

Motivation for compression in information retrieval systems

How can we compress the dictionary component of the
inverted index?
How can we compress the postings component of the inverted
index?
Term statistics: how are terms distributed in document
collections?
10
Introduction to Information Retrieval

Outline
❶ Recap
❷ Compression
❸ Term statistics
❹ Dictionary compression
❺ Postings compression

11
Introduction to Information Retrieval

Why compression? (in general)

Use less disk space (saves money)

Keep more stuff in memory (increases speed)
Increase speed of transferring data from disk to memory
(again, increases speed)
[read compressed data and decompress in memory] is
faster than
[read
uncompressed data]
Premise: Decompression algorithms are fast.
This is true of the decompression algorithms we will use.

12
Introduction to Information Retrieval

Why compression in information retrieval?

First, we will consider space for dictionary
Main motivation for dictionary compression: make it small
enough to keep in main memory
Then for the postings file
Motivation: reduce disk space needed, decrease time needed to
read from disk
Note: Large search engines keep significant part of postings in
memory
We will devise various compression schemes for dictionary and
postings.

13
Introduction to Information Retrieval

Lossy vs. lossless compression

Lossy compression: Discard some information

Several of the preprocessing steps we frequently use can be
viewed as lossy compression:
downcasing, stop words, porter, number elimination
Lossless compression: All information is preserved.
What we mostly do in index compression

14
Introduction to Information Retrieval

Outline
❶ Recap
❷ Compression
❸ Term statistics
❹ Dictionary compression
❺ Postings compression

15
Introduction to Information Retrieval

Model collection: The Reuters collection

symbol statistics value
N documents 800,000
L avg. # tokens per document 200
M word types 400,000
avg. # bytes per token (incl. spaces/punct.) 6
avg. # bytes per token (without spaces/punct.) 4.5
avg. # bytes per term (= word type) 7.5
T
non-positional postings 100,000,000

16
Introduction to Information Retrieval

Effect of preprocessing for Reuters

17
Introduction to Information Retrieval

How big is the term vocabulary?

That is, how many distinct words are there?
Can we assume there is an upper bound?
Not really: At least 7020 ≈ 1037 different words of length 20.
The vocabulary will keep growing with collection size.
Heaps’ law: M = kTb
M is the size of the vocabulary, T is the number of tokens in the collection.
Typical values for the parameters k and b are: 30 ≤ k ≤ 100 and b ≈ 0.5.
Heaps’ law is linear in log-log space.
It is the simplest possible relationship between collection size and vocabulary size in log-log
space.

Empirical law

18
Introduction to Information Retrieval

Heaps’ law for Reuters

Vocabulary size M as a
function of collection size
T (number of tokens) for
Reuters-RCV1. For these
data, the dashed line
log10M =
0.49 ∗ log10 T + 1.64 is the
best least squares fit.
1.64T0.49
Thus, M = 10
and k = 101.64 ≈ 44 and
b = 0.49.

19
Introduction to Information Retrieval

Empirical fit for Reuters

Good, as we just saw in the graph.

Example: for the first 1,000,020 tokens Heaps’ law predicts
38,323 terms:
44 × 1,000,0200.49 ≈ 38,323
The actual number is 38,365 terms, very close to the prediction.
Empirical observation: fit is good in general.

20
Introduction to Information Retrieval

Exercise

❶What is the effect of including spelling errors vs.

automatically correcting spelling errors on Heaps’ law?
❷Compute vocabulary size M
Looking at a collection of web pages, you find that there are
3000 different terms in the first 10,000 tokens and 30,000
different terms in the first 1,000,000 tokens.
Assume a search engine indexes a total of 20,000,000,000
(2 × 1010) pages, containing 200 tokens on average
What is the size of the vocabulary of the indexed collection as predicted by Heaps’ law?

21
Introduction to Information Retrieval

Zipf’s law
Now we have characterized the growth of the vocabulary in
collections.
We also want to know how many frequent vs. infrequent
terms we should expect in a collection.
In natural language, there are a few very frequent terms and
very many very rare terms.
Zipf’s law: The ith most frequent term has frequency cfi proportional to 1/i .

cfi is collection frequency: the number of occurrences of the term ti in the collection.

22
Introduction to Information Retrieval

Zipf’s law
Zipf’s law: The ith most frequent term has frequency proportional to 1/i .

cf is collection frequency: the number of occurrences of the term in the collection.
So if the most frequent term (the) occurs cf1 times, then the second most frequent term (of)
has half as many occurrences
. . . and the third most frequent term (and) has a third as many occurrences
Equivalent: cfi = cik and log cfi = log c +k log i (for k = −1)
Example of a power law

23
Introduction to Information Retrieval

Zipf’s law for Reuters

Fit is not great. What
is important is the
key insight: Few frequent
terms, many
rare terms.

24
Introduction to Information Retrieval

Outline
❶ Recap
❷ Compression
❸ Term statistics
❹ Dictionary compression
❺ Postings compression

25
Introduction to Information Retrieval

Dictionary compression

The dictionary is small compared to the postings file.

But we want to keep it in memory.
Also: competition with other applications, cell phones,
onboard computers, fast startup time
So compressing the dictionary is important.

26
Introduction to Information Retrieval

Recall: Dictionary as array of fixed-width entries

Space needed: 20 bytes 4 bytes 4 bytes

for Reuters: (20+4+4)*400,000 = 11.2 MB

27
Introduction to Information Retrieval

Fixed-width entries are bad.

Most of the bytes in the term column are wasted.

We allot 20 bytes for terms of length 1.
We can’t handle HYDROCHLOROFLUOROCARBONS and
SUPERCALIFRAGILISTICEXPIALIDOCIOUS
Average length of a term in English: 8 characters
How can we use on average 8 characters per term?

28
Introduction to Information Retrieval

Dictionary as a string

29
Introduction to Information Retrieval

Space for dictionary as a string

4 bytes per term for frequency

4 bytes per term for pointer to postings list
8 bytes (on average) for term in string
3 bytes per pointer into string (need log2 8 · 400000 < 24 bits
to resolve 8 · 400,000 positions)
Space: 400,000 × (4 +4 +3 + 8) = 7.6MB (compared to 11.2
MB for fixed-width array)

30
Introduction to Information Retrieval

Dictionary as a string with blocking

31
Introduction to Information Retrieval

Space for dictionary as a string with blocking

Example block size k = 4

Where we used 4 × 3 bytes for term pointers without
blocking . . .
. . .we now use 3 bytes for one pointer plus 4 bytes for
indicating the length of each term.
We save 12 − (3 + 4) = 5 bytes per block.
Total savings: 400,000/4 ∗ 5 = 0.5 MB
This reduces the size of the dictionary from 7.6 MB to 7.1
MB.

32
Introduction to Information Retrieval

Lookup of a term without blocking

33
Introduction to Information Retrieval

Lookup of a term with blocking: (slightly) slower

34
Introduction to Information Retrieval

Front coding

One block in blocked compression (k = 4) . . .

8 a u t o m a t a 8 a u t o m a t e 9 a u t o m a t i c 10 a u t o m a t i
on

⇓
. . . further compressed with front coding.
8automat∗a1⋄e2⋄ic3⋄ion

35
Introduction to Information Retrieval

Dictionary compression for Reuters: Summary

data structure size in MB

dictionary, fixed-width 11.2
dictionary, term pointers into string 7.6
∼, with blocking, k = 4 7.1
∼, with blocking & front coding 5.9

36
Introduction to Information Retrieval

Exercise

Which prefixes should be used for front coding? What are

the tradeoffs?
Input: list of terms (= the term vocabulary)
Output: list of prefixes that will be used in front coding

37
Introduction to Information Retrieval

Outline
❶ Recap
❷ Compression
❸ Term statistics
❹ Dictionary compression
❺ Postings compression

38
Introduction to Information Retrieval

Postings compression

The postings file is much larger than the dictionary, factor of

at least 10.
Key desideratum: store each posting compactly
A posting for our purposes is a docID.
For Reuters (800,000 documents), we would use 32 bits per
docID when using 4-byte integers.
Alternatively, we can use log2 800,000 ≈ 19.6 < 20 bits per docID.
Our goal: use a lot less than 20 bits per docID.

39
Introduction to Information Retrieval

Key idea: Store gaps instead of docIDs

Each postings list is ordered in increasing order of docID.

Example postings list: COMPUTER: 283154, 283159, 283202, . . .
It suffices to store gaps: 283159-283154=5, 283202-283154=43
Example postings list using gaps : COMPUTER: 283154, 5, 43, . . .
Gaps for frequent terms are small.
Thus: We can encode small gaps with fewer than 20 bits.

40
Introduction to Information Retrieval

Gap encoding

41
Introduction to Information Retrieval

Variable length encoding

Aim:
For ARACHNOCENTRIC and other rare terms, we will use about
20 bits per gap (= posting).
For THE and other very frequent terms, we will use only a
few bits per gap (= posting).
In order to implement this, we need to devise some form
of variable length encoding.
Variable length encoding uses few bits for small gaps and
many bits for large gaps.

42
Introduction to Information Retrieval

Variable byte (VB) code

Used by many commercial/research systems
Good low-tech blend of variable-length coding and
sensitivity to alignment matches (bit-level codes, see later).
Dedicate 1 bit (high bit) to be a continuation bit c.
If the gap G fits within 7 bits, binary-encode it in the 7
available bits and set c = 1.
Else: encode lower-order 7 bits and then use one or more
additional bytes to encode the higher order bits using the
same algorithm.
At the end set the continuation bit of the last byte to 1
(c = 1) and of the other bytes to 0 (c = 0).

43
Introduction to Information Retrieval

VB code examples

docIDs 824 829 215406

gaps 5 214577
VB code 00000110 10111000 10000101 00001101 00001100 10110001

44
Introduction to Information Retrieval

VB code encoding algorithm

45
Introduction to Information Retrieval

VB code decoding algorithm

46
Introduction to Information Retrieval

Other variable codes

Instead of bytes, we can also use a different “unit of

alignment”: 32 bits (words), 16 bits, 4 bits (nibbles) etc
Variable byte alignment wastes space if you have many
small gaps – nibbles do better on those.
Recent work on word-aligned codes that efficiently “pack”
a variable number of gaps into one word – see resources at
the end

47
Introduction to Information Retrieval

Gamma codes for gap encoding

You can get even more compression with another type of
variable length encoding: bitlevel code.
Gamma code is the best known of these.
First, we need unary code to be able to introduce gamma
code.
Unary code
Represent n as n 1s with a final 0.
Unary code for 3 is 1110
Unary code for 40 is
11111111111111111111111111111111111111110
Unary code for 70 is:
11111111111111111111111111111111111111111111111111111111111111111111110

48
Introduction to Information Retrieval

Gamma code

Represent a gap G as a pair of length and offset.

Offset is the gap in binary, with the leading bit chopped off.
For example 13 → 1101 → 101 = offset
Length is the length of offset.
For 13 (offset 101), this is 3.
Encode length in unary code: 1110.
Gamma code of 13 is the concatenation of length and offset:
1110101.

49
Introduction to Information Retrieval

Gamma code examples

50
Introduction to Information Retrieval

Exercise

Compute the variable byte code of 130

Compute the gamma code of 130

51
Introduction to Information Retrieval

Length of gamma code

The length of offset is ⌊log2 G⌋ bits.

The length of length is ⌊log2 G⌋ + 1 bits,
So the length of the entire code is 2 x ⌊log2 G⌋ + 1 bits.
ϒ codes are always of odd length.
Gamma codes are within a factor of 2 of the optimal encoding length log2 G.
(assuming the frequency of a gap G is proportional to log2 G – not really true)

52
Introduction to Information Retrieval

Gamma code: Properties

Gamma code is prefix-free: a valid code word is not a prefix

of any other valid code.
Encoding is optimal within a factor of 3 (and within a factor
of 2 making additional assumptions).
This result is independent of the distribution of gaps!
We can use gamma codes for any distribution. Gamma code
is universal.
Gamma code is parameter-free.

53
Introduction to Information Retrieval

Gamma codes: Alignment

Machines have word boundaries – 8, 16, 32 bits

Compressing and manipulating at granularity of bits can be
slow.
Variable byte encoding is aligned and thus potentially more
efficient.
Regardless of efficiency, variable byte is conceptually simpler
at little additional space cost.

54
Introduction to Information Retrieval

Compression of Reuters
data structure size in MB
dictionary, fixed-width 11.2
dictionary, term pointers into string 7.6
∼, with blocking, k = 4 7.1
∼, with blocking & front coding 5.9
collection (text, xml markup etc) 3600.0
collection (text) 960.0
T/D incidence matrix 40,000.0
postings, uncompressed (32-bit words) 400.0
postings, uncompressed (20 bits) 250.0
postings, variable byte encoded 116.0
postings, encoded 101.0

55
Introduction to Information Retrieval

Term-document incidence matrix

Entry is 1 if term occurs. Example: CALPURNIA occurs in Julius

Caesar. Entry is 0 if term doesn’t occur. Example: CALPURNIA
doesn’t occur in The tempest.

56
Introduction to Information Retrieval

57
Introduction to Information Retrieval

Summary

We can now create an index for highly efficient Boolean

retrieval that is very space efficient.
Only 10-15% of the total size of the text in the collection.
However, we’ve ignored positional and frequency
information.
For this reason, space savings are less in reality.

58
Introduction to Information Retrieval

Take-away today

Motivation for compression in information retrieval systems

How can we compress the dictionary component of the
inverted index?
How can we compress the postings component of the
inverted index?
Term statistics: how are terms distributed in document
collections?

59
Introduction to Information Retrieval

Resources

Chapter 5 of IIR
Resources at http://ifnlp.org/ir
Original publication on word-aligned binary codes by Anh and
Moffat (2005); also: Anh and Moffat (2006a)
Original publication on variable byte codes by Scholer,
Williams, Yiannis and Zobel (2002)
More details on compression (including compression of
positions and frequencies) in Zobel and Moffat (2006)

Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
49 pages
Lecture 4-Dictionaries and Tolerant Retrieval
No ratings yet
Lecture 4-Dictionaries and Tolerant Retrieval
50 pages
03 Dictionaries
No ratings yet
03 Dictionaries
112 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
48 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
115 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
54 pages
03 Dictionaries
No ratings yet
03 Dictionaries
112 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
33 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
46 pages
Lecture3 Tolerant Retrieval
100% (1)
Lecture3 Tolerant Retrieval
48 pages
Lecture3 Tolerant Retrieval
100% (1)
Lecture3 Tolerant Retrieval
48 pages
Evaluation and Result Summaries
No ratings yet
Evaluation and Result Summaries
60 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
40 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
48 pages
Index Compression
100% (1)
Index Compression
38 pages
Digital Electronics Lab Manual
No ratings yet
Digital Electronics Lab Manual
93 pages
Compression
No ratings yet
Compression
46 pages
Lecture10 Efficient Scoring
No ratings yet
Lecture10 Efficient Scoring
19 pages
An Overview of Information Retrieval Outline: A (Simple) Database Example Databases vs. IR
No ratings yet
An Overview of Information Retrieval Outline: A (Simple) Database Example Databases vs. IR
16 pages
Lecture 5-Dictionaries and Tolerant Retrieval
No ratings yet
Lecture 5-Dictionaries and Tolerant Retrieval
48 pages
Index Construction
No ratings yet
Index Construction
37 pages
Lecture 4-Indexconstruction
No ratings yet
Lecture 4-Indexconstruction
45 pages
Lecture2 Dictionary
No ratings yet
Lecture2 Dictionary
62 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
48 pages
Information Retrival Systems
No ratings yet
Information Retrival Systems
50 pages
Lecture3 Tolerant Retrieval Handout 6 Per
No ratings yet
Lecture3 Tolerant Retrieval Handout 6 Per
8 pages
Information Retrieval Systems Chap 2
67% (3)
Information Retrieval Systems Chap 2
60 pages
Boolean Retrieval
No ratings yet
Boolean Retrieval
34 pages
VMW PPT Library Icons-Diagrams 2q12 2 of 3
No ratings yet
VMW PPT Library Icons-Diagrams 2q12 2 of 3
35 pages
Ir 1
No ratings yet
Ir 1
59 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
33 pages
Module 4-Boolean Retrieval Models-Edit Distance, Spelling Correction
No ratings yet
Module 4-Boolean Retrieval Models-Edit Distance, Spelling Correction
124 pages
Chemcad Training Simulator - Moser - CAPE2005 PDF
No ratings yet
Chemcad Training Simulator - Moser - CAPE2005 PDF
45 pages
IR Lec03 Vocabulary Postings List
No ratings yet
IR Lec03 Vocabulary Postings List
28 pages
Lecture4 Compression V1
No ratings yet
Lecture4 Compression V1
43 pages
Transforming Data Into Information
80% (5)
Transforming Data Into Information
17 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
51 pages
31026528-MA5600 Technical Manual (V3.10)
No ratings yet
31026528-MA5600 Technical Manual (V3.10)
108 pages
Module 4-Boolean Retrieval Models
No ratings yet
Module 4-Boolean Retrieval Models
52 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
50 pages
Ranked Retrieval: Thus Far, Our Queries Have All Been Boolean
No ratings yet
Ranked Retrieval: Thus Far, Our Queries Have All Been Boolean
40 pages
Lecture 4 - Tolerant-Retrieval Chapter 3
No ratings yet
Lecture 4 - Tolerant-Retrieval Chapter 3
20 pages
Huawei FusionCloud Desktop Solution Overview Presentation (For Governments)
No ratings yet
Huawei FusionCloud Desktop Solution Overview Presentation (For Governments)
56 pages
Proxmark3 V2 User Guid
100% (1)
Proxmark3 V2 User Guid
22 pages
Lecture5 Compression
No ratings yet
Lecture5 Compression
48 pages
System Programming Course Outline 2012
100% (1)
System Programming Course Outline 2012
4 pages
6 Tfidf
No ratings yet
6 Tfidf
48 pages
Term Vocabulary and Postings List
No ratings yet
Term Vocabulary and Postings List
64 pages
Product Overview Motor Driven Reels
No ratings yet
Product Overview Motor Driven Reels
28 pages
Lecture3 Tolerant Retrieval
No ratings yet
Lecture3 Tolerant Retrieval
48 pages
Interstage 300B Parallel Monoblock V 3.08
No ratings yet
Interstage 300B Parallel Monoblock V 3.08
98 pages
IR
No ratings yet
IR
8 pages
R Series Configuration Manual: 9650-0902-01 Rev. T
No ratings yet
R Series Configuration Manual: 9650-0902-01 Rev. T
96 pages
Digital Blood Pressure Meter
No ratings yet
Digital Blood Pressure Meter
8 pages
IR-Lec1 - Ch1-2023
No ratings yet
IR-Lec1 - Ch1-2023
41 pages
Lecture5 Index Compression
No ratings yet
Lecture5 Index Compression
48 pages
C4 Compression
No ratings yet
C4 Compression
44 pages
Lecture4-Indexconstruction Ch2 and Ch4
No ratings yet
Lecture4-Indexconstruction Ch2 and Ch4
49 pages
Lecture3-Tolerant-retrieval Dictionaries and Tolerant Retrieval CH 3
No ratings yet
Lecture3-Tolerant-retrieval Dictionaries and Tolerant Retrieval CH 3
47 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
54 pages
C3 IndexConstruction
No ratings yet
C3 IndexConstruction
46 pages
C7 SpellCorrection
No ratings yet
C7 SpellCorrection
43 pages
Chapter4 Indexconstruction
No ratings yet
Chapter4 Indexconstruction
49 pages
C2 Dictionary
No ratings yet
C2 Dictionary
6 pages
Ip 8
No ratings yet
Ip 8
51 pages
Worksheet 1: Evolution of Computers
No ratings yet
Worksheet 1: Evolution of Computers
1 page
MCR VDC Ui B DC
No ratings yet
MCR VDC Ui B DC
6 pages
CTC SLA Report - Desktop DEC 2022
No ratings yet
CTC SLA Report - Desktop DEC 2022
9 pages
Lecture6-Tfidf Vector Space Model
No ratings yet
Lecture6-Tfidf Vector Space Model
45 pages
4-Port Usb KVM Switch: Dkvm-4U
No ratings yet
4-Port Usb KVM Switch: Dkvm-4U
32 pages
Grandstream-GXP2160 Phone SetupGuide
No ratings yet
Grandstream-GXP2160 Phone SetupGuide
11 pages
Troubleshooting Ethernet Problems - Testing Cables
No ratings yet
Troubleshooting Ethernet Problems - Testing Cables
2 pages
12N60
No ratings yet
12N60
7 pages
1500 Series
No ratings yet
1500 Series
12 pages
05comp Flat
No ratings yet
05comp Flat
59 pages
6-Ci Sinifə Qəbul Olunanlarin Siyahisi (1-124) : Tarix Y. Puan
No ratings yet
6-Ci Sinifə Qəbul Olunanlarin Siyahisi (1-124) : Tarix Y. Puan
37 pages
My Journey Towards Preparing For Industry 4.0
No ratings yet
My Journey Towards Preparing For Industry 4.0
2 pages
Lecture4 Compression 1per
No ratings yet
Lecture4 Compression 1per
50 pages
Lecture5 Compression
No ratings yet
Lecture5 Compression
47 pages
Boot Camp Session 1: Module: COS 122 Author: Mr. Werner Hauger
No ratings yet
Boot Camp Session 1: Module: COS 122 Author: Mr. Werner Hauger
27 pages
Computer Assembly and Repair Lab Manual 1
No ratings yet
Computer Assembly and Repair Lab Manual 1
62 pages
DUCATI 48T 10911 Mounting Instructions
No ratings yet
DUCATI 48T 10911 Mounting Instructions
6 pages
Los Angeles World Airports: Guide Specification
No ratings yet
Los Angeles World Airports: Guide Specification
5 pages
Anonymous Resume
No ratings yet
Anonymous Resume
1 page
Chapter 1-8051 Architecture
No ratings yet
Chapter 1-8051 Architecture
76 pages
Spine Europe
No ratings yet
Spine Europe
13 pages
Week 6
No ratings yet
Week 6
98 pages
4 Lec 2025
No ratings yet
4 Lec 2025
57 pages
Python Programming: General-Purpose Libraries; NumPy,Pandas,Matplotlib,Seaborn,Requests,os & sys: Python, #2
From Everand
Python Programming: General-Purpose Libraries; NumPy,Pandas,Matplotlib,Seaborn,Requests,os & sys: Python, #2
e3
No ratings yet

Introduction To: Information Retrieval

Uploaded by

Introduction To: Information Retrieval

Uploaded by

Introduction to Information Retrieval

Hinrich Schütze and Christina Lioma

Blocked Sort-Based Indexing

Single-pass in-memory indexing

MapReduce for index construction

Dynamic indexing: Simplest approach

Maintain big main index on disk

Today: index compression

Motivation for compression in information retrieval systems

Why compression? (in general)

Use less disk space (saves money)

Why compression in information retrieval?

Lossy vs. lossless compression

Lossy compression: Discard some information

Model collection: The Reuters collection

Effect of preprocessing for Reuters

How big is the term vocabulary?

Heaps’ law for Reuters

Empirical fit for Reuters

Good, as we just saw in the graph.

❶What is the effect of including spelling errors vs.

Zipf’s law for Reuters

The dictionary is small compared to the postings file.

Recall: Dictionary as array of fixed-width entries

Space needed: 20 bytes 4 bytes 4 bytes

Fixed-width entries are bad.

Most of the bytes in the term column are wasted.

Space for dictionary as a string

4 bytes per term for frequency

Dictionary as a string with blocking

Space for dictionary as a string with blocking

Example block size k = 4

Lookup of a term without blocking

Lookup of a term with blocking: (slightly) slower

One block in blocked compression (k = 4) . . .

Dictionary compression for Reuters: Summary

data structure size in MB

Which prefixes should be used for front coding? What are

The postings file is much larger than the dictionary, factor of

Key idea: Store gaps instead of docIDs

Each postings list is ordered in increasing order of docID.

Variable length encoding

Variable byte (VB) code

docIDs 824 829 215406

VB code encoding algorithm

VB code decoding algorithm

Other variable codes

Instead of bytes, we can also use a different “unit of

Gamma codes for gap encoding

Represent a gap G as a pair of length and offset.

Gamma code examples

Compute the variable byte code of 130

Length of gamma code

The length of offset is ⌊log2 G⌋ bits.

Gamma code: Properties

Gamma code is prefix-free: a valid code word is not a prefix

Gamma codes: Alignment

Machines have word boundaries – 8, 16, 32 bits

Term-document incidence matrix

Entry is 1 if term occurs. Example: CALPURNIA occurs in Julius

We can now create an index for highly efficient Boolean

Motivation for compression in information retrieval systems

You might also like