0% found this document useful (0 votes)

4 views55 pages

Week 4 - Information Retrieval Indexing

The document discusses the architecture and mechanisms of text retrieval systems, focusing on indexing methods such as inverted files to enhance search efficiency. It explains the construction and organization of inverted files, including vocabulary management and posting files, as well as the process of searching through these files. Additionally, it covers practical aspects of index construction and merging techniques for large text collections.

Uploaded by

محمد وسيق شيراز

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views55 pages

Week 4 - Information Retrieval Indexing

Uploaded by

محمد وسيق شيراز

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 55

1.

3 INDEXING FOR INFORMATION

RETRIEVAL

Information Retrieval- 1
Architecture of Text Retrieval Systems
User Text
Interface
user need Text
1. feature extraction
Text Operations

query repr. document repr.

Query DB Manager
Indexing
user feedback Operations Module

query inverted file

Searching Index

retrieved docs 3. efficient data access

Text
Database
Ranking
ranked docs
2. ranking system

Information Retrieval- 2
Term Search
Problem: text retrieval algorithms need to find words in
documents efficiently
– Boolean, probabilistic and vector space retrieval
– Given index term ki, find document dj

B1 A Course on Integral Equations

B2 Attractors for Semigroups and Evolution Equations
B3 Automatic Differentiation of Algorithms: Theory, Implementation, and Application
application B4 Geometrical Aspects of Partial Differential Equations
B5 Ideals, Varieties, and Algorithms: An Introduction to Computational Algebraic Geometry and Commutative Algebra
B6 Introduction to Hamiltonian Dynamical Systems and the N-Body Problem
B7 Knapsack Problems: Algorithms and Computer Implementations
B8 Methods of Solving Singular Systems of Ordinary Differential Equations
B9 Nonlinear Systems
B3, B17 B10 Ordinary Differential Equations
B11 Oscillation Theory for Neutral Differential Equations with Delay
B12 Oscillation Theory of Delay Differential Equations
B13 Pseudodifferential Operators and Nonlinear Partial Differential Equations
B14 Sinc Methods for Quadrature and Differential Equations
B15 Stability of Stochastic Differential Equations with Respect to Semi-Martingales
B16 The Boundary Integral Approach to Static and Dynamic Contact Problems
B17 The Double Mellin-Barnes Type Integrals and Their Applications to Convolution Theory

Information Retrieval- 3
1.3.1 Inverted Files
An inverted file is a word-oriented mechanism for indexing a
text collection in order to speed up the term search task
– Addressing of documents and word positions within documents
– Most frequently used indexing technique for large text databases
– Appropriate when text collection is large and semi-static

Information Retrieval- 4
Inverted Files
Inverted list for a term

– fk number of documents in which occurs

– di1,…,difk list of documents’ identifiers of documents containing

Inverted File: lexicographically ordered sequence of

inverted lists

Information Retrieval- 5
Example: Documents
B1 A Course on Integral Equations
B2 Attractors for Semigroups and Evolution Equations
B3 Automatic Differentiation of Algorithms: Theory, Implementation, and Application
B4 Geometrical Aspects of Partial Differential Equations
B5 Ideals, Varieties, and Algorithms: An Introduction to Computational Algebraic Geometry and
Commutative Algebra
B6 Introduction to Hamiltonian Dynamical Systems and the N-Body Problem
B7 Knapsack Problems: Algorithms and Computer Implementations
B8 Methods of Solving Singular Systems of Ordinary Differential Equations
B9 Nonlinear Systems
B10 Ordinary Differential Equations
B11 Oscillation Theory for Neutral Differential Equations with Delay
B12 Oscillation Theory of Delay Differential Equations
B13 Pseudodifferential Operators and Nonlinear Partial Differential Equations
B14 Sinc Methods for Quadrature and Differential Equations
B15 Stability of Stochastic Differential Equations with Respect to Semi-Martingales
B16 The Boundary Integral Approach to Static and Dynamic Contact Problems
B17 The Double Mellin-Barnes Type Integrals and Their Applications to Convolution Theory

Information Retrieval- 6
Example
1 Algorithms 3 : 3 5 7
2 Application 2 : 3 17
3 Delay 2 : 11 12
4 Differential 8 : 4 8 10 11 12 13 14 15
5 Equations 10 : 1 2 4 8 10 11 12 13 14 15
6 Implementation 2 : 3 7
7 Integral 2 : 16 17
8 Introduction 2 : 5 6
9 Methods 2 : 8 14
10 Nonlinear 2 : 9 13
11 Ordinary 2 : 8 10
12 Oscillation 2 : 11 12
13 Partial 2 : 4 13
14 Problem 2 : 6 7
15 Systems 3 : 6 8 9
16 Theory 4 : 3 11 12 17

Information Retrieval- 7
Physical Organization of Inverted Files
Access Index file Posting file
structure Key, #Docs, Pos Doc# Document file

k1 f1 p1 Di D1 abcdefghijkl
k2 f2 p2 Dj D2 abcdefghijkl
. . D3 abcdefghijkl
. . .
. . Di abcdefghijkl
km fm pm . .
. .
access structure to one entry for . Dj abcdefghijkl .
the vocabulary can be each term of . .
B+-Tree, Hashing or the vocabulary . .
Sorted Array Dk Dn abcdefghijkl
space requirement space requirement O(nβ)
O(nβ) 0.4<β<0.6
occurrences of words are documents stored
(Heap's law)
stored ordered lexicographically in a contiguous file
space requirement O(n) space requirement O(n)
main memory
Information Retrieval- 8
secondary storage
Heap’s Law
An empirical law that describes the relation between the size
of a collection and the size of its vocabulary

Typical values observed:

Parameters depend on collection type

and preprocessing

log10
• Stemming, lower case decrease

m
vocabulary size
• Numbers, spelling errors increase
log10n
Information Retrieval- 9
Example
Index file Posting file Document file

Algorithms 3 3 B1 A Course on Integral Equations

Application 2 5 B2 Attractors for Semigroups and Evolution Equations
B3 Automatic Differentiation of Algorithms: Theory,
Delay 2 7 Implementation, and Application
Differential 8 3 B4 Geometrical Aspects of Partial Differential Equations
17 B5 Ideals, Varieties, and Algorithms: An Introduction to
Computational Algebraic Geometry and Commutative
11 Algebra
12 B6 Introduction to Hamiltonian Dynamical Systems and the
N-Body Problem
4
B7 Knapsack Problems: Algorithms and Computer
8 Implementations
10 B8 Methods of Solving Singular Systems of Ordinary
Differential Equations
11 B9 Nonlinear Systems
12 B10 Ordinary Differential Equations
13 B11 Oscillation Theory for Neutral Differential Equations
with Delay
14
15

Information Retrieval- 10
Addressing Granularity
By default, the entries in postings file point to the document
in the document file

Other granularities can be used

• Pointing to a specific position within the document
• Example: (B1, 4) points to “Integral”
• Grouping multiple documents to one
• Example: create groups of 4 documents, then G1 points to the
group (B1,B2,B3,B4)

Information Retrieval- 11
Payload of Postings
Postings consist of the document identifier
•For Boolean retrieval this is sufficient

In addition, other data can be stored with the posting

•For VS retrieval term frequency can be stored
•Other data: positions of occurrence, term properties

Information Retrieval- 12
A posting indicates ...
1. The frequency of a term in the vocabulary
2. The frequency of a term in a document
3. The occurrence of a term in a document
4. The list of terms occurring in a document

Information Retrieval- 13
When indexing a document collection using an inverted file,
the main space requirement is implied by ...
1. The access structure
2. The vocabulary
3. The index file
4. The postings file

Information Retrieval- 14
Searching the Inverted File
Step 1: Vocabulary search
– the words present in the query are searched in the index file
Step 2: Retrieval of occurrences
– the lists of the occurrences of all words found are retrieved from
the posting file
Step 3: Manipulation of occurrences
– the occurrences are processed in the document file to process
the query

Information Retrieval- 15
Example
Index file Posting file Document file
Step 1 Step 2 Step 3
Algorithms 3 3 B1 A Course on Integral Equations
Application 2 5 B2 Attractors for Semigroups and Evolution Equations
B3 Automatic Differentiation of Algorithms: Theory,
Delay 2 7 Implementation, and Application
Differential 8 3 B4 Geometrical Aspects of Partial Differential Equations
17 B5 Ideals, Varieties, and Algorithms: An Introduction to
Computational Algebraic Geometry and Commutative
11 Algebra
12 B6 Introduction to Hamiltonian Dynamical Systems and the
N-Body Problem
4
B7 Knapsack Problems: Algorithms and Computer
8 Implementations
10 B8 Methods of Solving Singular Systems of Ordinary
Differential Equations
11 B9 Nonlinear Systems
12 B10 Ordinary Differential Equations
Searching “Differential” 13 B11 Oscillation Theory for Neutral Differential Equations
with Delay
14
15

Information Retrieval- 16
Construction of the Inverted File – Step 1
Step 1: Search phase
– The vocabulary is kept in an ordered data structure, e.g., a trie or
sorted array, storing for each word a list of its occurrences
– Each word of the text is read sequentially and searched in the
vocabulary
– If a word is not found in the ordered data structure, it is added to
the vocabulary with an empty list of occurrences
– If a word is found, the word position is added to the end of its list
of occurrences

Information Retrieval- 17
Construction of the Inverted File – Step 2
Step 2: Storage phase (once the text is exhausted)
– The list of occurrences is written contiguously to the disk (posting
file)
– The vocabulary is stored in lexicographical order (index file) in
main memory together with a pointer for each word to its list in
the posting file

Overall cost O(n)

Information Retrieval- 18
Tries
A trie is a tree data structure to index strings
• For each prefix of each length in the set of strings a
separate path is created
• Strings are looked up by following the prefix path
Example: trie for {a, to, tea, ted, ten, i, in, inn}

a i t
i
a
n o e
in to

n a d n
inn tea ten
ted Information Retrieval- 19
Example
1 6 12 16 18 25 29 36 40 45 54 58 66 70

the house has a garden. the garden has many flowers. the flowers are beautiful
(each word = one document, position = document identifier)

the: 1
h t h t a h t

house: 6 the: 1 the: 1 a: 16 the: 1

a o a o

has: 12 house: 6 has: 12 house: 6

a g h t a g h t

a: 16 garden: 18 the: 1 a: 16 garden: 18 the: 1, 25

a o a o

has: 12 house: 6 has: 12 house: 6

Information Retrieval- 20
Example
1 6 12 16 18 25 29 36 40 45 54 58 66 70

the house has a garden. the garden has many flowers. the flowers are beautiful

a b f g h m t
a: 16
beautiful: 70 garden: 18, 29
many: 40 the: 1, 25, 54
r flowers: 45, 58
a o

are: 66 has: 12, 36 house: 6

a: 16
are: 66
beautiful: 70
flowers: 45, 58
inverted file I garden: 18, 29 16, 66, 70, 45, 58, 18, 29, 12, 36, 6, 40, 1, 25, 54
has: 12, 36
house: 6 postings file
many: 40
the: 1, 25, 54
Information Retrieval- 21
Example
1 6 12 16 18 25 29 36 40 45 54 58 66 70

the house has a garden. the garden has many flowers. the flowers are beautiful

a b f g h m t
a: 1
beautiful: 3 garden: 6
many: 11 the: 12
r flowers: 4
a o

are: 2 has: 8 house: 10

16, 66, 70, 45, 58, 18, 29, 12, 36, 6, 40, 1, 25, 54

postings file

Information Retrieval- 22
Index Construction in Practice
When using a single node not all index information can be kept in
main memory → Index merging
– When no more memory is available, a partial index Ii is written to disk
– The main memory is erased before continuing with the rest of the text
– Once the text is exhausted, a number of partial indices Ii exist on disk
– The partial indices are merged to obtain the final index

Information Retrieval- 23
Index Merging
BSBI: Blocked sort-based Indexing
I 1...8 final index

7 level 3

I 1...4 I 5...8

3 6 level 2

I 1...2 I 3...4 I 5...6 I 7...8

1 2 4 5 level 1

I1 I2 I3 I4 I5 I6 I7 I8 initial indices

D1 D2 D3 D4 D5 D6 D7 D8

Information Retrieval- 24
Example
1 6 12 16 18 25 29 36 40 45 54 58 66 70

the house has a garden. the garden has many flowers. the flowers are beautiful

a: 16 are: 66
inverted garden: 18, 29 beautiful: 70 inverted
file I1 has: 12, 36 flowers: 45, 58 file I2
house: 6 many: 40
the: 1, 25 the: 54

a: 16
are: 66
beautiful: 70
flowers: 45, 58
garden: 18, 29
has: 12, 36
house: 6
many: 40
1, 25 + 54 -> 1, 25, 54 the: 1, 25, 54
total cost: O(n log2(n/M))
concatenate inverted lists
M size of memory

Information Retrieval- 25
Addressing Granularity
Documents can be addressed at different granularities
– coarser: text blocks spanning multiple documents
– finer: paragraph, sentence, word level
General rule
– the finer the granularity the less post-processing but the larger
the index
Example: index size in % of document collection size
Index Small collection Medium collection Large collection
(1Mb) (200Mb) (2Gb)
Addressing words 73% 64% 63%
Addressing documents 26% 32% 47%
Addressing 256K blocks 25% 2.4% 0.7%

Information Retrieval- 26
Index Compression
Documents are ordered and each document identifier dij is
replaced by the difference to the preceding document
identifier
– Document identifiers are encoded using fewer bits for smaller,
common numbers
lk  f k : di1 ,..., di fk 

lk '  f k : di1 , di2  di1 ,..., di fk  di fk  1

– Use of varying length compression further reduces
space requirement
– In practice index is reduced to 10- 15% of database size

Information Retrieval- 27
Using a trie in index construction …
1. Helps to quickly find words that have been seen before
2. Helps to quickly decide whether a word has not seen
before
3. Helps to maintain the lexicographic order of words seen in
the documents
4. All of the above

Information Retrieval- 28
1.3.2 Web-Scale Indexing: Map-Reduce
Pioneered by Google: 20PB of data per day (2008)
– Scan 100 TB on 1 node @ 50 MB/s = 23 days
– Scan on 1000-node cluster = 33 minutes

Cost-efficiency
– Commodity nodes, network (cheap, but unreliable)
– Automatic fault-tolerance (fewer admins)
– Easy to use (fewer programmers)

Information Retrieval- 29
Map-Reduce Programming Model
Data type: key-value pairs

Map function:
Analyses some input, and produces a list of results

Reduce function:
Takes all results belonging to one key, and computes
aggregates

Information Retrieval- 30
Example
Basic word counter program

def mapper(document, line):

for word in line.split(): output(word,
1)

def reducer(key, values):

output(key, sum(values))

Information Retrieval- 31
Map-Reduce Processing Model
The input data is
partitioned into
subsets

Mappers extract
word occurrences

1 1 1 1 1 1 1 1 The assigned reduce

process is chosen

1 1 1 1 1 1 1
Reducers aggregate
word occurrences

Output is written
2 2 3 to stable storage

Important: the reducers can only start after all mappers have finished!
Information Retrieval- 32
Refined Map-Reduce Programming Model

Combiners work like reducers,

but only on the local data of a mapper

def combiner(key, values):

output(key, sum(values))

Partitioners allow to control the strategy

for distributing keys to reducers

Information Retrieval- 33
What the Programmer Controls (and not)
The programmer controls
• Key-value data structures (can be complex)
• Maintenance of state in mappers and reducers
• Sort order of intermediate key-value pairs
• Partitioning scheme on the key space
The map-reduce platform controls
• where the mappers and reducers run
• when a mapper and reducer starts and terminates
• which input data is assigned to a specific mapper
• which intermediate key-value pairs are processed by a specific
reducer
Information Retrieval- 34
Inverted File Construction Using Map-Reduce

Mappers extract postings from document

Postings are provided to reducers

Reducers aggregate posting lists

Information Retrieval- 35
Inverted File Construction Program
def mapper(document, text):
f = {}
for word in text.split(): f[word] +=
1
for word in f.keys():
output(word, (document, f[word]))

def reducer(key, postings):

p = []
for d, f in postings: p.append((d, f))
p.sort()
output(key, p)
Information Retrieval- 36
Other Applications of Map-Reduce
Framework is used in many other tasks, particular for
text and Web data processing
• Graph processing (e.g., PageRank)
• Processing relational joins
• Learning probabilistic models

Information Retrieval- 37
Maintaining the order of document identifiers for
vocabulary construction when partitioning the document
collection is important ...

1. in the index merging approach for single node machines

2. in the map-reduce approach for parallel clusters
3. in both
4. in neither of the two

Information Retrieval- 38
1.3.3 Distributed Retrieval
Centralized retrieval
– Aggregate the weights for ALL documents by scanning the
posting lists of the query terms
– Scanning is relatively efficient
– Computationally quite expensive (memory, processing)

Query = t1, t2 ,t3 …

List of docs with t1 or t2 or t3 (plus weights)

scan

Posting list of t1
Posting list of t2
Posting list of t3

Information Retrieval- 39
Distributed Retrieval
Distributed retrieval
– Posting lists for different terms stored on different nodes
– The transfer of complete posting lists can become prohibitively
expensive in terms of bandwidth consumption

Query = t1, t2 ,t3 …

Posting list of t3

Posting list of t1 Posting list of t3

Send over network
Posting list of t2

Is it necessary to transfer the complete posting list to

identify the top-k documents?
Information Retrieval- 40
1.3.3.1 Fagin’s Algorithm
Entries in posting lists are sorted according to the tf-idf
weights
– Scan in parallel all lists in round-robin till k documents are
detected that occur in all lists
– Lookup the missing weights for documents that have not been
seen in all lists
– Select the top-k elements
Algorithm provably returns the top-k documents for
monotone aggregation functions!

Information Retrieval- 41
Example 1
Finding the top-2 elements for a two-term query

d1 0.9 d6 0.81 d1 0.9 0.9

d4 0.82 d2 0.7 d6 0.81 0.81
d3 0.8 d5 0.66 d4 0.82 0.82
d2 0.7 0.7
d5 0.65 d1 0.45
d3 0.8 0.8
….. …..
d5 0.66 0.66
d6 0.51 d3 0.33
d2 0.1 d7 0.15
d7 0.0 d4 0.0

Information Retrieval- 42
Example 2
Finding the top-2 elements for a two-term query

d1 0.9 d6 0.81 d1 0.9 0.45 1.35

0.9
d4 0.82 d2 0.7 d6 0.81 0.81
d3 0.8 d5 0.66 d4 0.82 0.82
d2 0.7 0.7
d5 0.65 d1 0.45
d3 0.8 0.8
….. …..
d5 0.66 0.66
d6 0.51 d3 0.33 d5 0.65 0.66 1.34

d2 0.1 d7 0.15
d7 0.0 d4 0.0

Information Retrieval- 43
Example 3
Finding the top-2 elements for a two-term query

d1 0.9 d6 0.81 d1 0.9 0.45 1.35

0.9
d4 0.82 d2 0.7 d6 0.51 0.81 1.32
0.8
1
d3 0.8 d5 0.66 0.0 0.82
d4 0.82 0.8
d5 0.65 d1 0.45 0.1 0.8
2
….. ….. 0.33 1.13
d2 0.7 0.7
d6 0.51 d3 0.33 d3 0.65
0.8 1.31
0.8
d2 0.1 d7 0.15 d5 0.66 0.6
d7 0.0 d4 0.0 6

Information Retrieval- 44
Discussion
Complexity
– O((k n)1/2) entries are read in each list for n documents
– Assuming that entries are uncorrelated
– Improves if they are positively correlated
In distributed settings optimizations to reduce the
number of roundtrips
– Send a longer prefix of one list to the other node
Fagin’s algorithm may behave poorly in practical cases
– Alternative algorithm: Threshold Algorithm

Information Retrieval- 45
1.3.3.2 Threshold Algorithm
Threshold Algorithm
- Access sequentially elements in each list
- At each round
- lookup missing weights of current elements in other lists using
random access
- Keep the top k elements seen so far
- Compute threshold as aggregate value of the different elements
seen in current round
- If at least k documents have aggregate value higher than
threshold, halt

Information Retrieval- 46
Example
Finding the top-2 elements for a two-term query
Threshold

d1 0.9 d6 0.81 1.71 d1 0.9 0.45 1.35

d4 0.82 d2 0.7 d6 0.81 0.81
0.51 0.81
1.32
d3 0.8 d5 0.66
d5 0.65 d1 0.45 d1 and d6 have
….. ….. aggregate weights
d6 0.51 d3 0.33 lower than the
d2 0.1 d7 0.15 threshold, therefore
d7 0.0 d4 0.0
continue

Information Retrieval- 47
Example

Threshold

d1 0.9 d6 0.81 1.71 d1 0.9 0.45 1.35

d4 0.82 d2 0.7 1.52 d6 0.81 0.81
0.51 0.81
1.32
d3 0.8 d5 0.66
d5 0.65 d1 0.45 The documents d4,
….. ….. d2 have lower
d6 0.51 d3 0.33 aggregate weights
d2 0.1 d7 0.15 and are therefore
d7 0.0 d4 0.0
dismissed

d1 and d6 have
aggregate weights
lower than the
threshold, therefore
continue
Information Retrieval- 48
Example

Threshold

d1 0.9 d6 0.81 1.71 d1 0.9 0.45 1.35

d4 0.82 d2 0.7 1.52 d6 0.81 0.81
0.51 0.81
1.32
d3 0.8 d5 0.66
1.46
d5 0.65 d1 0.45 The documents d3,
….. ….. d5 have lower
d6 0.51 d3 0.33 aggregate weights
d2 0.1 d7 0.15 and are therefore
d7 0.0 d4 0.0
dismissed

d1 and d6 have
aggregate weights
lower than the
threshold, therefore
continue
Information Retrieval- 49
Example

Threshold

d1 0.9 d6 0.81 1.71 d1 0.9 0.45 1.35

d4 0.82 d2 0.7 1.52 d6 0.81 0.81
0.51 0.81
1.32
d3 0.8 d5 0.66
1.46
d5 0.65 d1 0.45
1.1 The document d5 has
….. …..
d6 0.51 d3 0.33 lower aggregate
d2 0.1 d7 0.15 weights and is
d7 0.0 d4 0.0 therefore dismissed

In this round d1 and

d6 have aggregate
weights higher than
the threshold and the
algorithm terminates
Information Retrieval- 50
Discussion
TA in general performs fewer rounds the FA
- Therefore, fewer document accesses
- But more random accesses
TA is also provably correct for monotone aggregation
functions

Information Retrieval- 51
Applications
Beyond distributed document retrieval these algorithms
have wider applications
– Multimedia, image retrieval
– Top-k processing in relational databases
– Document filtering
– Sensor data processing

Information Retrieval- 52
When applying Fagin’s algorithm for a query with
three different terms for finding the k top
documents, the algorithm will scan ...

1. 2 different lists
2. 3 different lists
3. k different lists
4. it depends how many
rounds are taken

Information Retrieval- 53
With Fagin’s algorithm, once k documents have
been identified that occur in all of the lists ...
1. These are the top-k
documents
2. The top-k documents are
among the documents seen
so far
3. The search has to continue
in round-robin till the top-k
documents are identified
4. Other documents have to be
searched to complete the
top-k list

Information Retrieval- 54
References
Lin, J., Dyer, C. (2010). Inverted Indexing for Text Retrieval. In: Data-
Intensive Text Processing with MapReduce. Synthesis Lectures on
Human Language Technologies. Springer, Cham.

Ronald Fagin, Amnon Lotem, Moni Naor. Optimal aggregation

algorithms for middleware, PODS 2001.

Information Retrieval- 55

CS 3308 - Information Retrieval Self Quiz - Unit 01 - Unit 088 - University of The People
No ratings yet
CS 3308 - Information Retrieval Self Quiz - Unit 01 - Unit 088 - University of The People
49 pages
4-Tolerant Retrieval
No ratings yet
4-Tolerant Retrieval
82 pages
Verified Functional Programming in Agda by Aaron Stump
No ratings yet
Verified Functional Programming in Agda by Aaron Stump
256 pages
Chap 5
No ratings yet
Chap 5
64 pages
Certificate: T.Y.Bsc Cs
No ratings yet
Certificate: T.Y.Bsc Cs
120 pages
IR Journal
No ratings yet
IR Journal
36 pages
NLP Week10 IR Enc Dec
No ratings yet
NLP Week10 IR Enc Dec
68 pages
IRS Imp
No ratings yet
IRS Imp
76 pages
Information Retrieval: Lecture One
No ratings yet
Information Retrieval: Lecture One
101 pages
ISR Chap... 4
No ratings yet
ISR Chap... 4
43 pages
13 Searching
No ratings yet
13 Searching
71 pages
Lecture5 Index Compression
No ratings yet
Lecture5 Index Compression
48 pages
Searching and Hashing
No ratings yet
Searching and Hashing
36 pages
C7 SpellCorrection
No ratings yet
C7 SpellCorrection
43 pages
Lecture5 Compression
No ratings yet
Lecture5 Compression
48 pages
01 Intro
No ratings yet
01 Intro
145 pages
Unit I
No ratings yet
Unit I
83 pages
Ex1 Solutions
No ratings yet
Ex1 Solutions
22 pages
Lecture2 Indexing
No ratings yet
Lecture2 Indexing
49 pages
Daa 20 21
No ratings yet
Daa 20 21
27 pages
Week 2 - Information Retrieval Basics
No ratings yet
Week 2 - Information Retrieval Basics
74 pages
Computational Thinking Searching and Sorting Algorithms VythqRmDYxPxsQt9
No ratings yet
Computational Thinking Searching and Sorting Algorithms VythqRmDYxPxsQt9
15 pages
2.boolean Retrieval Model
No ratings yet
2.boolean Retrieval Model
40 pages
02 Basic Techniques PDF
No ratings yet
02 Basic Techniques PDF
51 pages
Lec2 2
No ratings yet
Lec2 2
17 pages
Algorithms and Data Structures Cheatsheet
No ratings yet
Algorithms and Data Structures Cheatsheet
11 pages
Unit 1 Notes-1
No ratings yet
Unit 1 Notes-1
10 pages
Chapter 3 Indexing
No ratings yet
Chapter 3 Indexing
48 pages
20230922044043-Chapter 1
No ratings yet
20230922044043-Chapter 1
4 pages
Lecture3 Tolerant Retrieval Handout 6 Per
No ratings yet
Lecture3 Tolerant Retrieval Handout 6 Per
8 pages
Chap5 Index Construction
No ratings yet
Chap5 Index Construction
38 pages
IR Practical Theory
No ratings yet
IR Practical Theory
9 pages
Lec 1 IR
No ratings yet
Lec 1 IR
42 pages
Chapter-4 - Data Structure-File Structure
No ratings yet
Chapter-4 - Data Structure-File Structure
34 pages
Unit 2
No ratings yet
Unit 2
58 pages
Introduction To Information Rertrieval Answer
100% (4)
Introduction To Information Rertrieval Answer
6 pages
Ir Mod4 Notes
No ratings yet
Ir Mod4 Notes
19 pages
Module 1-1
No ratings yet
Module 1-1
12 pages
Lecture 4-Dictionaries and Tolerant Retrieval
No ratings yet
Lecture 4-Dictionaries and Tolerant Retrieval
50 pages
Chapter 3,4, 5 and 6
No ratings yet
Chapter 3,4, 5 and 6
145 pages
Advanced Indexing Issues
No ratings yet
Advanced Indexing Issues
52 pages
Information Retrieval Solutions Manual
84% (57)
Information Retrieval Solutions Manual
17 pages
ADDA
No ratings yet
ADDA
50 pages
Information Retrieval (CS6370) : Maunendra Sankar Desarkar
No ratings yet
Information Retrieval (CS6370) : Maunendra Sankar Desarkar
44 pages
Big Data and Analytics by Seema Acharya and Subhashini Chellappan Copyright 2015, WILEY INDIA PVT. LTD. Introduction To Pig
67% (3)
Big Data and Analytics by Seema Acharya and Subhashini Chellappan Copyright 2015, WILEY INDIA PVT. LTD. Introduction To Pig
34 pages
Sistem Temu Kembali
No ratings yet
Sistem Temu Kembali
6 pages
Vector Space Model and Features: Carl Staelin
No ratings yet
Vector Space Model and Features: Carl Staelin
28 pages
Supervisionguide15 16 Students
No ratings yet
Supervisionguide15 16 Students
18 pages
Stanford University Computer Science Department Solved CS347 Spring 2001 Mid-Term
No ratings yet
Stanford University Computer Science Department Solved CS347 Spring 2001 Mid-Term
5 pages
Solved-Midterm Sistem Temu Kembali
No ratings yet
Solved-Midterm Sistem Temu Kembali
5 pages
Information Retrievalpdf
No ratings yet
Information Retrievalpdf
7 pages
L05
No ratings yet
L05
33 pages
AWS Complete
100% (1)
AWS Complete
64 pages
Assignments 1 Solution
100% (1)
Assignments 1 Solution
6 pages
Introduction To Information Rertrieval Recitation
No ratings yet
Introduction To Information Rertrieval Recitation
2 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
31 pages
IR Unit 2
No ratings yet
IR Unit 2
54 pages
Big Data
No ratings yet
Big Data
957 pages
Chapter 1: Boolean Retrieval
No ratings yet
Chapter 1: Boolean Retrieval
9 pages
NO SQL Data Management
No ratings yet
NO SQL Data Management
123 pages
Unit 3 Full
No ratings yet
Unit 3 Full
89 pages
Architecting For Fast Data Applications Mesosphere
No ratings yet
Architecting For Fast Data Applications Mesosphere
45 pages
Completed UNIT-III 20.9.17
No ratings yet
Completed UNIT-III 20.9.17
61 pages
Dice Resume CV Yamini Vakula
No ratings yet
Dice Resume CV Yamini Vakula
5 pages
Data Analytics in Iot: Cs578: Internet of Things
No ratings yet
Data Analytics in Iot: Cs578: Internet of Things
27 pages
V11-2021-160 - Credits Scheme and Syllabus
No ratings yet
V11-2021-160 - Credits Scheme and Syllabus
90 pages
Cap456-Introduction To Big Data
No ratings yet
Cap456-Introduction To Big Data
1 page
T 01
100% (1)
T 01
1 page
CS19741-Cloud Computing-Unit 3 Notes
No ratings yet
CS19741-Cloud Computing-Unit 3 Notes
37 pages
Assignment1 - MapReduce
No ratings yet
Assignment1 - MapReduce
2 pages
2 BDA MapReduce
No ratings yet
2 BDA MapReduce
30 pages
Arnab Paul
No ratings yet
Arnab Paul
8 pages
S MapReduce Types Formats Features 06
No ratings yet
S MapReduce Types Formats Features 06
26 pages
Introduction To Analytics and Big Data - Hadoop: Thomas Rivera Hitachi Data Systems
No ratings yet
Introduction To Analytics and Big Data - Hadoop: Thomas Rivera Hitachi Data Systems
45 pages
B.Sc. Computer Science
No ratings yet
B.Sc. Computer Science
11 pages
Beginning Database Design
No ratings yet
Beginning Database Design
2 pages
ECL Stands For Enterprise Control Language 104-155
No ratings yet
ECL Stands For Enterprise Control Language 104-155
24 pages
Bda Final Manual
No ratings yet
Bda Final Manual
11 pages
4-4 Autonomous Syllabus R-15 250418
No ratings yet
4-4 Autonomous Syllabus R-15 250418
44 pages
Hive Lab
No ratings yet
Hive Lab
33 pages
A Warehouse Solution Over Map-Reduce Framework: Dony Ang
No ratings yet
A Warehouse Solution Over Map-Reduce Framework: Dony Ang
26 pages
H2O Automl: Scalable Automatic Machine Learning
No ratings yet
H2O Automl: Scalable Automatic Machine Learning
16 pages
Big Data Notes With Diagrams
No ratings yet
Big Data Notes With Diagrams
3 pages
Machine Learning Cheat Sheet: 1. Hardware
No ratings yet
Machine Learning Cheat Sheet: 1. Hardware
14 pages
Lab Programs On HDFS and MapReduce
No ratings yet
Lab Programs On HDFS and MapReduce
2 pages
Performance Comparison of Hive, Impala and Spark SQL
No ratings yet
Performance Comparison of Hive, Impala and Spark SQL
6 pages
CCS334 Set2
No ratings yet
CCS334 Set2
3 pages
Waterloo Student Resume (Software Engineering)
No ratings yet
Waterloo Student Resume (Software Engineering)
3 pages
Efficient Algorithms and Structures with Heaps: Definitive Reference for Developers and Engineers
From Everand
Efficient Algorithms and Structures with Heaps: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet

Week 4 - Information Retrieval Indexing

Uploaded by

Week 4 - Information Retrieval Indexing

Uploaded by

1.

3 INDEXING FOR INFORMATION

query repr. document repr.

query inverted file

retrieved docs 3. efficient data access

B1 A Course on Integral Equations

– fk number of documents in which occurs

Inverted File: lexicographically ordered sequence of

Typical values observed:

Parameters depend on collection type

Algorithms 3 3 B1 A Course on Integral Equations

Other granularities can be used

In addition, other data can be stored with the posting

Overall cost O(n)

house: 6 the: 1 the: 1 a: 16 the: 1

has: 12 house: 6 has: 12 house: 6

a: 16 garden: 18 the: 1 a: 16 garden: 18 the: 1, 25

has: 12 house: 6 has: 12 house: 6

are: 66 has: 12, 36 house: 6

are: 2 has: 8 house: 10

I 1...2 I 3...4 I 5...6 I 7...8

lk '  f k : di1 , di2  di1 ,..., di fk  di fk  1

def mapper(document, line):

def reducer(key, values):

1 1 1 1 1 1 1 1 The assigned reduce

Combiners work like reducers,

def combiner(key, values):

Partitioners allow to control the strategy

Mappers extract postings from document

Postings are provided to reducers

Reducers aggregate posting lists

def reducer(key, postings):

1. in the index merging approach for single node machines

Query = t1, t2 ,t3 …

Query = t1, t2 ,t3 …

Posting list of t1 Posting list of t3

Is it necessary to transfer the complete posting list to

d1 0.9 d6 0.81 d1 0.9 0.9

d1 0.9 d6 0.81 d1 0.9 0.45 1.35

d1 0.9 d6 0.81 d1 0.9 0.45 1.35

d1 0.9 d6 0.81 1.71 d1 0.9 0.45 1.35

d1 0.9 d6 0.81 1.71 d1 0.9 0.45 1.35

d1 0.9 d6 0.81 1.71 d1 0.9 0.45 1.35

d1 0.9 d6 0.81 1.71 d1 0.9 0.45 1.35

In this round d1 and

Ronald Fagin, Amnon Lotem, Moni Naor. Optimal aggregation

You might also like