Lecture 4-Indexconstruction
Lecture 4-Indexconstruction
Retrieval
Introduction to
Information Retrieval
Index Construction
Introduction to Information
Retrieval
Index construction
How do we construct an index?
What strategies can we use with limited main
memory?
Introduction to Information
Retrieval
Hardware basics
Many design decisions in information retrieval are
based on the characteristics of hardware
We begin by reviewing hardware basics
Introduction to Information
Retrieval
Hardware basics
Access to data in memory is much faster than
access to data on disk.
Disk seeks: No data is transferred from disk while
the disk head is being positioned.
Therefore: Transferring one large chunk of data
from disk to memory is faster than transferring
many small chunks.
Disk I/O is block-based: Reading and writing of
entire blocks (as opposed to smaller chunks).
Block sizes: 8KB to 256 KB.
Introduction to Information
Retrieval
Hardware basics
Servers used in IR systems now typically have
several GB of main memory, sometimes tens of GB.
Introduction to Information
Retrieval
statistic value
Introduction to Information
Retrieval
Introduction to Information
Retrieval
Introduction to Information
Retrieval
Introduction to Information
Retrieval
Doc 1
I did enact Julius
Caesar I was killed
i' the Capitol;
Brutus killed me.
Doc 2
So let it be with
Caesar. The noble
Brutus hath told you
Caesar was ambitious
Term
I
did
enact
julius
caesar
I
was
killed
i'
the
capitol
brutus
killed
me
so
let
it
be
with
caesar
the
noble
brutus
hath
told
you
caesar
was
ambitious
Doc #
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
Introduction to Information
Retrieval
Key step
After all documents have
been parsed, the inverted
file is sorted by terms.
Term
I
did
enact
julius
caesar
I
was
killed
i'
the
capitol
brutus
killed
me
so
let
it
be
with
caesar
the
noble
brutus
hath
told
you
caesar
was
ambitious
Doc #
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
Term
ambitious
be
brutus
brutus
capitol
caesar
caesar
caesar
did
enact
hath
I
I
i'
it
julius
killed
killed
let
me
noble
so
the
the
told
you
was
was
with
Doc #
2
2
1
2
1
1
2
2
1
1
1
1
1
1
2
1
1
1
2
1
2
2
1
2
2
2
1
2
2
Introduction to Information
Retrieval
Introduction to Information
Retrieval
Introduction to Information
Retrieval
Introduction to Information
Retrieval
Bottleneck
Parse and build postings entries one doc at
a time
Now sort postings entries by term (then by
doc within each term)
Doing this with random disk seeks would
be too slow must sort T=100M records
If every comparison took 2 disk seeks, and N items could be
sorted with N log2N comparisons, how long would this take?
Introduction to Information
Retrieval
Introduction to Information
Retrieval
Introduction to Information
Retrieval
Introduction to Information
Retrieval
Introduction to Information
Retrieval
4
Disk
Introduction to Information
Retrieval
Introduction to Information
Retrieval
Introduction to Information
Retrieval
Introduction to Information
Retrieval
SPIMI-Invert
Introduction to Information
Retrieval
SPIMI: Compression
Compression makes SPIMI even more
efficient.
Compression of terms
Compression of postings
Introduction to Information
Retrieval
Distributed indexing
For web-scale indexing:
must use a distributed computing cluster
Introduction to Information
Retrieval
Introduction to Information
Retrieval
Distributed indexing
Maintain a master machine directing the
indexing job considered safe.
Break up indexing into sets of (parallel)
tasks.
Master machine assigns each task to an
idle machine from a pool.
Introduction to Information
Retrieval
Parallel tasks
We will use two sets of parallel tasks
Parsers
Inverters
Introduction to Information
Retrieval
Parsers
Master assigns a split to an idle parser
machine
Parser reads a document at a time and
emits (term, doc) pairs
Parser writes pairs into j partitions
Each partition is for a range of terms first
letters
(e.g., a-f, g-p, q-z) here j = 3.
Introduction to Information
Retrieval
Inverters
An inverter collects all (term,doc) pairs (=
postings) for one term-partition.
Sorts and writes to postings lists
Introduction to Information
Retrieval
Data flow
assign
splits
Master
assign
Parser
Parser
Parser
Map
phase
Segment files
Postings
Inverter
a-f
Inverter
g-p
Inverter
q-z
Reduce
phase
Introduction to Information
Retrieval
MapReduce
The index construction algorithm we just
described is an instance of MapReduce.
MapReduce (Dean and Ghemawat 2004) is
a robust and conceptually simple
framework for distributed computing
without having to write code for the
distribution part.
They describe the Google indexing system
(ca. 2002) as consisting of a number of
phases, each implemented in MapReduce.
Introduction to Information
Retrieval
MapReduce
Index construction was just one phase.
Another phase: transforming a termpartitioned index into a documentpartitioned index.
Term-partitioned: one machine handles a
subrange of terms
Document-partitioned: one machine handles a
subrange of documents
Introduction to Information
Retrieval
Introduction to Information
Retrieval
Reduce:
(<C,(d1,d2,d1)>, <died,(d2)>, <came,(d1)>, <ced,
(d1)>) (<C,(d1:2,d2:1)>, <died,(d2:1)>, <came,
(d1:1)>, <ced,(d1:1)>)
37
Introduction to Information
Retrieval
Dynamic indexing
Up to now, we have assumed that
collections are static.
They rarely are:
Documents come in over time and need to be
inserted.
Documents are deleted and modified.
Introduction to Information
Retrieval
39
Introduction to Information
Retrieval
Simplest approach
Introduction to Information
Retrieval
Introduction to Information
Retrieval
Logarithmic merge
Maintain a series of indexes, each twice as
large as the previous one
At any time, some of these powers of 2 are
instantiated
Introduction to Information
Retrieval
Introduction to Information
Retrieval
Logarithmic merge
Auxiliary and main index: index construction time
is O(T2) as each posting is touched in each merge.
Logarithmic merge: Each posting is merged O(log
T) times, so complexity is O(T log T)
So logarithmic merge is much more efficient for
index construction
But query processing now requires the merging of
O(log T) indexes
Whereas it is O(1) if you just have a main and
auxiliary index
Where, T is total number of postings
Introduction to Information
Retrieval
Introduction to Information
Retrieval
Why?