Module 3 Indexing Part A
Module 3 Indexing Part A
• Retrieval variables:
– queries, documents, terms, relevance judgements, users, information needs
• Boolean queries
– Used by Boolean retrieval model and in other models
– Boolean query Boolean model
• Probabilistic models
Other Model Dimensions
• Logical View of Documents
– Index terms
– Full text
– Full text + Structure (e.g. hypertext)
• User Task
– Retrieval
– Browsing
Issues for Vector Space Model
• How to determine important words in a document?
– Word sense?
– Word n-grams (and phrases, idioms,…) terms
• How to determine degree of importance of a term
within a document and within the entire collection?
• How to determine the degree of similarity between
a document and the query?
• In the case of the web, what is a collection and
what are the effects of links, formatting
information, etc.?
The Vector-Space Model
• Assume t distinct terms remain after pre-
processing (the index terms or the vocabulary).
• These “orthogonal” terms form a vector space.
Dimension = t = |vocabulary|
• Each term, i, in a document or query, j, is
given a real-valued weight, wij.
• Both documents and queries are expressed as
t-dimensional vectors:
dj = (w1j, w2j, …, wtj)
Document Collection
• Collection of n documents can be represented in the vector
space model by a term-document matrix.
• An entry in the matrix corresponds to the “weight” of a
term in the document;
– zero means the term has no significance in the document or it
simply doesn’t exist in the document.
T1 T2 …. Tt
D1 w11 w21 … wt1
D2 w12 w22 … wt2
: : : :
: : : :
Dn w1n w2n … wtn
• tf-idf weighting typical: wij = tfij*idfi = tfij log2 (N/ dfi)
Graphic Representation
Example: T3
D1 = 2T1 + 3T2 + 5T3
D2 = 3T1 + 7T2 + T3 5
Q = 0T1 + 0T2 + 2T3
D1 = 2T1+ 3T2 + 5T3
wik tf ik * log( N / nk )
Tk term k in document Di
tf ik frequency of term Tk in document Di
idf k inverse document frequency of term Tk in C
N total number of documents in the collection C
nk the number of documents in C that contain Tk
idf k log N
nk
Inverse Document Frequency
• IDF provides high values for rare words and
low values for common words, The most
frequent words are not the most descriptive.
10000
log 0
10000
For a
10000
collection log 0.301
of 10000 5000
documents 10000
(N = 10000) log 2.698
20
10000
log 4
1
Query Vector
• Query vector is typically treated as a document
and also tf-idf weighted.
where wij is the weight of term i in document j and wiq is the weight of
term i in the query
Binary:
– D = 1, 1, 1, 0, 1, 1, 0
Size of vector = size of vocabulary = 7
– Q = 1, 0 , 1, 0, 0, 1, 1
0 means corresponding term not found in
document or query
sim(D, Q) = 3
Weighted:
D1 = 2T1 + 3T2 + 5T3 D2 = 3T1 + 7T2 + 1T3
Q = 0T1 + 0T2 + 2T3
D1
Q
t 2
dj q ( wij wiq )
t1
CosSim(dj, q) =
i 1
t t
wij wiq
2 2
dj q
i 1 i 1 t2 D2
D1 = 2T1 + 3T2 + 5T3 CosSim(D1 , Q) = 10 / (4+9+25)(0+0+4) = 0.81
D2 = 3T1 + 7T2 + 1T3 CosSim(D2 , Q) = 2 / (9+49+1)(0+0+4) = 0.13
Q = 0T1 + 0T2 + 2T3
D1 is 6 times better than D2 using cosine similarity but only 5 times better using
inner product.
Simple Implementation
1. Convert all documents in collection D to tf-idf weighted
vectors, dj, for keyword vocabulary V.
2. Convert query to a tf-idf-weighted vector q.
3. For each dj in D do
Compute score sj = cosSim(dj, q)
4. Sort documents by decreasing score.
5. Present top ranked documents to the user.
Time complexity: O(|V|·|D|) Bad for large V & D !
|V| = 10,000; |D| = 100,000; |V|·|D| = 1,000,000,000
Comments on Vector Space Models
• Simple, mathematically based approach.
• Considers both local (tf) and global (idf) word
occurrence frequencies.
• Provides partial matching and ranked results.
• Tends to work quite well in practice despite
obvious weaknesses.
• Allows efficient implementation for large
document collections.
Problems with Vector Space Model
• Assumption of term independence
• Missing semantic information (e.g. word sense).
• Missing syntactic information (e.g. phrase
structure, word order, proximity information).
• Lacks the control of a Boolean model (e.g.,
requiring a term to appear in a document).
– Given a two-term query “A B”,
• may prefer a document containing A frequently but not B,
over a document that contains both A and B, but both less
frequently.
Mechanism of Query Processing
1. Relevant inverted indices are found
1. Typically the indexes are in memory, otherwise this
could take a full half second
2. If they are bit vectors, they are ANDed or ORed, then
materialized, then lists are handled
• Result is many URLs.
• Next step is to determine their rank so the highest ranked
URLs can be delivered to the user.
Ranking Pages
• Indexes have returned pages. Which ones
are most relevant to you?
There are many criteria for ranking pages;
_Presence of all words
– All words close together
– Words in important locations and formats on
the page
– Words near anchor text of links in reference
pages
• But the killer criteria is PageRank
PageRank Intuition
• You need to find a plumber. How do you do it?
1. Call plumbers and talk to them
2. ! Call friends and ask for plumber references
• Then choose plumbers who have the most references
3. !! Call friends who know a lot about plumbers (important friends) and
ask them for plumber references
• Then choose plumbers who have the most references from important
people.
• Technique 1 was used before Google.
• Google introduced technique 2 to search engines
• Google also introduced technique 3
• Techniques 2, and especially 3, wiped out the competition.
• The big challenge: determine which pages are important
What does this mean for pages?
1. Most search engines look for pages
containing the word "plumber"
2. Google searches for pages that are linked to
by pages containing "plumber".
3. Google searches for pages that are linked to
by important pages containing "plumber".
• A web page is important if many
important pages link to it.
– This is a recursive equation.
– Google solves it by imagining a web
walker/Crawler.
Importance of a page?
• Inverted files are used to index text
• The indices are appropriate when the
text collection is large and semi-static
• If the text collection is volatile online
searching is the only option
• Some techniques combine online and
indexed searching
IR System: What Do You Need?
• Vocabulary List
– Text preprocessing modules
• lexical analysis, stemming, stopwords
• Occurrences of Vocabulary Terms
– Inverted index creation
• term frequency in documents, document frequency
• Retrieval and Ranking Algorithm
• Query and Ranking Interfaces
• Browsing/Visualization Interface
Pros and cons of Indexing
Advantages
Can be searched quickly, e.g., by binary search, O(log n)
Good for sequential processing, e.g., comp*
Convenient for batch updating
Economical use of storage
Disadvantages
Index must be rebuilt if an extra term is added
Inverted index
• The inverted index of a document collection is
basically a data structure that
– attaches each distinctive term with a list of all
documents that contains the term.
• Thus, in retrieval, it takes constant time to
– find the documents that contains a query term.
– multiple query terms are also easy handled
33
An example
34
Search using inverted index
Given a query q, search has the following steps:
• Step 1 (vocabulary search): find each
term/word in q in the inverted index.
• Step 2 (results merging): Merge results to find
documents that contain all or some of the
words/terms in q.
• Step 3 (Rank score computation): To rank the
resulting documents/pages, using,
– content-based ranking
– link-based ranking
35
Index construction
36
Step1 : Vocabulary search
• The construction of an inverted index is done efficiently
using a trie data structure
• The time complexity of the index construction is O(T),
where T is the number of all terms (including duplicates) in
the document collection (after pre-processing
• For each document, the algorithm scans it sequentially and
for each term, it finds the term in the trie.
• If it is found, the document ID and other information (e.g.,
the offset of the term) are added to the inverted list of the
term.
• If the term is not found, a new leaf is created to represent
the term.).
Step2: Results merging
• The partial index I1 obtained at a point of time is written on the disk.
• Then, we process the subsequent documents and build the partial index
I2 in memory, and so on.
• After all documents have been processed, we have k partial indices, I1,
I2, …, Ik, on disk. We then merge the partial indices in a hierarchical
manner.
• That is, we first perform pair-wise merges of I1 and I2, I3 and I4, and
so on. This gives us larger indices I1-2, I3-4 and so on.
• After the first level merging is complete, we proceed to the second
level merging, i.e., we merge I1-2 and I3-4, I5-6 and I7-8 and so on.
This process continues until all the partial indices are merged into a
single index.
Index Compression
• An inverted index can be very large. reducing the index
size becomes an important issue. to speed up the search
• A natural solution to this is index compression, which
aims to represent the same information with fewer bits or
bytes.
• Using compression, the size of an inverted index can be
reduced dramatically.
• In the lossless compression, the original index can also be
reconstructed exactly using the compressed version.
Index compression techniques
• The two classes of compression schemes for inverted lists:
the variable-bit scheme and the variable-byte scheme.
• Variable bit scheme
– Unary Encoding
– Elias delta
– Elias gamma
• Variable byte scheme
Unary Encoding
• It represents a number x with x-1 bits of
zeros
• followed by a bit of one. For example, 5 is
represented as 00001
• This scheme is effective for very small
numbers, but wasteful for large numbers
Elias Gamma Coding
• Decoding: We decode an Elias
• Coding in 2 steps gamma-coded integer in two
steps:
1. Write x in binary.
1. Read and count zeroes from
2. Subtract 1 from the number of
the stream until we reach the
bits written in step 1 and
first one. Let this count of
prepend that many zeros.
zeroes be K.
• Ex: The number 9 is
2. Consider the one that was
represented by 0001001
reached to be the first digit of
the integer,with a value of 2K,
read the remaining K bits of
the integer.
• Example 7: To decompress 0001001, we
first read all zero bits from the beginning
until we see a bit of 1. We have K = 3 zero
bits. We then include the 1 bit with the
following 3 bits, which give us 1001 (binary
for 9).
Elias Delta Coding
• Coding: In the Elias delta coding, a • Decoding:
positive integer x is stored with the 1. Read and count zeroes from the stream
gamma code representation of 1+[log2x] until you reach the first one. Let this
followed by the binary representation of count of zeroes be L.
x less the most significant bit. 2. Considering the one that was reached to
• Example : Let us code the number 9. be the first bit of an integer, with a
Since 1+[log2x] = 4, value of 2L, read the remaining L digits
we have gamma code 00100 for 4. of the integer. This is the integer M.
9’s binary representation less the most 3. Put a one in the first place of our final
significant bit is 001. output, representing the value 2M.
– Read and append the following M-1 bits.
we have to append the above two: 00100
and 001 to yield delta code of 00100001 for Example : We want to decode 00100001.
9 We can see that L = 2 after step 1, and after
step 2, we have read and consumed 5 bits.
We also obtain M = 4 (100 in binary).
Finally, we prepend 1 to the M-1 bits
(which is 001) to give 1001, which is 9 in
binary.
Variable-Byte Coding
• Coding: In this method, seven • Decoding: Decoding is
bits in each byte are used to performed in two steps:
code an integer, with the least • 1. Read all bytes until a byte
significant bit set to 0 in the last with the zero last bit is seen.
byte, or to 1 if further bytes • 2. Remove the least significant
follow. bit from each byte read so far
• In this way, small integers are and
represented efficiently. • concatenate the remaining bits.
• Ex: 135 is represented in two • For example, 00000011
bytes, since it lies in the range 00001110 is decoded to
2power 7 and 2 power14, as 00000010000111, which is 135
00000011 00001110
Comparison of compression
techniques
Contd..
• A suitable compression technique can allow
retrieval to be up to twice as fast than without
compression
• the space requirement averages 20% – 25% of the
cost of storing uncompressed integers.
• Variable byte integers are faster than variable-bit
integers, despite having higher storage costs,
because fewer CPU operations are required to
decode variable-byte integers and they are byte-
aligned on disk.