0% found this document useful (0 votes)
3 views

Module 3 Indexing Part A

Uploaded by

Ayush Tiwari
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Module 3 Indexing Part A

Uploaded by

Ayush Tiwari
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

Module 3- Retrieval Models and Indexing

Boolean, Vector Space, Probabilistic


Outline
• Retrieval models
• Search algorithm
• Indexing process and inverted list
• Index Compression
What is a retrieval model?
• An idealization or abstraction of an actual process (retrieval)
– results in measure of similarity b/w query and document

• May describe the computational process


– e.g. how documents are ranked
– note that inverted file is an implementation not a model

• May attempt to describe the human process


– e.g. the information need, search strategy, etc

• Retrieval variables:
– queries, documents, terms, relevance judgements, users, information needs

• Have an explicit or implicit definition of relevance


Mathematical models
• Study the properties of the process
• Draw conclusions or make predictions
– Conclusions derived depend on whether model is a
good approximation of the actual situation

• Statistical models represent repetitive processes


– predict frequencies of interesting events
– use probability as the fundamental tool
Exact Match Retrieval
• Retrieving documents that satisfy a Boolean expression
constitutes the Boolean exact match retrieval model
– query specifies precise retrieval criteria
– every document either matches or fails to match query
– result is a set of documents (no order)
• Advantages:
– efficient
– predictable, easy to explain
– structured queries
– work well when the user knows exactly what documents are
required
Exact-match Retrieval Model
• Disadvantages:
– query formulation difficult for most users
– difficulty increases with collection size (why?)
– indexing vocabulary same as query vocabulary
– acceptable precision generally means unacceptable
recall
– ranking models are consistently better
• Best-match or ranking models are now more
common
Boolean retrieval
• Most common exact-match model
– queries: logic expressions with doc features as operands
– retrieved documents are generally not ranked
– query formulation difficult for novice users

• Boolean queries
– Used by Boolean retrieval model and in other models
– Boolean query  Boolean model

• “Pure” Boolean operators: AND, OR, and NOT


• Most systems have proximity operators
• Most systems support simple regular expressions as search
terms to match spelling variants
Classes of Retrieval Models
• Boolean models (set theoretic)
– Extended Boolean

• Vector space models (statistical/algebraic)


– Generalized VS
– Latent Semantic Indexing

• Probabilistic models
Other Model Dimensions
• Logical View of Documents
– Index terms
– Full text
– Full text + Structure (e.g. hypertext)

• User Task
– Retrieval
– Browsing
Issues for Vector Space Model
• How to determine important words in a document?
– Word sense?
– Word n-grams (and phrases, idioms,…)  terms
• How to determine degree of importance of a term
within a document and within the entire collection?
• How to determine the degree of similarity between
a document and the query?
• In the case of the web, what is a collection and
what are the effects of links, formatting
information, etc.?
The Vector-Space Model
• Assume t distinct terms remain after pre-
processing (the index terms or the vocabulary).
• These “orthogonal” terms form a vector space.
Dimension = t = |vocabulary|
• Each term, i, in a document or query, j, is
given a real-valued weight, wij.
• Both documents and queries are expressed as
t-dimensional vectors:
dj = (w1j, w2j, …, wtj)
Document Collection
• Collection of n documents can be represented in the vector
space model by a term-document matrix.
• An entry in the matrix corresponds to the “weight” of a
term in the document;
– zero means the term has no significance in the document or it
simply doesn’t exist in the document.
T1 T2 …. Tt
D1 w11 w21 … wt1
D2 w12 w22 … wt2
: : : :
: : : :
Dn w1n w2n … wtn
• tf-idf weighting typical: wij = tfij*idfi = tfij log2 (N/ dfi)
Graphic Representation
Example: T3
D1 = 2T1 + 3T2 + 5T3
D2 = 3T1 + 7T2 + T3 5
Q = 0T1 + 0T2 + 2T3
D1 = 2T1+ 3T2 + 5T3

Q = 0T1 + 0T2 + 2T3


2 3
T1
D2 = 3T1 + 7T2 + T3
Is D1 or D2 more similar to Q?
How to measure the degree of
7
T2 similarity? Distance? Angle?
Projection?
Term Weights: Term Frequency
• More frequent terms in a document are more
important, i.e. more indicative of the topic.
fij = frequency of term i in document j

• May want to normalize term frequency (tf)


across the entire corpus:
tfij = fij / max{fij}
Term Weights: IDF
• Terms that appear in many different documents
are less indicative of overall topic.
df i = document frequency of term i/ number of
documents containing term i
idfi = inverse document frequency of term i,
= log2 (N/ df i)
(N: total number of documents)
• Recall: indication of a term’s discrimination
power.
• Log used to dampen the effect relative to tf.
Simple tf*idf

wik  tf ik * log( N / nk )
Tk  term k in document Di
tf ik  frequency of term Tk in document Di
idf k  inverse document frequency of term Tk in C
N  total number of documents in the collection C
nk  the number of documents in C that contain Tk

idf k  log N 
 nk 
Inverse Document Frequency
• IDF provides high values for rare words and
low values for common words, The most
frequent words are not the most descriptive.
 10000 
log  0
 10000 
For a
 10000 
collection log    0.301
of 10000  5000 
documents  10000 
(N = 10000) log    2.698
 20 
 10000 
log  4
 1 
Query Vector
• Query vector is typically treated as a document
and also tf-idf weighted.

• Alternative is for the user to supply weights for


the given query terms.
Similarity Measure
• A similarity measure is a function that
computes the degree of similarity between two
vectors.
• Using a similarity measure between the query
and each document:
– It is possible to rank the retrieved documents in the
order of presumed relevance.

– It is possible to enforce a certain threshold so that


the size of the retrieved set can be controlled.
Similarity Measure - Inner Product
• Similarity between vectors for the document di and query q can
be computed as the vector inner product:
t
sim(dj,q) = dj•q =  wij · wiq
i 1

where wij is the weight of term i in document j and wiq is the weight of
term i in the query

• For binary vectors, the inner product is the number of matched


query terms in the document (size of intersection).

• For weighted term vectors, it is the sum of the products of the


weights of the matched terms.
Inner Product -- Examples

Binary:
– D = 1, 1, 1, 0, 1, 1, 0
Size of vector = size of vocabulary = 7
– Q = 1, 0 , 1, 0, 0, 1, 1
0 means corresponding term not found in
document or query
sim(D, Q) = 3

Weighted:
D1 = 2T1 + 3T2 + 5T3 D2 = 3T1 + 7T2 + 1T3
Q = 0T1 + 0T2 + 2T3

sim(D1 , Q) = 2*0 + 3*0 + 5*2 = 10


sim(D2 , Q) = 3*0 + 7*0 + 1*2 = 2
Cosine Similarity Measure
• Cosine similarity measures the cosine of the angle between t3
two vectors.
• Inner product normalized by the vector lengths. 1

D1
Q
  t 2
dj q   ( wij  wiq )
t1
CosSim(dj, q) =  

i 1
t t

 wij   wiq
2 2
dj  q
i 1 i 1 t2 D2
D1 = 2T1 + 3T2 + 5T3 CosSim(D1 , Q) = 10 / (4+9+25)(0+0+4) = 0.81
D2 = 3T1 + 7T2 + 1T3 CosSim(D2 , Q) = 2 / (9+49+1)(0+0+4) = 0.13
Q = 0T1 + 0T2 + 2T3

D1 is 6 times better than D2 using cosine similarity but only 5 times better using
inner product.
Simple Implementation
1. Convert all documents in collection D to tf-idf weighted
vectors, dj, for keyword vocabulary V.
2. Convert query to a tf-idf-weighted vector q.
3. For each dj in D do
Compute score sj = cosSim(dj, q)
4. Sort documents by decreasing score.
5. Present top ranked documents to the user.
Time complexity: O(|V|·|D|) Bad for large V & D !
|V| = 10,000; |D| = 100,000; |V|·|D| = 1,000,000,000
Comments on Vector Space Models
• Simple, mathematically based approach.
• Considers both local (tf) and global (idf) word
occurrence frequencies.
• Provides partial matching and ranked results.
• Tends to work quite well in practice despite
obvious weaknesses.
• Allows efficient implementation for large
document collections.
Problems with Vector Space Model
• Assumption of term independence
• Missing semantic information (e.g. word sense).
• Missing syntactic information (e.g. phrase
structure, word order, proximity information).
• Lacks the control of a Boolean model (e.g.,
requiring a term to appear in a document).
– Given a two-term query “A B”,
• may prefer a document containing A frequently but not B,
over a document that contains both A and B, but both less
frequently.
Mechanism of Query Processing
1. Relevant inverted indices are found
1. Typically the indexes are in memory, otherwise this
could take a full half second
2. If they are bit vectors, they are ANDed or ORed, then
materialized, then lists are handled
• Result is many URLs.
• Next step is to determine their rank so the highest ranked
URLs can be delivered to the user.
Ranking Pages
• Indexes have returned pages. Which ones
are most relevant to you?
There are many criteria for ranking pages;
_Presence of all words
– All words close together
– Words in important locations and formats on
the page
– Words near anchor text of links in reference
pages
• But the killer criteria is PageRank
PageRank Intuition
• You need to find a plumber. How do you do it?
1. Call plumbers and talk to them
2. ! Call friends and ask for plumber references
• Then choose plumbers who have the most references
3. !! Call friends who know a lot about plumbers (important friends) and
ask them for plumber references
• Then choose plumbers who have the most references from important
people.
• Technique 1 was used before Google.
• Google introduced technique 2 to search engines
• Google also introduced technique 3
• Techniques 2, and especially 3, wiped out the competition.
• The big challenge: determine which pages are important
What does this mean for pages?
1. Most search engines look for pages
containing the word "plumber"
2. Google searches for pages that are linked to
by pages containing "plumber".
3. Google searches for pages that are linked to
by important pages containing "plumber".
• A web page is important if many
important pages link to it.
– This is a recursive equation.
– Google solves it by imagining a web
walker/Crawler.
Importance of a page?
• Inverted files are used to index text
• The indices are appropriate when the
text collection is large and semi-static
• If the text collection is volatile online
searching is the only option
• Some techniques combine online and
indexed searching
IR System: What Do You Need?
• Vocabulary List
– Text preprocessing modules
• lexical analysis, stemming, stopwords
• Occurrences of Vocabulary Terms
– Inverted index creation
• term frequency in documents, document frequency
• Retrieval and Ranking Algorithm
• Query and Ranking Interfaces
• Browsing/Visualization Interface
Pros and cons of Indexing
Advantages
Can be searched quickly, e.g., by binary search, O(log n)
Good for sequential processing, e.g., comp*
Convenient for batch updating
Economical use of storage
Disadvantages
Index must be rebuilt if an extra term is added
Inverted index
• The inverted index of a document collection is
basically a data structure that
– attaches each distinctive term with a list of all
documents that contains the term.
• Thus, in retrieval, it takes constant time to
– find the documents that contains a query term.
– multiple query terms are also easy handled

33
An example

34
Search using inverted index
Given a query q, search has the following steps:
• Step 1 (vocabulary search): find each
term/word in q in the inverted index.
• Step 2 (results merging): Merge results to find
documents that contain all or some of the
words/terms in q.
• Step 3 (Rank score computation): To rank the
resulting documents/pages, using,
– content-based ranking
– link-based ranking
35
Index construction

Vocabulary trie and inverted list

36
Step1 : Vocabulary search
• The construction of an inverted index is done efficiently
using a trie data structure
• The time complexity of the index construction is O(T),
where T is the number of all terms (including duplicates) in
the document collection (after pre-processing
• For each document, the algorithm scans it sequentially and
for each term, it finds the term in the trie.
• If it is found, the document ID and other information (e.g.,
the offset of the term) are added to the inverted list of the
term.
• If the term is not found, a new leaf is created to represent
the term.).
Step2: Results merging
• The partial index I1 obtained at a point of time is written on the disk.
• Then, we process the subsequent documents and build the partial index
I2 in memory, and so on.
• After all documents have been processed, we have k partial indices, I1,
I2, …, Ik, on disk. We then merge the partial indices in a hierarchical
manner.
• That is, we first perform pair-wise merges of I1 and I2, I3 and I4, and
so on. This gives us larger indices I1-2, I3-4 and so on.
• After the first level merging is complete, we proceed to the second
level merging, i.e., we merge I1-2 and I3-4, I5-6 and I7-8 and so on.
This process continues until all the partial indices are merged into a
single index.
Index Compression
• An inverted index can be very large. reducing the index
size becomes an important issue. to speed up the search
• A natural solution to this is index compression, which
aims to represent the same information with fewer bits or
bytes.
• Using compression, the size of an inverted index can be
reduced dramatically.
• In the lossless compression, the original index can also be
reconstructed exactly using the compressed version.
Index compression techniques
• The two classes of compression schemes for inverted lists:
the variable-bit scheme and the variable-byte scheme.
• Variable bit scheme
– Unary Encoding
– Elias delta
– Elias gamma
• Variable byte scheme
Unary Encoding
• It represents a number x with x-1 bits of
zeros
• followed by a bit of one. For example, 5 is
represented as 00001
• This scheme is effective for very small
numbers, but wasteful for large numbers
Elias Gamma Coding
• Decoding: We decode an Elias
• Coding in 2 steps gamma-coded integer in two
steps:
1. Write x in binary.
1. Read and count zeroes from
2. Subtract 1 from the number of
the stream until we reach the
bits written in step 1 and
first one. Let this count of
prepend that many zeros.
zeroes be K.
• Ex: The number 9 is
2. Consider the one that was
represented by 0001001
reached to be the first digit of
the integer,with a value of 2K,
read the remaining K bits of
the integer.
• Example 7: To decompress 0001001, we
first read all zero bits from the beginning
until we see a bit of 1. We have K = 3 zero
bits. We then include the 1 bit with the
following 3 bits, which give us 1001 (binary
for 9).
Elias Delta Coding
• Coding: In the Elias delta coding, a • Decoding:
positive integer x is stored with the 1. Read and count zeroes from the stream
gamma code representation of 1+[log2x] until you reach the first one. Let this
followed by the binary representation of count of zeroes be L.
x less the most significant bit. 2. Considering the one that was reached to
• Example : Let us code the number 9. be the first bit of an integer, with a
Since 1+[log2x] = 4, value of 2L, read the remaining L digits
we have gamma code 00100 for 4. of the integer. This is the integer M.
9’s binary representation less the most 3. Put a one in the first place of our final
significant bit is 001. output, representing the value 2M.
– Read and append the following M-1 bits.
we have to append the above two: 00100
and 001 to yield delta code of 00100001 for Example : We want to decode 00100001.
9 We can see that L = 2 after step 1, and after
step 2, we have read and consumed 5 bits.
We also obtain M = 4 (100 in binary).
Finally, we prepend 1 to the M-1 bits
(which is 001) to give 1001, which is 9 in
binary.
Variable-Byte Coding
• Coding: In this method, seven • Decoding: Decoding is
bits in each byte are used to performed in two steps:
code an integer, with the least • 1. Read all bytes until a byte
significant bit set to 0 in the last with the zero last bit is seen.
byte, or to 1 if further bytes • 2. Remove the least significant
follow. bit from each byte read so far
• In this way, small integers are and
represented efficiently. • concatenate the remaining bits.
• Ex: 135 is represented in two • For example, 00000011
bytes, since it lies in the range 00001110 is decoded to
2power 7 and 2 power14, as 00000010000111, which is 135
00000011 00001110
Comparison of compression
techniques
Contd..
• A suitable compression technique can allow
retrieval to be up to twice as fast than without
compression
• the space requirement averages 20% – 25% of the
cost of storing uncompressed integers.
• Variable byte integers are faster than variable-bit
integers, despite having higher storage costs,
because fewer CPU operations are required to
decode variable-byte integers and they are byte-
aligned on disk.

You might also like