0% found this document useful (0 votes)

11 views

Module 3 Indexing Part A

Uploaded by

Ayush Tiwari

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views

Module 3 Indexing Part A

Uploaded by

Ayush Tiwari

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 46

Module 3- Retrieval Models and Indexing

Boolean, Vector Space, Probabilistic

Outline
• Retrieval models
• Search algorithm
• Indexing process and inverted list
• Index Compression
What is a retrieval model?
• An idealization or abstraction of an actual process (retrieval)
– results in measure of similarity b/w query and document

• May describe the computational process

– e.g. how documents are ranked
– note that inverted file is an implementation not a model

• May attempt to describe the human process

– e.g. the information need, search strategy, etc

• Retrieval variables:
– queries, documents, terms, relevance judgements, users, information needs

• Have an explicit or implicit definition of relevance

Mathematical models
• Study the properties of the process
• Draw conclusions or make predictions
– Conclusions derived depend on whether model is a
good approximation of the actual situation

• Statistical models represent repetitive processes

– predict frequencies of interesting events
– use probability as the fundamental tool
Exact Match Retrieval
• Retrieving documents that satisfy a Boolean expression
constitutes the Boolean exact match retrieval model
– query specifies precise retrieval criteria
– every document either matches or fails to match query
– result is a set of documents (no order)
• Advantages:
– efficient
– predictable, easy to explain
– structured queries
– work well when the user knows exactly what documents are
required
Exact-match Retrieval Model
• Disadvantages:
– query formulation difficult for most users
– difficulty increases with collection size (why?)
– indexing vocabulary same as query vocabulary
– acceptable precision generally means unacceptable
recall
– ranking models are consistently better
• Best-match or ranking models are now more
common
Boolean retrieval
• Most common exact-match model
– queries: logic expressions with doc features as operands
– retrieved documents are generally not ranked
– query formulation difficult for novice users

• Boolean queries
– Used by Boolean retrieval model and in other models
– Boolean query  Boolean model

• “Pure” Boolean operators: AND, OR, and NOT

• Most systems have proximity operators
• Most systems support simple regular expressions as search
terms to match spelling variants
Classes of Retrieval Models
• Boolean models (set theoretic)
– Extended Boolean

• Vector space models (statistical/algebraic)

– Generalized VS
– Latent Semantic Indexing

• Probabilistic models
Other Model Dimensions
• Logical View of Documents
– Index terms
– Full text
– Full text + Structure (e.g. hypertext)

• User Task
– Retrieval
– Browsing
Issues for Vector Space Model
• How to determine important words in a document?
– Word sense?
– Word n-grams (and phrases, idioms,…)  terms
• How to determine degree of importance of a term
within a document and within the entire collection?
• How to determine the degree of similarity between
a document and the query?
• In the case of the web, what is a collection and
what are the effects of links, formatting
information, etc.?
The Vector-Space Model
• Assume t distinct terms remain after pre-
processing (the index terms or the vocabulary).
• These “orthogonal” terms form a vector space.
Dimension = t = |vocabulary|
• Each term, i, in a document or query, j, is
given a real-valued weight, wij.
• Both documents and queries are expressed as
t-dimensional vectors:
dj = (w1j, w2j, …, wtj)
Document Collection
• Collection of n documents can be represented in the vector
space model by a term-document matrix.
• An entry in the matrix corresponds to the “weight” of a
term in the document;
– zero means the term has no significance in the document or it
simply doesn’t exist in the document.
T1 T2 …. Tt
D1 w11 w21 … wt1
D2 w12 w22 … wt2
: : : :
: : : :
Dn w1n w2n … wtn
• tf-idf weighting typical: wij = tfij*idfi = tfij log2 (N/ dfi)
Graphic Representation
Example: T3
D1 = 2T1 + 3T2 + 5T3
D2 = 3T1 + 7T2 + T3 5
Q = 0T1 + 0T2 + 2T3
D1 = 2T1+ 3T2 + 5T3

Q = 0T1 + 0T2 + 2T3

2 3
T1
D2 = 3T1 + 7T2 + T3
Is D1 or D2 more similar to Q?
How to measure the degree of
7
T2 similarity? Distance? Angle?
Projection?
Term Weights: Term Frequency
• More frequent terms in a document are more
important, i.e. more indicative of the topic.
fij = frequency of term i in document j

• May want to normalize term frequency (tf)

across the entire corpus:
tfij = fij / max{fij}
Term Weights: IDF
• Terms that appear in many different documents
are less indicative of overall topic.
df i = document frequency of term i/ number of
documents containing term i
idfi = inverse document frequency of term i,
= log2 (N/ df i)
(N: total number of documents)
• Recall: indication of a term’s discrimination
power.
• Log used to dampen the effect relative to tf.
Simple tf*idf

wik  tf ik * log( N / nk )
Tk  term k in document Di
tf ik  frequency of term Tk in document Di
idf k  inverse document frequency of term Tk in C
N  total number of documents in the collection C
nk  the number of documents in C that contain Tk

idf k  log N 
 nk 
Inverse Document Frequency
• IDF provides high values for rare words and
low values for common words, The most
frequent words are not the most descriptive.
 10000 
log  0
 10000 
For a
 10000 
collection log    0.301
of 10000  5000 
documents  10000 
(N = 10000) log    2.698
 20 
 10000 
log  4
 1 
Query Vector
• Query vector is typically treated as a document
and also tf-idf weighted.

• Alternative is for the user to supply weights for

the given query terms.
Similarity Measure
• A similarity measure is a function that
computes the degree of similarity between two
vectors.
• Using a similarity measure between the query
and each document:
– It is possible to rank the retrieved documents in the
order of presumed relevance.

– It is possible to enforce a certain threshold so that

the size of the retrieved set can be controlled.
Similarity Measure - Inner Product
• Similarity between vectors for the document di and query q can
be computed as the vector inner product:
t
sim(dj,q) = dj•q =  wij · wiq
i 1

where wij is the weight of term i in document j and wiq is the weight of
term i in the query

• For binary vectors, the inner product is the number of matched

query terms in the document (size of intersection).

• For weighted term vectors, it is the sum of the products of the

weights of the matched terms.
Inner Product -- Examples

Binary:
– D = 1, 1, 1, 0, 1, 1, 0
Size of vector = size of vocabulary = 7
– Q = 1, 0 , 1, 0, 0, 1, 1
0 means corresponding term not found in
document or query
sim(D, Q) = 3

Weighted:
D1 = 2T1 + 3T2 + 5T3 D2 = 3T1 + 7T2 + 1T3
Q = 0T1 + 0T2 + 2T3

sim(D1 , Q) = 20 + 30 + 5*2 = 10

sim(D2 , Q) = 3*0 + 7*0 + 1*2 = 2
Cosine Similarity Measure
• Cosine similarity measures the cosine of the angle between t3
two vectors.
• Inner product normalized by the vector lengths. 1

D1
Q
  t 2
dj q   ( wij  wiq )
t1
CosSim(dj, q) =  

i 1
t t

 wij   wiq
2 2
dj  q
i 1 i 1 t2 D2
D1 = 2T1 + 3T2 + 5T3 CosSim(D1 , Q) = 10 / (4+9+25)(0+0+4) = 0.81
D2 = 3T1 + 7T2 + 1T3 CosSim(D2 , Q) = 2 / (9+49+1)(0+0+4) = 0.13
Q = 0T1 + 0T2 + 2T3

D1 is 6 times better than D2 using cosine similarity but only 5 times better using
inner product.
Simple Implementation
1. Convert all documents in collection D to tf-idf weighted
vectors, dj, for keyword vocabulary V.
2. Convert query to a tf-idf-weighted vector q.
3. For each dj in D do
Compute score sj = cosSim(dj, q)
4. Sort documents by decreasing score.
5. Present top ranked documents to the user.
Time complexity: O(|V|·|D|) Bad for large V & D !
|V| = 10,000; |D| = 100,000; |V|·|D| = 1,000,000,000
Comments on Vector Space Models
• Simple, mathematically based approach.
• Considers both local (tf) and global (idf) word
occurrence frequencies.
• Provides partial matching and ranked results.
• Tends to work quite well in practice despite
obvious weaknesses.
• Allows efficient implementation for large
document collections.
Problems with Vector Space Model
• Assumption of term independence
• Missing semantic information (e.g. word sense).
• Missing syntactic information (e.g. phrase
structure, word order, proximity information).
• Lacks the control of a Boolean model (e.g.,
requiring a term to appear in a document).
– Given a two-term query “A B”,
• may prefer a document containing A frequently but not B,
over a document that contains both A and B, but both less
frequently.
Mechanism of Query Processing
1. Relevant inverted indices are found
1. Typically the indexes are in memory, otherwise this
could take a full half second
2. If they are bit vectors, they are ANDed or ORed, then
materialized, then lists are handled
• Result is many URLs.
• Next step is to determine their rank so the highest ranked
URLs can be delivered to the user.
Ranking Pages
• Indexes have returned pages. Which ones
are most relevant to you?
There are many criteria for ranking pages;
_Presence of all words
– All words close together
– Words in important locations and formats on
the page
– Words near anchor text of links in reference
pages
• But the killer criteria is PageRank
PageRank Intuition
• You need to find a plumber. How do you do it?
1. Call plumbers and talk to them
2. ! Call friends and ask for plumber references
• Then choose plumbers who have the most references
3. !! Call friends who know a lot about plumbers (important friends) and
ask them for plumber references
• Then choose plumbers who have the most references from important
people.
• Technique 1 was used before Google.
• Google introduced technique 2 to search engines
• Google also introduced technique 3
• Techniques 2, and especially 3, wiped out the competition.
• The big challenge: determine which pages are important
What does this mean for pages?
1. Most search engines look for pages
containing the word "plumber"
2. Google searches for pages that are linked to
by pages containing "plumber".
3. Google searches for pages that are linked to
by important pages containing "plumber".
• A web page is important if many
important pages link to it.
– This is a recursive equation.
– Google solves it by imagining a web
walker/Crawler.
Importance of a page?
• Inverted files are used to index text
• The indices are appropriate when the
text collection is large and semi-static
• If the text collection is volatile online
searching is the only option
• Some techniques combine online and
indexed searching
IR System: What Do You Need?
• Vocabulary List
– Text preprocessing modules
• lexical analysis, stemming, stopwords
• Occurrences of Vocabulary Terms
– Inverted index creation
• term frequency in documents, document frequency
• Retrieval and Ranking Algorithm
• Query and Ranking Interfaces
• Browsing/Visualization Interface
Pros and cons of Indexing
Advantages
Can be searched quickly, e.g., by binary search, O(log n)
Good for sequential processing, e.g., comp*
Convenient for batch updating
Economical use of storage
Disadvantages
Index must be rebuilt if an extra term is added
Inverted index
• The inverted index of a document collection is
basically a data structure that
– attaches each distinctive term with a list of all
documents that contains the term.
• Thus, in retrieval, it takes constant time to
– find the documents that contains a query term.
– multiple query terms are also easy handled

33
An example

34
Search using inverted index
Given a query q, search has the following steps:
• Step 1 (vocabulary search): find each
term/word in q in the inverted index.
• Step 2 (results merging): Merge results to find
documents that contain all or some of the
words/terms in q.
• Step 3 (Rank score computation): To rank the
resulting documents/pages, using,
– content-based ranking
– link-based ranking
35
Index construction

Vocabulary trie and inverted list

36
Step1 : Vocabulary search
• The construction of an inverted index is done efficiently
using a trie data structure
• The time complexity of the index construction is O(T),
where T is the number of all terms (including duplicates) in
the document collection (after pre-processing
• For each document, the algorithm scans it sequentially and
for each term, it finds the term in the trie.
• If it is found, the document ID and other information (e.g.,
the offset of the term) are added to the inverted list of the
term.
• If the term is not found, a new leaf is created to represent
the term.).
Step2: Results merging
• The partial index I1 obtained at a point of time is written on the disk.
• Then, we process the subsequent documents and build the partial index
I2 in memory, and so on.
• After all documents have been processed, we have k partial indices, I1,
I2, …, Ik, on disk. We then merge the partial indices in a hierarchical
manner.
• That is, we first perform pair-wise merges of I1 and I2, I3 and I4, and
so on. This gives us larger indices I1-2, I3-4 and so on.
• After the first level merging is complete, we proceed to the second
level merging, i.e., we merge I1-2 and I3-4, I5-6 and I7-8 and so on.
This process continues until all the partial indices are merged into a
single index.
Index Compression
• An inverted index can be very large. reducing the index
size becomes an important issue. to speed up the search
• A natural solution to this is index compression, which
aims to represent the same information with fewer bits or
bytes.
• Using compression, the size of an inverted index can be
reduced dramatically.
• In the lossless compression, the original index can also be
reconstructed exactly using the compressed version.
Index compression techniques
• The two classes of compression schemes for inverted lists:
the variable-bit scheme and the variable-byte scheme.
• Variable bit scheme
– Unary Encoding
– Elias delta
– Elias gamma
• Variable byte scheme
Unary Encoding
• It represents a number x with x-1 bits of
zeros
• followed by a bit of one. For example, 5 is
represented as 00001
• This scheme is effective for very small
numbers, but wasteful for large numbers
Elias Gamma Coding
• Decoding: We decode an Elias
• Coding in 2 steps gamma-coded integer in two
steps:
1. Write x in binary.
1. Read and count zeroes from
2. Subtract 1 from the number of
the stream until we reach the
bits written in step 1 and
first one. Let this count of
prepend that many zeros.
zeroes be K.
• Ex: The number 9 is
2. Consider the one that was
represented by 0001001
reached to be the first digit of
the integer,with a value of 2K,
read the remaining K bits of
the integer.
• Example 7: To decompress 0001001, we
first read all zero bits from the beginning
until we see a bit of 1. We have K = 3 zero
bits. We then include the 1 bit with the
following 3 bits, which give us 1001 (binary
for 9).
Elias Delta Coding
• Coding: In the Elias delta coding, a • Decoding:
positive integer x is stored with the 1. Read and count zeroes from the stream
gamma code representation of 1+[log2x] until you reach the first one. Let this
followed by the binary representation of count of zeroes be L.
x less the most significant bit. 2. Considering the one that was reached to
• Example : Let us code the number 9. be the first bit of an integer, with a
Since 1+[log2x] = 4, value of 2L, read the remaining L digits
we have gamma code 00100 for 4. of the integer. This is the integer M.
9’s binary representation less the most 3. Put a one in the first place of our final
significant bit is 001. output, representing the value 2M.
– Read and append the following M-1 bits.
we have to append the above two: 00100
and 001 to yield delta code of 00100001 for Example : We want to decode 00100001.
9 We can see that L = 2 after step 1, and after
step 2, we have read and consumed 5 bits.
We also obtain M = 4 (100 in binary).
Finally, we prepend 1 to the M-1 bits
(which is 001) to give 1001, which is 9 in
binary.
Variable-Byte Coding
• Coding: In this method, seven • Decoding: Decoding is
bits in each byte are used to performed in two steps:
code an integer, with the least • 1. Read all bytes until a byte
significant bit set to 0 in the last with the zero last bit is seen.
byte, or to 1 if further bytes • 2. Remove the least significant
follow. bit from each byte read so far
• In this way, small integers are and
represented efficiently. • concatenate the remaining bits.
• Ex: 135 is represented in two • For example, 00000011
bytes, since it lies in the range 00001110 is decoded to
2power 7 and 2 power14, as 00000010000111, which is 135
00000011 00001110
Comparison of compression
techniques
Contd..
• A suitable compression technique can allow
retrieval to be up to twice as fast than without
compression
• the space requirement averages 20% – 25% of the
cost of storing uncompressed integers.
• Variable byte integers are faster than variable-bit
integers, despite having higher storage costs,
because fewer CPU operations are required to
decode variable-byte integers and they are byte-
aligned on disk.

5.5.2 Lab - Configure and Verify Extended Ipv4 Acls PDF
No ratings yet
5.5.2 Lab - Configure and Verify Extended Ipv4 Acls PDF
5 pages
5 6140900005952618788
100% (5)
5 6140900005952618788
121 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
27 pages
ISR chap...5
No ratings yet
ISR chap...5
34 pages
Chapter 5 IR
No ratings yet
Chapter 5 IR
46 pages
4_IRModels
No ratings yet
4_IRModels
46 pages
IR Lecture 4b
No ratings yet
IR Lecture 4b
57 pages
IR Lecture 4b
No ratings yet
IR Lecture 4b
57 pages
Chapter 4 IR Models
No ratings yet
Chapter 4 IR Models
43 pages
Vector Space Model: TF - IDF: Adapted From Lectures by
No ratings yet
Vector Space Model: TF - IDF: Adapted From Lectures by
37 pages
Chapter Five IR Models
No ratings yet
Chapter Five IR Models
28 pages
IR Models: Chapter Five
100% (1)
IR Models: Chapter Five
26 pages
5 IRModels IR
No ratings yet
5 IRModels IR
25 pages
5 B IRModels
No ratings yet
5 B IRModels
51 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
31 pages
Vector Space Model
No ratings yet
Vector Space Model
7 pages
5 IRModels
No ratings yet
5 IRModels
30 pages
L02-IR Models MMN
No ratings yet
L02-IR Models MMN
27 pages
L04
No ratings yet
L04
35 pages
Introduction of IR Models
No ratings yet
Introduction of IR Models
67 pages
4-IR Models
No ratings yet
4-IR Models
33 pages
IR Systems Usually Adopt Index Terms To Process Queries Index Term
No ratings yet
IR Systems Usually Adopt Index Terms To Process Queries Index Term
24 pages
AI6122 Topic 3.2 - Ranking
No ratings yet
AI6122 Topic 3.2 - Ranking
27 pages
4-IR Models
No ratings yet
4-IR Models
33 pages
IR Chap4
100% (1)
IR Chap4
32 pages
IR Chap4
100% (1)
IR Chap4
32 pages
Introduction of IR Models
No ratings yet
Introduction of IR Models
62 pages
Retrieval Models and Rank Retrieval
No ratings yet
Retrieval Models and Rank Retrieval
16 pages
4_IRModels
No ratings yet
4_IRModels
30 pages
Tf-Idf: David Kauchak cs160 Fall 2009
No ratings yet
Tf-Idf: David Kauchak cs160 Fall 2009
51 pages
Relevance of A Document To A Query
No ratings yet
Relevance of A Document To A Query
10 pages
Unit V Easy To Learn
No ratings yet
Unit V Easy To Learn
21 pages
3 Retrieval Models
No ratings yet
3 Retrieval Models
87 pages
06 VectorSpaceModel PDF
No ratings yet
06 VectorSpaceModel PDF
75 pages
Chapter 4
No ratings yet
Chapter 4
8 pages
IR - ch5 - Vector Space Model
No ratings yet
IR - ch5 - Vector Space Model
23 pages
Chapter 4 IR Models
No ratings yet
Chapter 4 IR Models
34 pages
Chapter 4- Part II
No ratings yet
Chapter 4- Part II
44 pages
TF Idf
100% (3)
TF Idf
38 pages
Lecture 6 Score - Term Weight - Vector Space Model
No ratings yet
Lecture 6 Score - Term Weight - Vector Space Model
43 pages
Vector Space Model
No ratings yet
Vector Space Model
11 pages
CS8080 INFORMATION RETRIEVAL TECHNIQUES II INTERNAL EXAMINATION - Google Forms
No ratings yet
CS8080 INFORMATION RETRIEVAL TECHNIQUES II INTERNAL EXAMINATION - Google Forms
420 pages
Chapter 2
No ratings yet
Chapter 2
37 pages
Vector Space Model
No ratings yet
Vector Space Model
6 pages
CAIM: Cerca I Anàlisi D'informació Massiva: FIB, Grau en Enginyeria Informàtica
No ratings yet
CAIM: Cerca I Anàlisi D'informació Massiva: FIB, Grau en Enginyeria Informàtica
21 pages
06 VectorSpaceModel
No ratings yet
06 VectorSpaceModel
65 pages
Lec2 2
No ratings yet
Lec2 2
17 pages
Unit 4
No ratings yet
Unit 4
61 pages
Text
No ratings yet
Text
11 pages
UNIT5-User Search Techniques
No ratings yet
UNIT5-User Search Techniques
24 pages
Information Retrieval Models: Vector Space Models: Chengxiang Zhai
No ratings yet
Information Retrieval Models: Vector Space Models: Chengxiang Zhai
30 pages
F-IR
No ratings yet
F-IR
30 pages
IR - Models
100% (3)
IR - Models
58 pages
Chapter 6_Scoring Term weighting and vector space model
No ratings yet
Chapter 6_Scoring Term weighting and vector space model
43 pages
02 Chap02a-BooleanAndvector Models
No ratings yet
02 Chap02a-BooleanAndvector Models
30 pages
3 termWeightingIR
No ratings yet
3 termWeightingIR
32 pages
ISR Chap..3
No ratings yet
ISR Chap..3
26 pages
Basic IR: Modeling
No ratings yet
Basic IR: Modeling
22 pages
Chapter Three Term Weighting and Similarity Measures
No ratings yet
Chapter Three Term Weighting and Similarity Measures
25 pages
IRS-Unit-4
No ratings yet
IRS-Unit-4
63 pages
Physics I Essentials
From Everand
Physics I Essentials
The Editors of REA
3.5/5 (4)
Multiple Integrals, A Collection of Solved Problems
From Everand
Multiple Integrals, A Collection of Solved Problems
Steven Tan
No ratings yet
BAM Session2Slides
No ratings yet
BAM Session2Slides
25 pages
TARP
No ratings yet
TARP
7 pages
abstract
No ratings yet
abstract
1 page
Tera Com
No ratings yet
Tera Com
9 pages
Overture v25 Installation Instructions On A Ubuntu 14.04
No ratings yet
Overture v25 Installation Instructions On A Ubuntu 14.04
15 pages
Verilog - PPT 1
No ratings yet
Verilog - PPT 1
41 pages
Olt Qs
No ratings yet
Olt Qs
57 pages
NVL - Firebird: Traduzir Esta Página
No ratings yet
NVL - Firebird: Traduzir Esta Página
4 pages
User's Manual and Installation Instructions of Communication Control and Waveform Analysis Software of Digital Storage Oscilloscope
No ratings yet
User's Manual and Installation Instructions of Communication Control and Waveform Analysis Software of Digital Storage Oscilloscope
41 pages
Audirvana Plus User Manual
No ratings yet
Audirvana Plus User Manual
39 pages
MPMC Unit 5
No ratings yet
MPMC Unit 5
2 pages
Trio 2018
No ratings yet
Trio 2018
2 pages
06 Computer Ethics and Cyber Laws CG
100% (1)
06 Computer Ethics and Cyber Laws CG
52 pages
Chos Course
No ratings yet
Chos Course
55 pages
IBM Project Pre Assesment - PPT Template 202324
No ratings yet
IBM Project Pre Assesment - PPT Template 202324
9 pages
Maths
No ratings yet
Maths
3 pages
Developer Cover Letter
100% (1)
Developer Cover Letter
5 pages
JDBC Oracle PDF
No ratings yet
JDBC Oracle PDF
1 page
Proposal for School Management Software
No ratings yet
Proposal for School Management Software
6 pages
FichaTecnica LoraWan
No ratings yet
FichaTecnica LoraWan
2 pages
Shreyash Kalaskar - 19 - Oops - Practical No.5
No ratings yet
Shreyash Kalaskar - 19 - Oops - Practical No.5
3 pages
Computer Operator Duties and Responsibilities
0% (1)
Computer Operator Duties and Responsibilities
4 pages
Guided Practice_Using Ansible Roles
No ratings yet
Guided Practice_Using Ansible Roles
9 pages
MESWeb Portal User Guide
No ratings yet
MESWeb Portal User Guide
135 pages
Unit 3 - Block Chain
No ratings yet
Unit 3 - Block Chain
16 pages
Geographical Data in the Computer-1
No ratings yet
Geographical Data in the Computer-1
36 pages
Class 12 Sample Paper CS
No ratings yet
Class 12 Sample Paper CS
7 pages
J1939 Data Mapping Explained
100% (5)
J1939 Data Mapping Explained
13 pages
Programming Without Coding Technology (PWCT) Abstract
100% (6)
Programming Without Coding Technology (PWCT) Abstract
60 pages
Chapter-1 Introduction To Software Engineering
100% (1)
Chapter-1 Introduction To Software Engineering
58 pages
Session 1: Prof. M. N. Sahoo Dept. of CSE, NIT Rourkela
No ratings yet
Session 1: Prof. M. N. Sahoo Dept. of CSE, NIT Rourkela
24 pages
Olam International Case Study EPM SAP
No ratings yet
Olam International Case Study EPM SAP
4 pages

Module 3 Indexing Part A

Uploaded by

Module 3 Indexing Part A

Uploaded by

Module 3- Retrieval Models and Indexing

Boolean, Vector Space, Probabilistic

• May describe the computational process

• May attempt to describe the human process

• Have an explicit or implicit definition of relevance

• Statistical models represent repetitive processes

• “Pure” Boolean operators: AND, OR, and NOT

• Vector space models (statistical/algebraic)

Q = 0T1 + 0T2 + 2T3

• May want to normalize term frequency (tf)

• Alternative is for the user to supply weights for

– It is possible to enforce a certain threshold so that

• For binary vectors, the inner product is the number of matched

• For weighted term vectors, it is the sum of the products of the

sim(D1 , Q) = 2*0 + 3*0 + 5*2 = 10

Vocabulary trie and inverted list

You might also like

sim(D1 , Q) = 20 + 30 + 5*2 = 10