0% found this document useful (0 votes)
6 views43 pages

IR Unit 2 Final

The document discusses various information retrieval (IR) models, including Boolean and vector-space models, term weighting methods like TF-IDF, and the use of cosine similarity for ranking documents. It covers the construction of inverted indices for efficient text search and highlights the importance of term frequency and document frequency in determining the relevance of documents. Additionally, it addresses the limitations of the Boolean model and introduces the vector space model for improved document ranking based on query similarity.

Uploaded by

vedaraj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views43 pages

IR Unit 2 Final

The document discusses various information retrieval (IR) models, including Boolean and vector-space models, term weighting methods like TF-IDF, and the use of cosine similarity for ranking documents. It covers the construction of inverted indices for efficient text search and highlights the importance of term frequency and document frequency in determining the relevance of documents. Additionally, it addresses the limitations of the Boolean model and introduces the vector space model for improved document ranking based on query similarity.

Uploaded by

vedaraj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 43

UNIT II

Boolean and vector-space retrieval models- Term weighting – TF-IDF weighting-


cosine similarity – Preprocessing – Inverted indices – efficient processing with sparse
vectors – Language Model based IR – Probabilistic IR –Latent Semantic Indexing –
Relevance feedback and query expansion.

Introduction
IR Model
Definition:

An IR model is a quadruple [D, Q, F, R(qi, dj)] where


1. D is a set of logical views for the documents in the collection
2. Q is a set of logical views for the user queries
3. F is a framework for modeling documents and queries
4. R(qi, dj) is a ranking function

Basic Concepts
 Each document is represented by a set of representative keywords or index terms
 Index term:
In a restricted sense: it is a keyword that has some meaning on its
own; usually plays the role of a noun
In a more general form: it is any word that appears in a document
 Let, t be the number of index terms in the document
collection ki be a generic index term Then,
 The vocabulary V = {k1, . . . , kt} is the set of all distinct index
terms in the collection

The Term-Document Matrix


 The occurrence of a term ti in a document dj establishes a relation between ti
and dj
 A term-document relation between ti and dj can be quantified by the
frequency of the term in the document
 In matrix form, this can written as

 where each fi,j element stands for the frequency of term ti in document dj
 Logical view of a document: from full text to a set of index terms
1
________________________________________________________________________________________

Boolean Retrieval Models


Definition : The Boolean retrieval model is a model for information retrieval in
which the query is in the form of a Boolean expression of terms, combined with
the operators AND, OR, and NOT. The model views each document as just a set
of words.
 Simple model based on set theory and Boolean algebra
 The Boolean model predicts that each document is either relevant or non-
relevant
Example :
A fat book which many people own is Shakespeare‟s Collected Works.

Problem : To determine which plays of Shakespeare contain the words Brutus


AND Caesar AND NOT Calpurnia.

Method1 : Using Grep


The simplest form of document retrieval is for a computer to do the linear scan
through documents. This process is commonly referred to as grepping through text,
after the Unix command grep. Grepping allows useful possibilities for wildcard
pattern matching through the use of regular expressions.
This method is time consuming and not suitable for large collections.

Method2: Using Boolean Retrieval Model


 The Boolean retrieval model is a model for information retrieval in which we
can pose any query which is in the form of a Boolean expression of terms, that
is, in which terms are combined with the operators AND, OR, and NOT. The
model views each document as just a set of words.
 Terms are the indexed units . we have a vector for each term, which shows
the documents it appears in, or a vector for each document, showing the
terms that occur in it. The result is a binary term-document incidence matrix, as in
Figure .

Anthony Julius The Hamlet Othello. Macbeth


and Caesar Tempest
Cleopatra
Anthony 1 1 0 0 0 1
Brutus 1 1 0 1 0 0
Caesar 1 1 0 1 1 1
Calpurnia 0 1 0 0 0 0
Cleopatra 1 0 0 0 0 0
Mercy 1 0 1 1 1 1
Worser 1 0 1 1 1 0
A term-document incidence matrix. Matrix element (t, d) is 1 if the play in
column d contains the word in row t, and is 0 otherwise.

 To answer the query Brutus AND Caesar AND NOT Calpurnia, we take the
vectors for Brutus, Caesar and Calpurnia, complement the last, and then do a
bitwise AND:
110100 AND 110111 AND 101111 = 100100

2
Solution : Antony and Cleopatra and Hamlet

Results from Shakespeare for the query Brutus AND Caesar AND
NOT Calpurnia.
Inverted index / inverted File
It is the most efficient structure for supporting ad hoc text search. For each term t, we
store a list of all documents that contain t. The two parts of an inverted index. The
dictionary is commonly kept in memory, with pointers to each postings list, which is
stored on disk.

i) Dictionary / vocabulary /lexicon : we use dictionary for the data structure and
vocabulary for the set of terms, The dictionary in Figure has been sorted
alphabetically
ii) Posting : for each term, we have a list of Document ID in which the term present
.The list is then called a postings list (or inverted list), and each postings list is sorted by
document ID.

Building a Inverted index


To gain the speed benefits of indexing at retrieval time, we have to build the index in
advance. Building a basic inverted index is sort-based indexing .

The major steps in this are:


1. Collect the documents to be indexed:

...
2. Tokenize the text, turning each document into a list of tokens:

...
3. Do linguistic preprocessing, producing a list of normalized tokens, which are

the indexing terms: ...


4. Index the documents that each term occurs in by creating an inverted index,
consisting of a dictionary and postings.

3
DocID :
Each document has a unique serial number, known as the document
identifier ( docID ).
Input  Dictionary & posting :
The input to indexing is a list of normalized tokens for each document,
which we can equally think of as a list of pairs of term and docID . The
core indexing step is sorting this list , Multiple occurrences of the same
term from the same document are then merged. Instances of the same
term are then grouped, and the result is split into a dictionary and postings
Document Frequency :
The dictionary records some statistics, such as the number of documents
which contain each term . This provides the basis for efficient query
processing.

4
Storage (dictionary & postings lists) :
1. A fixed length array would be wasteful as some words occur in many
documents, and others in very few.
2. For an in-memory postings list - two good alternatives
a. singly linked lists :
b. Variable length arrays :
3. A hybrid scheme with a linked list of fixed length arrays for each term.
Processing Boolean queries

To process a query using an inverted index and the basic Boolean retrieval
model Consider processing the simple conjunctive query :
Brutus AND
over
the inverted index partially shown in Figure Calpurnia
Steps :
1. Locate Brutus in the Dictionary
2. Retrieve its postings
3. Locate Calpurnia in the Dictionary
4. Retrieve its postings
5. Intersect the two postings lists, as shown in Figure

Intersecting the postings lists for Brutus and Calpurnia


Algorithm for the intersection of two postings lists P1 and
P 2.

Drawbacks of the Boolean Model


 Retrieval based on binary decision criteria with no notion of partial matching
 No ranking of the documents is provided (absence of a grading scale)
 Information need has to be translated into a Boolean expression, which most
users find awkward
 The Boolean queries formulated by the users are most often too simplistic
The model frequently returns either too few or too many documents in response to a
user query
5
________________________________________________________________Term weighting
Search Engine should return in order the documents most likely to be useful
to the searcher . To achieve this , ordering documents with respect to a query
- called Ranking

Term-Document Incidence Matrix


A Boolean model only records term presence or absence, Assign a score – say in
[0, 1] – to each document , it measures how well document and query “match”

For One-term query “BRUTUS” , score is 1 if it is present in the


document , 0 otherwise , More appearances of term in document have
higher score

Anthony Julius The Hamlet Othello. Macbeth


and Caesar Tempest
Cleopatra
Anthony 1 1 0 0 0 1
Brutus 1 1 0 1 0 0
Caesar 1 1 0 1 1 1
Calpurnia 0 1 0 0 0 0
Cleopatra 1 0 0 0 0 0
Mercy 1 0 1 1 1 1
Worser 1 0 1 1 1 0
Document represented by binary vector Є {0,1}|V|
Term Frequency tf
One of the weighting scheme is Term Frequency and is denoted tf t,d, with the
subscripts denoting the term and the document in order.
Term frequency TF(t, d) of term t in document d = number of times that t
occurs in d
Ex: Term-Document Count Matrix
but we would like to give more weight to documents that have a term several
times as opposed to ones that contain it only once. To do this we need term
frequency information the number of times a term occurs in a document .
Assign a score to represent the number of occurrences

Se
Document represented by count vector Є Nv
Bag of Words
Model

6
The exact ordering of the terms in a document is ignored but the number of
occurrences of each term is important.
Example : two documents with similar bag of words representations are similar in
content.
“Mary is quicker than John” “John is quicker than Mary”

This is called the bag of words model

Log-Frequency Weighting

Log-frequency weight of term t in document d is calculated as

Document Frequency & Collection Frequency


Document frequency DF(t): the number of documents in the collection that
contain a term t
Collection frequency CF(t): the total number of occurrences of a term t in the
collection
Example :
TF(do,d1) =2
TF(do,d2) =0
TF(do,d3) =3
TF(do,d4) =3
CF (do) =8
DF(do) =3

Rare terms are more informative than frequent terms, to capture this we will
use document frequency (df)
Example: common word THE
Document containing this term can be about
anything We want very low weight for common
terms like THE

Inverse Document Frequency ( idf Weight)


It estimates the rarity of a term in the whole document collection. idft is an
inverse measure of the informativeness of t and idft <= N
dft is the document frequency of t: the number of documents that contain t
Informativeness idf (inverse document frequency) of t:

log (N/dft) is used instead of N/dft to diminish the effect of idf.

7
N: the total number of documents in the collection (for example :806,791
documents)
• IDF(t) is high if t is a rare term
• IDF(t) is likely low if t is a frequent term
N = 1, 000, 000
idft = log10 1,000,000

dft

TF-IDF Weighting
tf-idf weight of a term: product of tf weight and idf weight , Best known
weighting scheme in information retrieval. TF(t, d) measures the importance of
a term t in document d , IDF(t) measures the importance of a term t in the
whole collection of documents
TF/IDF weighting: putting TF and IDF together

TFIDF(t, d) = TF(t, d) x IDF(t)

(if log tf is used)

 High if t occurs many times in a small number of documents, i.e., highly


discriminative in those documents
 Not high if t appears infrequent in a document, or is frequent in many
documents, i.e., not discriminative
 Low if t occurs in almost all documents, i.e., no discrimination at all
Simple Query-Document Score
 scoring finds whether or not a query term is present in a zone (Zones:
document features whose content can be arbitrary free text – Examples: title,
abstracts ) within a document.
 Score for a document-query pair: sum over terms t in both q
and d:

If the Query contains more than one terms , Score for a document-query pair
is sum over terms t in both q and d:

The score is 0 if none of the query terms is present in the document


---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

VECTOR SPACE RETRIEVAL MODEL


The representation of a set of documents as vectors in a common vector space is
known as the vector space model and is fundamental to a host of information retrieval
operations ranging from scoring documents on a query, document classification and
document clustering.
8
Each document is represented as a count vector which is then represented by tf-
idf weight vector
Documents as Vectors
Each document is now represented as a real-valued vector of tf-idf weights ∈ R|
V|. So we have a |V|-dimensional real-valued vector space. Terms are axes of the
space. Documents are points or vectors in this space.
Very high-dimensional: tens of millions of dimensions when you apply this to web search
engines. Each vector is very sparse - most entries are zero.

To represent the document as vector , We can


consider each document as  a vector
each term of doc  one component of vector
weight for each component is given by TFIDF(t, d) = TF(t, d) x IDF(t)

Queries as Vectors
Key idea 1: Represent queries as vectors in same space
Key idea 2: Rank documents according to proximity to query in this space
proximity = similarity of vectors
proximity ≈ inverse of distance
Get away from Boolean model , Rank more relevant documents higher than less
relevant documents

Formalizing Vector Space Proximity


We start by calculating Euclidean distance:

But, Euclidean distance is large for vectors of different lengths

The Euclidean distance is large although the distribution of terms in the query q
and the distribution of terms in the document d2 are very similar.

Use Angle Instead of Distance


The angle between the two documents is 0, corresponding to maximal Similarity.
Key idea:
 Length unimportant
 Rank documents according to angle from query

Problems with Angle


Angles expensive to compute.However,Cosine is a monotonically decreasing
9
function of the angle for the interval [0◦, 180◦]

Length Normalization / How do we compute the cosine


Computing cosine similarity involves length normalizing document and query
vectors L2 norm:

Dividing a vector by its L2 norm makes it a unit (length) vector (on surface of
unit hypersphere). This maps vectors onto the unit sphere . . .

As a result, longer documents and shorter documents have weights of the same order of
magnitude. Effect on the two documents d and d′ (d appended to itself) from earlier
example : they have identical vectors after length-normalization.

Cosine Similarity
Measure the similarity between document and the query using the cosine of the
vector representations

qi is the tf-idf weight of term i in the query


di is the tf-idf weight of term i in the document

is the cosine similarity of q and d = the cosine of the angle between q and d.

In reality:
- Length-normalize when document added to index:

- Length-normalize query:

For normalized vectors, the cosine is equivalent to the dot product or


scalar product:

1
0
 (if and are length-normalized).

Example1: To find similarity between 3 novels


Three novels are taken for discussion , namely
o SaS: Sense and Sensibility
o PaP: Pride and Prejudice
o WH: Wuthering Heights?
To find how similar are the novels using Term frequency tft , values are

1
1
Log frequency weights :

After length normalization

cos(SaS,PaP) ≈ 0.789*0.832 + 0.515*0.555 + 0.335*0 + 0*0 ≈ 0.94


cos(SaS,WH) ≈0.79
cos(PaP,WH) ≈
0.69
cos(SAS,PaP) > cos(*,WH)

Computing Cosine Scores

1
2
Preprocessing

Format/Language: Complications
 A single index usually contains terms of several languages.
 Sometimes a document or its components contain multiple
languages/formats.
 French email with Spanish pdf attachment
 What is the document unit for indexing?
 A file?
 An email?
 An email with 5 attachments?
 A group of files (ppt or latex in HTML)?
 Upshot: Answering the question “what is a document?” is not trivial and
requires some design decisions.
Determining the vocabulary of terms

Word – A delimited string of characters as it appears in the text.


Term – A “normalized” word (case, morphology, spelling etc); an equivalence
class of words.
Token – An instance of a word or term occurring in a document.
Type – The same as a term in most cases: an equivalence class of tokens.

1) Tokenization:
Task of splitting the document into pieces called tokens.

 Input:

 Output:

Tokenization problems: One word or two? (or several)


Ex:
Numbers

1
3
 Hewlett
-
Packar
d
 State-
of-the-
art
 co-
educati
on
 the
hold-
him-
back-
and-
drag-
him-
away
maneuv
er

 3/20/91
date
 20/3/9
1
 Mar 20,
1991
 B-52
 100.2.8
6.144
IP
address
 (800)
234-
2333
 Phone
Numbe
r
 800.23
4.2333

1
4
2) Normalization
 Need to “normalize” terms in indexed text as well as query terms
into the same form.
 Example: We want to match U.S.A. and USA
 We most commonly implicitly define equivalence classes of terms.
 Alternatively: do asymmetric expansion
 window → window, windows
 windows → Windows, windows
 Windows (no expansion)
 More powerful, but less efficient

Case Folding-
Reduce all letters to lower case
Possible exceptions: capitalized words in mid-sentence
MIT vs. mit
Fed vs. fed
It‟s often best to lowercase everything since users will use lowercase
regardless of correct capitalization.

 Normalization and language detection interact.


 PETER WILL NICHT MIT. → MIT = mit
 He got his PhD from MIT. → MIT ≠ mit

Accents and diacritics

 Accents: résumé vs. resume (simple omission of accent)


 Most important criterion: How are users likely to write their queries for
these words?. Even in languages that standardly have accents, users
often do not type them.
 Stop words

3) Stop words

 stop words = extremely common words which would appear to be


of little value in helping select documents matching a user need
 Examples: a, an, and, are, as, at, be, by, for, from, has, he, in, is, it, its, of,
on, that, the, to, was, were, will, with
 Stop word elimination used to be standard in older IR systems.
 But you need stop words for phrase queries, e.g. “King of Denmark”
 Most web search engines index stop words

4) Lemmatization & Stemming


o Reduce inflectional/variant forms to base form
o Example: am, are, is → be
o Example: car, cars, car’s, cars’ → car
o Example: the boy’s cars are different colors → the boy car be different color
o Lemmatization implies doing “proper” reduction to dictionary headword form
(the lemma).
o Inflectional morphology (cutting → cut) vs. derivational morphology
(destruction → destroy)

1
5
o Definition of stemming: Crude heuristic process that chops off the
ends of words in the hope of achieving what “principled”
lemmatization attempts to do with a lot of linguistic knowledge.
o Language dependent
o Example : automate, automatic, automation all reduce to automat

Porter algorithm
 Most common algorithm for stemming English
 Results suggest that it is at least as good as other stemming options
 Conventions + 5 phases of reductions
 Phases are applied sequentially
 Each phase consists of a set of commands.
 Sample command: Delete final ement if what remains is longer than
1 character
 replacement → replac
 cement → cement
 Sample convention: Of the rules in a compound command, select the one
that applies to the longest suffix.
 Porter stemmer: A few rules
Rule Example
SSES → SS caresses → caress
IES → I ponies → poni
SS → SS caress → caress
S→ cats → cat
Other Three stemmers
1) Sample text:
2) Lovins stemmer:
3) Paice stemmer:

DOCUMENT PREPROCESSING (REFER DIAGRAM)


Document pre-processing is the process of incorporating a new document into an
information retrieval system. The goal is to Represent the document efficiently in
terms of both space (for storing the document) and time (for processing retrieval
requests) requirements. Maintain good retrieval performance (precision and recall).
Document pre-processing is a complex process that leads to the representation of
each document by a set of index terms. Logical view of a document is given below

Document pre-processing includes 5 stages:


1. Lexical analysis
2. Stopword elimination
3. Stemming
4. Index-term selection
5. Construction of thesauri

Lexical analysis
Objective: Determine the words of the document.
Lexical analysis separates the input alphabet into
 Word characters (e.g., the letters a-z)
 Word separators (e.g., space, newline, tab)
The following decisions may have impact on
retrieval
 Digits: Used to be ignored, but the trend now is to identify numbers
1
6
(e.g., telephone numbers) and mixed strings as words.
 Punctuation marks: Usually treated as word separators.
 Hyphens: Should we interpret “pre-processing” as “pre processing” or
as “preprocessing”?
 Letter case: Often ignored, but then a search for “First Bank” (a specific bank)
would retrieve a document with the phrase “Bank of America was the first
bank to offer its customers…”

Stopword Elimination
Objective: Filter out words that occur in most of the documents.
Such words have no value for retrieval purposes , These words are referred to as
stopwords.
They include
 Articles (a, an, the, …)
 Prepositions (in, on, of, …)
 Conjunctions (and, or, but, if, …)
 Pronouns (I, you, them, it…)
 Possibly some verbs, nouns, adverbs, adjectives (make, thing, similar, …)
 A typical stopword list may include several hundred words.
the 100 most frequent words add-up to about 50% of the words in a document. Hence,
stopword elimination improves the size of the indexing structures
Stemming
Objective: Replace all the variants of a word with the single stem of the word.
Variants include plurals, gerund forms (ing-form), third person suffixes, past tense
suffixes, etc.
Example: connect: connects, connected, connecting, connection,…
All have similar semantics and relate to a single concept.
In parallel, stemming must be performed on the user query.

Stemming improves Storage and search efficiency: less terms are


stored. Recall:
without stemming a query about “connection”, matches only documents that
have “connection”.
With stemming, the query is about “connect” and matches in addition
documents that originally had “connects” , “connected” , “connecting”, etc.
However, stemming may hurt precision, because users can no longer target
just a particular form.
Stemming may be performed using
o Algorithms that stripe of suffixes according to substitution rules.
o Large dictionaries, that provide the stem of each word.

Index term selection (indexing)


Objective: Increase efficiency by extracting from the resulting document a
selected set of terms to be used for indexing the document.
If full text representation is adopted then all words are used for indexing.
Indexing is a critical process: User's ability to find documents on a particular
subject is limited by the indexing process having created index terms for this subject.
Index can be done manually or automatically.
Historically, manual indexing was performed by professional indexers
associated with library organizations. However, automatic indexing is more common
now

1
7
Relative advantages of manual indexing:
1. Ability to perform abstractions (conclude what the subject is) and determine
additional related terms,
2. Ability to judge the value of concepts.

Relative advantages of automatic indexing:


1. Reduced cost: Once initial hardware cost is amortized, operational cost is
cheaper than wages for human indexers.
2. Reduced processing time
3. Improved consistency.
4. Controlled vocabulary: Index terms must be selected from a predefined set of
terms (the domain of the index). Use of a controlled vocabulary helps
standardize the choice of terms. Searching is improved, because users know
the vocabulary being used. Thesauri can compensate for lack of controlled
vocabularies.
5. Index exhaustivity: the extent to which concepts are indexed. Should we index
only the most important concepts, or also more minor concepts?
6. Index specificity: the preciseness of the index term used. Should we use
general indexing terms or more specific terms? Should we use the term
"computer", "personal computer", or “Gateway E-3400”?
7. Main effect: High exhaustivity improves recall (decreases precision). High
specificity improves precision (decreases recall).
8. Related issues: Index title and abstract only, or the entire document? Should
index terms be weighted?

Reducing the size of the index: Recall that articles, prepositions, conjunctions,
pronouns have already been removed through a stopword list. Recall that the 100
most frequent words account for 50% of all word occurrences. Words that are very
infrequent (occur only a few times in a collection) are often removed, under the
assumption that they would probably not be in the user‟s vocabulary. Reduction not
based on probabilistic arguments: Nouns are often preferred over verbs, adjectives,
or adverbs.

Indexing Types : It may also assign weights to terms.


1. Non-weighted indexing: No attempt to determine the value of the different
terms assigned to a document. Not possible to distinguish between major
topics and casual references. All retrieved documents are equal in value.
Typical of commercial systems through the 1980s.
2. Weighted indexing: Attempt made to place a value on each term as a
description of the document. This value is related to the frequency of
occurrence of the term in the document (higher is better), but also to the
number of collection documents that use this term (lower is better). Query
weights and document weights are combined to a value describing the
likelihood that a document matches a query

Thesauri
Objective: Standardize the index terms that were selected. In its simplest form a
thesaurus is A list of “important” words (concepts). For each word, an associated list
of synonyms. A thesaurus may be generic (cover all of English) or concentrate on a
particular domain of knowledge. The role of a thesaurus in information retrieval is to
 Provide a standard vocabulary for indexing.
 Help users locate proper query terms.
 Provide hierarchies for automatic broadening or narrowing of queries.
1
8
Here, our interest is to provide a standard vocabulary (a controlled vocabulary). This
is final stage, where each indexing term is replaced by the concept that defines its
thesaurus class.
_____________________________________________________________________________________________
Language models
A document is a good match to a query, if the document model is likely to generate the
query i.e If document contains query words often
Ex:

A language model is a probability distribution over sequences of words. Given such a


sequence, say of length m, it assigns a probability to the whole
sequence. These distributions can be used to predict the likelihood that the next
token in the sequence is a given word . These probability distributions are called
language models. It is useful in many natural language processing applications.

A language model for IR is composed of the following


components
A set of document language models, one per document dj of the collection
 A probability distribution function that allows estimating the likelihood that a

document language model Mj generates each of the query terms
 A ranking function that combines these generating probabilities for the query
terms

Traditional language model


The traditional language model uses Finite automata and it is a Generative model.

This diagram shows a simple finite automaton and some of the strings in the
language it generates. The full set of strings that can be generated is called the
language of the automaton.

Definition of language model


A language model is a function that puts a probability measure over strings drawn
1
9
from some vocabulary. That is, for a language model M over an alphabet Σ:

Language model example : unigram language model

state emission probabilities

Probability that some text (e.g. a query) was generated by the model:

The notion of a language model is inherently probabilistic, it places a


probability distribution over any sequence of words.

Query likelihood Model


Language models used in information retrieval is the query likelihood model.
Here a separate language model is associated with each document in a
collection. Documents are ranked based on the probability of the query Q in the
document's language model .

s frog said that toad likes that dog

M1 0.01 0.03 0.04 0.01 0.02 0.04 0.005

M2 0.0002 0.03 0.04 0.0001 0.04 0.04 0.01

Query (q) = frog likes toad


P(q | M1) = 0.01 x 0.02 x 0.01
P(q | M2) = 0.0002 x 0.04 x
0.0001 P(q|M1) > P(q|M2)
=> M1 is more likely to generate query q

One simple language model is equivalent to a probabilistic finite automaton


consisting of just a single node with a single probability distribution over producing

different terms , so that .

Types of language models

1) Unigram language model


2) Bigram language models
2
0
3) N-gram Language Models

1) Unigram language model :


The simplest form of language model simply throws away all conditioning
context, and estimates each term independently. Such a model is called a
unigram language model :

A unigram model used in information retrieval can be treated as the


combination of several one-state finite automata.
It splits the probabilities of different terms in a context, Followed is an
illustration of a unigram model of a document.

The probability generated for a specific query is calculated as

For Example: considering 2 language models M1 & M2 with their word


emission probabilities

2
1
Partial specification of two unigram language models

To find
probability of a = probabilities of word X probability of continuing /
word sequence (given by model ) stopping after producing each
word.
For example,

2) Bigram language models


There are many more complex kinds of language models, such as bigram language
models , in which the condition is based on the previous term,

Ex: In a bigram (n = 2) language model, the probability of the sentence “I saw the
red house” is approximated as ( <s> means stop )

Most language-modeling work in IR has used unigram language models.

Three ways of developing the language modeling approach:


(a) query likelihood - Probability of generating the query text from a document
language model
(b) document likelihood - Probability of generating the document text from a query
language model
(c) model comparison - Comparing the language models representing the query and
document topics

2
2
1) The query likelihood model in IR
The original and basic method for using language models in IR is the query
likelihood model .
Goal : Construct from each document d in the collection a language model M d, goal is
to rank documents by P(d|q), where the probability of a document is interpreted as
the likelihood that it is relevant to the query.

Using Bayes rule , we have:


P(d|q) = P(q|d)P(d)/P(q)

P(q) – same for all documents  ignored


P(d) – often treated as uniform across documents  ignored
Could be non uniform prior based on criteria like authority,
length, type, newness, and number of previous people who have
read the document.
o Rank by P(q|d)  the probability of the query q under the language model
derived from d. The probability that a query would be observed as a random
sample from the respective document model
Algorithm:
1. Infer a LM Md for each document d
2. Estimate P(q|Md)

Md  language model of document d,


tft,d  (raw) term frequency of term t in document
d Ld  number of tokens in document d.
3. Rank the documents according to these probabilities

The probability of producing the query given the LM Md of


document d using maximum likelihood estimation ( MLE ) and the unigram
assumption.

2
3
E.g., P(q|Md3) > P(q|Md1) > P(q|Md2)  d3 is first, d1 is second, d2 is third

Sparse Data Problem


 The classic problem with using language models is terms appear very
sparsely in documents , Some words don‟t appear in the document - a term
missing from a document . In such cases
o In particular, some of the query terms
 P(q|Md) = 0 ; zero probability problem
We get Conjunctive semantics i.e documents will only give a
query non-zero probability if all of the query terms appear in the
document
 Occurring words are poorly estimated
Solution: smoothing

Smoothing
 Decreasing the estimated probability of seen events and increasing the
probability of unseen events is referred to as smoothing

 The role of smoothing in this model is not only to avoid zero probabilities.
we need to smooth probabilities in our document language models:
 to discount non-zero probabilities
 to give some probability mass to unseen words.

 The probability of a non occurring term should be close to its probability to


occur in the collection
P(t|Mc) = cf(t)/T
cf(t) = #occurrences of term t in the
collection
T – length of the collection = sum of all document lengths
Smoothing Methods
Linear
Interpolation

(Mixer Model)  Mixes the probability from the document with the
2
4
general collection frequency of the word.
 High value of λ: “conjunctive-like” search – tends to
retrieve documents containing all query words.
 Low value of λ: more disjunctive, suitable for long queries
 Correctly setting λ is very important for good performance.
Bayesian
smoothing

Summary, with
linear
interpolation

2) Document likelihood model


There are other ways to think of using the language modeling idea in
IR settings , It is the probability of a query language model Mq
generating the document.

3) Model comparison
Make LM from both query and document ,Measure `how different` these
LMs from each other , Uses KL divergence
KL divergence (Kullback–Leibler (KL) divergence)
An asymmetric divergence measure from information theory , which
measures the how bad the probability distribution Mq is at modeling Md

Rank by KLD - the closer to 0 the higher is the rank

LMs vs. vector space model


 LMs have some things in common with vector space models.
 Term frequency is directed in the model.
 But it is not scaled in LMs.
 Probabilities are inherently “length-normalized”.
 Cosine normalization does something similar for vector space.
 Mixing document and collection frequencies has an effect similar to idf.
 Terms rare in the general collection, but common in some documents will
have a greater influence on the ranking.
 LMs vs. vector space model: commonalities
 Term frequency is directly in the model.
 Probabilities are inherently “length-normalized”.
 Mixing document and collection frequencies has an effect similar to idf.
 LMs vs. vector space model: differences
 LMs: based on probability theory
 Vector space: based on similarity, a geometric/ linear algebra notion
2
5
 Collection frequency vs. document frequency
 Details of term frequency, length normalization etc.

Probabilistic information retrieval


Idea: Given a user query q, and the ideal answer set R of the relevant
documents, the problem is to specify the properties for this set
– Assumption (probabilistic principle): the probability of relevance depends on
the query and document representations only; ideal answer set R should
maximize the overall probability of relevance
– The probabilistic model tries to estimate the probability that the user
will find the document dj relevant with ratio
P(dj relevant to q)/P(dj nonrelevant to q)

Given a query q, there exists a subset of the documents R which are relevant to q
But membership of R is uncertain (not sure) , A Probabilistic retrieval model ranks
documents in decreasing order of probability of relevance to the information need:
P(R | q,di)

Probability theory provides a principled foundation for such reasoning under


uncertainty. This model provides how likely a document is relevant to an information
need.

Document can be relevant and nonrelevant document, we can estimate the


probability of a term t appearing in a relevant document P(t | R=1) .

Probabilistic methods are one of the oldest but also one of the currently hottest
topics in IR .

Probabilistic IR Models :

 Classical probabilistic retrieval model


 Probability ranking principle
 Binary Independence Model,
BestMatch25 (Okapi)
 Bayesian networks for text retrieval
 Language model approach to IR

 Bayes‟ Rule for inverting conditional probabilities:

2
6
 Odds of an event (is the ratio of the probability of an event to the probability of
its complement.) provide a kind of multiplier for how probabilities change:

The Document Ranking Problem

In Ranked retrieval setup, for a given collection of documents, the user issues a
query, and an ordered list of documents is returned. Assume binary notion of
relevance: Rd,q is a a random dichotomous variable (A categorical variable that
can take on exactly two values is termed a binary variable or dichotomous variable), such
that

Rd,q = 1 if document d is relevant w.r.t


query q Rd,q = 0 otherwise

Probabilistic ranking orders documents decreasingly by their estimated probability of


relevance w.r.t. query: P(R = 1|d, q)

The Probability Ranking Principle

PRP in brief
If the retrieved documents (w.r.t a query) are ranked decreasingly on their
probability of relevance, then the effectiveness of the system will be the best
that is obtainable
PRP in full
If [the IR] system‟s response to each [query] is a ranking of the documents [...] in
order of decreasing probability of relevance to the [query], where the probabilities
are estimated as accurately as possible on the basis of whatever data have been
made available to the system for this purpose, the overall effectiveness of the
system to its user will be the best that is obtainable on the basis of those data.

1/0 loss :
 either returning a non relevant document or failing to return a relevant
document is called as 1/0 loss.
 The goal is to return the best possible results as the top k documents, for
any value of k the user chooses to examine.
 The PRP then says to simply rank all documents in decreasing order of P (R=1
| d,q) . If a set of retrieval results is to be returned, rather than an ordering, the
Bayes Optimal Decision Rule , the decision which minimizes the risk of loss, is
to simply return documents that are more likely relevant than

nonrelevant:

The PRP with retrieval costs


Let C1  cost of not retrieving a relevant document C0  cost of retrieval
of a non relevant document d next document to be retrieved
d‟  not yet retrieved document

Then the Probability Ranking Principle says

2
7
that

The Binary Independence Model (BIM)


This model is a traditionally used with the PRP, „Binary‟ (equivalent to Boolean) means
documents and queries represented as binary term incidence vectors
E.g., document d represented by  vector x = (x1, . . . , xM), where
xt = 1 if term t occurs in d
and xt = 0 otherwise
Different documents may have the same vector representation . „Independence‟
means no association between terms

To make a probabilistic retrieval strategy precise, need to estimate how terms in


documents contribute to relevance

 Find measurable statistics (term frequency, document frequency, document


length) that affect judgments about document relevance
 Combine these statistics to estimate the probability P(R|d, q) of document
relevance

modeled using term incidence vectors as

probability that if a relevant or non relevant


document is retrieved, then that document‟s

representation is

prior probability of retrieving a relevant or nonrelevant


document for a query q , Estimate from percentage of
relevant documents in the collection

2
8
Since a document is either relevant or nonrelevant to a query, we must have that:

Deriving a ranking function for query terms

Given a query q, ranking documents by P(R = 1|d, q) is modeled under BIM as


ranking them by P(R = 1|x, q)

Easier: rank documents by their odds of relevance (gives same

P(R=1|q) is a constant for a given query - can be ignored


P(R=0|q)

So:

Since each xt is either 0 or 1, we can separate the terms to give:

2
9
Let pt = P(xt = 1|R = 1,q) be the probability of a term appearing in a document
relevant to the query,

Let ut = P(xt = 1|R = 0,q) be the probability of a term appearing in a non relevant
document. It can be contingency table :

Additional simplifying assumption: terms not occurring in the query are equally
likely to occur in relevant and irrelevant documents, If qt = 0, then pt = ut

Now we need only to consider terms in the products that appear in the query:

The left product is over query terms found in the document and the right product is over
query terms not found in the document.

The Retrieval Status Value (RSV) of this model:

Equivalent: rank documents using the log odds ratios for the terms in the query ct

The odds ratio is the ratio of two odds: (i) the odds of the term appearing if the
document is relevant (pt/(1 − pt)), and (ii) the odds of the term appearing if the document
is irrelevant (ut/(1 − ut ))
 ct = 0: term has equal odds of appearing in relevant and irrelevant docs
 ct positive: higher odds to appear in relevant documents

3
0
 ct negative: higher odds to appear in irrelevant documents
ct functions as a term weight. Retrieval status value for document d:

 So BIM and vector space model are identical on an operational level , except
that the term weights are different. In particular: we can use the same data
structures (inverted index etc) for the two models.

How to compute Probability Estimate(in practice)

For each term t in a query, estimate ct in the whole collection using a contingency
table of counts of documents in the collection, where df t is the number of
documents that contain term t:

pt=s/S and ut =(dft – s)/ N-S)

Avoiding Zeroes :

If any of the counts is a zero, then the term weight is not well-defined. Maximum
likelihood estimates do not work for rare events.

To avoid zeros: add 0.5 to each count (expected likelihood estimation =

ELE) For example, use S − s + 0.5 in formula for S – s

Assuming that relevant documents are a very small percentage of the collection,
approximate statistics for irrelevant documents by statistics from the whole collection

Hence, ut (the probability of term occurrence in irrelevant documents for a query) is


dft/N and

The above approximation cannot easily be extended to

How different are vector space and BIM?

They are not that different. In either case you build an information
retrieval scheme in the exact same way.

3
1
For probabilistic IR, at the end, you score queries not by cosine similarity and
tf- idf in a vector space, but by a slightly different formula motivated by
probability theory. Next to add term frequency and length normalization to
the probabilistic model.

Latent semantic indexing

Latent semantic indexing (LSI) is an indexing and retrieval method that uses a
mathematical technique called singular value decomposition (SVD) to identify
patterns in the relationships between the terms and concepts contained in an
unstructured collection of text. LSI is based on the principle that words that are used
in the same contexts tend to have similar meanings.

Problems with Lexical Semantics

 Ambiguity and association in natural language


 Polysemy: Words often have a multitude of meanings and different
types of usage (more severe in very heterogeneous collections).
 Synonymy: Different terms may have an identical or a similar
meaning (weaker: words indicating the same topic).

 LSI Perform a low-rank approximation of document-term matrix ) .

In LSI

 Map documents (and terms) to a low-dimensional representation.

 Design a mapping such that the low-dimensional space reflects


semantic associations (latent semantic space).

 Compute document similarity based on the inner product in this


latent semantic space

 We will decompose the term-document matrix into a product of matrices.

 The particular decomposition we‟ll use: singular value decomposition (SVD).

 SVD: C = UΣV T (where C = term-document matrix)

 We will then use the SVD to compute a new, improved term-


document matrix C′.

 We‟ll get better similarity values out of C′ (compared to C).

 Using SVD for this purpose is called latent semantic indexing or LSI.
Steps
:

3
2
1. decompose the term-document matrix into a product of matrices.The
particular decomposition is called as singular value decomposition (SVD).
SVD: C = UΣV T (where C = term-document matrix)

 The term matrix U – consists of one (row) vector for each term
 The document matrix VT – consists of one (column) vector for each document
 The singular value matrix Σ – diagonal matrix with singular
values, reflecting importance of each dimension
2. Use the SVD to compute a new, improved term-document matrix C′ by
reducing the space which gives better similarity values compared to C.

3. Map the query into the reduced space

4. This follows from:


5. Compute similarity of q2 with all reduced documents in V2.
6. Output ranked list of documents

Why we use LSI in information retrieval

LSI takes documents that are semantically similar (= talk about the same
topics), but are not similar in the vector space (because they use different
words) and re- represents them in a reduced vector space in which they have
higher similarity.
Thus, LSI addresses the problems of synonymy and semantic related
ness. Standard vector space: Synonyms contribute nothing to document
similarity.
How LSI addresses synonymy and semantic relatedness

The dimensionality reduction forces us to omit a lot of “detail”.


We have to map different words (= different dimensions of the full space) to the
same dimension in the reduced space.
The “cost” of mapping synonyms to the same dimension is much less than the
cost of collapsing unrelated words.
SVD selects the “least costly” mapping ,Thus, it will map synonyms to the same
dimension. But it will avoid doing that for unrelated words.
LSI: Comparison to other approaches

 Relevance feedback and query expansion are used to increase recall in


information retrieval – if query and documents have (in the extreme
case) no terms in common.
 LSI increases recall and hurts precision.
 Thus, it addresses the same problems as (pseudo) relevance feedback
and query expansion .
Example:

This is a standard term-document matrix.

3
3
The matrix U - consists of one (row) vector for each term

One row per term, one column per min(M,N) where M is the number of terms and N
is the number of documents. This is an orthonormal matrix:

(ii) Row vectors have unit length. (ii) Any two distinct row vectors are orthogonal to
each other. Think of the dimensions as “semantic” dimensions that capture distinct
topics like politics, sports, economics. Each number uij in the matrix indicates how
strongly related term i is to the topic represented by semantic dimension j .

This is a standard term-document matrix. Actually, we use a non-weighted matrix


here to simplify the example.

The matrix Σ

This is a square, diagonal matrix of dimensionality min(M,N) × min(M,N). The diagonal


consists of the singular values of C. The magnitude of the singular value measures
the importance of the corresponding semantic dimension. We‟ll make use of this by
mitting unimportant dimensions.

The matrix VT
3
4
One column per document, one row per min(M,N) where M is the number of terms
and N is the number of documents. Again: This is an orthonormal matrix: (i)
Column vectors have unit length. (ii) Any two distinct column vectors are
orthogonal to each other.. Each number vij in the matrix indicates how strongly
related document i is to the topic represented by semantic dimension j .Reducing
the dimensionality to 2

3
5
Actually, we only zero out singular values in Σ. This has the effect of setting the
corresponding dimensions in U and V T to zero when computing the product C = UΣV T .

Why the reduced matrix is “better” ¿

Original matrix C vs. reduced C2 = UΣ2VT

3
6
We can view C2 as a two-dimensional representation of the matrix. We have
performed a dimensionality reduction to two dimensions.

Similarity of d2 and d3 in the original space: 0.

Similarity of d2 und d3 in the reduced space:

0.52 * 0.28 + 0.36 * 0.16 + 0.72 * 0.36 + 0.12 * 0.20 + - 0.39 * - 0.08 ≈ 0.52
_________________________________________________________

Relevance feedback and pseudo relevance feedback

The idea of relevance feedback) is to involve the user in the retrieval process so as
to improve the final result set. In particular, the user gives feedback on the relevance
of documents in an initial set of results. The basic procedure is:

 The user issues a (short, simple) query.


 The search engine returns a set of documents.
 User marks some docs as relevant, some as nonrelevant.
 Search engine computes a new representation of the information need.
Hope: better than the initial query.
 Search engine runs new query and returns new results.
 New results have (hopefully) better recall.

We can iterate this: several rounds of relevance feedback. We will use the term
ad hoc retrieval to refer to regular retrieval without relevance feedback.
Key concept for relevance feedback: Centroid

The centroid is the center of mass of a set of points. Recall that we


represent documents as points in a high-dimensional space. Thus: we can
compute centroids of documents.
Definition:

where D is a set of documents and is the vector to represent document


d.
Example:

The Rocchio algorithm for relevance feedback


 The Rocchio‟ algorithm implements relevance feedback in the vector space model.
 Rocchio‟ chooses the query that maximizes

 Dr : set of relevant docs; Dnr : set of nonrelevant docs


 Intent: ~qopt is the vector that separates relevant and nonrelevant docs maximally.
 Making some additional assumptions, we can rewrite as:

3
7
 The optimal query vector is:

 The Problem is we don’t know the truly relevant docs. We move the
centroid of the relevant documents by the difference between the two
centroids.

The Rocchio optimal query for separating relevant and nonrelevant


documents.
Rocchio 1971 algorithm

Relevance feedback on initial query

 We can modify the query based on relevance feedback and apply


standard vector space model. Use only the docs that were marked.
Relevance feedback can improve recall and precision.Relevance feedback
is most useful for increasing recall in situations where recall is important.
Users can be expected to review results and to take time to iterate

3
8
qm: modified query vector; q0: original query vector; Dr and Dnr : sets of known relevant
and nonrelevant documents respectively; α, β, and γ: weights

 New query moves towards relevant documents and away from nonrelevant
documents.

 Tradeoff α vs. β/γ: If we have a lot of judged documents, we want a higher β/γ.

 Set negative term weights to 0.

 “Negative weight” for a term doesn‟t make sense in the vector space model.

 Positive feedback is more valuable than negative feedback.

For example, set β = 0.75, γ = 0.25 to give higher weight to positive

feedback. Many systems only allow positive feedback.

To Compute Rocchio’ vector

1) circles: relevant documents, Xs:


nonrelevant documents 2) centroid of relevant
documents

4) centroid of nonrelevant
3) does not separate relevant
documents.
/ nonrelevant.

5) 6) difference vector

3
9
7) Add difference vector to 8) to get

9) separates relevant / nonrelevant perfectly.

Disadvantages of Relevance feedback

4
0
 Relevance feedback is expensive.
 Relevance feedback creates long modified queries.
 Long queries are expensive to process.
 Users are reluctant to provide explicit feedback.
 It‟s often hard to understand why a particular document was retrieved
after applying relevance feedback.
 The search engine Excite had full relevance feedback at one point, but
abandoned it later.

Pseudo relevance feedback / blind relevance feedback

Pseudo-relevance feedback automates the “manual” part of true relevance


feedback.

Pseudo-relevance algorithm:

1) Retrieve a ranked list of hits for the user‟s query


2) Assume that the top k documents are relevant.
3) Do relevance feedback (e.g., Rocchio)
Works very well on average, But can go horribly wrong for some queries. Several
iterations can cause query drift.

Query Expansion

 Query expansion is another method for increasing recall. In global query


expansion, the query is modified based on some global resource, i.e. a resource
that is not query-dependent.
 Two broad types of method for refining the query: Global methods reformulate
query terms independent of the query results I using a thesaurus or WordNet I via
automatic thesaurus generation Local methods reformulate query based on initial
results for the query I using relevance feedback I using pseudo relevance feedback I
using indirect feedback, e.g., click-through data
Types of user feedback

There are two types of feedback


1) Feedback on documents - More common in relevance feedback
2) Feedback on words or phrases. - More common in query expansion
Types of query expansion

1) Manual thesaurus (maintained by editors, e.g., PubMed)


2) Automatically derived thesaurus (e.g., based on co-occurrence statistics)
3) Query-equivalence based on query log mining (common on the web as in the
“palm” example)

Thesaurus-based query expansion

For each term t in the query, expand the query with words the thesaurus lists as
semantically related with t.

Example : HOSPITAL → MEDICAL

Generally increases recall , May significantly decrease precision, particularly with


4
1
ambiguous terms

INTEREST RATE → INTEREST RATE FASCINATE

It‟s very expensive to create a manual thesaurus and to maintain it over time. A
manual thesaurus has an effect roughly equivalent to annotation with a controlled
vocabulary.

Automatic thesaurus generation

 Attempt to generate a thesaurus automatically by analyzing the distribution of


words in documents
 Fundamental notion: similarity between two words
 Definition 1: Two words are similar if they co-occur with similar words.
 “car” ≈ “motorcycle” because both occur with “road”, “gas” and
“license”, so they must be similar.
 Definition 2: Two words are similar if they occur in a given grammatical
relation with the same words.
 You can harvest, peel, eat, prepare, etc. apples and pears, so apples and
pears must be similar.
 Quality of associations is usually a problem. Term ambiguity may introduce
irrelevant statistically correlated terms.
 “Apple computer”  “Apple red fruit computer”
 Problems:
 False positives: Words deemed similar that are not
 False negatives: Words deemed dissimilar that are similar
 Since terms are highly correlated anyway, expansion may not retrieve many
additional documents.
 Co-occurrence is more robust, grammatical relations are more accurate.

Co-occurence-based thesaurus: Examples


The simplest way to compute a co-occurrence thesaurus is based on term-term
similarities.

in C = AAT where A is term-document matrix , wi,j = (normalized) weight for (ti ,dj)

For each ti, pick terms with high values in C

Word Nearest neighbors


Absolutely absurd whatsoever totally exactly nothing
bottomed dip copper drops topped slide trimmed
captivating shimmer stunningly superbly plucky witty

4
2
doghouse dog porch crawling beside downstairs
makeup repellent lotion glossy sunscreen skin gel
mediating reconciliation negotiate case conciliation
keeping hoping bring wiping could some would
lithographs drawings Picasso Dali sculptures Gauguin
pathogens toxins bacteria organisms bacterial parasite
senses grasp psyche truly clumsy naive innate
WordSpace demo on web

4
3

You might also like