0% found this document useful (0 votes)
25 views48 pages

ISE Information Retrieval Mod-V

The document discusses different models of information retrieval including classical, non-classical, and alternative models. Classical models include boolean, vector space, and probabilistic models. Non-classical models are based on principles other than similarity, probability, or boolean operations. Alternative models enhance classical models using techniques from other fields.

Uploaded by

sharan raj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views48 pages

ISE Information Retrieval Mod-V

The document discusses different models of information retrieval including classical, non-classical, and alternative models. Classical models include boolean, vector space, and probabilistic models. Non-classical models are based on principles other than similarity, probability, or boolean operations. Alternative models enhance classical models using techniques from other fields.

Uploaded by

sharan raj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 48

NLP -Module-V

Information Retrieval(IR)
By:
Savitha N J
Asst. professor, Dept. of CSE, CMRIT,
Bengaluru-560037
What is IR?
• Information retrieval (IR) deals with the organization, storage,
retrieval and evaluation of information relevant to a user’s query
written in a natural language.

• ‘An information retrieval system does not inform (i.e. change the
knowledge of) the user on the subject of her inquiry. It merely
informs on the existence (or non-existence) and whereabouts of
documents relating to her request.’
Design of a basic IR system
Basic IR process
• Problems:
• Problem of representation of documents and queries.
• Matching the query representation with document representation.
• Documents represented using ‘Index terms’ or ‘keywords’ which
provides logical view of document.
Indexing
• Transforming text document to some logical representation.
• Ex: Inverted indexing
• Reducing set of representative keywords:
• Stop word elimination
• Stemming
• Zipf’s law
• ‘Term-weighting’ to indicate the significance of index term to
document.
Indexing
• Most of the indexing techniques involve identifying good document
descriptors, such as keywords or terms, to describe information content of
the documents.
• A good descriptor is one that helps in describing the content of the document
and in discriminating the document from other documents in the collection.
• Term can be a single word or it can be multi-word phrases.
Example:
‘Design Features of Information Retrieval systems’
can be represented by single word terms :
Design, Features, Information, Retrieval, systems
or by the set of multi-term words terms:
Design, Features, Information Retrieval, Information Retrieval systems
Luhn’s early assumption
• Luhn(1957) assumed that frequency of word occurrence in an
article gives meaningful identification of their content.
• Discrimination power for index terms is a function of the rank order
of their frequency of occurrence.
• Middle frequency terms have the highest discrimination power.
Eliminating stop words
• Stop words are high frequency words, which have little semantic weight and are thus
unlikely to help with retrieval.
• Such words are commonly used in documents, regardless of topics; and have no topical
specificity.
Example :
articles (“a”, “an” “the”) and prepositions (e.g. “in”, “of”, “for”, “at” etc.).
• Advantage
Eliminating these words can result in considerable reduction in number of index terms
without losing any significant information.
• Disadvantage
It can sometimes result in elimination of terms useful for searching, for instance the stop
word A in Vitamin A. Some phrases like “to be or not to be” consist entirely of stop words.
Stemming
• Stemming normalizes morphological variants
• It removes suffixes from the words to reduce them to some root form e.g.
the words compute, computing, computes and computer will all be reduced
to same word stem comput.
• Porter Stemmer(1980).
• Example:
The stemmed representation of
Design Features of Information Retrieval systems
will be
{design, featur, inform, retriev, system}
Zipf’s law
“Frequency of words multiplied by their ranks in a large corpus is
approximately constant”, i.e.

 
High frequency words: Common
words with less discriminatory power.

Low frequency words: Less likely to


be included in query.

Medium frequency words: Content


bearing words that can be used for
indexing , extracted using high and
low frequency threshold.
Information Retrieval Models
• An IR model is a pattern that defines several aspects of retrieval procedure, for example,
• how documents and user’s queries are represented
• how a system retrieves relevant documents according to users’ queries
• how retrieved documents are ranked.
• An IR system consists of
• a model of documents
• a model of queries
• a matching function which compares queries to documents.
•The central objective of the model is to retrieve all documents relevant to the query.
• IR models can be classified as:
• Classical models of IR
• Non-Classical models of IR
• Alternative models of IR
• Classical IR models: Based on mathematical knowledge that was easily
recognized and well understood
• Simple, efficient and easy to implement
Ex: Boolean, Vector and Probabilistic models .

• Non-classical IR models: Based on principles other than similarity,


probability, Boolean operations etc. on which classical retrieval models are
based on.
Ex: Information logic model, Situation theory model and Interaction model.

• Alternative IR models: Enhancements of classical models making use of


specific techniques from other fields.
Ex: Cluster model, fuzzy model and latent semantic indexing (LSI) models.
Boolean Model
• The oldest of the three classical models. It is based on Boolean
logic and classical set theory. Represents documents as a set of
keywords, usually stored in an inverted file.
• Users are required to express their queries as a boolean expression
consisting of keywords connected with boolean logical operators
(AND, OR, NOT).
• Retrieval is performed based on whether or not document contains
the query terms.
Boolean model
Given a finite set
T = {t1, t2, ...,ti,...,tm}
of index terms, a finite set
D = {d1, d2, ...,dj,...,dn}
of documents and a boolean expression in a normal form - representing a query
Q as follows:

1. The set Ri of documents are obtained that contain or not term ti:
Ri = { },
Where
2. Set operations are used to retrieve documents in response to Q:
Example:
Which plays of Shakespeare contain the words Brutus AND Caesar but NOT Calpurnia?
Document collection: A collection of Shakespeare's work.

1 if play contains
• So we have a 0/1 vector for each term. word, 0 otherwise
• To answer query: take the vectors for Brutus, Caesar and Calpurnia (complemented)
 bitwise AND.
• 110100 AND 110111 AND 101111 = 100100.
Advantages:
Simple, efficient, easy to implement, performs well in terms of recall
and precision if the query is well formulated.
Drawbacks:
1. Cannot retrieve documents that are only partly relevant to user
query.
2. Boolean system cannot rank the retrieved documents.
3. User need to formulate the query in pure Boolean expression.
Probabilistic model
• Ranks the document based on the probability of their relevance to a given
query.
• Retrieval depends on whether probability of relevance of a document is
higher than that of non relevance or threshold value.
• Given a set of document D, a query ‘q’ and a cut off value ‘α’ , this model
calculates the probability of relevance and irrelevance of a document to the
query, then ranks the document.
• P( R | d ) is the probability of relevance of a document d, for the query q, P( I
| d ) is the probability of irrelevance. Then the set of the documents
retrieved is:
• S={dj | P( R | dj ) >=P( I | dj )} , P( R | dj ) >α
Vector space model
• Most well studied retrieval model.
• Represents documents and queries as vectors of features
representing terms that occur within them. Each document is
characterized by a numerical vector.
• Vector represented in multidimensional space , in which each
dimension corresponds to a distinct term in the corpus of
documents.
• Vector space = all the keywords encountered
T= <t1, t2, t3, …, tm>
• Document
D = < d1, d2, d3, …, dn>
• Each document is represented by a column vector of weights
{w1j, w2j, w3j,…. wij........wmj }t
Where Wij is the weight of the term ti in document dj
Term space
Document space

w11, w12, w13……………… w1n


w21, w22, w23……………… w2n
w31, w32, w33……………… w3n
wm1, wm2, wm3……………wmn
Example:
• D1= Information retrieval is concerned with the organization, storage,
retrieval and evaluation of information relevant to user’s query.
• D2=A user having an information needs to formulate a request in the form
of query written in natural languages.
• D3= The retrieval system responds by retrieving the document that seems
relevant to the query.
• T={information, retrieval, query}
• Based on the frequency of the term vector will be as below.
t1 t2 t3 T1 t2 T3
d1 2 2 1 d1 0.67 0.67 0.33
After Normalization ->
d2 1 0 1 d2 0.71 0 0.71
d3 0 1 1   d3 0 0.71 0.71
Term weighting
 
Ex:
 

3.296
Weighting schemes
A. Term frequency
B. Inverse document frequency
C. Document length
Simple automatic method for indexing
• Step 1: Tokenization:
This extracts individual term from the document, converts all the letters lower case, and removes
punctuation marks.

• Step 2: Stop word elimination:


This removes the words that appear more frequently in the document.

• Step 3: Stemming:
This reduces the remaining terms to their linguistic root, to obtain the index terms.

• Step 4: Term weighting:


This assigns weight to terms according to their importance in the document or collection of documents.
Non-classical models of IR
• Based on principles such as Information logic model, situation
theory model and interaction model.
1. Information Logic Model: Based on logical imaging, a measure of
uncertainty is associated with this inferences which is given by

“ Given any two sentences x and y , a measure of uncertainty of y->x relating to


a given data set is determined by the minimal extent to which one has to add
information to the data set in order to establish truth of y->x”
Non-classical models of IR
2. Situation Theory: Retrieval is considered as a flow of information
from one document to another. A structure called Infon(τ) is used to
describe the situation and to model information flow. The polarity of
an infon is either 0 or 1, indicating if infon carries positive information
or negative.
Ex: Adil is serving a dish
τ=<<Serving Adil,dish;1>>
Polarity depends on the support represented by s |= τ
S can be “ I saw Adil serving a dish”
A document d is relevant to a query q, if d |=q
Non-classical models of IR
• Interaction model:
Documents are interconnected and query interacts with the
connected modules. Retrieval is considered as a result of this
interaction. Implemented using artificial neural networks where
both document and query acts as a neuron in the neural network.
Alternative models of IR
• Cluster Model:
Attempts to reduce the number of matches during retrieval based on
cluster hypothesis:
“Closely associated documents tend to be relevant to the same
clusters”
And hence instead of matching a query with individual documents , it
is matched with representatives of the cluster(class), and only
documents from a class whose representative is close to query, are
considered for individual match.
Clustering can also be applied on terms based on co-occurrence.
Cluster generation based on similarity
matrix
• Let D={ d1,d2,d3,…..dm} be set of documents.
• Let (eij)n,n be the similarity matrix, element Ei,j denotes a similarity
between document di and dj if similarity measure exceed threshold
value then are grouped to form a cluster.
Fuzzy model
• Document is represented as a fuzzy set of terms [ti ,μ(ti )]
• Where μ is a member function that assigns to each term of the document a numeric membership
degree.
• Ex: d1={information, retrieval, query}
d2={retrieval, query, model}
d3={information, retrieval}
T={information, model, query, retrieval}
Fuzzy set for terms
f1={(d1,1/3),(d2,0),(d3,1/2)}
f2={(d1,0),(d2,1/3),(d3,0)}
f3={(d1,1/3),(d2,1/3),(d3,0)}
f4={(d1,1/3),(d2,1/3),(d3,1/2)}
If the query is t2 ∧ t4, then document d2 is returned.
https://www.youtube.com/watch?v=Ls2TgGFfZnU
Evaluation of the IR systems
• Six criteria
1. Coverage of the collection
2. Time lag
3. Presentation format
4. User effort
5. Precision
6. Recall
Both precision and recall are most frequently applies in evaluating IR
system, related to effectiveness of IR system ie the system ability to retrieve
relevant documents in response to a query.
Relevance
• Not possible to measure true relevance as it is subjective in nature.
• Relevance test based on known relevance judgments based on
assessment by experts of that discipline.
• Degree of relevance: binary or continuous function
• Relevance frameworks: System, communication, psychological,
situational
Effectiveness
Measure of ability of the system to satisfy the user in terms of
relevance of documents retrieved.
• Relevance of the document retrieved
• Order of relevance
• Number of relevant documents returned
• Two most important measures
1. Precision
2. Recall
Precision and recall
• 

Relevant Non relevant


Retrieved A∩B A∩B B
Not-retrieved A ∩B A ∩B B
A A
Trade off between precision and
recall
• High value of both at the same time is desirable.
• But precision is high when recall is low and as recall increases,
precision decreases.
• The ideal case of perfect retrieval requires that all relevant
documents be retrieved before the first non relevant document is
retrieved.
Non-interpolated average precision
• Average of the precision at observed recall points. Observed recall
points corresponds to points where a relevant document is
retrieved.

Ex:
Interpolated precision
• Precision values are interpolated for a set of recall points.
• Recall levels are 0.0, 0.1, 0.2…….1.0. Precisions are calculated at each of
these level and then averaged to get a single value.
• Precision at a given recall level is the greatest known precision at any
recall level greater than or equal to this given level.
Ex: precision observed at recall points:
0.25 1.0
0.4 0.67
0.55 0.8
0.8 0.6
1.00.5
Lexical resources
Lexical resources
• ‘Knowing where the information is half the information”
• Tools and lexical resources to work with NLP.
• WordNet
• FrameNet
• Stemmers
• Taggers
• Parsers
• Text Corpus
WORDNET
• Large database for English language by Princeton university.
• Three databases: Noun, Verb, Adjectives and Adverbs
• Information organized into sets of synonymous words called
synsets.
• synsets are linked to each other by means of lexical and semantic
relations.
• Relation includes synonymy, hyponymy, antonymy,
meronymy/holonymy, troponymy
• WordNet for other languages : Hindi WordNet, EuroWordNet
WordNet applications
• Concept identifications in Natural language
• Word sense disambiguation
• Automatic Query Expansion
• Document structuring and categorization
• Document summarization
FRAMENET
• FrameNet is a large database of semantically annotated English
sentences.
• British national corpus
• Each word evokes a particular situation with particular participants.
• FrameNet captures situation through frames, and frame elements.
Stemmers
• Reducing inflected word to its
root word.
• Most commonly used stemmers
• Porter’s algorithm stemmers
• Lovins stemmers
• Paice/Husk stemmers
• Snowball stemmer for European
languages.
POS Taggers
• Stanford Log-linear POS tagger (97.24% accuracy)
• Postagger
• TnT tagger (Trigrams’n’ Tag)
• Brill tagger
• CLAWS (constituent likelihood automatic word-tagging system)
• Tree-tagger
• ACOPOST (A collection of POS taggers)
• Maximum entropy tagger
• Trigram tagger
• Error-driven transformation-based tagger
Example-based tagger

You might also like