0% found this document useful (0 votes)

25 views48 pages

ISE Information Retrieval Mod-V

The document discusses different models of information retrieval including classical, non-classical, and alternative models. Classical models include boolean, vector space, and probabilistic models. Non-classical models are based on principles other than similarity, probability, or boolean operations. Alternative models enhance classical models using techniques from other fields.

Uploaded by

sharan raj

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

25 views48 pages

ISE Information Retrieval Mod-V

Uploaded by

sharan raj

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 48

NLP -Module-V

Information Retrieval(IR)
By:
Savitha N J
Asst. professor, Dept. of CSE, CMRIT,
Bengaluru-560037
What is IR?
• Information retrieval (IR) deals with the organization, storage,
retrieval and evaluation of information relevant to a user’s query
written in a natural language.

• ‘An information retrieval system does not inform (i.e. change the
knowledge of) the user on the subject of her inquiry. It merely
informs on the existence (or non-existence) and whereabouts of
documents relating to her request.’
Design of a basic IR system
Basic IR process
• Problems:
• Problem of representation of documents and queries.
• Matching the query representation with document representation.
• Documents represented using ‘Index terms’ or ‘keywords’ which
provides logical view of document.
Indexing
• Transforming text document to some logical representation.
• Ex: Inverted indexing
• Reducing set of representative keywords:
• Stop word elimination
• Stemming
• Zipf’s law
• ‘Term-weighting’ to indicate the significance of index term to
document.
Indexing
• Most of the indexing techniques involve identifying good document
descriptors, such as keywords or terms, to describe information content of
the documents.
• A good descriptor is one that helps in describing the content of the document
and in discriminating the document from other documents in the collection.
• Term can be a single word or it can be multi-word phrases.
Example:
‘Design Features of Information Retrieval systems’
can be represented by single word terms :
Design, Features, Information, Retrieval, systems
or by the set of multi-term words terms:
Design, Features, Information Retrieval, Information Retrieval systems
Luhn’s early assumption
• Luhn(1957) assumed that frequency of word occurrence in an
article gives meaningful identification of their content.
• Discrimination power for index terms is a function of the rank order
of their frequency of occurrence.
• Middle frequency terms have the highest discrimination power.
Eliminating stop words
• Stop words are high frequency words, which have little semantic weight and are thus
unlikely to help with retrieval.
• Such words are commonly used in documents, regardless of topics; and have no topical
specificity.
Example :
articles (“a”, “an” “the”) and prepositions (e.g. “in”, “of”, “for”, “at” etc.).
• Advantage
Eliminating these words can result in considerable reduction in number of index terms
without losing any significant information.
• Disadvantage
It can sometimes result in elimination of terms useful for searching, for instance the stop
word A in Vitamin A. Some phrases like “to be or not to be” consist entirely of stop words.
Stemming
• Stemming normalizes morphological variants
• It removes suffixes from the words to reduce them to some root form e.g.
the words compute, computing, computes and computer will all be reduced
to same word stem comput.
• Porter Stemmer(1980).
• Example:
The stemmed representation of
Design Features of Information Retrieval systems
will be
{design, featur, inform, retriev, system}
Zipf’s law
“Frequency of words multiplied by their ranks in a large corpus is
approximately constant”, i.e.

High frequency words: Common
words with less discriminatory power.

Low frequency words: Less likely to

be included in query.

Medium frequency words: Content

bearing words that can be used for
indexing , extracted using high and
low frequency threshold.
Information Retrieval Models
• An IR model is a pattern that defines several aspects of retrieval procedure, for example,
• how documents and user’s queries are represented
• how a system retrieves relevant documents according to users’ queries
• how retrieved documents are ranked.
• An IR system consists of
• a model of documents
• a model of queries
• a matching function which compares queries to documents.
•The central objective of the model is to retrieve all documents relevant to the query.
• IR models can be classified as:
• Classical models of IR
• Non-Classical models of IR
• Alternative models of IR
• Classical IR models: Based on mathematical knowledge that was easily
recognized and well understood
• Simple, efficient and easy to implement
Ex: Boolean, Vector and Probabilistic models .

• Non-classical IR models: Based on principles other than similarity,

probability, Boolean operations etc. on which classical retrieval models are
based on.
Ex: Information logic model, Situation theory model and Interaction model.

• Alternative IR models: Enhancements of classical models making use of

specific techniques from other fields.
Ex: Cluster model, fuzzy model and latent semantic indexing (LSI) models.
Boolean Model
• The oldest of the three classical models. It is based on Boolean
logic and classical set theory. Represents documents as a set of
keywords, usually stored in an inverted file.
• Users are required to express their queries as a boolean expression
consisting of keywords connected with boolean logical operators
(AND, OR, NOT).
• Retrieval is performed based on whether or not document contains
the query terms.
Boolean model
Given a finite set
T = {t1, t2, ...,ti,...,tm}
of index terms, a finite set
D = {d1, d2, ...,dj,...,dn}
of documents and a boolean expression in a normal form - representing a query
Q as follows:

1. The set Ri of documents are obtained that contain or not term ti:
Ri = { },
Where
2. Set operations are used to retrieve documents in response to Q:
Example:
Which plays of Shakespeare contain the words Brutus AND Caesar but NOT Calpurnia?
Document collection: A collection of Shakespeare's work.

1 if play contains
• So we have a 0/1 vector for each term. word, 0 otherwise
• To answer query: take the vectors for Brutus, Caesar and Calpurnia (complemented)
 bitwise AND.
• 110100 AND 110111 AND 101111 = 100100.
Advantages:
Simple, efficient, easy to implement, performs well in terms of recall
and precision if the query is well formulated.
Drawbacks:
1. Cannot retrieve documents that are only partly relevant to user
query.
2. Boolean system cannot rank the retrieved documents.
3. User need to formulate the query in pure Boolean expression.
Probabilistic model
• Ranks the document based on the probability of their relevance to a given
query.
• Retrieval depends on whether probability of relevance of a document is
higher than that of non relevance or threshold value.
• Given a set of document D, a query ‘q’ and a cut off value ‘α’ , this model
calculates the probability of relevance and irrelevance of a document to the
query, then ranks the document.
• P( R | d ) is the probability of relevance of a document d, for the query q, P( I
| d ) is the probability of irrelevance. Then the set of the documents
retrieved is:
• S={dj | P( R | dj ) >=P( I | dj )} , P( R | dj ) >α
Vector space model
• Most well studied retrieval model.
• Represents documents and queries as vectors of features
representing terms that occur within them. Each document is
characterized by a numerical vector.
• Vector represented in multidimensional space , in which each
dimension corresponds to a distinct term in the corpus of
documents.
• Vector space = all the keywords encountered
T= <t1, t2, t3, …, tm>
• Document
D = < d1, d2, d3, …, dn>
• Each document is represented by a column vector of weights
{w1j, w2j, w3j,…. wij........wmj }t
Where Wij is the weight of the term ti in document dj
Term space
Document space

w11, w12, w13……………… w1n

w21, w22, w23……………… w2n
w31, w32, w33……………… w3n
wm1, wm2, wm3……………wmn
Example:
• D1= Information retrieval is concerned with the organization, storage,
retrieval and evaluation of information relevant to user’s query.
• D2=A user having an information needs to formulate a request in the form
of query written in natural languages.
• D3= The retrieval system responds by retrieving the document that seems
relevant to the query.
• T={information, retrieval, query}
• Based on the frequency of the term vector will be as below.
t1 t2 t3 T1 t2 T3
d1 2 2 1 d1 0.67 0.67 0.33
After Normalization ->
d2 1 0 1 d2 0.71 0 0.71
d3 0 1 1 d3 0 0.71 0.71
Term weighting

Ex:

3.296
Weighting schemes
A. Term frequency
B. Inverse document frequency
C. Document length
Simple automatic method for indexing
• Step 1: Tokenization:
This extracts individual term from the document, converts all the letters lower case, and removes
punctuation marks.

• Step 2: Stop word elimination:

This removes the words that appear more frequently in the document.

• Step 3: Stemming:
This reduces the remaining terms to their linguistic root, to obtain the index terms.

• Step 4: Term weighting:

This assigns weight to terms according to their importance in the document or collection of documents.
Non-classical models of IR
• Based on principles such as Information logic model, situation
theory model and interaction model.
1. Information Logic Model: Based on logical imaging, a measure of
uncertainty is associated with this inferences which is given by

“ Given any two sentences x and y , a measure of uncertainty of y->x relating to

a given data set is determined by the minimal extent to which one has to add
information to the data set in order to establish truth of y->x”
Non-classical models of IR
2. Situation Theory: Retrieval is considered as a flow of information
from one document to another. A structure called Infon(τ) is used to
describe the situation and to model information flow. The polarity of
an infon is either 0 or 1, indicating if infon carries positive information
or negative.
Ex: Adil is serving a dish
τ=<<Serving Adil,dish;1>>
Polarity depends on the support represented by s |= τ
S can be “ I saw Adil serving a dish”
A document d is relevant to a query q, if d |=q
Non-classical models of IR
• Interaction model:
Documents are interconnected and query interacts with the
connected modules. Retrieval is considered as a result of this
interaction. Implemented using artificial neural networks where
both document and query acts as a neuron in the neural network.
Alternative models of IR
• Cluster Model:
Attempts to reduce the number of matches during retrieval based on
cluster hypothesis:
“Closely associated documents tend to be relevant to the same
clusters”
And hence instead of matching a query with individual documents , it
is matched with representatives of the cluster(class), and only
documents from a class whose representative is close to query, are
considered for individual match.
Clustering can also be applied on terms based on co-occurrence.
Cluster generation based on similarity
matrix
• Let D={ d1,d2,d3,…..dm} be set of documents.
• Let (eij)n,n be the similarity matrix, element Ei,j denotes a similarity
between document di and dj if similarity measure exceed threshold
value then are grouped to form a cluster.
Fuzzy model
• Document is represented as a fuzzy set of terms [ti ,μ(ti )]
• Where μ is a member function that assigns to each term of the document a numeric membership
degree.
• Ex: d1={information, retrieval, query}
d2={retrieval, query, model}
d3={information, retrieval}
T={information, model, query, retrieval}
Fuzzy set for terms
f1={(d1,1/3),(d2,0),(d3,1/2)}
f2={(d1,0),(d2,1/3),(d3,0)}
f3={(d1,1/3),(d2,1/3),(d3,0)}
f4={(d1,1/3),(d2,1/3),(d3,1/2)}
If the query is t2 ∧ t4, then document d2 is returned.
https://www.youtube.com/watch?v=Ls2TgGFfZnU
Evaluation of the IR systems
• Six criteria
1. Coverage of the collection
2. Time lag
3. Presentation format
4. User effort
5. Precision
6. Recall
Both precision and recall are most frequently applies in evaluating IR
system, related to effectiveness of IR system ie the system ability to retrieve
relevant documents in response to a query.
Relevance
• Not possible to measure true relevance as it is subjective in nature.
• Relevance test based on known relevance judgments based on
assessment by experts of that discipline.
• Degree of relevance: binary or continuous function
• Relevance frameworks: System, communication, psychological,
situational
Effectiveness
Measure of ability of the system to satisfy the user in terms of
relevance of documents retrieved.
• Relevance of the document retrieved
• Order of relevance
• Number of relevant documents returned
• Two most important measures
1. Precision
2. Recall
Precision and recall
•

Relevant Non relevant

Retrieved A∩B A∩B B
Not-retrieved A ∩B A ∩B B
A A
Trade off between precision and
recall
• High value of both at the same time is desirable.
• But precision is high when recall is low and as recall increases,
precision decreases.
• The ideal case of perfect retrieval requires that all relevant
documents be retrieved before the first non relevant document is
retrieved.
Non-interpolated average precision
• Average of the precision at observed recall points. Observed recall
points corresponds to points where a relevant document is
retrieved.

Ex:
Interpolated precision
• Precision values are interpolated for a set of recall points.
• Recall levels are 0.0, 0.1, 0.2…….1.0. Precisions are calculated at each of
these level and then averaged to get a single value.
• Precision at a given recall level is the greatest known precision at any
recall level greater than or equal to this given level.
Ex: precision observed at recall points:
0.25 1.0
0.4 0.67
0.55 0.8
0.8 0.6
1.00.5
Lexical resources
Lexical resources
• ‘Knowing where the information is half the information”
• Tools and lexical resources to work with NLP.
• WordNet
• FrameNet
• Stemmers
• Taggers
• Parsers
• Text Corpus
WORDNET
• Large database for English language by Princeton university.
• Three databases: Noun, Verb, Adjectives and Adverbs
• Information organized into sets of synonymous words called
synsets.
• synsets are linked to each other by means of lexical and semantic
relations.
• Relation includes synonymy, hyponymy, antonymy,
meronymy/holonymy, troponymy
• WordNet for other languages : Hindi WordNet, EuroWordNet
WordNet applications
• Concept identifications in Natural language
• Word sense disambiguation
• Automatic Query Expansion
• Document structuring and categorization
• Document summarization
FRAMENET
• FrameNet is a large database of semantically annotated English
sentences.
• British national corpus
• Each word evokes a particular situation with particular participants.
• FrameNet captures situation through frames, and frame elements.
Stemmers
• Reducing inflected word to its
root word.
• Most commonly used stemmers
• Porter’s algorithm stemmers
• Lovins stemmers
• Paice/Husk stemmers
• Snowball stemmer for European
languages.
POS Taggers
• Stanford Log-linear POS tagger (97.24% accuracy)
• Postagger
• TnT tagger (Trigrams’n’ Tag)
• Brill tagger
• CLAWS (constituent likelihood automatic word-tagging system)
• Tree-tagger
• ACOPOST (A collection of POS taggers)
• Maximum entropy tagger
• Trigram tagger
• Error-driven transformation-based tagger
Example-based tagger

ISE Information Retrieval Mod-V (Uploaded by Snaptricks - In)
No ratings yet
ISE Information Retrieval Mod-V (Uploaded by Snaptricks - In)
48 pages
NLP Unit-Ii (Part-I)
No ratings yet
NLP Unit-Ii (Part-I)
19 pages
NLP - Module 5
No ratings yet
NLP - Module 5
58 pages
NLP Mod-V Q - A (Uploaded by Snaptricks - In)
No ratings yet
NLP Mod-V Q - A (Uploaded by Snaptricks - In)
7 pages
Mod 4
No ratings yet
Mod 4
35 pages
IR Models: - Why IR Models? - Boolean IR Model - Vector Space IR Model - Probabilistic IR Model
No ratings yet
IR Models: - Why IR Models? - Boolean IR Model - Vector Space IR Model - Probabilistic IR Model
46 pages
Cs8080 Ir Unit2 I Modeling and Retrieval Evaluation
No ratings yet
Cs8080 Ir Unit2 I Modeling and Retrieval Evaluation
42 pages
Introduction of IR Models
No ratings yet
Introduction of IR Models
67 pages
Mod4 NLP
No ratings yet
Mod4 NLP
53 pages
Completed Unit II 17.7.17
No ratings yet
Completed Unit II 17.7.17
113 pages
4 IRModels
No ratings yet
4 IRModels
46 pages
IR Chapter 4
No ratings yet
IR Chapter 4
15 pages
Module 5 - Information Retrieval and Lexical Resources
0% (1)
Module 5 - Information Retrieval and Lexical Resources
80 pages
Unit 2
No ratings yet
Unit 2
58 pages
Unit-5 Adt
No ratings yet
Unit-5 Adt
11 pages
Unit V Easy To Learn
No ratings yet
Unit V Easy To Learn
21 pages
Information Retrieval
No ratings yet
Information Retrieval
5 pages
Unit Iii - Information Retrieval Design Features of Information Retrieval Systems
No ratings yet
Unit Iii - Information Retrieval Design Features of Information Retrieval Systems
57 pages
02 Chap02a-BooleanAndvector Models
No ratings yet
02 Chap02a-BooleanAndvector Models
30 pages
Information Retrieval System-Chapter-1
No ratings yet
Information Retrieval System-Chapter-1
23 pages
Information Retrieval - 1
No ratings yet
Information Retrieval - 1
47 pages
Introduction of IR Models
No ratings yet
Introduction of IR Models
62 pages
IR Unit 2
No ratings yet
IR Unit 2
54 pages
Information Retrieval: Adt-V Unit
No ratings yet
Information Retrieval: Adt-V Unit
106 pages
Irt-23 Unit 2
No ratings yet
Irt-23 Unit 2
10 pages
4-IR Models
No ratings yet
4-IR Models
33 pages
11 Multimedia Media IR
No ratings yet
11 Multimedia Media IR
19 pages
Information Retrieval System and The Pagerank Algorithm
No ratings yet
Information Retrieval System and The Pagerank Algorithm
37 pages
L02-IR Models MMN
No ratings yet
L02-IR Models MMN
27 pages
IR Models
No ratings yet
IR Models
65 pages
Information Retrieval Detailed Lecture Nov 2023
No ratings yet
Information Retrieval Detailed Lecture Nov 2023
39 pages
Unit 2 Irt
No ratings yet
Unit 2 Irt
33 pages
Information Retrievalpdf
No ratings yet
Information Retrievalpdf
7 pages
IRS Module 2
No ratings yet
IRS Module 2
24 pages
Unit Ii Part B 1. Write About Basic IR Model
No ratings yet
Unit Ii Part B 1. Write About Basic IR Model
17 pages
Chapter Five IR Models
No ratings yet
Chapter Five IR Models
28 pages
Tamrakar 2015
No ratings yet
Tamrakar 2015
6 pages
Information Retrieval Models
No ratings yet
Information Retrieval Models
15 pages
LIBS 894 Assignment Three Classic Models
No ratings yet
LIBS 894 Assignment Three Classic Models
8 pages
5 IRModels IR
No ratings yet
5 IRModels IR
25 pages
Chapter 4 IR Models
No ratings yet
Chapter 4 IR Models
34 pages
Module 6 Updated Final
No ratings yet
Module 6 Updated Final
48 pages
Information Retrieval Models
No ratings yet
Information Retrieval Models
4 pages
Unit 1: Introduction and Data Pre-Processing
No ratings yet
Unit 1: Introduction and Data Pre-Processing
71 pages
4 IRModels
No ratings yet
4 IRModels
32 pages
Chapter 2: Modeling: Advanced Topics in Information Retrieval
No ratings yet
Chapter 2: Modeling: Advanced Topics in Information Retrieval
28 pages
Week 2 - Information Retrieval Basics
No ratings yet
Week 2 - Information Retrieval Basics
74 pages
Slides Chap 04 B
No ratings yet
Slides Chap 04 B
69 pages
CompletedUNIT 1 PPT 10.7.17
100% (6)
CompletedUNIT 1 PPT 10.7.17
87 pages
Information Retreival Methods
No ratings yet
Information Retreival Methods
19 pages
Web Information Retrieval
No ratings yet
Web Information Retrieval
10 pages
Bulu
No ratings yet
Bulu
47 pages
Performance Enhancement and Customization of Information Storage and Retrieval System
No ratings yet
Performance Enhancement and Customization of Information Storage and Retrieval System
32 pages
Chapter 4
No ratings yet
Chapter 4
48 pages
IR Problem: Introduction To Information Retrieval Outline
No ratings yet
IR Problem: Introduction To Information Retrieval Outline
11 pages
IR Systems Usually Adopt Index Terms To Process Queries Index Term
No ratings yet
IR Systems Usually Adopt Index Terms To Process Queries Index Term
24 pages
IR Basics Lec28 Oct 3 2011
No ratings yet
IR Basics Lec28 Oct 3 2011
26 pages
Chapter 4 IR Models
No ratings yet
Chapter 4 IR Models
43 pages
4 IRModels
No ratings yet
4 IRModels
46 pages
2017-10-26 in Class Questions
No ratings yet
2017-10-26 in Class Questions
8 pages
Arabian Egl
No ratings yet
Arabian Egl
2 pages
Non-Classical Models of IR (Uploaded by Snaptricks - In)
No ratings yet
Non-Classical Models of IR (Uploaded by Snaptricks - In)
8 pages
Evaluating Self Explanations in iSTART Word Matching, Latent Semantic
No ratings yet
Evaluating Self Explanations in iSTART Word Matching, Latent Semantic
12 pages
NLP Paper
No ratings yet
NLP Paper
5 pages
Introduction To NLP
No ratings yet
Introduction To NLP
37 pages
NLP Mod-I Q - A Final (Uploaded by Snaptricks - In)
No ratings yet
NLP Mod-I Q - A Final (Uploaded by Snaptricks - In)
8 pages

ISE Information Retrieval Mod-V

Uploaded by

ISE Information Retrieval Mod-V

Uploaded by

NLP -Module-V

Low frequency words: Less likely to

Medium frequency words: Content

• Non-classical IR models: Based on principles other than similarity,

• Alternative IR models: Enhancements of classical models making use of

w11, w12, w13……………… w1n

• Step 2: Stop word elimination:

• Step 4: Term weighting:

“ Given any two sentences x and y , a measure of uncertainty of y->x relating to

Relevant Non relevant

You might also like