ISE Information Retrieval Mod-V
ISE Information Retrieval Mod-V
Information Retrieval(IR)
By:
Savitha N J
Asst. professor, Dept. of CSE, CMRIT,
Bengaluru-560037
What is IR?
• Information retrieval (IR) deals with the organization, storage,
retrieval and evaluation of information relevant to a user’s query
written in a natural language.
• ‘An information retrieval system does not inform (i.e. change the
knowledge of) the user on the subject of her inquiry. It merely
informs on the existence (or non-existence) and whereabouts of
documents relating to her request.’
Design of a basic IR system
Basic IR process
• Problems:
• Problem of representation of documents and queries.
• Matching the query representation with document representation.
• Documents represented using ‘Index terms’ or ‘keywords’ which
provides logical view of document.
Indexing
• Transforming text document to some logical representation.
• Ex: Inverted indexing
• Reducing set of representative keywords:
• Stop word elimination
• Stemming
• Zipf’s law
• ‘Term-weighting’ to indicate the significance of index term to
document.
Indexing
• Most of the indexing techniques involve identifying good document
descriptors, such as keywords or terms, to describe information content of
the documents.
• A good descriptor is one that helps in describing the content of the document
and in discriminating the document from other documents in the collection.
• Term can be a single word or it can be multi-word phrases.
Example:
‘Design Features of Information Retrieval systems’
can be represented by single word terms :
Design, Features, Information, Retrieval, systems
or by the set of multi-term words terms:
Design, Features, Information Retrieval, Information Retrieval systems
Luhn’s early assumption
• Luhn(1957) assumed that frequency of word occurrence in an
article gives meaningful identification of their content.
• Discrimination power for index terms is a function of the rank order
of their frequency of occurrence.
• Middle frequency terms have the highest discrimination power.
Eliminating stop words
• Stop words are high frequency words, which have little semantic weight and are thus
unlikely to help with retrieval.
• Such words are commonly used in documents, regardless of topics; and have no topical
specificity.
Example :
articles (“a”, “an” “the”) and prepositions (e.g. “in”, “of”, “for”, “at” etc.).
• Advantage
Eliminating these words can result in considerable reduction in number of index terms
without losing any significant information.
• Disadvantage
It can sometimes result in elimination of terms useful for searching, for instance the stop
word A in Vitamin A. Some phrases like “to be or not to be” consist entirely of stop words.
Stemming
• Stemming normalizes morphological variants
• It removes suffixes from the words to reduce them to some root form e.g.
the words compute, computing, computes and computer will all be reduced
to same word stem comput.
• Porter Stemmer(1980).
• Example:
The stemmed representation of
Design Features of Information Retrieval systems
will be
{design, featur, inform, retriev, system}
Zipf’s law
“Frequency of words multiplied by their ranks in a large corpus is
approximately constant”, i.e.
High frequency words: Common
words with less discriminatory power.
1. The set Ri of documents are obtained that contain or not term ti:
Ri = { },
Where
2. Set operations are used to retrieve documents in response to Q:
Example:
Which plays of Shakespeare contain the words Brutus AND Caesar but NOT Calpurnia?
Document collection: A collection of Shakespeare's work.
1 if play contains
• So we have a 0/1 vector for each term. word, 0 otherwise
• To answer query: take the vectors for Brutus, Caesar and Calpurnia (complemented)
bitwise AND.
• 110100 AND 110111 AND 101111 = 100100.
Advantages:
Simple, efficient, easy to implement, performs well in terms of recall
and precision if the query is well formulated.
Drawbacks:
1. Cannot retrieve documents that are only partly relevant to user
query.
2. Boolean system cannot rank the retrieved documents.
3. User need to formulate the query in pure Boolean expression.
Probabilistic model
• Ranks the document based on the probability of their relevance to a given
query.
• Retrieval depends on whether probability of relevance of a document is
higher than that of non relevance or threshold value.
• Given a set of document D, a query ‘q’ and a cut off value ‘α’ , this model
calculates the probability of relevance and irrelevance of a document to the
query, then ranks the document.
• P( R | d ) is the probability of relevance of a document d, for the query q, P( I
| d ) is the probability of irrelevance. Then the set of the documents
retrieved is:
• S={dj | P( R | dj ) >=P( I | dj )} , P( R | dj ) >α
Vector space model
• Most well studied retrieval model.
• Represents documents and queries as vectors of features
representing terms that occur within them. Each document is
characterized by a numerical vector.
• Vector represented in multidimensional space , in which each
dimension corresponds to a distinct term in the corpus of
documents.
• Vector space = all the keywords encountered
T= <t1, t2, t3, …, tm>
• Document
D = < d1, d2, d3, …, dn>
• Each document is represented by a column vector of weights
{w1j, w2j, w3j,…. wij........wmj }t
Where Wij is the weight of the term ti in document dj
Term space
Document space
3.296
Weighting schemes
A. Term frequency
B. Inverse document frequency
C. Document length
Simple automatic method for indexing
• Step 1: Tokenization:
This extracts individual term from the document, converts all the letters lower case, and removes
punctuation marks.
• Step 3: Stemming:
This reduces the remaining terms to their linguistic root, to obtain the index terms.
Ex:
Interpolated precision
• Precision values are interpolated for a set of recall points.
• Recall levels are 0.0, 0.1, 0.2…….1.0. Precisions are calculated at each of
these level and then averaged to get a single value.
• Precision at a given recall level is the greatest known precision at any
recall level greater than or equal to this given level.
Ex: precision observed at recall points:
0.25 1.0
0.4 0.67
0.55 0.8
0.8 0.6
1.00.5
Lexical resources
Lexical resources
• ‘Knowing where the information is half the information”
• Tools and lexical resources to work with NLP.
• WordNet
• FrameNet
• Stemmers
• Taggers
• Parsers
• Text Corpus
WORDNET
• Large database for English language by Princeton university.
• Three databases: Noun, Verb, Adjectives and Adverbs
• Information organized into sets of synonymous words called
synsets.
• synsets are linked to each other by means of lexical and semantic
relations.
• Relation includes synonymy, hyponymy, antonymy,
meronymy/holonymy, troponymy
• WordNet for other languages : Hindi WordNet, EuroWordNet
WordNet applications
• Concept identifications in Natural language
• Word sense disambiguation
• Automatic Query Expansion
• Document structuring and categorization
• Document summarization
FRAMENET
• FrameNet is a large database of semantically annotated English
sentences.
• British national corpus
• Each word evokes a particular situation with particular participants.
• FrameNet captures situation through frames, and frame elements.
Stemmers
• Reducing inflected word to its
root word.
• Most commonly used stemmers
• Porter’s algorithm stemmers
• Lovins stemmers
• Paice/Husk stemmers
• Snowball stemmer for European
languages.
POS Taggers
• Stanford Log-linear POS tagger (97.24% accuracy)
• Postagger
• TnT tagger (Trigrams’n’ Tag)
• Brill tagger
• CLAWS (constituent likelihood automatic word-tagging system)
• Tree-tagger
• ACOPOST (A collection of POS taggers)
• Maximum entropy tagger
• Trigram tagger
• Error-driven transformation-based tagger
Example-based tagger