Module 4 Notes
Module 4 Notes
MODULE-4
CHAPTER-9
INFORMATION RETRIEVAL
9.1 INTRODUCTION
• Information retrieval (IR) deals with the organization, storage, retrieval, and evaluation of information
relevant to a user’s query.
• A user in need of information formulates a request in the form of a query written in a natural language.
• The retrieval system responds by retrieving the document that seems relevant to the query
The actual text of the document is not used in the retrieval process. Instead, documents in a collection are
frequently represented through a set of index terms or keywords. Keywords can be single word or multi-
word phrases. They might be extracted automatically or manually (i.e., specified by a human). Such a
representation provides a logical view of the document. The process of transforming document text to
some representation of it is known as indexing. There are different types of index structures. One used
data structure, commonly by the IR system, is the inverted index.
An inverted index is simply a list of keywords, with each keyword carrying pointers to the documents
containing that keyword.
NATURAL LANGUGAE PROCESSING BAI601/BAD613B
The computational cost involved in adopting a full text logical view (i.e., using a full set of words to
represent a document) is high.
Hence, some text operations are usually performed to reduce the set of representative keywords. The two
most commonly used text operations are
Zipf’s law can be applied to further reduce the size of index set.
Not all the terms in a document are equally relevant.
Some might be more important in conveying a document’s content.
Attempts have been made to quantify the significance of index terms to a document by assigning
them numerical values, called weights.
9.2.1 INDEXING
In a small collection of documents, an IR system can access a document to decide its relevance to
a query.
However, in a large collection of documents, this technique poses practical problems.
Hence, a collection of raw documents is usually transformed into an easily accessible
representation. This process is known as indexing.
Most indexing techniques involve identifying good document descriptors, such as keywords or
terms, which describe the information content of documents.
A good descriptor is one that helps describe the content of the document and discriminate it from
other documents in the collection.
It attempts to interpret the structure and meaning of larger units, e.g., at the paragraph and
document level, in terms of words, phrases, clusters, and sentences. It deals with how the meaning
of a sentence is determined by preceding sentence.
Thus, indexing is simply the representation of text (query and document) as a set of terms whose
meaning is equivalent to some content of the original text.
The word term can be a single word or multi-word phrases.
For example, the sentence, Design features of information retrieval systems, can be represented as
follows:
9.2.3 STEMMERS
Stemming normalizes morphological variants, though in a crude manner, by removing affixes
from the words to reduce them to their stem, e.g., the words compute, computing, computes, and
computer, are all be reduced to the word compute.
Thus, the keywords or terms used to represent same word stem, comput.
Thus, the keywords or terms used to represent text are stems, not the actual words.
One of the most widely used stemming algorithms has been developed by Porter (1980). The
stemmed representation of the text, Design features of information retrieval systems, is
(design, featur, inform, retriev, system)
One of the problems associated with stemming is that it may throw away useful distinctions. In
some cases, it may be useful to help conflate similar terms, resulting in increased recall.
In others, it may be harmful, resulting in reduced precision (e.g., when documents containing the
term computation are returned in response to the query phrase personal computer). Recall and
precision are the two most commonly used measures of the effectiveness of an information
retrieval system.
Empirical investigation of Zipf’s law on large corpuses suggest that human languages contain a
small number of words that occur with high frequency and a large number of words that occur
with low frequency.
NATURAL LANGUGAE PROCESSING BAI601/BAD613B
• An IR model is a pattern that defines several aspects of the retrieval procedure, for example,
how documents and user's queries are represented,
how a system retrieves relevant documents according to users' queries, and
How retrieved documents are ranked.
• The IR system consists of a model for documents, a model for queries, and a matching function
which compares queries to documents.
• The central objective of the model is to retrieve all documents relevant to a query. This defines the
central task of an IR system.
Several different IR models have been developed.
These models differ in the way documents and queries are represented and retrieval is performed.
• Some of them consider documents as sets of terms and perform retrieval based merely on the
presence or absence of one or more query terms in the document.
• Others represent a document as a vector of term weights and perform retrieval based on the
numeric score assigned to each document, representing similarity between the query and the
document.
• These models can be classified as follows:
1. Classical models of IR
2. Non-classical models of IR
3. Alternative models of IR
• The three classical IR models — Boolean, vector, and probabilistic — are based on mathematical
knowledge that is easily recognized and well understood. These models are simple, efficient, and
easy to implement.
An inverted file is a list of keywords and identifiers of the documents in which they occur.
Users are required to express their queries as a Boolean expression consisting of keywords
connected with Boolean logical operators (AND, OR, NOT).
Retrieval is performed based on whether or not a document contains the query terms.
Document Descriptions
D₁: "Information retrieval is concerned with the organization, storage, retrieval, and
evaluation of information relevant to user’s query."
D₂: "A user having an information need formulates a request in natural language."
D₃: "The retrieval system responds by retrieving documents that seem relevant to the
query."
Set of Terms (T):
T = {information, retrieval, query}
R₁ = {d₁, d₂}
Retrieve R₂:
Documents that contain "retrieval":
R₂ = {d₁, d₃}
Intersection (R₁ ∩ R₂):
Documents that contain both terms:
R₁ ∩ R₂ = {d₁}
Final Result
No Partial Matching
The model retrieves only documents that fully satisfy the Boolean query.
It cannot retrieve documents that are partially relevant.
All results are binary: a document either satisfies the query completely or not at all.
Example phrase: "all information is ‘to be or not to be’."
No Ranking Mechanism
• It then ranks documents having probabilities of relevance at least that of irrelevance in decreasing order
of their relevance.
• Documents are retrieved if the probability of relevance in the ranked list exceeds the cut off value.
• More formally, if P(R/d) is the probability of relevance of a document dj for query q, and P(I/d) is the
probability of irrelevance, then the set of documents retrieved in response to the query q is as follows:
• The probabilistic model, like the vector space model, can return documents that partly
match the user’s query — a key advantage over Boolean models.
• However, a major drawback is:
• The determination of a threshold value for the initially retrieved set."
• This means the model requires setting a probability threshold to decide which documents
are likely relevant.
• But in practice:
• "The number of relevant documents by a query is usually too small for the
probability to be estimated accurately."
• The vector space model is one of the most well-studied retrieval models.
• The vector space model represents documents and queries as vectors of features representing
terms that occur within them.
• Each document is characterized by a Boolean or numerical vector.
• These vectors are represented in a multi-dimensional space, in which each dimension
corresponds to a distinct term in the corpus of documents.
• In its simplest form, each feature takes a value of either zero or one, indicating the absence
or presence of that term in a document or query.
• More generally, features are assigned numerical values that are usually a function of the
frequency of terms.
• Ranking algorithms compute the similarity between document and query vectors, to yield
a retrieval score to each document.
• This score is used to produce a ranked list of retrieved documents.
NATURAL LANGUGAE PROCESSING BAI601/BAD613B
Document d₁ → (2, 2, 1)
Document d₂ → (1, 0, 1)
Document d₃ → (0, 1, 1)
Rows: Terms
Columns: Documents
NATURAL LANGUGAE PROCESSING BAI601/BAD613B
To reduce the importance of the length of document vectors, we normalize document vectors.
Normalization changes all vectors to a standard length. We convert document vectors to unit length
by dividing each dimension by the overall length of the vector. Normalizing the term-document
matrix shown in this example, we get the following matrix:
Elements of each column are divided by the length of the column vector given by
The values shown in this matrix have been rounded to two decimal digits.
• Term weighting is a technique used in information retrieval and text mining to assign
importance to terms (usually words) in documents. The goal is to reflect how relevant a
term is within a specific document and across a collection of documents.
• The first factor simply means that terms that occur more frequently represent the document’s
meaning more strongly than those occurring less frequently, and hence should be given high
NATURAL LANGUGAE PROCESSING BAI601/BAD613B
weights. In the simplest form, this weight is the raw frequency of the term in the document,
as discussed earlier.
• The second factor actually considers term distribution across the document collection.
Terms occurring in a few documents are useful for distinguishing those documents from the
rest of the collection. Similarly, terms that occur more frequently across the entire collection
are less helpful while discriminating among documents.
• This requires a fraction
• As the number of documents in any collection is usually large, the log of this measure is
usually taken. This results in the inverse document frequency (idf) term weight:
•
Inverse document frequency (idf) attaches more importance to more specific terms. If a term
occurs in all documents in a collection, its idf is 0.
Sparck-Jones showed experimentally that a weight of:
termed as inverse document frequency, leads to more effective retrieval. Later researchers
attempted to combine term frequency (tf) and idf weights, resulting in a family of tf × idf weight
schemes having the following general form:
NATURAL LANGUGAE PROCESSING BAI601/BAD613B
• A third factor that may affect weighting function is the document length.
• A term appearing the same number of times in a short document and in a long document,
will be more valuable to the former.
• Most weighting schemes can thus be characterized by the following three factors:
1. Within-document frequency or term frequency (tf)
2. Collection frequency or inverse document frequency (idf)
3. Document length
Any term weighting scheme can be represented by a triple ABC. The letter A in this triple represents
the way the tf component is handled, B indicates the way the idf component is incorporated, and C
represents the length normalization component.
• Different combinations of options can be used to represent document and query vectors. The
retrieval model themselves can be represented by a pair of triples like nnn.nnn (doc =
‘nnn’, query = ‘nnn’), where the first triple corresponds to the weighting strategy used for
the documents and the second triple to the weighting strategy used for the query term.
NATURAL LANGUGAE PROCESSING BAI601/BAD613B
Different choices for A, B, and C for query and document vectors yield different retrieval
modes, for example, ntc-ntc, lnc-ltc, etc.
The choices for tf (term frequency) are:
n: use the raw term frequency
b: binary (i.e., neglect term frequency; term frequency will be 1 if the term is present in the
document, otherwise 0)
a: augmented normalized frequency
l: logarithmic term frequency
L: logarithmic frequency normalized by average term frequency
The options for idf are:
n: use 1.0 (ignore idf factor)
t: use idf
The possible options listed in Table 9.3 for document length normalization are:
NATURAL LANGUGAE PROCESSING BAI601/BAD613B
n: no normalization
c: cosine normalization
To achieve cosine normalization, every element of the term weight vector is divided by the
Euclidean length of the vector. This is called cosine normalization because the length of the
normalized vector is 1, and its projection on any axis in document space gives the cosine of the angle
between the vector and the axis under consideration.
9.4.5 SIMILARITY MEASURES
Vector space model represents documents and queries as vectors in a multi-dimensional space.
Retrieval is performed by measuring the ‘closeness’ of the query vector to document vector.
Documents can then be ranked according to the numeric similarity between the query and the
document.
In the vector space model, the documents selected are those that are geometrically closest to the query
according to some measure.
Figure 9.4 gives an example of document and query representation in two-dimensional vector space.
These dimensions correspond to the two index terms ti and tj. Document d1has two occurrences of ti,
document d2 has one occurrence of ti, and document d3d_3d3 has one occurrence of ti and tj each.
Documents d1, d2 and d3 are represented in this space using term weights—raw term frequency being
used here—as coordinates. The angles between the documents and query are represented as θ1, θ2
and θ3 respectively.
NATURAL LANGUGAE PROCESSING BAI601/BAD613B
The simplest way of comparing document and query is by counting the number of terms they
have in common. One frequently used similarity measure"
NATURAL LANGUGAE PROCESSING BAI601/BAD613B
• Examples include information logic model, situation theory model, and interaction model.
• The information logic model is based on a special logic technique called logical imaging.
Retrieval is performed by making inferences from document to query.
• This is unlike classical models, where a search process is used. Unlike usual implication,
which is true in all cases except that when antecedent is true and consequent is false, this
inference is uncertain.
• Hence, a measure of uncertainty is associated with this inference. The principle put forward
by van Rijsbergen is used to measure this uncertainty.
• This principle says: Given any two sentences x and y, a measure of the uncertainty of y → x
relative to a given data set is determined by the minimal extent to which one has to add
information to the data set in order to establish the truth of y → x.
• In the fuzzy model, the document is represented as a fuzzy set of terms, i.e., a set of pairs
[ti,μ(ti)] where μ\muμ is the membership function.
• The membership function assigns to each term of the document a numeric membership degree.
• The membership degree expresses the significance of term to the information contained in the
document.
• Usually, the significance values (weights) are assigned based on the number of occurrences of
the term in the document and in the entire document collection, as discussed earlier.
can thus be represented as a vector of term weights, as in the following vector space model
(w1j,w2j,w3j,...,wij,...,wmj)t
Latent Semantic Indexing (LSI) is an information retrieval method based on the mathematical
technique of Singular Value Decomposition (SVD). The central idea is that there is an underlying
"latent" or hidden semantic structure in the usage of words across a collection of documents.
Traditional retrieval methods face issues with synonyms and polysemy (one word having
multiple meanings).
LSI helps overcome these issues by identifying deeper semantic associations between words
and documents, beyond simple term matching.
Documents and terms form a matrix with rows representing terms and columns representing
documents.
Entries usually represent term frequency (TF), TF-IDF, or some weighted value indicating
importance.
The document collection is first processed to get a m×n term-by-document matrix, W, where m is
the number of index terms and n is the total number of documents in the collection. Columns in this
matrix represent document vectors, whereas the rows denote term vectors. The matrix element Wij
represents the weight of the term iii in document j. The weight may be assigned based on term
frequency or some combination of local and global weighting, as in the case of vector space model.
NATURAL LANGUGAE PROCESSING BAI601/BAD613B
Singular value decomposition (SVD) of the term-by-document matrix is then computed. Using
SVD, the matrix is represented as a product of three matrices:
T and D are orthogonal matrices containing the left and right singular vectors of W. S is a diagonal
matrix, containing singular values stored in decreasing order. We eliminate small singular values and
approximate the original term-by-document matrix using truncated SVD. For example, by
considering only the first k number of the largest singular values, along with their corresponding
columns in T and D, we get the following approximation of the original term-by-document matrix in
a space of k orthogonal dimensions, where k is sufficiently less than n:
NATURAL LANGUGAE PROCESSING BAI601/BAD613B
Retrieval is performed by computing the similarity between query vector and document vector. For
example, we can use the cosine similarity measure to rank documents to perform retrieval. In a
keyword-based retrieval, relevant documents that do not share any term with the query are not
retrieved. The LSI-based approach is capable of retrieving such documents, as similarity is computed
based on the overall pattern of term usage across the document collection rather than on term overlap.
Relevance Ranking: Determining which documents are most relevant to a query remains challenging
due to variations in language and meaning.
Query Understanding: Users often input vague or ambiguous queries, making it difficult for the
system to interpret intent accurately.
Scalability: Handling and searching through massive volumes of data efficiently requires high
computational resources and optimized indexing.
Synonymy & Polysemy: Words with multiple meanings (polysemy) and different words with the same
meaning (synonymy) create confusion in matching queries with documents.
User Personalization: Tailoring results based on user preferences and history without breaching
privacy is complex.
Multilingual Retrieval: Supporting queries and documents in multiple languages adds another layer
of complexity.
***CHAPTER ENDS**
NATURAL LANGUGAE PROCESSING BAI601/BAD613B
CHAPTER-12
LEXICAL RESOURCES
12.1 INTRODUCTION
A whole range of tools and lexical resources have been developed to ease the task of researchers
working with natural language processing (NLP).
Many of these are open sources, i.e., readers can download them off the Internet.
This chapter introduces some of the freely available resources.
The motivation behind including this chapter comes from the belief that knowing where the
information is, is half of the information.
12.2 WORDNET
WordNet¹ (Miller 1990, 1995) is a large lexical database for the English language.
Inspired by psycholinguistic theories, it was developed and is being maintained at the Cognitive
Science Laboratory, Princeton University, under the direction of George A. Miller.
WordNet consists of three databases — one for nouns, one for verbs, and one for both adjectives
and adverbs.
Information is organized into sets of synonymous words, called synsets, each representing one
base concept.
The synsets are linked to each other by means of lexical and semantic relations.
Lexical relations occur between word-forms (i.e., senses) and semantic relations between word
meanings.
These relations include synonymy, hypernymy/hyponymy, antonymy, meronymy/holonymy,
troponymy, etc.
A word may appear in more than one synset and in more than one part-of-speech.
The meaning of a word is called sense. WordNet lists all senses of a word, each sense
belonging to a different synset.
WordNet’s sense-entries consist of a set of synonyms and a gloss.
A gloss consists of a dictionary-style definition and examples demonstrating the use of a
synset in a sentence.
NATURAL LANGUGAE PROCESSING BAI601/BAD613B
NATURAL LANGUGAE PROCESSING BAI601/BAD613B
NATURAL LANGUGAE PROCESSING BAI601/BAD613B
WordNets for other languages have also been developed, e.g., EuroWordNet and Hindi
WordNet. EuroWordNet covers European languages, including English, Dutch, Spanish,
Italian, German, French, Czech, and Estonian. Other than language internal relations, it also
contains multilingual relations from each WordNet to English meanings.
Hindi WordNet has been developed by CFILT (Resource Center for Indian Language
Technology Solutions), IIT Bombay. Its database consists of more than 26,208 synsets and
56,928 Hindi words. It is organized using the same principles as English WordNet but includes
some Hindi-specific relations (e.g., causative relations). A total of 16 relations have been used
in Hindi WordNet. Each entry consists of synset, gloss, and position of synset in ontology.
Figure 12.7 shows the Hindi WordNet entry for the word "आकाां क्षा" (aakanksha).
APPLICATIONS:
Document Summarization
WordNet has been used for creating lexical chains to aid in automatic text
summarization.
NATURAL LANGUGAE PROCESSING BAI601/BAD613B
12.3 FRAMENET
FrameNet is a large database of semantically annotated English sentences. It is based on
principles of frame semantics.
FrameNet defines a set of semantic roles called frame elements.
Sentences from the British National Corpus are tagged with these frame elements.
The core idea: each word evokes a particular situation involving certain participants.
FrameNet captures these situations using case-frame representations of words (verbs, adjectives,
nouns).
The word that evokes a frame is called the target word or predicate.
Participants (entities) are called frame elements (like roles).
The FrameNet ontology is a semantic-level model of predicate-argument structure.
Each frame includes:
A main lexical item (predicate)
Frame-specific semantic roles (example: in the "ARREST" frame, roles are AUTHORITIES,
TIME, and SUSPECT).
For example:
In sentence (12.1), the word 'nab' is the target word (a verb) in the ARREST frame, linked with
roles like AUTHORITIES and SUSPECT.
APPLICATIONS:
Gildea and Jurafsky (2002) and Kwon et al. (2004) used FrameNet data for automatic
semantic parsing.
The shallow semantic roles from FrameNet help in information extraction.
Example: Even if syntactic roles differ, semantic roles can identify that the theme
remains the same.
Sentences:
"The umpire stopped the match." (12.4)
"The match stopped due to bad weather." (12.5)
In (12.4), ‘match’ is the object, while in (12.5), it is the subject — but the semantic role
(theme) is the same.
NATURAL LANGUGAE PROCESSING BAI601/BAD613B
12.4 STEMMERS
Stemming, often called conflation, is the process of reducing inflected (or sometimes derived) words
to their base or root form. The stem need not be identical to the morphological base of the word; it is
usually sufficient that related words map to the same stem, even if this stem is not in itself a valid
root.
Stemming is useful in search engines for query expansion or indexing and other NLP problems.
Stemming programs are commonly referred to as stemmers. The most common algorithm for
stemming English is Porter's algorithm (Porter 1980).
Figure 12.10 shows a sample text and output produced using these stemmers.
There are many stemmers available for English and other languages. Snowball presents
stemmers for English, Russian, and a number of other European languages, including French,
Spanish, Portuguese, Hungarian, Italian, German, Dutch, Swedish, Norwegian, Danish, and
Finnish. The links for stemming algorithms for these languages can be found at:
http://snowball.tartarus.org/texts/stemmersoverview.html
Standard stemmers are not yet available for Hindi and other Indian languages. The major
research on Hindi stemming has been accomplished by Ramanathan and Rao (2003) and
Majumder et al. (2007). Ramanathan and Rao based their work on handcrafted suffix lists.
Majumder et al. used a cluster-based approach to find root words and their morphological
variants.
They evaluated their approach and found that task-based stemming improves recall for
Indian languages, as shown in their observation on Bengali data. The Resource Centre of
Indian Language Technology (CFILT), IIT Bombay has also developed stemmers for
Indian languages, available at:
http://www.cfilt.iitb.ac.in
NATURAL LANGUGAE PROCESSING BAI601/BAD613B
Stemmers are common in search and retrieval systems like web search engines. Stemming
reduces word variants to a common base form, minimizing index size and helping retrieve
related documents.
Example: A search for "astronauts" also retrieves "astronaut".
This boosts coverage, but sometimes reduces precision, especially in English queries.
Stemming is also used in text summarization and categorization, where it helps in
identifying features by reducing various morphological forms of words to their stems.
Part-of-speech tagging is used at an early stage of text processing in many NLP applications such as
speech synthesis, machine translation, IR (Information Retrieval), and information extraction.
I. It makes explicit use of both the preceding and following tag contexts via a dependency
network representation.
II. It uses a broad range of lexical features.
III. It utilizes priors in conditional log-linear models.
The reported accuracy of this tagger on the Penn Treebank WSJ is 97.24%, with an error
reduction of 4.4% over the best previous result (Tuotanova et al. 2003).
Details: http://nlp.stanford.edu/software/tagger.shtml
Brill (1992) introduced a trainable rule-based tagger that performs comparably to stochastic
taggers.
12.5.6 Tree-Tagger
NATURAL LANGUGAE PROCESSING BAI601/BAD613B
Research corpora have been developed for a number of NLP-related tasks. In the following section,
we point out few of the available standard document collections for a variety of NLP-related tasks,
along with their Internet links.
We have already provided a list of IR test document collection in Chapter 9. Glasgow University,
UK, maintains a list of freely available IR test collections. Table 12.2 lists the sources of those and
few more IR test collections.
LETOR (learning to rank) is a package of benchmark data sets released by Microsoft Research Asia.
It consists of two datasets OHSUMED and TREC (TD2003 and TD2004). LETOR is packaged with
extracted features for each query-document pair in the collection, baseline results of several state-of-
the-art learning-to-rank algorithms on the data and evaluation tools. The data set is aimed at
supporting future research in the area of learning ranking function for information retrieval.
Evaluating a text summarizing system requires existence of ‘gold summaries’. DUC provides
document collections with known extracts and abstracts, which are used for evaluating performance
of summarization systems submitted at TREC conferences. Figure 12.11 shows a sample document
and its extract from DUC 2002 summarization data.
The multilingual EMILLE corpus is the result of the Enabling Minority Language Engineering
(EMILLE) project at Lancaster University, UK.
NATURAL LANGUGAE PROCESSING BAI601/BAD613B
Key points:
Focuses on data generation, software resources, and basic NLP tools for South Asian
languages.
CIIL (Central Institute for Indian Languages), Indian partner in the project, extended
languages to include more Indian languages.
Covers monolingual and parallel/annotated corpora for a variety of genres.
EMILLE/CIIL corpus is freely available for research use at:
http://www.elda.org/catalogue/en/text/W0037.html
Manual: http://www.emille.lancs.ac.uk/manual.pdf
Corpus Details:
Monolingual corpus: Written text for 14 South Asian languages and spoken text for 5
languages (Hindi, Bengali, Gujrati, Punjabi, Urdu).
Spoken data: Derived from BBC Asia Network broadcasts.
Parallel corpus: English text + translations in 5 languages; includes UK government
leaflets.
Annotated corpus includes:
o POS tagging
o Urdu translation
o Hindi corpus annotated for demonstrative use
******CHAPTER ENDS*****