0% found this document useful (0 votes)
10 views

Module 4 Notes

The document discusses Information Retrieval (IR), focusing on the organization, storage, retrieval, and evaluation of information based on user queries. It outlines key design features of IR systems, including indexing, elimination of stop words, stemming, and Zipf's law, which influences the selection of index terms. Additionally, it describes various IR models, including Boolean, probabilistic, and vector space models, highlighting their mechanisms and limitations in retrieving relevant documents.

Uploaded by

Spoorthi Harkuni
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Module 4 Notes

The document discusses Information Retrieval (IR), focusing on the organization, storage, retrieval, and evaluation of information based on user queries. It outlines key design features of IR systems, including indexing, elimination of stop words, stemming, and Zipf's law, which influences the selection of index terms. Additionally, it describes various IR models, including Boolean, probabilistic, and vector space models, highlighting their mechanisms and limitations in retrieving relevant documents.

Uploaded by

Spoorthi Harkuni
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

NATURAL LANGUGAE PROCESSING BAI601/BAD613B

MODULE-4
CHAPTER-9
INFORMATION RETRIEVAL
9.1 INTRODUCTION
• Information retrieval (IR) deals with the organization, storage, retrieval, and evaluation of information
relevant to a user’s query.
• A user in need of information formulates a request in the form of a query written in a natural language.
• The retrieval system responds by retrieving the document that seems relevant to the query

9.2 DESIGN FEATURES OF INFORMATION RETRIEVAL SYSTEM


• It begins with the user’s information need. Based on this need, he/she formulates a query.
• The IR system returns documents that seem relevant to the query. This is an engineering account of
the IR system.
• The basic question involved is, ‘what constitutes the information in the documents and the queries’.
• This, in turn is related to the problem of representation of documents and queries.
• The retrieval is performed by matching the query representation with document representation.

The actual text of the document is not used in the retrieval process. Instead, documents in a collection are
frequently represented through a set of index terms or keywords. Keywords can be single word or multi-
word phrases. They might be extracted automatically or manually (i.e., specified by a human). Such a
representation provides a logical view of the document. The process of transforming document text to
some representation of it is known as indexing. There are different types of index structures. One used
data structure, commonly by the IR system, is the inverted index.
An inverted index is simply a list of keywords, with each keyword carrying pointers to the documents
containing that keyword.
NATURAL LANGUGAE PROCESSING BAI601/BAD613B

The computational cost involved in adopting a full text logical view (i.e., using a full set of words to
represent a document) is high.

Hence, some text operations are usually performed to reduce the set of representative keywords. The two
most commonly used text operations are

 stop word elimination


 Stemming.
Stop word elimination removes grammatical or functional words, while stemming reduces words to their
common grammatical roots.

 Zipf’s law can be applied to further reduce the size of index set.
 Not all the terms in a document are equally relevant.
 Some might be more important in conveying a document’s content.
 Attempts have been made to quantify the significance of index terms to a document by assigning
them numerical values, called weights.

9.2.1 INDEXING
 In a small collection of documents, an IR system can access a document to decide its relevance to
a query.
 However, in a large collection of documents, this technique poses practical problems.
 Hence, a collection of raw documents is usually transformed into an easily accessible
representation. This process is known as indexing.
 Most indexing techniques involve identifying good document descriptors, such as keywords or
terms, which describe the information content of documents.
 A good descriptor is one that helps describe the content of the document and discriminate it from
other documents in the collection.
 It attempts to interpret the structure and meaning of larger units, e.g., at the paragraph and
document level, in terms of words, phrases, clusters, and sentences. It deals with how the meaning
of a sentence is determined by preceding sentence.
 Thus, indexing is simply the representation of text (query and document) as a set of terms whose
meaning is equivalent to some content of the original text.
 The word term can be a single word or multi-word phrases.
For example, the sentence, Design features of information retrieval systems, can be represented as
follows:

Design, features, information, retrieval, systems


It can also be represented by the set of terms:

Design, features, information retrieval, information retrieval systems.


 These multi-word terms can be obtained by looking at frequently appearing sequences of words, n-
grams, part-of-speech tags, or by applying NLP to identify meaningful phrases or handcrafting.
 POS tagging helps extract meaningful sequences of words; it handles sense ambiguity, as words are
assigned POS based on their local (sentential) context.
 Though statistical approaches to phrase extraction are more efficient, they fail to handle word order
changes and structural variations, which are better handled by syntactic approaches.
 In text retrieval conference (TREC), the method used for phrase extraction is as follows:
NATURAL LANGUGAE PROCESSING BAI601/BAD613B

1. Any pair of adjacent non-stop words is regarded a potential phrase.


2. The final list of phrases is composed of those pairs of words that occur in, say, 25 or more
documents in the document collection.
 The NLP is also used in the recognition of proper nouns and the normalization of noun phrases.
Ideally, all names in the text need to be recognized and represented as a single entity, e.g. President
Kalam, President of India, and variants of the same name are recognized as such. Phrase
normalization captures structural variations in phrases.
 For example, the three phrases text categorization, categorization of text, and categorize text, are
normalized to give text categorize.

9.2.2 ELIMINATING STOP WORDS


 The lexical processing of index terms involves elimination of stop words. Stop words are high
frequency words which have little semantic weight and are thus, unlikely to help in retrieval.
 These words play important grammatical roles in language, such as in the formation of phrases, but
do not contribute to the semantic content of a document in a keyword-based representation. Such
words are commonly used in documents, regardless of topics, and thus, have no topical specificity.
 Typical example of stop words are articles and prepositions.
 Eliminating them considerably reduces the number of index terms. The drawback of eliminating
stop words is that it can sometimes result in the elimination of useful index terms, for instance the
stop word A in Vitamin A.
 Some phrases, like to be or not to be, consist entirely of stop words.
 Eliminating stop words in such case, make it impossible to correctly search a document.
 Table 9.1 shows some stop words in English.
NATURAL LANGUGAE PROCESSING BAI601/BAD613B

9.2.3 STEMMERS
 Stemming normalizes morphological variants, though in a crude manner, by removing affixes
from the words to reduce them to their stem, e.g., the words compute, computing, computes, and
computer, are all be reduced to the word compute.
 Thus, the keywords or terms used to represent same word stem, comput.
 Thus, the keywords or terms used to represent text are stems, not the actual words.
 One of the most widely used stemming algorithms has been developed by Porter (1980). The
stemmed representation of the text, Design features of information retrieval systems, is
(design, featur, inform, retriev, system)
 One of the problems associated with stemming is that it may throw away useful distinctions. In
some cases, it may be useful to help conflate similar terms, resulting in increased recall.
 In others, it may be harmful, resulting in reduced precision (e.g., when documents containing the
term computation are returned in response to the query phrase personal computer). Recall and
precision are the two most commonly used measures of the effectiveness of an information
retrieval system.

9.2.4 ZIPF’S LAW


 Zipf made an important observation on the distribution of words in natural languages.
 This observation has been named Zipf’s law. Simply stated, Zipf’s law says that the frequency of
words multiplied by their ranks in a large corpus is more or less constant.
 More formally, Frequency × rank ≈ constant.
 This means that if we compute the frequencies of the words in a corpus, and arrange them in
decreasing order of frequency, then the product of the frequency of a word and its rank is
approximately equal to the product of the frequency and rank of another word.
 This indicates that the frequency of a word is inversely proportional to its rank. This relationship
is shown

 Empirical investigation of Zipf’s law on large corpuses suggest that human languages contain a
small number of words that occur with high frequency and a large number of words that occur
with low frequency.
NATURAL LANGUGAE PROCESSING BAI601/BAD613B

 In between, is a middling number of medium frequency terms.,This distribution has important


significance in IR.
 The high frequency words, being common, have less discriminating power, and thus, are not useful
for indexing. Low frequency words are less likely to be included in the query, and are also not
useful for indexing.
 As there are a large number of rare (low frequency) words, dropping them considerably reduces
the size of a list of index terms.
 The remaining medium frequency words are content-bearing terms and can be used for indexing.
 This can be implemented by defining thresholds for high and low frequency, and dropping words
that have frequencies above or below these thresholds. Stop word elimination can be thought of
as an implementation of Zipf’s law, where high frequency terms are dropped from a set of index
terms.

9.3 INFORMATION RETRIEVAL MODELS

• An IR model is a pattern that defines several aspects of the retrieval procedure, for example,
 how documents and user's queries are represented,
 how a system retrieves relevant documents according to users' queries, and
 How retrieved documents are ranked.
• The IR system consists of a model for documents, a model for queries, and a matching function
which compares queries to documents.
• The central objective of the model is to retrieve all documents relevant to a query. This defines the
central task of an IR system.
Several different IR models have been developed.
These models differ in the way documents and queries are represented and retrieval is performed.

• Some of them consider documents as sets of terms and perform retrieval based merely on the
presence or absence of one or more query terms in the document.
• Others represent a document as a vector of term weights and perform retrieval based on the
numeric score assigned to each document, representing similarity between the query and the
document.
• These models can be classified as follows:
1. Classical models of IR
2. Non-classical models of IR
3. Alternative models of IR
• The three classical IR models — Boolean, vector, and probabilistic — are based on mathematical
knowledge that is easily recognized and well understood. These models are simple, efficient, and
easy to implement.

9.4 CLASSICAL INFORMATION RETRIEVAL MODELS

9.4.1 BOOLEAN MODEL


 The Boolean model is the oldest of the three classical models.
 It is based on Boolean logic and classical set theory. In this model, documents are represented as
a set of keywords, usually stored in an inverted file.
NATURAL LANGUGAE PROCESSING BAI601/BAD613B

 An inverted file is a list of keywords and identifiers of the documents in which they occur.
 Users are required to express their queries as a Boolean expression consisting of keywords
connected with Boolean logical operators (AND, OR, NOT).
 Retrieval is performed based on whether or not a document contains the query terms.

• Example 9.1 Let the set of original documents be𝐷�={𝐷�1,𝐷�2,𝐷�3}

Document Descriptions
D₁: "Information retrieval is concerned with the organization, storage, retrieval, and
evaluation of information relevant to user’s query."

D₂: "A user having an information need formulates a request in natural language."
D₃: "The retrieval system responds by retrieving documents that seem relevant to the
query."
Set of Terms (T):
T = {information, retrieval, query}

Set of Documents (D):


D = {d₁, d₂, d₃}
Where:
d₁ = {information, retrieval, query}
d₂ = {information, query}
d₃ = {retrieval, query}
Query
Q = information ∧ retrieval
(User wants documents containing both “information” and “retrieval”)
Step-by-Step Retrieval Process
Retrieve R₁:
Documents that contain "information":
NATURAL LANGUGAE PROCESSING BAI601/BAD613B

R₁ = {d₁, d₂}

Retrieve R₂:
Documents that contain "retrieval":
R₂ = {d₁, d₃}
Intersection (R₁ ∩ R₂):
Documents that contain both terms:
R₁ ∩ R₂ = {d₁}

Final Result

 The system retrieves d₁ in response to the query Q = information ∧ retrieval.


 If more than one document have the same representation, every such document is retrieved.
 Boolean information retrieval does not differentiate between these documents.
 With an inverted index, this simply means taking an intersection of the list of the documents
associated with the keywords information and retrieval.
 Boolean retrieval models have been used in IR systems for a long time. They are simple,
efficient, and easy to implement and perform well in terms of recall and precision if the query
is well formulated. However, the model suffers from certain drawbacks:

No Partial Matching

 The model retrieves only documents that fully satisfy the Boolean query.
 It cannot retrieve documents that are partially relevant.
 All results are binary: a document either satisfies the query completely or not at all.
 Example phrase: "all information is ‘to be or not to be’."

No Ranking Mechanism

 The model does not rank documents based on relevance.


 It only checks for presence or absence of keywords.
 It does not account for keyword frequency or importance in a document.

Unnatural Query Formation

 Users rarely write queries in strict Boolean logic.


 Boolean expressions (using AND, OR, NOT) are not intuitive for most users.
 This creates a gap between user intent and system expectation.

9.4.2 PROBABILISTIC MODEL


• The probabilistic model applies a probabilistic framework to IR.
• It ranks documents based on the probability of their relevance to a given query.
• Retrieval depends on whether probability of relevance (relative to a query) of a document is higher
than that of non-relevance, i.e. whether it exceeds a threshold value.
• Given a set of documents D, a query q, and a cut-off value α\alpha, this model first calculates the
probability of relevance and irrelevance of a document to the query.
NATURAL LANGUGAE PROCESSING BAI601/BAD613B

• It then ranks documents having probabilities of relevance at least that of irrelevance in decreasing order
of their relevance.
• Documents are retrieved if the probability of relevance in the ranked list exceeds the cut off value.
• More formally, if P(R/d) is the probability of relevance of a document dj for query q, and P(I/d) is the
probability of irrelevance, then the set of documents retrieved in response to the query q is as follows:

• The probabilistic model, like the vector space model, can return documents that partly
match the user’s query — a key advantage over Boolean models.
• However, a major drawback is:
• The determination of a threshold value for the initially retrieved set."
• This means the model requires setting a probability threshold to decide which documents
are likely relevant.
• But in practice:
• "The number of relevant documents by a query is usually too small for the
probability to be estimated accurately."

9.4.3 VECTOR SPACE MODEL

• The vector space model is one of the most well-studied retrieval models.
• The vector space model represents documents and queries as vectors of features representing
terms that occur within them.
• Each document is characterized by a Boolean or numerical vector.
• These vectors are represented in a multi-dimensional space, in which each dimension
corresponds to a distinct term in the corpus of documents.
• In its simplest form, each feature takes a value of either zero or one, indicating the absence
or presence of that term in a document or query.
• More generally, features are assigned numerical values that are usually a function of the
frequency of terms.
• Ranking algorithms compute the similarity between document and query vectors, to yield
a retrieval score to each document.
• This score is used to produce a ranked list of retrieved documents.
NATURAL LANGUGAE PROCESSING BAI601/BAD613B

Example 9.2 – Vector Representation

Step 1: Document-Term Vectors

The documents are represented using term frequency vectors:

 Document d₁ → (2, 2, 1)
 Document d₂ → (1, 0, 1)
 Document d₃ → (0, 1, 1)

Each number represents the frequency of a term in that document.

Step 2: Euclidean Space Representation

 Each document vector becomes a point in 3D Euclidean space.


 The coordinates of the point are the term frequencies.
 For example:
d1=P1=(2,2,1)

Step 3: Term-Document Matrix

The same information is expressed as a matrix:

 Rows: Terms
 Columns: Documents
NATURAL LANGUGAE PROCESSING BAI601/BAD613B

To reduce the importance of the length of document vectors, we normalize document vectors.
Normalization changes all vectors to a standard length. We convert document vectors to unit length
by dividing each dimension by the overall length of the vector. Normalizing the term-document
matrix shown in this example, we get the following matrix:

Elements of each column are divided by the length of the column vector given by

The values shown in this matrix have been rounded to two decimal digits.

9.4.4 TERM WEIGHTING


• Each term used as an indexing feature in a document helps discriminate that document from
others.

• Term weighting is a technique used in information retrieval and text mining to assign
importance to terms (usually words) in documents. The goal is to reflect how relevant a
term is within a specific document and across a collection of documents.

• The first factor simply means that terms that occur more frequently represent the document’s
meaning more strongly than those occurring less frequently, and hence should be given high
NATURAL LANGUGAE PROCESSING BAI601/BAD613B

weights. In the simplest form, this weight is the raw frequency of the term in the document,
as discussed earlier.
• The second factor actually considers term distribution across the document collection.
Terms occurring in a few documents are useful for distinguishing those documents from the
rest of the collection. Similarly, terms that occur more frequently across the entire collection
are less helpful while discriminating among documents.
• This requires a fraction

• n = total number of documents in the collection


• ni= number of documents in which the term i occurs
• This measure assigns:

• Lowest weight (1) → term appears in all documents

• Highest weight (n) → term appears in only one document

• As the number of documents in any collection is usually large, the log of this measure is
usually taken. This results in the inverse document frequency (idf) term weight:

 Inverse document frequency (idf) attaches more importance to more specific terms. If a term
occurs in all documents in a collection, its idf is 0.
 Sparck-Jones showed experimentally that a weight of:

 termed as inverse document frequency, leads to more effective retrieval. Later researchers
attempted to combine term frequency (tf) and idf weights, resulting in a family of tf × idf weight
schemes having the following general form:
NATURAL LANGUGAE PROCESSING BAI601/BAD613B

• A third factor that may affect weighting function is the document length.

• A term appearing the same number of times in a short document and in a long document,
will be more valuable to the former.

• Most weighting schemes can thus be characterized by the following three factors:
1. Within-document frequency or term frequency (tf)
2. Collection frequency or inverse document frequency (idf)
3. Document length

Any term weighting scheme can be represented by a triple ABC. The letter A in this triple represents
the way the tf component is handled, B indicates the way the idf component is incorporated, and C
represents the length normalization component.

• Different combinations of options can be used to represent document and query vectors. The
retrieval model themselves can be represented by a pair of triples like nnn.nnn (doc =
‘nnn’, query = ‘nnn’), where the first triple corresponds to the weighting strategy used for
the documents and the second triple to the weighting strategy used for the query term.
NATURAL LANGUGAE PROCESSING BAI601/BAD613B

Different choices for A, B, and C for query and document vectors yield different retrieval
modes, for example, ntc-ntc, lnc-ltc, etc.
The choices for tf (term frequency) are:
 n: use the raw term frequency

 b: binary (i.e., neglect term frequency; term frequency will be 1 if the term is present in the
document, otherwise 0)
 a: augmented normalized frequency
 l: logarithmic term frequency
 L: logarithmic frequency normalized by average term frequency
The options for idf are:
 n: use 1.0 (ignore idf factor)
 t: use idf
The possible options listed in Table 9.3 for document length normalization are:
NATURAL LANGUGAE PROCESSING BAI601/BAD613B

 n: no normalization
 c: cosine normalization

To achieve cosine normalization, every element of the term weight vector is divided by the
Euclidean length of the vector. This is called cosine normalization because the length of the
normalized vector is 1, and its projection on any axis in document space gives the cosine of the angle
between the vector and the axis under consideration.
9.4.5 SIMILARITY MEASURES
Vector space model represents documents and queries as vectors in a multi-dimensional space.
Retrieval is performed by measuring the ‘closeness’ of the query vector to document vector.

Documents can then be ranked according to the numeric similarity between the query and the
document.

In the vector space model, the documents selected are those that are geometrically closest to the query
according to some measure.

Figure 9.4 gives an example of document and query representation in two-dimensional vector space.
These dimensions correspond to the two index terms ti and tj. Document d1has two occurrences of ti,
document d2 has one occurrence of ti, and document d3d_3d3 has one occurrence of ti and tj each.
Documents d1, d2 and d3 are represented in this space using term weights—raw term frequency being
used here—as coordinates. The angles between the documents and query are represented as θ1, θ2
and θ3 respectively.
NATURAL LANGUGAE PROCESSING BAI601/BAD613B

The simplest way of comparing document and query is by counting the number of terms they
have in common. One frequently used similarity measure"
NATURAL LANGUGAE PROCESSING BAI601/BAD613B

9.5 NON-CLASSICAL MODELS OF IR


• Non-classical IR models are based on principles other than similarity, probability, Boolean
operations, etc., on which classical retrieval models are based.

• Examples include information logic model, situation theory model, and interaction model.
• The information logic model is based on a special logic technique called logical imaging.
Retrieval is performed by making inferences from document to query.

• This is unlike classical models, where a search process is used. Unlike usual implication,
which is true in all cases except that when antecedent is true and consequent is false, this
inference is uncertain.
• Hence, a measure of uncertainty is associated with this inference. The principle put forward
by van Rijsbergen is used to measure this uncertainty.

• This principle says: Given any two sentences x and y, a measure of the uncertainty of y → x
relative to a given data set is determined by the minimal extent to which one has to add
information to the data set in order to establish the truth of y → x.

• The situation theory model is also based on van Rijsbergen’s principle.


• Retrieval is considered as a flow of information from document to query.
• A structure called infon, denoted by ι, is used to describe the situation and to model
information flow. An infon represents an n-ary relation and its polarity.
• The polarity of an infon can be either 1 or 0, indicating that the infon carries either positive
or negative information.
• For example, the information in the sentence, Adil is serving a dish, is conveyed by the infon:
NATURAL LANGUGAE PROCESSING BAI601/BAD613B

• A document d is considered relevant to a query q if it supports or entails it, written as:


• 𝑑�⊨𝑞�
• But if d does not support q, it does not necessarily mean the document is irrelevant! Because:
It may use different words (e.g., synonyms, hyponyms).
• For example, "car" vs "automobile", "serve" vs "offer".
• This transformation (d → d′) is considered a flow of information between situations.
• The interaction IR model was first introduced in Dominich (1992, 1993) and Rijsbergen (1996).
In this model, the documents are not isolated; instead, they are interconnected.
• The query interacts with the interconnected documents. Retrieval is conceived as a result of
this interaction. This view of interaction is taken from the concept of interaction as realized in
the Copenhagen interpretation of quantum mechanics.
• Artificial neural networks can be used to implement this model.
• Each document is modelled as a neuron, the document set as a whole forms a neural
network.
• The query is also modelled as a neuron and integrated into the network.
• This enables:
 Formation of new connections
 Modification of existing connections
 Interactive restructuring of relationships during retrieval
• Retrieval is based on the measure of interaction between the query and documents.
• The interaction score is used to rank or retrieve relevant documents.

9.6 ALTERNATIVE MODELS OF IR

9.6.1 CLUSTER MODEL


 The cluster model is an attempt to reduce the number of matches during retrieval.
 Closely associated documents tend to be relevant to the same clusters.
 This hypothesis suggests that closely associated documents are likely to be retrieved together.
 This means that by forming groups (classes or clusters) of related documents, the search time
reduced considerably.
 Instead of matching the query with every document in the collection, it is matched with
representatives of the class, and only documents from a class whose representative is close to
query, are considered for individual match.
NATURAL LANGUGAE PROCESSING BAI601/BAD613B
NATURAL LANGUGAE PROCESSING BAI601/BAD613B

9.6.2 FUZZY MODEL

• In the fuzzy model, the document is represented as a fuzzy set of terms, i.e., a set of pairs
[ti,μ(ti)] where μ\muμ is the membership function.
• The membership function assigns to each term of the document a numeric membership degree.

• The membership degree expresses the significance of term to the information contained in the
document.

• Usually, the significance values (weights) are assigned based on the number of occurrences of
the term in the document and in the entire document collection, as discussed earlier.

• Each document in the collection


D={d1,d2,...,dj,...,dn}

can thus be represented as a vector of term weights, as in the following vector space model
(w1j,w2j,w3j,...,wij,...,wmj)t

where wij is the degree to which term ti belongs to document dj.


 Each term in the document is considered a representative of a subject area and wij is the
membership function of document dj to the subject area represented by term ti.
 Each term ti is itself represented by a fuzzy set fi in the domain of documents given by
 fi={(dj,wij)}∣i=1,...,m; j=1,...,n.
 This weighted representation makes it possible to rank the retrieved documents in decreasing
order of their relevance to the user’s query.
 Queries are Boolean queries. For each term that appears in the query, a set of documents is
retrieved. Fuzzy set operators are then applied to obtain the desired result.
NATURAL LANGUGAE PROCESSING BAI601/BAD613B

9.6.3 LATENT SEMANTIC INDEXING MODEL

Latent Semantic Indexing (LSI) is an information retrieval method based on the mathematical
technique of Singular Value Decomposition (SVD). The central idea is that there is an underlying
"latent" or hidden semantic structure in the usage of words across a collection of documents.

 Traditional retrieval methods face issues with synonyms and polysemy (one word having
multiple meanings).
 LSI helps overcome these issues by identifying deeper semantic associations between words
and documents, beyond simple term matching.

LSI involves several key steps:


Step 1: Construct the Term-Document Matrix

Documents and terms form a matrix with rows representing terms and columns representing
documents.

Entries usually represent term frequency (TF), TF-IDF, or some weighted value indicating
importance.

The document collection is first processed to get a m×n term-by-document matrix, W, where m is
the number of index terms and n is the total number of documents in the collection. Columns in this
matrix represent document vectors, whereas the rows denote term vectors. The matrix element Wij
represents the weight of the term iii in document j. The weight may be assigned based on term
frequency or some combination of local and global weighting, as in the case of vector space model.
NATURAL LANGUGAE PROCESSING BAI601/BAD613B

Singular value decomposition (SVD) of the term-by-document matrix is then computed. Using
SVD, the matrix is represented as a product of three matrices:

T and D are orthogonal matrices containing the left and right singular vectors of W. S is a diagonal
matrix, containing singular values stored in decreasing order. We eliminate small singular values and
approximate the original term-by-document matrix using truncated SVD. For example, by
considering only the first k number of the largest singular values, along with their corresponding
columns in T and D, we get the following approximation of the original term-by-document matrix in
a space of k orthogonal dimensions, where k is sufficiently less than n:
NATURAL LANGUGAE PROCESSING BAI601/BAD613B

Retrieval is performed by computing the similarity between query vector and document vector. For
example, we can use the cosine similarity measure to rank documents to perform retrieval. In a
keyword-based retrieval, relevant documents that do not share any term with the query are not
retrieved. The LSI-based approach is capable of retrieving such documents, as similarity is computed
based on the overall pattern of term usage across the document collection rather than on term overlap.

9.7 MAJOR ISSUES IN INFORMATION RETRIEVAL


Several major issues are:

Relevance Ranking: Determining which documents are most relevant to a query remains challenging
due to variations in language and meaning.

Query Understanding: Users often input vague or ambiguous queries, making it difficult for the
system to interpret intent accurately.

Scalability: Handling and searching through massive volumes of data efficiently requires high
computational resources and optimized indexing.

Synonymy & Polysemy: Words with multiple meanings (polysemy) and different words with the same
meaning (synonymy) create confusion in matching queries with documents.

User Personalization: Tailoring results based on user preferences and history without breaching
privacy is complex.

Multilingual Retrieval: Supporting queries and documents in multiple languages adds another layer
of complexity.
***CHAPTER ENDS**
NATURAL LANGUGAE PROCESSING BAI601/BAD613B

CHAPTER-12
LEXICAL RESOURCES
12.1 INTRODUCTION
 A whole range of tools and lexical resources have been developed to ease the task of researchers
working with natural language processing (NLP).
 Many of these are open sources, i.e., readers can download them off the Internet.
 This chapter introduces some of the freely available resources.
 The motivation behind including this chapter comes from the belief that knowing where the
information is, is half of the information.

12.2 WORDNET

 WordNet¹ (Miller 1990, 1995) is a large lexical database for the English language.

 Inspired by psycholinguistic theories, it was developed and is being maintained at the Cognitive
Science Laboratory, Princeton University, under the direction of George A. Miller.

 WordNet consists of three databases — one for nouns, one for verbs, and one for both adjectives
and adverbs.
 Information is organized into sets of synonymous words, called synsets, each representing one
base concept.
 The synsets are linked to each other by means of lexical and semantic relations.
 Lexical relations occur between word-forms (i.e., senses) and semantic relations between word
meanings.
 These relations include synonymy, hypernymy/hyponymy, antonymy, meronymy/holonymy,
troponymy, etc.
 A word may appear in more than one synset and in more than one part-of-speech.
 The meaning of a word is called sense. WordNet lists all senses of a word, each sense
belonging to a different synset.
 WordNet’s sense-entries consist of a set of synonyms and a gloss.
 A gloss consists of a dictionary-style definition and examples demonstrating the use of a
synset in a sentence.
NATURAL LANGUGAE PROCESSING BAI601/BAD613B
NATURAL LANGUGAE PROCESSING BAI601/BAD613B
NATURAL LANGUGAE PROCESSING BAI601/BAD613B

 WordNet is freely and publicly available for download from


http://wordnet.princeton.edu/obtain.

 WordNets for other languages have also been developed, e.g., EuroWordNet and Hindi
WordNet. EuroWordNet covers European languages, including English, Dutch, Spanish,
Italian, German, French, Czech, and Estonian. Other than language internal relations, it also
contains multilingual relations from each WordNet to English meanings.

 Hindi WordNet has been developed by CFILT (Resource Center for Indian Language
Technology Solutions), IIT Bombay. Its database consists of more than 26,208 synsets and
56,928 Hindi words. It is organized using the same principles as English WordNet but includes
some Hindi-specific relations (e.g., causative relations). A total of 16 relations have been used
in Hindi WordNet. Each entry consists of synset, gloss, and position of synset in ontology.
Figure 12.7 shows the Hindi WordNet entry for the word "आकाां क्षा" (aakanksha).

 APPLICATIONS:

Concept Identification in Natural Language


• WordNet can be used to identify concepts pertaining to a term, to suit them to the full
semantic richness and complexity of a given information need.
Word Sense Disambiguation
• WordNet combines features of a number of the other resources commonly used in
disambiguation work.
• It offers sense definitions of words, identifies synsets of synonyms, defines a number
of semantic relations and is freely available.
• This makes it the (currently) best known and most utilized resource for word sense
disambiguation.

Automatic Query Expansion


• WordNet’s semantic relations (synonyms, hypernyms, hyponyms) allow expanding
search queries beyond exact word matches, improving document retrieval
effectiveness.

Document Structuring and Categorization

 WordNet’s semantic knowledge helps in building conceptual representations for


text categorization

Document Summarization

 WordNet has been used for creating lexical chains to aid in automatic text
summarization.
NATURAL LANGUGAE PROCESSING BAI601/BAD613B

12.3 FRAMENET
 FrameNet is a large database of semantically annotated English sentences. It is based on
principles of frame semantics.
 FrameNet defines a set of semantic roles called frame elements.
 Sentences from the British National Corpus are tagged with these frame elements.
 The core idea: each word evokes a particular situation involving certain participants.
 FrameNet captures these situations using case-frame representations of words (verbs, adjectives,
nouns).
 The word that evokes a frame is called the target word or predicate.
 Participants (entities) are called frame elements (like roles).
 The FrameNet ontology is a semantic-level model of predicate-argument structure.
 Each frame includes:
 A main lexical item (predicate)
 Frame-specific semantic roles (example: in the "ARREST" frame, roles are AUTHORITIES,
TIME, and SUSPECT).
 For example:
 In sentence (12.1), the word 'nab' is the target word (a verb) in the ARREST frame, linked with
roles like AUTHORITIES and SUSPECT.

APPLICATIONS:

 Gildea and Jurafsky (2002) and Kwon et al. (2004) used FrameNet data for automatic
semantic parsing.
 The shallow semantic roles from FrameNet help in information extraction.
 Example: Even if syntactic roles differ, semantic roles can identify that the theme
remains the same.
 Sentences:
 "The umpire stopped the match." (12.4)
 "The match stopped due to bad weather." (12.5)
 In (12.4), ‘match’ is the object, while in (12.5), it is the subject — but the semantic role
(theme) is the same.
NATURAL LANGUGAE PROCESSING BAI601/BAD613B

 Semantic roles also help in Question Answering (QA) systems.


 Example: verbs like 'send' and 'receive' share semantic roles like SENDER,
RECIPIENT, GOODS, under a TRANSFER frame (Gildea and Jurafsky 2002).
 This enables QA systems to answer questions like:
 “Who sent the packet to Khushbu?” based on sentences like:
 → "Khushbu received a packet from the examination cell." (12.6)

12.4 STEMMERS

Stemming, often called conflation, is the process of reducing inflected (or sometimes derived) words
to their base or root form. The stem need not be identical to the morphological base of the word; it is
usually sufficient that related words map to the same stem, even if this stem is not in itself a valid
root.

Stemming is useful in search engines for query expansion or indexing and other NLP problems.
Stemming programs are commonly referred to as stemmers. The most common algorithm for
stemming English is Porter's algorithm (Porter 1980).

Other existing stemmers include:


NATURAL LANGUGAE PROCESSING BAI601/BAD613B

 Lovins stemmer (Lovins 1968)


 A more recent one called the Paice/Husk stemmer (Paice 1990).

Figure 12.10 shows a sample text and output produced using these stemmers.

12.4.1 Stemmers for European Languages

 There are many stemmers available for English and other languages. Snowball presents
stemmers for English, Russian, and a number of other European languages, including French,
Spanish, Portuguese, Hungarian, Italian, German, Dutch, Swedish, Norwegian, Danish, and
Finnish. The links for stemming algorithms for these languages can be found at:
http://snowball.tartarus.org/texts/stemmersoverview.html

12.4.2 Stemmers for Indian Languages

 Standard stemmers are not yet available for Hindi and other Indian languages. The major
research on Hindi stemming has been accomplished by Ramanathan and Rao (2003) and
Majumder et al. (2007). Ramanathan and Rao based their work on handcrafted suffix lists.
Majumder et al. used a cluster-based approach to find root words and their morphological
variants.
 They evaluated their approach and found that task-based stemming improves recall for
Indian languages, as shown in their observation on Bengali data. The Resource Centre of
Indian Language Technology (CFILT), IIT Bombay has also developed stemmers for
Indian languages, available at:
http://www.cfilt.iitb.ac.in
NATURAL LANGUGAE PROCESSING BAI601/BAD613B

12.4.3 Stemming Applications

 Stemmers are common in search and retrieval systems like web search engines. Stemming
reduces word variants to a common base form, minimizing index size and helping retrieve
related documents.
 Example: A search for "astronauts" also retrieves "astronaut".
This boosts coverage, but sometimes reduces precision, especially in English queries.
 Stemming is also used in text summarization and categorization, where it helps in
identifying features by reducing various morphological forms of words to their stems.

12.5 PART OF SPEECH TAGGER

Part-of-speech tagging is used at an early stage of text processing in many NLP applications such as
speech synthesis, machine translation, IR (Information Retrieval), and information extraction.

In IR, POS tagging helps in:

 Indexing (e.g., identifying useful tokens like nouns),


 Extracting phrases,
 Disambiguating word senses.

This section presents various POS taggers.

12.5.1 Stanford Log-linear Part-of-Speech (POS) Tagger

This POS Tagger is based on maximum entropy Markov models.


Key feature:

I. It makes explicit use of both the preceding and following tag contexts via a dependency
network representation.
II. It uses a broad range of lexical features.
III. It utilizes priors in conditional log-linear models.
The reported accuracy of this tagger on the Penn Treebank WSJ is 97.24%, with an error
reduction of 4.4% over the best previous result (Tuotanova et al. 2003).
Details: http://nlp.stanford.edu/software/tagger.shtml

12.5.2 A Part-of-Speech Tagger for English

 Uses a bi-directional inference algorithm for POS tagging.


 Based on Maximum Entropy Markov Models (MEMM).
 Enumerates all decomposition structures and selects the one with highest probability.
 Bi-directional MEMMs outperform unidirectional models and are competitive with
advanced methods like SVMs (Tsuruoka and Tsujii, 2005).

NATURAL LANGUGAE PROCESSING BAI601/BAD613B

12.5.3 TnT Tagger

 TnT (Trigrams’n’Tags) by Brants (2000)


 Based on Hidden Markov Models (HMM)
 Efficient and statistical POS tagger
 Uses smoothing and optimization techniques for unknown words
 Performs as well as maximum entropy models

Table 12.1 shows an example:

12.5.4 Brill Tagger

Brill (1992) introduced a trainable rule-based tagger that performs comparably to stochastic
taggers.

 It uses transformation-based learning to induce rules.


 Can handle unknown words well using rules.
 Can be extended to k-best tagging, assigning multiple tags in uncertain cases.
 Download link: http://www.cs.jhu.edu/~brill/RBT1_14.tar.Z

12.5.5 CLAWS Part-of-Speech Tagger for English

CLAWS (Constituent Likelihood Automatic Word-tagging System)

 One of the earliest probabilistic taggers, developed at the University of Lancaster


(http://ucrel.lancs.ac.uk/claws)
 CLAWS4: hybrid of probabilistic and rule-based elements
 Can adapt to varied input formats and text types
 Achieved 96–97% accuracy
 Accuracy depends on the type of input text
 Major references: Garside, Leech, Bryant (1994), Garside & Smith (1997)

12.5.6 Tree-Tagger
NATURAL LANGUGAE PROCESSING BAI601/BAD613B

Tree-Tagger (Schmid 1994) is a probabilistic POS tagger

 Uses decision tree instead of Markov models to estimate probabilities.


 Automatically chooses the best context size.
 Achieved over 96% accuracy on Penn Treebank WSJ corpus.
 Download link: http://www.ims.uni-
stuttgart.de/projekte/corplex/TreeTagger/DecisionTreeTagger.html

12.5.7 ACOPOS: A Collection of POS Taggers

 ACOPOS is a collection of freely available POS taggers


 Written in C and based on different frameworks
 Includes four different taggers in one package

Maximum Entropy Tagger (MET)


Suggested by Ratnaparkhi (1997), this tagger:
 Uses an iterative method to improve parameters
 Employs contextual features to distinguish relevant tags

Trigram Tagger (T3)


 Based on Hidden Markov Models (HMM)
 States are tag pairs that emit words
 Technique by Rabiner (1990); implementation influenced by Brants (2000)

Error-driven Transformation-based Tagger (TBT)


 Proposed by Brill (1993)
 Learns transformation rules from annotated corpora
 Uses rules to change tags using contextual clues

Example-based Tagger (ET)


 Based on example-based or memory-based models
 Applies past experiences instead of learned rules
 Suggested for NLP by Daelemans et al. (1996)

12.5.7 POS Tagger for Indian Languages


Challenges:
 Lack of tools and large annotated corpora
Research & Development Sites:
 CDAC, IIT Bombay, IIIT Hyderabad, University of Hyderabad, CIIL Mysore,
University of Lancaster
IIT Bombay's Work:
 Morphology analyzers & POS taggers for Hindi and Marathi
 Based on bootstrapping using a small corpus, with rule-based + statistical methods

12.6 RESEARCH CORPORA


NATURAL LANGUGAE PROCESSING BAI601/BAD613B

Research corpora have been developed for a number of NLP-related tasks. In the following section,
we point out few of the available standard document collections for a variety of NLP-related tasks,
along with their Internet links.

12.6.1 IR Test Collection

We have already provided a list of IR test document collection in Chapter 9. Glasgow University,
UK, maintains a list of freely available IR test collections. Table 12.2 lists the sources of those and
few more IR test collections.

LETOR (learning to rank) is a package of benchmark data sets released by Microsoft Research Asia.
It consists of two datasets OHSUMED and TREC (TD2003 and TD2004). LETOR is packaged with
extracted features for each query-document pair in the collection, baseline results of several state-of-
the-art learning-to-rank algorithms on the data and evaluation tools. The data set is aimed at
supporting future research in the area of learning ranking function for information retrieval.

12.6.2 Summarization Data

Evaluating a text summarizing system requires existence of ‘gold summaries’. DUC provides
document collections with known extracts and abstracts, which are used for evaluating performance
of summarization systems submitted at TREC conferences. Figure 12.11 shows a sample document
and its extract from DUC 2002 summarization data.

12.6.3 Word Sense Disambiguation

SEMCOR is a sense-tagged corpus used in disambiguation.

 It is a subset of the Brown corpus, sense-tagged with WordNet synsets.


 Open Mind Word Expert attempts to create a very large sense-tagged corpus by collecting
word sense tagging from the general public over the Web.

12.6.4 Asian Language Corpora

The multilingual EMILLE corpus is the result of the Enabling Minority Language Engineering
(EMILLE) project at Lancaster University, UK.
NATURAL LANGUGAE PROCESSING BAI601/BAD613B

Key points:

 Focuses on data generation, software resources, and basic NLP tools for South Asian
languages.
 CIIL (Central Institute for Indian Languages), Indian partner in the project, extended
languages to include more Indian languages.
 Covers monolingual and parallel/annotated corpora for a variety of genres.
 EMILLE/CIIL corpus is freely available for research use at:
http://www.elda.org/catalogue/en/text/W0037.html
Manual: http://www.emille.lancs.ac.uk/manual.pdf

Corpus Details:

 Monolingual corpus: Written text for 14 South Asian languages and spoken text for 5
languages (Hindi, Bengali, Gujrati, Punjabi, Urdu).
 Spoken data: Derived from BBC Asia Network broadcasts.
 Parallel corpus: English text + translations in 5 languages; includes UK government
leaflets.
 Annotated corpus includes:
o POS tagging
o Urdu translation
o Hindi corpus annotated for demonstrative use

******CHAPTER ENDS*****

*****END OF MODULE-4 *****

You might also like