Unit 5
Unit 5
Prepared By
Dr. Shikha Pandey
UNIT – V (CO5)
Information retrieval and lexical resources: Information Retrieval: Design features of
Information Retrieval Systems, Classical, Non-classical, Alternative Models of
Information Retrieval, valuation
Lexical Resources: World Net, Frame Net, Stemmers, POS Tagger
These systems are integral to various applications, such as search engines, recommendation
systems, document management systems, and chatbots. The primary goal of an IR system is to
bridge the gap between the user’s information needs and the available data by providing timely
and accurate results.
Indexing: It creates an organized structure that maps terms (words or phrases) to the documents
in which they appear. This structure allows for efficient lookup and retrieval of records based on
specific terms.
Query Processing: The system analyses and processes user queries to identify the most relevant
terms and concepts. This often involves techniques to handle synonymy (different words with the
same meaning) and polysemy (a word with multiple meanings).
Relevance Ranking: Documents retrieved from the index are ranked based on their perceived
relevance to the user’s query. Various ranking algorithms, such as TF-IDF (Term Frequency-
Inverse Document Frequency) and BM25, are used to determine the order in which documents
are presented to the user.
User Interaction and Feedback: Some IR systems learn from user interactions to improve their
performance over time. For instance, if a user clicks on a particular search result, the system
might know that similar results are likely relevant.
Information Presentation: The retrieved documents are typically presented to the user with
additional information, such as document snippets, titles, and links, to help users quickly assess
the relevance of each result.
Query Expansion: This technique automatically enhances user queries with additional terms
related to the original query. By accounting for different ways of expressing the same idea, it can
help retrieve more relevant results.
Relevance
The foremost objective of an IR system is to retrieve information directly relevant to the user’s query. This
means the system should consider exact keyword matches, understand the user’s intent, and provide
documents that address the user’s information needs.
Relevance ensures that users receive the most pertinent information, which enhances their overall
satisfaction. By focusing on relevance, IR systems can significantly improve the quality of the search
results, making it easier for users to find the information they need quickly and efficiently.
Efficiency
IR systems aim to retrieve relevant documents quickly, even from large datasets. Speed and efficiency are
critical to providing a satisfactory user experience, especially when users expect rapid responses to their
queries.
An efficient IR system processes vast amounts of data in real-time, ensuring users do not experience delays.
This efficiency is achieved through advanced algorithms and optimised data structures that enable the
system to search and retrieve information rapidly, enhancing the overall user experience.
Ranking
Once relevant documents are retrieved, the IR system ranks them in order of perceived relevance. This
ranking helps users prioritise their focus on the most relevant documents and saves them time by not having
to sift through irrelevant results.
Users can quickly find what they are looking for by presenting the most pertinent information. Ranking
involves sophisticated algorithms that consider keyword frequency, document popularity, and user
preferences, ensuring that the most helpful information appears at the top of the search results.
Accuracy
IR systems strive to minimise false positives (irrelevant documents retrieved) and false negatives (relevant
documents not retrieved). Accurate retrieval ensures that users receive trustworthy and appropriate
information.
An accurate IR system meticulously evaluates the relevance of documents, reducing the chances of
irrelevant details appearing in the search results. This accuracy is crucial for maintaining the credibility and
reliability of the IR system, as users depend on it to provide precise and valuable information.
Contextual Understanding
Beyond literal keyword matching, IR systems aim to comprehend the context and semantics of both user
queries and document content. This allows the system to provide results that align with the user’s intended
meaning.
Contextual understanding involves analysing the relationships between words and phrases within the query
and documents, ensuring that the search results are relevant and contextually appropriate. This deep
understanding of language nuances significantly enhances the accuracy and relevance of the information
retrieved.
User Interaction
Many modern IR systems incorporate user interactions and feedback to improve future retrieval results. By
learning from user behaviour and preferences, the system becomes better at refining its results over time.
User interaction allows the IR system to adapt to individual user needs, making the search process more
personalised and effective. Feedback mechanisms such as clicks, ratings, and comments provide valuable
insights into user preferences, enabling the system to improve and continuously deliver more accurate and
relevant search results.
Personalisation
In some cases, IR systems personalise results based on user profiles, preferences, and historical interactions.
This ensures that users receive information most relevant to their needs. Personalisation involves tailoring
the search results to match each user’s unique interests and requirements.
By considering factors such as search history, demographic information, and individual preferences, the IR
system can deliver a more customised and satisfying search experience, increasing user engagement and
satisfaction.
Diversity of Results
While relevance is crucial, IR systems also aim to provide diverse results. This prevents the system from
returning multiple highly similar documents and instead offers a well-rounded view of the topic.
Diversity ensures that users are exposed to various perspectives and information sources, enriching their
understanding of the subject matter. By incorporating diverse results, the IR system can cater to user needs
and preferences, providing a more comprehensive and balanced search experience.
Adaptability
IR systems need to adapt to changes in data and user behaviour. As new documents are added and user
preferences evolve, the system should continue to provide accurate and relevant results.
Adaptability involves continuously updating the system’s algorithms and data structures to accommodate
new information and changing user behaviours. This ensures that the IR system remains effective and
reliable over time, consistently delivering high-quality search results regardless of the dynamic nature of
the data and user expectations.
The system should handle complex queries involving multiple concepts, logical operators, and facets. It
should understand and interpret these queries accurately to provide meaningful results. Supporting complex
queries requires sophisticated algorithms capable of parsing and processing intricate search expressions.
By accurately interpreting and addressing complex queries, the IR system can meet users’ diverse and
specific information needs, ensuring that even the most detailed and nuanced queries yield accurate and
relevant results. This capability enhances the system’s utility and versatility, making it a valuable tool for
users with varied and complex search requirements.
The Information Retrieval (IR) process involves a series of steps that collectively aim to retrieve relevant
information from a collection of data or documents based on user queries. This process goes beyond simple
keyword matching and employs various techniques to understand user intent, index documents, and rank
their relevance.
First, we gather documents or data on how the IR system will operate. This initial step involves collecting
vast amounts of raw data from various sources, such as databases, web pages, or text documents.
After gathering the data, we preprocess it by cleaning and tokenising it, breaking it into individual words
or phrases. This step also involves removing unnecessary elements like stopwords (common words like
“the” or “and”) and punctuation.
Optionally, we apply techniques like stemming or lemmatisation to reduce words to their root forms,
ensuring consistency and improving search accuracy.
Indexing
In the indexing phase, we create a data structure that maps terms (words or phrases) to the documents they
appear. This index allows for efficient lookup and retrieval of documents containing specific terms. We can
facilitate fast and accurate retrieval by using data structures like inverted indexes.
The inverted index is particularly effective because it stores a list of documents for each term, making it
quick to find all documents containing a particular word or phrase. This step ensures the IR system can
quickly and accurately respond to user queries.
Query Processing
When a user submits a query, the system processes it to identify relevant terms and concepts. This step
involves analysing the query to understand the user’s intent and determine the most important words or
phrases. We also handle query expansion, adding additional terms related to the user’s query to enhance
retrieval accuracy.
For example, if a user searches for “cars,” we might also consider related terms like “automobiles” or
“vehicles.” Additionally, we address synonymy (different words with similar meanings) and polysemy
(words with multiple meanings) to ensure we capture the user’s intended meaning.
Relevance Ranking
After identifying the relevant documents, we calculate a relevance score for each document using ranking
algorithms such as TF-IDF (Term Frequency-Inverse Document Frequency) or BM25. These algorithms
consider various factors, such as the frequency of query terms in the document and the overall importance
of the terms.
Documents with higher relevance scores are ranked higher and presented to the user first. This ranking
process ensures that the most pertinent and useful documents appear at the top of the search results,
enhancing the user’s search experience.
Presentation of Results
We then display the retrieved documents to the user in a user-friendly format. This presentation typically
includes document titles, text snippets matching the query, and links to the full documents.
Additional information, such as publication dates, authors, and metadata, helps users assess the relevance
of each result. By providing a clear and informative presentation, we help users quickly determine which
documents will most likely meet their needs and encourage further exploration of the search results.
User Interaction and Feedback
User interaction with the presented results provides valuable feedback for improving the IR system. By
observing user actions, such as clicks and the amount of time spent on each document, we gather insights
into the relevance of the retrieved documents.
This feedback loop allows us to refine the ranking algorithms and improve future retrieval results.
Incorporating user feedback is essential for adapting to users’ changing needs and preferences, ensuring the
IR system remains practical and relevant over time.
Iterative Querying
Users often refine their queries based on the initial results. They may modify keywords, add filters, or
change their search terms to narrow their search and improve the relevance of the retrieved documents.
Each iteration helps the user get closer to finding the information they need. This iterative querying process
is a critical component of the IR system, as it allows users to explore different aspects of their search topic
and progressively improve their search results.
Finally, the IR system must continuously learn and adapt to remain effective. We update the index and
ranking algorithms as new documents are added to the collection. We also adapt to user behaviour and
preferences changes, ensuring the system remains accurate and relevant.
By continuously learning from user interactions and updating the system accordingly, we can maintain
high-quality search results and provide a better user experience.
The Information Retrieval process is dynamic and multifaceted. It aims to efficiently provide users with the
most relevant information. By following these steps, IR systems can effectively meet user needs and adapt
to the ever-changing information landscape.
The IR system evaluates the relevance of the documents, ensuring that the articles it retrieves discuss
affordable smartphones with good features. This means it looks for content where the term “budget” is
associated with “smartphones;” these devices are rated highly for their value.
Additionally, the IR system considers the user’s intent behind the search. It understands the user wants to
find the best options within a specific price range. As a result, it prioritises articles that compare different
budget smartphones, reviews that highlight their features, and lists that recommend top choices.
The IR system ensures a more satisfying and accurate search experience by aligning the search results with
the user’s intent. This example demonstrates how IR systems go beyond simple keyword matching,
employing sophisticated algorithms to deliver relevant and helpful information tailored to the user’s needs.
Information Retrieval and Information Extraction in AI
Information Retrieval (IR) and Information Extraction (IE) are two fundamental pillars of AI’s language
understanding capabilities. IR focuses on fetching relevant information from vast datasets. When users
enter a query, IR systems scan large data collections, such as documents, databases, and websites, to find
the most pertinent information.
This process involves indexing, ranking, and retrieving documents based on their relevance to the query.
Effective IR systems, like search engines, ensure users receive accurate and helpful information quickly,
enhancing their ability to find what they need from extensive data sources.
In contrast, Information Extraction identifies and extracts structured information from unstructured text. IE
systems analyse text to identify specific pieces of information, such as names, dates, locations, and
relationships. This structured data can then be organised into databases or knowledge graphs, significantly
contributing to AI’s knowledge base.
For instance, an IE system might process news articles to extract data about events, people involved, and
their connections, transforming raw text into actionable insights. This capability is crucial for tasks like
automated summarization, question answering, and content recommendation.
Together, IR and IE enable AI systems to understand and utilize human language more effectively, driving
advancements in natural language processing and contributing to the development of intelligent
applications.
Inverted Index
An inverted index is an index data structure storing a mapping from content (content can be words or
numbers) to its locations in a document or a set of documents.
We can also say the inverted index is a hashmap-like data structure that directs the user from a word to a
document or a web page. The inverted index is also the primary data structure of most information retrieval
systems.
Inverted index as a data structure lists for every word, all documents that contain it, and frequency of the
occurrences in the document hence making it easy to search for hits of a query word.
Types of inverted indexes: Mainly two types record level inverted index, and word level inverted index.
Record level inverted index contains a list of references to documents for each word.
Word level inverted index additionally contains the positions of each word within a document. This form
of the inverted index also offers more functionality but needs more processing power and space to be
created.
Advantages of inverted index: The main utility of inverted index is that it allows fast full-text searches at
the cost of increased processing when a document is added to the database.
Inverted index is also the most popular data structure used in document retrieval systems used on a large
scale for example, in search engines.
Disadvantage of inverted index: Inverted index has a large storage overhead with high maintenance costs
for the update, delete, and insert statements.
Stop words are high-frequency words that are deemed unlikely to be useful for searching inside the
documents of the information retrieval system.
All the words in the corpus with less semantic weights are kept in a list called a stop list.
Example: Articles like a, an, the, and prepositions like in, of, for, at, etc. are examples of stop words.
Size reduction of the inverted index using stop list: One main pro of eliminating stop words is that the size
of the inverted index can be significantly reduced by a stop list.
As per Zipf’s law, a stop list covering a few dozen words reduces the size of the inverted index by almost
half.
One disadvantage of stop word elimination is that sometimes it may cause the elimination of the term that
is useful for searching.
Example: If we eliminate the alphabet A from Vitamin A, then the word will lose its significance.
Stemming
Stemming is the heuristic process of extracting the base form of words by chopping off the ends of words.
It is the process of producing morphological variants of a root or base word. Stemming programs are
commonly referred to as stemming algorithms or stemmers. Stemming is one of the important steps in
information retrieval systems like search engines.
For example, the words laughing, laughs, and laughed would be stemmed from the root word laugh.
Usage of stemming in information retrieval system with an example: If we want to search for the word
chocolate in a collection of documents, we want to see all the documents that have information about the
word chocolate.
It may so happen that the words chocolates, chocolatey, and choco may be present in many documents
instead of chocolate.
To relate these many words, we can stem these words into their root word chocolate again so that we can
retrieve all the documents containing this base word no matter the way it is represented across documents.
There are many standard tools for performing this reduction (of stemming into root word) like the Porter’s
Stemmer, the Snowball stemmer, the Lancaster stemmer, etc.
Classical models focus primarily on the mathematical representation of documents and queries,
matching them to retrieve relevant results.
a. Boolean Model
Key Features:
o Based on set theory.
o Queries are expressed as Boolean expressions using AND, OR, and NOT operators.
o Documents are either relevant or not relevant (binary retrieval).
Advantages:
o Simple to implement and understand.
o Gives precise control over the query through the logical operators.
Limitations:
o Does not handle partial matching or ranking of results.
o The query formulation can be complex for users.
o Cannot retrieve documents based on term frequency or document relevance scores.
b. Vector Space Model (VSM)
Key Features:
o Represents documents and queries as vectors in a high-dimensional space.
o Terms in documents are weighted, often using TF-IDF (Term Frequency-Inverse
Document Frequency).
o Relevance is determined by computing the cosine similarity between the query vector
and document vectors.
Advantages:
o Supports partial matching and can rank documents by relevance.
o Intuitive representation of document-query similarity.
Limitations:
o Assumes terms are independent (does not capture term correlations).
o High computational cost due to vector manipulation.
c. Probabilistic Model
Key Features:
o Assumes there is a probability that a document is relevant to a given query.
o Models include Binary Independence Model (BIM), which estimates the probability that
a document is relevant given its terms and the query.
o Each term contributes to the probability of relevance.
Advantages:
o The probabilistic model inherently ranks documents.
o It can be updated with user feedback to improve retrieval over time.
Limitations:
o Requires a prior knowledge of relevant documents for training the model, which is
often not available.
o Limited by the independence assumption between terms.
Key Features:
o Represents relevance in degrees rather than binary terms (as in Boolean models).
o Each document has a degree of membership in the relevant set.
o Supports fuzzy logic for query processing, allowing partial matches and linguistic
uncertainty.
Advantages:
o Flexible and can handle ambiguous or imprecise queries.
o Provides a continuous spectrum of relevance instead of rigid binary retrieval.
Limitations:
o More complex to implement.
o Requires appropriate fuzzy membership functions, which may be difficult to define.
Key Features:
o A hybrid of the Boolean model and Vector Space Model.
o Allows weighted terms in Boolean queries, providing a mechanism for partial matching.
o Uses proximity-based retrieval by calculating the distance between documents and
query terms.
Advantages:
o Balances the strictness of Boolean logic with the flexibility of vector space ranking.
Limitations:
o Still requires users to formulate complex queries.
Key Features:
o Uses a Bayesian network to model the relationships between terms, documents, and
queries.
o Documents and queries are represented as nodes in a network, and relevance is
computed using probabilistic inference.
Advantages:
o Highly expressive model, capturing term relationships and uncertainty.
o Adaptable to various retrieval tasks and user preferences.
Limitations:
o Computationally expensive due to the complexity of the network.
o Difficult to implement and scale for large datasets.
Alternative models explore approaches that go beyond traditional models, often incorporating
more advanced concepts from machine learning, linguistics, and cognitive science.
Key Features:
o Treats each document as a language model and computes the probability of generating
a query from that document.
o One popular method is the query likelihood model, where the goal is to rank
documents by how likely they would generate the query.
Advantages:
o Provides a strong theoretical framework and allows for a natural incorporation of
statistical techniques.
o Can model term dependencies and capture more complex patterns of term usage.
Limitations:
o Can suffer from data sparsity, where the model doesn’t have enough information to
build reliable language models for certain documents.
Key Features:
o Uses singular value decomposition (SVD) to reduce the dimensionality of the term-
document matrix, capturing the latent structure in the data.
o Documents and queries are mapped to a latent semantic space, where synonyms and
related terms are grouped together.
Advantages:
o Improves recall by retrieving documents with similar meanings even if different words
are used (synonym handling).
o Reduces noise by focusing on the key latent concepts.
Limitations:
o High computational cost due to SVD.
o It may struggle with new terms not captured in the original matrix.
Key Features:
o Leverages deep learning techniques (e.g., word embeddings, neural networks) to model
semantic relationships between terms, documents, and queries.
o Techniques like BERT (Bidirectional Encoder Representations from Transformers) are
used to understand context and relationships between words.
Advantages:
o Excellent at handling complex, natural language queries and understanding context.
o Can learn from large datasets and improve with more data.
Limitations:
o Requires significant computational resources.
o Needs large amounts of data for training, making it less practical for smaller datasets.
Evaluation of IR Models
Evaluation of IR models is crucial to assess their effectiveness and usability in real-world
scenarios.
Precision: The proportion of relevant documents retrieved out of the total retrieved documents.
Recall: The proportion of relevant documents retrieved out of the total relevant documents
available.
F1-Score: Harmonic mean of precision and recall, providing a balance between the two.
Mean Average Precision (MAP): A measure of precision across multiple queries, providing an
overall assessment.
Discounted Cumulative Gain (DCG): Measures the usefulness of a document based on its
position in the result set, emphasizing the importance of ranking relevant documents higher.
User Satisfaction: Subjective but vital metric, based on user feedback regarding the relevance of
results, speed, and ease of use.