What Is Information Retrieval (IR)
What Is Information Retrieval (IR)
In Information Retrieval (IR), similarity functions are used to compare queries and
documents or compare documents with each other. These functions help in ranking documents
based on how relevant they are to a given query. Choosing the right similarity function depends
on:
● In Vector Space Model (VSM), both queries and documents are represented as
vectors in a high-dimensional space.
● Each dimension represents a unique term from the vocabulary.
● The Cosine Similarity is used to measure how similar a document vector is to a query
vector.
● It is calculated using the cosine of the angle between the two vectors.
Example:
After converting them into vectors based on word frequency, cosine similarity will measure the
angle between them. The smaller the angle, the more similar the documents are.
2. Jaccard Similarity
● Measures similarity between two sets (e.g., sets of words in two documents).
● Formula:
● where A and B are sets of words in two documents.
Example:
● TF-IDF helps determine how important a word is in a document relative to the entire
collection.
● Formula:
● TF (Term Frequency) → How often a term appears in a document.
● IDF (Inverse Document Frequency) → How unique a term is across all documents.
Example:
If the word "retrieval" appears 10 times in one document but only in 2 out of 1000 documents,
it will have a higher TF-IDF score, making it more important for ranking.
● Unlike simple word matching, word embeddings represent words as dense vectors in
a continuous space.
● Similarity is computed by averaging word embeddings in a document and comparing
them.
● Used in deep learning-based retrieval models.
Example:
"Car" and "Vehicle" might have similar embeddings because they often appear in similar
contexts.
Query Expansion
Query expansion improves search performance by modifying the original query to retrieve
more relevant results.
Query expansion enhances a user’s search query by adding relevant terms, improving search
engine performance.
Local analysis analyzes the terms within a set of retrieved documents (rather than relying
on external knowledge bases) to understand the user's intent better. This improves search
relevance.
1. Tokenization
○ Splitting text into individual words or phrases.
○ Example:
■ Query: "Best programming languages for AI"
■ Tokens: ["Best", "programming", "languages", "for",
"AI"]
2. Stopword Removal
○ Removing common words (e.g., the, is, and, or) that do not impact search
results.
○ Example:
■ Before: ["Best", "programming", "languages", "for",
"AI"]
■ After: ["Best", "programming", "languages", "AI"]
3. Stemming & Lemmatization
○ Stemming: Reduces words to their root form (e.g., running → run).
○ Lemmatization: Converts words to their base dictionary form (e.g., better →
good).
4. Term Frequency (TF) Calculation
○ Determines how often a term appears in a document.
○ Example:
■ Document 1: "AI is transforming industries. AI is important."
■ TF of “AI” = 2 (appears twice).
5. Term Importance Weighting
○ Words that appear frequently in many documents are less important.
○ Example: TF-IDF assigns higher importance to unique words.
6. Semantic Analysis
○ Finds contextually similar words.
○ Example: "car" and "vehicle" are semantically related.
7. Query Expansion Using Local Analysis
○ Based on these steps, related terms (synonyms, similar words) are added to
improve search relevance.
Unlike local analysis, this method uses external knowledge sources like:
1. Document Normalization
Document normalization is the process of standardizing text data to improve the efficiency,
accuracy, and reliability of information retrieval (IR) and text processing.
When dealing with large document collections, there may be inconsistencies due to:
1. Lowercasing
○ Converts all text to lowercase for uniformity.
○ Example: "Machine Learning" → "machine learning"
○ Benefit: Ensures that searches are case-insensitive.
2. Stopword Removal
○ Removes common words that do not add much meaning to the search (e.g.,
"the", "is", "and").
○ Example: "the big brown fox" → "big brown fox"
○ Benefit: Reduces storage and processing time.
3. Stemming and Lemmatization
○ Stemming reduces words to their base form by chopping off suffixes.
■ Example: "running", "runs", "ran" → "run"
○ Lemmatization converts words into their dictionary form.
■ Example: "better" → "good"
○ Benefit: Treats different word variations as the same term.
4. Punctuation & Special Character Removal
○ Eliminates unnecessary symbols.
○ Example: "Hello, world!" → "Hello world"
○ Benefit: Prevents indexing of irrelevant characters.
5. Deduplication
○ Removes duplicate documents from the database.
○ Example: Two identical news articles are merged into one.
6. Whitespace and Formatting Standardization
○ Ensures consistent spacing and formatting.
○ Example:
■ "Machine Learning" → "Machine Learning"
■ "Hello\nWorld" → "Hello World"
Multi-field Retrieval
Multi-field retrieval is an advanced search approach where documents are indexed and
retrieved based on multiple attributes (or fields) rather than just plain text content.
Documents often contain rich metadata that helps improve search precision. Instead of
searching only in the main text, multi-field retrieval considers other relevant fields like:
● Title
● Author
● Date
● Categories
● Keywords
● Abstract
● References
For example, a research paper search engine can rank results differently based on:
Example:
Imagine you are searching for research papers on "deep learning in healthcare".
Retrieval evaluation assesses how well an Information Retrieval (IR) system performs when
fetching relevant documents based on user queries.
A good retrieval system should:
❌
Without proper evaluation, search engines may:
❌
Rank irrelevant results higher
❌
Miss out on important documents
❌
Take too long to return results
Fail to understand user intent properly
Thus, evaluating retrieval performance helps identify weaknesses and improve the system.
● Example:
○ If total relevant documents in the dataset are {D1, D2, D3, D4, D5, D6,
D7}
○ Recall = 3/7 = 0.42 (42%)
● Formula:
● where DCG discounts results based on their rank, and IDCG is the best possible DCG
score.
● Example:
○ If D1 is highly relevant but ranked at position 5, its impact is lower than if it
were ranked at 1.
While IR systems improve access to vast amounts of information, they also come with
challenges and limitations.
1. Information Overload
2. Precision-Recall Trade-off
● Issue:
○ High Precision (accurate results) → Misses some relevant documents.
○ High Recall (retrieves more documents) → Includes irrelevant results.
● Example:
○ A medical search engine retrieving only the top 5 articles (high precision)
might miss some relevant studies (low recall).
● Solution: Use hybrid ranking models (e.g., TF-IDF + Neural Networks) to balance
precision and recall.
3. Vocabulary Mismatch
✅
● Solution:
Query expansion (e.g., "COVID treatment" → "coronavirus therapy")
✅ Semantic analysis (e.g., using Word2Vec for similar words)
4. Context Sensitivity
✅
● Solution:
✅
User intent detection using machine learning models.
Personalized search (if a user recently searched for iPhones, prioritize Apple Inc.).
✅
● Solution:
✅
Diverse training datasets
Bias-aware algorithms
✅
● Solution:
✅
Efficient indexing (e.g., Inverted Index, Elasticsearch)
Parallel computing (e.g., Hadoop, Spark)
7. Difficulty in Evaluation
✅
● Solution:
✅
User-based relevance feedback
Customizable ranking models
✅
End-to-end encryption
User consent for data storage
✅
○ Amazon shows:
■ "Only 5 left in stock" → Encourages quick buying.
■ ❌ "Out of stock" → Prevents false expectations.
✅ Benefit: Reduces cart abandonment and improves customer experience.
1.3. Personalized Product Recommendations
● Optimized search algorithms ensure customers find the right products quickly.
● Why? Poor search = Frustrated customers = Fewer purchases.
● Techniques used:
○ Autocomplete: Predicts what the user is typing.
○ Spell correction: "iphne" → "iPhone"
○ Synonym matching: "sofa" = "couch"
● Faster internet speeds, better mobile devices, and secure payments have boosted
online shopping.
● Example:
○ 5G networks enable faster mobile shopping.
○ Google Pay & PayPal ensure safe transactions.