0% found this document useful (0 votes)
23 views17 pages

What Is Information Retrieval (IR)

The document discusses similarity functions in Information Retrieval (IR), detailing various methods such as Cosine Similarity, Jaccard Similarity, TF-IDF, and Word Embedding Similarity, which are used to compare queries and documents for relevance. It also covers query expansion techniques, document normalization, multi-field retrieval, and the evaluation of retrieval systems, highlighting the importance of these processes in improving search efficiency and accuracy. Additionally, it addresses the challenges faced in IR, including information overload and the precision-recall trade-off.

Uploaded by

m.yadav9315
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views17 pages

What Is Information Retrieval (IR)

The document discusses similarity functions in Information Retrieval (IR), detailing various methods such as Cosine Similarity, Jaccard Similarity, TF-IDF, and Word Embedding Similarity, which are used to compare queries and documents for relevance. It also covers query expansion techniques, document normalization, multi-field retrieval, and the evaluation of retrieval systems, highlighting the importance of these processes in improving search efficiency and accuracy. Additionally, it addresses the challenges faced in IR, including information overload and the precision-recall trade-off.

Uploaded by

m.yadav9315
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Similarity Function in Information Retrieval (IR)

In Information Retrieval (IR), similarity functions are used to compare queries and
documents or compare documents with each other. These functions help in ranking documents
based on how relevant they are to a given query. Choosing the right similarity function depends
on:

●​ The nature of the data


●​ The specific retrieval task
●​ The characteristics of the document collection

Now, let's go over different types of similarity functions one by one.

1. Vector Space Model (Cosine Similarity)

●​ In Vector Space Model (VSM), both queries and documents are represented as
vectors in a high-dimensional space.
●​ Each dimension represents a unique term from the vocabulary.
●​ The Cosine Similarity is used to measure how similar a document vector is to a query
vector.
●​ It is calculated using the cosine of the angle between the two vectors.

Example:

Let's say we have two documents:

Document 1: "Information retrieval uses similarity functions."​


Document 2: "Similarity functions are used in search engines."

After converting them into vectors based on word frequency, cosine similarity will measure the
angle between them. The smaller the angle, the more similar the documents are.

2. Jaccard Similarity

●​ Measures similarity between two sets (e.g., sets of words in two documents).

●​ Formula:
●​ where A and B are sets of words in two documents.

Example:

Document 1: {information, retrieval, similarity, functions}​


Document 2: {similarity, functions, search, engines}
Intersection: {similarity, functions} → 2 words​
Union: {information, retrieval, similarity, functions, search, engines}
→ 6 words

A higher Jaccard similarity means two documents share more words.

3. TF-IDF (Term Frequency-Inverse Document Frequency)

●​ TF-IDF helps determine how important a word is in a document relative to the entire
collection.

●​ Formula:
●​ TF (Term Frequency) → How often a term appears in a document.
●​ IDF (Inverse Document Frequency) → How unique a term is across all documents.

Example:

If the word "retrieval" appears 10 times in one document but only in 2 out of 1000 documents,
it will have a higher TF-IDF score, making it more important for ranking.

4. Word Embedding Similarity (Word2Vec, GloVe, fastText)

●​ Unlike simple word matching, word embeddings represent words as dense vectors in
a continuous space.
●​ Similarity is computed by averaging word embeddings in a document and comparing
them.
●​ Used in deep learning-based retrieval models.

Example:

"Car" and "Vehicle" might have similar embeddings because they often appear in similar
contexts.

Query Expansion

Query expansion improves search performance by modifying the original query to retrieve
more relevant results.

Steps in Query Expansion

1.​ Identifying Query Terms:


○​ The search engine analyzes the user’s query and extracts keywords.
2.​ Expanding Query Terms:
○​ New related words are added to improve search results.
3.​ Methods of Query Expansion:
○​ Synonym Expansion: Adds words with similar meaning.
■​ Example: "Big" → "Large"
○​ Related Term Expansion: Adds words that are semantically related.
■​ Example: "Laptop" → "Computer"
○​ Concept Expansion: Uses knowledge bases like Wikipedia or WordNet to find
related concepts.
■​ Example: "COVID-19" → "Pandemic"
○​ Feedback-based Expansion: Uses top-ranked documents to find additional
relevant terms.
○​ Automatic Expansion: Some search engines use predefined rules for
expansion.
4.​ Re-Ranking of Results:
○​ Once the query is expanded, documents are ranked using TF-IDF, BM25, or
neural ranking models.

Query Expansion with Local Analysis

Query expansion enhances a user’s search query by adding relevant terms, improving search
engine performance.

How Local Analysis Helps in Query Expansion

Local analysis analyzes the terms within a set of retrieved documents (rather than relying
on external knowledge bases) to understand the user's intent better. This improves search
relevance.

1.​ Tokenization
○​ Splitting text into individual words or phrases.
○​ Example:
■​ Query: "Best programming languages for AI"
■​ Tokens: ["Best", "programming", "languages", "for",
"AI"]
2.​ Stopword Removal
○​ Removing common words (e.g., the, is, and, or) that do not impact search
results.
○​ Example:
■​ Before: ["Best", "programming", "languages", "for",
"AI"]
■​ After: ["Best", "programming", "languages", "AI"]
3.​ Stemming & Lemmatization
○​ Stemming: Reduces words to their root form (e.g., running → run).
○​ Lemmatization: Converts words to their base dictionary form (e.g., better →
good).
4.​ Term Frequency (TF) Calculation
○​ Determines how often a term appears in a document.
○​ Example:
■​ Document 1: "AI is transforming industries. AI is important."
■​ TF of “AI” = 2 (appears twice).
5.​ Term Importance Weighting
○​ Words that appear frequently in many documents are less important.
○​ Example: TF-IDF assigns higher importance to unique words.
6.​ Semantic Analysis
○​ Finds contextually similar words.
○​ Example: "car" and "vehicle" are semantically related.
7.​ Query Expansion Using Local Analysis
○​ Based on these steps, related terms (synonyms, similar words) are added to
improve search relevance.

Query Expansion with External Resources

Unlike local analysis, this method uses external knowledge sources like:

●​ Lexical databases (e.g., WordNet)


●​ Ontologies (structured concepts)
●​ Knowledge graphs (e.g., Google Knowledge Graph)
●​ Domain-specific dictionaries (e.g., medical or legal glossaries)

Steps for Query Expansion with External Resources

1.​ Identifying External Resources


○​ Examples:
■​ WordNet → Provides synonyms, hyponyms, hypernyms.
■​ Wikipedia → Retrieves conceptually related terms.
2.​ Extracting Additional Terms
○​ The search engine queries the external source and extracts related words.
○​ Example:
■​ User Query: "Renewable energy sources"
■​ External resource (Wikipedia) suggests: "solar energy, wind power,
hydropower"
3.​ Expanding the Query
○​ The new terms are merged with the original query.
○​ Example:
■​ Original: "Renewable energy"
■​ Expanded: "Renewable energy, solar power, wind energy,
hydropower"
4.​ Weighting & Integration
○​ More relevant terms (from trusted sources) get higher priority.
○​ Example: Terms from an expert database (e.g., IEEE research papers) get more
weight.
5.​ Retrieval & Ranking
○​ The expanded query is used for document retrieval.
○​ Ranking methods like:
■​ TF-IDF
■​ BM25 (improved TF-IDF)
■​ Neural ranking models (AI-based)
6.​ Evaluation & Feedback
○​ The effectiveness of query expansion is measured using:
■​ Precision (Are results correct?)
■​ Recall (Are all relevant results retrieved?)
■​ User Satisfaction (Do users find useful information?)

1. Document Normalization

Document normalization is the process of standardizing text data to improve the efficiency,
accuracy, and reliability of information retrieval (IR) and text processing.

Why is Document Normalization Important?

When dealing with large document collections, there may be inconsistencies due to:

●​ Different word forms (e.g., "running" vs. "ran")


●​ Noise (extra spaces, special characters, inconsistent formatting)
●​ Redundant content (duplicate data)
●​ Irregular capitalization (e.g., "USA" vs. "usa")
Normalization removes these inconsistencies, making searches more accurate and
computational processes faster.

Key Techniques in Document Normalization

Here are some common techniques used in document normalization:

1.​ Lowercasing
○​ Converts all text to lowercase for uniformity.
○​ Example: "Machine Learning" → "machine learning"
○​ Benefit: Ensures that searches are case-insensitive.
2.​ Stopword Removal
○​ Removes common words that do not add much meaning to the search (e.g.,
"the", "is", "and").
○​ Example: "the big brown fox" → "big brown fox"
○​ Benefit: Reduces storage and processing time.
3.​ Stemming and Lemmatization
○​ Stemming reduces words to their base form by chopping off suffixes.
■​ Example: "running", "runs", "ran" → "run"
○​ Lemmatization converts words into their dictionary form.
■​ Example: "better" → "good"
○​ Benefit: Treats different word variations as the same term.
4.​ Punctuation & Special Character Removal
○​ Eliminates unnecessary symbols.
○​ Example: "Hello, world!" → "Hello world"
○​ Benefit: Prevents indexing of irrelevant characters.
5.​ Deduplication
○​ Removes duplicate documents from the database.
○​ Example: Two identical news articles are merged into one.
6.​ Whitespace and Formatting Standardization
○​ Ensures consistent spacing and formatting.
○​ Example:
■​ "Machine Learning" → "Machine Learning"
■​ "Hello\nWorld" → "Hello World"

Impact of Document Normalization

Normalization significantly enhances information retrieval in the following ways:

1.​ Indexing Efficiency


○​ Why? Standardized documents allow faster indexing and use less storage
space.
○​ Example: Instead of storing different forms of "run" ("running", "ran"), the system
stores just "run".
2.​ Improved Search Relevance
○​ Why? Normalization ensures different word variations are treated the same.
○​ Example:
■​ Query: "color"
■​ Document contains "colours" → Without normalization, it might be
missed.
3.​ Reduced Redundancy
○​ Why? Eliminates duplicate content.
○​ Example: Wikipedia pages with identical summaries can be merged into one.
4.​ Consistency Across Documents
○​ Why? Helps in structured comparisons.
○​ Example: "U.S.A.", "USA", and "United States" all become "USA".
5.​ Enhanced Computational Efficiency
○​ Why? Reducing unnecessary variations decreases processing time.
○​ Example: If query processing time reduces from 0.8s to 0.3s, searches become
faster.
6.​ Interoperability and Integration
○​ Why? Standardized documents can be used across multiple platforms.

Multi-field Retrieval

Multi-field retrieval is an advanced search approach where documents are indexed and
retrieved based on multiple attributes (or fields) rather than just plain text content.

Why is Multi-field Retrieval Important?

Documents often contain rich metadata that helps improve search precision. Instead of
searching only in the main text, multi-field retrieval considers other relevant fields like:

●​ Title
●​ Author
●​ Date
●​ Categories
●​ Keywords
●​ Abstract
●​ References

For example, a research paper search engine can rank results differently based on:

●​ Title relevance (most important)


●​ Abstract match (secondary importance)
●​ Keywords and references (lower importance)

How Multi-field Retrieval Works


Each field gets a score based on its relevance, and these scores are combined to produce a
final ranking score.

Example:​
Imagine you are searching for research papers on "deep learning in healthcare".

Final Relevance Score = 0.5 + 0.3 + 0.2 = 1.0​


This paper would rank higher than another paper with a lower score.

Techniques in Multi-field Retrieval

1.​ Field Weighting


○​ Assigns different importance levels to different fields.
○​ Example:
■​ In a job search engine, "Job Title" might have higher priority than
"Job Description".
2.​ Field-specific Ranking Algorithms
○​ TF-IDF for abstract
○​ BM25 for main text
○​ Neural ranking models for author relevance
3.​ Query Expansion Based on Fields
○​ If a query matches keywords, related fields are searched too.
○​ Example: Searching "Quantum Computing" might also search "Physics" or
"AI" fields.

Applications of Multi-field Retrieval

1.​ Search Engines


○​ Google ranks pages based on title, meta description, headings, and body.
2.​ Academic Research
○​ ResearchGate and Google Scholar use title, author, abstract, and references
for ranking.
3.​ E-commerce
○​ Amazon searches across product title, description, brand, and category.
4.​ Job Portals
○​ Title, Company, Location, and Job Type are all considered.
Evaluation of Retrieval

Retrieval evaluation assesses how well an Information Retrieval (IR) system performs when
fetching relevant documents based on user queries.​
A good retrieval system should:

●​ Retrieve relevant documents.


●​ Display the most useful results at the top.
●​ Ensure fast and efficient search operations.
●​ Improve user satisfaction by reducing irrelevant results.

Why is Retrieval Evaluation Important?


Without proper evaluation, search engines may:​


Rank irrelevant results higher​


Miss out on important documents​


Take too long to return results​
Fail to understand user intent properly

Thus, evaluating retrieval performance helps identify weaknesses and improve the system.

Key Retrieval Evaluation Metrics

Different evaluation metrics assess different aspects of search effectiveness.

1. Precision, Recall, and F1-score

●​ These are standard IR metrics used to measure retrieval accuracy.


●​ Precision: Measures how many retrieved documents are actually relevant
●​ Example:
○​ Query: "Machine Learning"
○​ Retrieved Documents: {D1, D2, D3, D4, D5}
○​ Relevant Documents: {D1, D2, D4}
○​ Precision = 3/5 = 0.6 (60%)

Recall: Measures how many relevant documents were actually retrieved

●​ Example:
○​ If total relevant documents in the dataset are {D1, D2, D3, D4, D5, D6,
D7}
○​ Recall = 3/7 = 0.42 (42%)

F1-score: Balances Precision and Recall

○​ Example: If Precision = 60% and Recall = 42%,


■​ F1-score = 49%

2. Mean Average Precision (MAP)

●​ Evaluates how well relevant documents are ranked in a set of queries.


●​ Computes the average precision across multiple queries.
●​ Higher MAP → Better ranking performance.

3. Normalized Discounted Cumulative Gain (NDCG)

●​ Measures the ranking quality of retrieved documents.


●​ Key idea:
○​ Highly relevant documents should appear at the top.
○​ Lower-ranked documents contribute less to the score.

●​ Formula:
●​ where DCG discounts results based on their rank, and IDCG is the best possible DCG
score.
●​ Example:
○​ If D1 is highly relevant but ranked at position 5, its impact is lower than if it
were ranked at 1.

Other Evaluation Techniques

✅ User Feedback: Direct responses from users on search quality.​


✅ Click-through Rate (CTR): How often users click retrieved results.​
✅ Dwell Time: How long users stay on a document before returning to search results.
2. Disadvantages of Information Retrieval (IR)

While IR systems improve access to vast amounts of information, they also come with
challenges and limitations.

1. Information Overload

●​ Issue: Too many results, making it hard to find relevant information.


●​ Example: A Google search for "AI" returns millions of results, overwhelming users.
●​ Solution: Improve ranking algorithms to filter out low-quality results.

2. Precision-Recall Trade-off

●​ Issue:
○​ High Precision (accurate results) → Misses some relevant documents.
○​ High Recall (retrieves more documents) → Includes irrelevant results.
●​ Example:
○​ A medical search engine retrieving only the top 5 articles (high precision)
might miss some relevant studies (low recall).
●​ Solution: Use hybrid ranking models (e.g., TF-IDF + Neural Networks) to balance
precision and recall.

3. Vocabulary Mismatch

●​ Issue: Users might use different words than those in documents.


●​ Example:
○​ Query: "COVID treatment"
○​ Document uses: "SARS-CoV-2 therapy"
○​ Standard search fails because the words don’t match.


●​ Solution:​
Query expansion (e.g., "COVID treatment" → "coronavirus therapy")​
✅ Semantic analysis (e.g., using Word2Vec for similar words)

4. Context Sensitivity

●​ Issue: Same words have different meanings.


●​ Example: "Apple"
🍏 🍏
💻 💻
○​ Fruit
○​ Tech Company


●​ Solution:​


User intent detection using machine learning models.​
Personalized search (if a user recently searched for iPhones, prioritize Apple Inc.).

5. Bias and Fairness Issues

●​ Issue: IR systems can reflect algorithmic and data biases.


●​ Example:
○​ A job search engine might prioritize male-dominated professions if trained on
biased data.


●​ Solution:​


Diverse training datasets​
Bias-aware algorithms

6. Scalability & Efficiency

●​ Issue: IR systems struggle to process large-scale data.


●​ Example:
○​ Google handles 3.5 billion searches daily → Requires highly optimized
infrastructure.


●​ Solution:​


Efficient indexing (e.g., Inverted Index, Elasticsearch)​
Parallel computing (e.g., Hadoop, Spark)

7. Difficulty in Evaluation

●​ Issue: Determining relevance is subjective.


●​ Example:
○​ A news article on AI may be relevant for a tech researcher but not for a high
school student.


●​ Solution:​


User-based relevance feedback​
Customizable ranking models

8. Privacy & Security Concerns

●​ Issue: IR systems handle sensitive user data.


●​ Example:
○​ A health search engine storing personal medical queries risks data
breaches.

●​ Solution:​


End-to-end encryption​
User consent for data storage

1. Incentives of Engaging with E-commerce

E-commerce platforms encourage customer engagement using various strategies, making


online shopping more attractive and seamless.

1.1. User-Generated Content (UGC)

●​ E-commerce platforms encourage customers to leave reviews and ratings.


●​ Why? Customers trust peer reviews more than advertisements.
●​ Example:
○​ On Amazon, products with thousands of reviews and high ratings sell better
than those with few or no reviews.
○​ TripAdvisor ranks hotels based on customer feedback, helping travelers make
better choices.

✅ Benefit: Builds social proof → More trust → More purchases.


1.2. Real-Time Inventory Updates

●​ E-commerce platforms track product availability in real-time.


●​ Why? Prevents users from ordering out-of-stock items.
●​ Example:
○​ Nike’s online store updates stock instantly after a purchase.


○​ Amazon shows:
■​ "Only 5 left in stock" → Encourages quick buying.
■​ ❌ "Out of stock" → Prevents false expectations.
✅ Benefit: Reduces cart abandonment and improves customer experience.
1.3. Personalized Product Recommendations

●​ Platforms use AI-based algorithms to recommend relevant products.


●​ Why? Increases chances of purchase by showing products the customer is already
interested in.
●​ Example:
○​ Netflix recommends movies based on viewing history.
○​ Amazon suggests:
■​ "Customers who bought this also bought..."
■​ "You may also like..."
✅ Benefit: Increases sales by suggesting relevant products.
1.4. Dynamic Pricing

●​ Prices change based on demand, competition, and user behavior.


●​ Why? Maximizes profits and attracts price-sensitive customers.
●​ Example:
○​ Uber’s surge pricing: Higher demand = Higher fares.
○​ Amazon price changes: Prices of items like electronics fluctuate daily.

✅ Benefit: Ensures competitive pricing and higher revenue.


1.5. Search Relevance

●​ Optimized search algorithms ensure customers find the right products quickly.
●​ Why? Poor search = Frustrated customers = Fewer purchases.
●​ Techniques used:
○​ Autocomplete: Predicts what the user is typing.
○​ Spell correction: "iphne" → "iPhone"
○​ Synonym matching: "sofa" = "couch"

✅ Benefit: Improves user experience and boosts engagement.


1.6. Targeted Marketing Campaigns

●​ E-commerce companies use customer data to send personalized promotions.


●​ Why? More relevant ads = Higher engagement.
●​ Example:
○​ Amazon sends personalized discount emails: "20% off on items in your
wishlist!"
○​ Facebook ads show products you recently viewed.

✅ Benefit: Higher conversion rates and customer retention.


1.7. Cross-Selling & Upselling

●​ Cross-selling: Suggests related products.


●​ Upselling: Encourages customers to buy a better version.
●​ Example:
○​ Amazon: "People also bought..."
○​ Apple Store: "Upgrade to iPhone Pro for just $200 more!"

✅ Benefit: Increases order value.


1.8. Interactive Shopping Experiences

●​ Technologies like Augmented Reality (AR) improve online shopping.


●​ Why? Enhances the customer experience.
●​ Example:
○​ Lenskart’s virtual try-on lets users see how glasses fit before buying.
○​ IKEA’s AR app lets customers see furniture in their home before purchase.

✅ Benefit: Reduces returns and increases engagement.


2. Forces Behind E-commerce

E-commerce growth is driven by several factors, from technology to consumer behavior.

2.1. Technological Innovation

●​ Faster internet speeds, better mobile devices, and secure payments have boosted
online shopping.
●​ Example:
○​ 5G networks enable faster mobile shopping.
○​ Google Pay & PayPal ensure safe transactions.

✅ Impact: Faster transactions & better accessibility.


2.2. Globalization

●​ Why? E-commerce removes borders—customers can buy from anywhere.


●​ Example:
○​ Alibaba & Amazon allow international shipping.
○​ Etsy helps local artists sell globally.

✅ Impact: Increased cross-border trade.


2.3. Consumer Behavior Shifts

●​ Consumers prefer convenience over physical shopping.


●​ Why? Online shopping offers variety, better prices, and home delivery.
●​ Example:
○​ COVID-19 pandemic boosted online grocery shopping.
○​ Flipkart’s Big Billion Sale attracts millions.

✅ Impact: Growth in e-commerce adoption.


2.4. Market Competition
●​ Why? More competition forces companies to innovate.
●​ Example:
○​ Amazon, Flipkart, and Reliance JioMart compete in India’s e-commerce
market.
○​ Amazon Prime’s free delivery forces rivals to offer better perks.

✅ Impact: Lower prices & better services for customers.


2.5. Digital Transformation

●​ Why? Even traditional retailers are moving online.


●​ Example:
○​ Reliance, Walmart, and H&M now have online stores.

✅ Impact: Brick-and-mortar stores are evolving into e-commerce businesses.


2.6. Mobile Commerce (M-commerce)

●​ Why? Mobile shopping is fast and easy.


●​ Example:
○​ Amazon, Myntra, and Flipkart apps drive huge sales via smartphones.
○​ UPI & PayTM enable instant mobile payments.

✅ Impact: Mobile shopping is growing rapidly.


2.7. Data-driven Insights

●​ Why? AI analyzes customer behavior & trends.


●​ Example:
○​ Amazon uses AI to recommend products based on past purchases.
○​ Netflix personalizes movie recommendations.

✅ Impact: Higher sales through personalization.


2.8. Logistics & Fulfillment Innovations

●​ Why? Fast shipping & order tracking improve satisfaction.


●​ Example:
○​ Amazon Prime delivers in 1 day.
○​ DHL & FedEx use AI for route optimization.

✅ Impact: Faster, reliable deliveries.


2.9. Regulatory Environment
●​ Why? Governments regulate consumer protection, taxation, and privacy.
●​ Example:
○​ GDPR protects customer data privacy in Europe.
○​ India’s new e-commerce rules ensure fair pricing.

✅ Impact: Trust in online shopping increases.


Summary
●​ E-commerce engagement is driven by personalized recommendations, interactive
shopping, and targeted marketing.
●​ E-commerce growth depends on technology, globalization, consumer behavior, and
regulatory policies.
●​ Emerging trends like AI, AR, and data analytics are shaping the future of e-commerce.

You might also like