0% found this document useful (0 votes)
8 views64 pages

IR Unit 3

The document discusses web page crawling techniques, including Breadth-First Search (BFS) and Depth-First Search (DFS), which are essential for search engines and data mining. It also covers focused crawling, challenges in dynamic content handling, and cross-lingual information retrieval methods. Additionally, it highlights algorithms for near duplicate detection and various machine translation approaches for effective information retrieval across languages.

Uploaded by

dramya761
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views64 pages

IR Unit 3

The document discusses web page crawling techniques, including Breadth-First Search (BFS) and Depth-First Search (DFS), which are essential for search engines and data mining. It also covers focused crawling, challenges in dynamic content handling, and cross-lingual information retrieval methods. Additionally, it highlights algorithms for near duplicate detection and various machine translation approaches for effective information retrieval across languages.

Uploaded by

dramya761
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 64

UNIT 3

Web Page Crawling Techniques


• Web page crawling techniques are methods used to systematically browse and
retrieve data from websites. These techniques are fundamental to search engines,
web scrapers, and data mining applications. Here are some common techniques:
• 1. Breadth-First Search (BFS)
• Approach: Starts from a given URL and explores all neighboring links (depth level
1) before moving to the next level.
• Use case: Good for crawling websites that have a lot of interlinked pages where
deep pages should be discovered after the immediate ones.
• 2. Depth-First Search (DFS)
• Approach: Starts from a given URL and follows each link to its depth before
backtracking.
• Use case: Suitable for crawling websites where all content should be retrieved,
even if it means a deep exploration of each link.
Breath First Search
BreathFirst(Startingthis)
{
For(i=0;i<=StraightUrl;i++)
EnqueryBoundry,url;
Do {
url=Dequeue(boundry);
Page=fetch(url);
Visited=Visited+1
Enqueue(BoundaryExtractLinks(Page));
}
While(visited<maxpages&&Boundary r = Null);
• Start the Crawling Process
• The function starts with a given URL (Startingthis).
• It initializes a loop to process a list of URLs.
• Adding URLs to the Boundary (Queue)
• A loop (for loop) runs from 0 to StraightUrl, adding URLs to a queue called
Boundary.
• Start Crawling Websites by Enquiry
• they will be processed one by one.
• Fetch and Process the Page
• The crawler picks (Dequeue) the first URL from the Boundary (queue).
• It fetches the webpage from the internet.
• Track the Number of Visited Pages
• The number of visited pages increases (Visited = Visited + 1).
• Extract and Store New Links
• The crawler extracts new URLs
• These new links are added to the queue
• Continue Until Conditions Are Met
• The number of visited pages reaches maxpages
Depth First Serach
A
Algorithm DFS(graph G.Vertex V } /\
// Recursion algorithm B C
For all edges e in G.incidentEdges(v) /\ \
Do D E F
If edges r is unexpected then // An "unexpected" edge is an edge that has
not been visited yet.
Label e as discovery edge
Recursively calf DFS(G,w)// move forward to the next room (w).
Else
Label e as back edge
Focused Crawling
• A focused crawler is a type of web crawler designed to collect only web
pages that are relevant to a specific topic while ignoring irrelevant pages.
• Focused crawler is used to collect those web pages that are relevant to a
particular topic while filtering out the irrelevant.
• Thus, focused crawling can be used to generate data for an individual
user.
• Focused crawlers rely on the fact that pages about a topic tend to have
links to other pages on the same topic.
• It would be possible to start a crawl at one on-topic page, then crawl all
pages on that topic just by following links from a single root page.
• Text classifiers are tools that can make this kind of distinction.
• The crawler uses the classifier to decide whether the page is on topic.
• There are three major challenges for focused crawling:
1. It needs to determine the relevance of a retrieved web page.
2. Predict and identify potential URLs that can lead to relevant pages.
3. Rank and order the relevant URLs so the crawler knows exactly what to
follow next.
This kind of web crawler creates indexing more effectively directly helping
us in achieving the basic requirement of quicker and more relevant
retrieval of data from the huge repository of the World Wide Web.
Architecture of Focused crawling
• Seed URLs (Starting Pages)
• The process begins with a set of seed URLs (starting web pages) that are
related to the topic of interest.
• The URL queue stores the URLs that need to be crawled.
• These URLs are stored in the URL Queue, waiting to be processed.

Index (HashSet)
•The index (HashSet) keeps track of already visited URLs to avoid revisiting
the same pages and improve efficiency.
• Page Downloader
• The Page Downloader fetches web pages from the WWW (World Wide
Web) using the URLs from the URL queue.
• The downloaded pages are stored in the Crawling History & Page
Repository for future reference.
• Crawling History & Page Repository
• This stores previously crawled pages to avoid duplicate work and maintain a
record of visited pages.
• Training Set (Used for Learning)
• A Training Set of previously classified relevant and irrelevant pages helps the
system learn which pages to keep and which to discard.
• Classifier (Deciding If a Page Is Relevant)
• The classifier consists of three main parts:
• Page Analysis: Extracts text and content from the downloaded page.
• Characteristic Extraction: Identifies important features (keywords, topics, metadata) that
determine relevance.
• Relevance Analysis: Compares the extracted features with the Topic Taxonomy to decide if the
page is useful.
• Topic Taxonomy (Defining What’s Relevant)
• This is a database of topic-related terms that helps determine whether a page is
relevant.
• Example: If the topic is "Artificial Intelligence," the taxonomy might include words
like "Machine Learning," "Deep Learning," "Neural Networks," etc.
• Continue the process until enough relevant pages are collected
Near Duplicate Detection Algorithm
• The Near Duplicate Detection Algorithm is used to find content that is very
similar to another but not exactly the same.
• These duplicates could have slight changes, like rewarding or small modifications,
but still convey the same idea.
How it Works:
The algorithm compares the text, images, or data to find out if they are close enough to
be considered duplicates.
Example in Text:
• Imagine two versions of an article:
• Article 1: "The quick brown fox jumped over the lazy dog."
• Article 2: "A quick fox jumped over a lazy dog.“
Both sentences are similar but not identical. The near duplicate algorithm will
recognize that these are "close enough" in meaning, even if some words are different.
Example in Images:
Image 1: A picture of a red apple on a white background.
Image 2: A red apple with a small watermark in the corner.
The algorithm might detect that both images are near duplicates, despite the
watermark in Image 2, because the core content is the same (a red apple on a
white background).
Real-Life Applications:
1.Search Engines: If you search for a topic, the search engine might find many
similar pages or articles, some of which may be near duplicates (just rephrased
versions). The algorithm helps in filtering these out to show unique results.
2. Plagiarism Detection: In academic writing, if someone paraphrases or
slightly modifies an existing article, near duplicate detection can identify this
and flag it as plagiarism.
Continue……
• Cosine Similarity
• Approach:
• Cosine similarity measures the cosine of the angle between two vectors in a multi-
dimensional space. It is often used to compare documents represented as term-
frequency vectors.
• Documents that have a high cosine similarity are considered near duplicates.
• Use Case: Text-based applications like document comparison, plagiarism
detection, and content retrieval.
• Algorithm Steps:
• Convert the text into a vector representation (e.g., TF-IDF).
• Calculate the cosine of the angle between the vectors.
• A cosine similarity close to 1 indicates high similarity.
• Shingling (or k-shingles)Approach:
• Shingling is the process of breaking down a document into
overlapping substrings (shingles) of length k. This technique is widely
used in detecting near duplicates by comparing the shingles of
different documents.The similarity between documents can be
measured by comparing the sets of shingles they contain.
• Use Case: Text and document comparison for plagiarism detection
and deduplication.
• Algorithm Steps:
• Break the document into k-length shingles (substrings).
• Represent the document as a set of shingles.
• Compare the sets of shingles using set-based similarity measures
(e.g., Jaccard index).
Fingerprinting
• Approach:
• Fingerprinting algorithms generate a "fingerprint" (a hash or signature) of a
document or webpage. Near duplicates have similar fingerprints.
• Popular fingerprinting techniques include Rabin-Karp hashing and Simhash.
• Use Case: Detecting near duplicates in large-scale web crawling,
deduplication tasks, and for avoiding storing similar files in systems.
1.Algorithm Steps:Break the document into features (e.g., words,
sentences, or blocks).
2.Compute a hash or fingerprint for the document.
3.Compare fingerprints using similarity measures (e.g., Hamming distance
for binary fingerprints).
Handling Dynamic web content During
Crawling
• Dynamic URLs:
• URLs that change based on variable parameters (e.g., user input, database
queries).
• Often contain characters like ?, &, %, +, =, S, .cgi..
• Difference from Static URLs:
• Static URLs remain unchanged unless modified in the HTML.
• Dynamic URLs fetch data from a database and change based on queries.
• Common Dynamic Parameters:
• Session ID (sessionid) – Tracks user activity during a session.
• Source Tracker Parameter – Used for logging traffic sources but does not modify
page content.
• Challenges in Crawling Dynamic URLs:
• Some parameters do not change content but create different URLs.
• Large amounts of resources required to process and compare pages.
• Risk of accessing duplicate content due to session IDs and tracking parameters.
• Techniques for Handling Dynamic URLs in Crawling:
• Intelligent Analysis: Compare webpages to determine uniqueness.
• Strict URL Rules: Restrict crawling of "similar-looking" URLs.
• Character Limit: Avoid URLs exceeding a certain length to bypass session IDs.
• Webmaster Modifications:Rewrite dynamic URLs to static format.
• Optimize website structures for better crawling.
• JavaScript-Generated Content Challenge:
• Many modern websites use JavaScript-heavy frameworks like Angular, React.js,
Vue.js.
• These frameworks load content asynchronously after the page is loaded, making
traditional crawling difficult.
• Need for Efficient Crawling Methods:
• Reduce unnecessary duplicate accesses.
• Enable better indexing of dynamic content-heavy websites.
• Improve web crawling efficiency while maintaining content coverage.
Cross Lingual and Multi Lingual retrieval
• Cross-Lingual Information Retrieval (CLIR) is a subfield of Information
Retrieval (IR) where a user submits a query in one language, but the system
retrieves relevant documents written in another language.
• This technique is useful when:
• Users want to search in their native language but retrieve documents from a
global database.
• A language barrier exists between search queries and available content.
• Multilingual content needs to be accessed efficiently without requiring
translation beforehand.
• Key Components of CLIR
• Query Translation: Translating user queries into the target language
before searching.
• Document Translation: Translating retrieved documents into the user’s
language.
Example of CLIR in Action:
A Hindi-speaking user searches for "सौर ऊर्जा के लजभ" (Benefits of Solar
Energy).
The CLIR system translates it into English as "Benefits of Solar Energy".
The system retrieves English documents related to solar energy benefits.
If needed, the retrieved documents are translated back into Hindi for the
user.
Query Translation in Cross-Language
Information Retrieval (CLIR)
• Definition:
Query Translation is a key component of CLIR, where user queries are
translated from one language (source language) into the target language
before performing a search in a multilingual database.
• Process:
1.Dictionary-Based Translation: Uses bilingual dictionaries or glossaries to
translate keywords.
2.Statistical Machine Translation (SMT): Translates queries using probabilistic
models based on large bilingual corpora.
• Neural Machine Translation (NMT): Uses deep learning models to provide
more accurate and context-aware translations.
• Transliteration: Converts words phonetically when an exact translation is
unavailable (e.g., names, technical terms).
• Disambiguation: Resolves multiple meanings of words using context to
ensure accurate translation.
Challenges:
• Ambiguity in translations.
• Loss of meaning due to literal translations.
• Handling domain-specific terminology.
Multilingual Information Retrieval
• Multilingual Information Retrieval (MLIR) is a process where users can
search for information in multiple languages and retrieve results from
documents written in those different languages.
• MLIR supports multiple languages simultaneously without requiring
translation.
Example of MLIR :
A user searches for "climate change effects" in English.
The system retrieves documents in English, Spanish, French, and German that
contain relevant content.
The user can then read the documents in their original languages or use a
translation option if needed.
Technique for Cross – Lingual Retrieval
• Cross-Lingual Retrieval (CLR) techniques help retrieve information across
different languages.
1. Dictionary-Based Approach
2. Parallel Corpora-Based Approach
3. Comparable Corpora-Based Approach
4. Machine Translator-Based Approach
1. Dictionary-Based Approach
• Uses a bilingual dictionary to translate words from one language to another.
• Example: If you search for "computer" in English, the system looks up a
dictionary and translates it to "ordenador" (Spanish) or "ordinateur"
(French) to find relevant documents.
• Limitation:
1.Can struggle with phrases or context-specific meanings.
• Not every form of words used in query is always found in dictionary.
• Sometimes problems occurs in translating different compound words.
2. Lexical Ambiguity :
Lexical ambiguity occurs when a word has multiple meanings
Example:
• "Bank"
• I deposited money in the bank (financial institution).
• He sat on the bank of the river (riverbank).
This is called as Homonymous Words
"Homo" means "same,"
"nym" means "name.
So, homonyms have the same "name“
Polysemous Words :
Definition: Words that have multiple related meanings.
"Poly" means "many," and "sem" relates to "meaning.
So, polysemous words have "many meanings" that are connected in some way.
• Examples:
"Head"
• Leader → She is the head of the department.
• Body part → He shook his head in disagreement.
• Top part of something → The head of the bed is against the wall.
"Light"
• Not heavy → This bag is very light to carry.
• Brightness → Turn on the light in the room.
• Not serious → He made a light joke about the situation.
Parallel Corpora-Based Approach
• Uses large collections of documents that exist in multiple languages with
direct translations.
• Example: Websites like the United Nations or European Parliament have
the same content in different languages.
• The system learns translations from these texts to improve search accuracy.
• For example, if there is a resolution in English, there is an identical version
in French, Spanish, and other official UN languages.
• The system (like Google Translate) would use these parallel texts to learn
how certain phrases and terms are translated directly in different
languages. This helps ensure accurate translation, especially in formal
contexts.
Comparable Corpora-Based Approach :
Uses similar (but not directly translated) texts in different languages
to find matching terms.
Example: News articles from CNN (English) and Le Monde
(French) may report the same event but with different wording.
These two publications might report the same event but use different
wording, sentence structure, and vocabulary, as they are written in
different languages. While the content is similar, it is not a word-for-
word translation.
Machine Translator-Based Approach
• Uses tools like Google Translate or DeepL to translate search queries and
documents.
• Example: You search for "Artificial Intelligence" in English, the system
translates it to "Inteligencia Artificial" in Spanish and fetches Spanish
documents.
• It uses different strategies depending on how the translation is handled.
1. Word-for-Word Approach : Translates each word individually without
considering the sentence structure or meaning.
2. Syntactic Transfer : Focuses on translating sentence structures, adjusting
word order and grammar rules from the source language to the target
language.
3. Semantic Approach:
Explanation: Looks at the meaning of the entire sentence rather than just the
words, aiming for more accurate and natural translations.
Example: English: "He is on top of the world." (Meaning "He is very happy.")
Machine Translation for IR
• Machine Translation (MT) is a branch of computational linguistics that
automatically translates text from one language to another using software.
• In the context of Information Retrieval (IR), MT is used to translate
documents or search queries, making it easier to retrieve relevant
information across different languages.
Approaches of Machine Translation (MT) for IR :
There are three major approaches to MT:
1.Rule-Based Machine Translation (RBMT)
2.Corpus-Based Machine Translation (CBMT)
3.Statistical Machine Translation (SMT)
1. Rule-Based Machine Translation (RBMT) :
RBMT is a method of translating text from one language to another using a set of
rules and dictionaries created by human experts.
It’s one of the oldest ways to do machine translation.
While it can be very accurate, it takes a lot of time and effort to set up because
experts need to define all the rules.
Types of RBMT:
Direct Translation MT:
This method translates words one by one with only a little bit of grammar adjustment.
Example:
French: "Je mange une pomme"
English: "I eat an apple"
Here, the system looks up each word in a bilingual dictionary and makes a small
change to the grammar.
Transfer-Based MT:
• This method first changes the text into a different form (an intermediate
form) and then adapts it to the target language.
• Example:
• Source Language (Spanish): "La casa blanca"
• Intermediate Form: The system might convert this into a representation like
• Concept: [House, Color: White]
• Target Language (English): The system then translates the intermediate form
into "The white house," applying English grammatical rules.
Interlingua MT:
• This method translates the text into a common, language-independent
format before converting it to the target language.
Example: If translating from Hindi to French, the system first converts Hindi to
an abstract representation and then translates that into French.
Dictionary-Based MT: Uses a bilingual dictionary to replace words but often
results in inaccurate translations.
• Example: Translating "bank" (financial institution) might be mistakenly
translated as "river bank.“
2. Corpus-Based Machine Translation (CBMT) :
CBMT relies on previously translated text (parallel corpora) to learn translation
patterns.
It automatically analyzes and extracts knowledge from large translation
examples.
Example: Google Translate learns from millions of translated texts. If many
documents show "Bonjour" → "Hello," the system identifies it as a correct
translation.
3. Statistical Machine Translation (SMT):
SMT uses probability-based models to determine the most likely translation
based on training data.
Instead of following rules, it predicts translations statistically.
Example: If "good morning" is frequently translated as "buenos días" in
training data, the system will prioritize that translation.
Hybrid Machine Translation Approach:
To improve translation accuracy, modern MT systems combine rule-based,
corpus-based, and statistical methods.
Multilingual Document Representations and
query translation
• In Multilingual Information Retrieval (MLIR), documents in different languages need
to be indexed and retrieved effectively. A key challenge is handling homographs (words
that look the same in different languages but have different meanings, e.g., "but" in
English and "but" in French).
• Solution: Using Language Tags for Each Indexing Term
• To address this, each indexing term is tagged with a language identifier. When a user
submits a query and specifies the languages of interest, the query is translated into all
relevant languages. The translated queries, along with the original query, form a large
multilingual query expression for retrieval.
• This approach has two key advantages:
1.More Comparable Index Term Weights – Since indexing is done uniformly across
languages, term weights remain balanced.
2.Eliminates the Need for Merging Translations – The retrieval results naturally contain
documents from different languages without requiring a complex merging step.
Steps for Multilingual Document Representation :
• 1. Language Identification
• Determines the language of each document.
• Uses statistical language models to classify documents accurately.
• Ensures that documents are processed with language-specific rules.
• 2. Language-Dependent Preprocessing
• Stopword Removal – Eliminates common words in each language separately.
• Stemming/Lemmatization – Reduces words to their root forms using
language-specific algorithms.
• Language Tagging – Associates each processed word with its respective
language label.
• Results in a structured document collection where words are clearly
distinguished by language.
• 3. Indexing of the Mixed Document Collection
• Performed similarly to monolingual indexing.
• Indexing terms from different languages are weighted using the same
schema (e.g., TF-IDF).
• A single index file is created for all languages to facilitate efficient retrieval.
4. Query Translation
• The query (originally in English) is translated into multiple languages (e.g.,
French, Italian, German, Spanish).
• Translated words undergo stemming and language tagging, similar to
document indexing.
• A single multilingual query is formed, including:
• The original query in the source language.
• Translations in all specified target languages.
5. Retrieval :
The Retrieval is performed exactly in the same way as in monolingual retrieval.
The output is a list of documents in different language.
Evaluation Techniques For IR Systems
• Evaluating an Information Retrieval (IR) system means measuring how well
it finds and ranks documents that satisfy a user's query.
• The main goal is to ensure that users get the most relevant results quickly
and efficiently.
Types of Evaluation Measures
IR evaluation measures are divided into two categories:
1.Online Measures (based on real user interactions)
2.Offline Measures (based on pre-judged relevance of results)
1. Online Measures (Based on User Behavior):
These metrics analyze real user interactions with the search engine to
measure effectiveness.
a) Session Abandonment Rate
Measures the percentage of searches where the user does not click on any
result.
High abandonment may indicate poor relevance of results.
Example:
If 100 users perform searches and 30 do not click on any results, the
abandonment rate is 30%.
b) Click-Through Rate (CTR)
• The percentage of users who click on a search result out of the total users who viewed
it.
• A higher CTR means more users find the search results useful.
Formula:

Example :Online Ads (Google Ads or Facebook Ads CTR)


• You run a Facebook ad for a new clothing collection.
• The ad is shown 20,000 times (impressions).
• 1,000 people click on the ad to visit your website.
CTR Calculation:

A higher CTR means your ad is attractive and relevant to users.


c) Session Success Rate :
Measures how often a search session results in a successful interaction(e.g.,
user spends time on the page, copies text, or clicks another link).
High success rate means users are finding useful information.
If 200 users search for something and 150 find relevant results and interact
with them, the session success rate is (150/200) × 100 = 75%.
d) Zero Result Rate (ZRR):
Zero Result Rate (ZRR) refers to the percentage of searches that return no
results for a user query.
A high ZRR indicates that users are searching for terms that your system or
database cannot provide relevant results for.
2. Offline Measures (Based on Relevance Judgment)
• These metrics evaluate an IR system based on human-judged relevance
scores.
a) Precision
• Precision measures how many of the retrieved documents are actually
relevant.
• Higher precision means fewer irrelevant results are shown to users.
b. Recall :
• Definition:
• Recall measures how many relevant documents are successfully retrieved by
the search engine out of all relevant documents available in the system.
• Formula:
C. Fall-out :
Definition:
• Fall-out measures how many non-relevant documents are retrieved out of all
non-relevant documents in the system.
Formula:
D. F-score (F-measure) :
The F-score combines precision and recall into a single measure using the
harmonic mean. It balances between retrieving more relevant documents (high
recall) and avoiding irrelevant ones (high precision).
E. Average Precision:
Average Precision is used to evaluate how well a retrieval system ranks
relevant documents.
In simpler terms, it tells you how precise the system is at finding relevant
results, taking the order of the results into account.
• Formula : The formula for Average Precision (AveP) is:

Where:
•n is the total number of retrieved documents (for example, 10 documents).
•P(k) is the precision at rank k.
•Δr(k) is the change in recall between rank k-1 and k.
Example:
You perform a search, and the system returns 5 documents. Out of the 5, 3 are
relevant to your query, and 2 are not.
Document 1: Relevant
Document 2: Not Relevant
Document 3: Relevant
Document 4: Not Relevant
Document 5: Relevant
Precision at rank 1 (P(1)):
At rank 1, only 1 relevant document (Document 1) has been retrieved. So, the
precision is:
• Precision at rank 2 (P(2)):
Document 1: Relevant
Document 2: Not Relevant
Document 3: Relevant
Document 4: Not Relevant
Document 5: Relevant
At rank 2, 1 relevant document (Document 1) and 1 irrelevant document
(Document 2) have been retrieved. So, the precision is:
• Now, we calculate the Average Precision using the formula:
Discounted Cumulative Gain (DCG) :
DCG is a way to measure how well search results are ranked by their
relevance.
In other words, it evaluates how useful the order of results is, based on the
relevance of the documents shown.
Relevance: How important or useful a result is to the searcher (usually rated 0
to 3, where 0 means irrelevant and 3 means highly relevant).
Rank: The position of the result in the list (1st, 2nd, 3rd, etc.).
• p is the number of results you want to consider (e.g., top 5 results).
• rel(i) is the relevance of the document at rank i.
• i is the rank position (e.g., 1st, 2nd, 3rd).
Example to Understand DCG:
Let's say you perform a search and the search engine returns 5 results
with the following relevance scores:
User Based Evaluation
• Definition:
User-Based Evaluation in Information Retrieval (IR) assesses how well a
search system meets user needs by involving actual users in the
evaluation process. It focuses on relevance, usability, efficiency, and
satisfaction rather than just system performance metrics.
• Key Aspects of User-Based Evaluation
• Measures real user experiences instead of only algorithmic precision.
• Helps improve search algorithms, interface design, and overall
effectiveness of IR systems.
• Evaluates relevance, speed, reliability, and user satisfaction.
• Methods of User-Based Evaluation
• 1. User Studies
• Focuses on user behavior and search experiences during specific tasks.
• Analyzes aspects like query formulation, result examination, and relevance
judgment.
• Examines how user behavior changes based on different search conditions.
• Helps in understanding how users interact with search systems.
• Also known as user research or user testing..
• 2. Surveys
• Directly collects user feedback through questionnaires and interviews.
• Evaluates user preferences, behavior, values, and attributes.
• Can be qualitative (open-ended responses) or quantitative (rating scales, statistics).
• One of the most commonly used methods in user evaluation.
TEST COLLECTIONS AND BENCHMARKING :
• A test collection is a structured dataset used to evaluate the performance of an IR
system. It typically consists of:
1.Document Collection – A fixed set of documents (e.g., digital library texts, web
pages, images, videos).
2.User Queries (Topics) – A set of queries representing real-world information needs.
3.Relevance Judgments – Labels indicating which documents are relevant for each
query.
Importance of Test Collections :
Simulate real-world retrieval problems (e.g., ad hoc retrieval, question answering,
filtering, summarization).
Ensure consistent evaluation by providing a standardized dataset.
Allow comparative analysis across different IR models and techniques.
Support reproducible research by maintaining a static collection.
Benchmarking in IR :
Benchmarking involves evaluating IR systems using standard test collections
and performance metrics. It enables:
• Comparison of different retrieval strategies.
• Consistency in evaluation using shared datasets and measures.
• Identification of strengths and weaknesses in search algorithms.
Challenges in Test Collection-Based Evaluation
• Creating relevance judgments is time-consuming and subjective.
• User-oriented evaluations, though beneficial, are expensive and difficult to
replicate.
• Dynamic web content makes maintaining static collections challenging.
Online Evaluation Methods in Information
Retrieval (IR)
• Online evaluation methods assess the performance of an IR system in a real-
world setting, where actual users interact with the system. Two common
approaches are Absolute Quality (A/B Testing) and Relative Evaluation
(Interleaving).
• 1. Absolute Quality: A/B Testing
• A/B testing is a controlled experiment where two or more system versions
are compared by exposing users to different versions and analyzing their
behavior.
• Process:
• Users are randomly split into two groups:
• Group A uses the current (baseline) system.
• Group B uses the new (experimental) system.
• Performance metrics (e.g., click-through rate, dwell time, conversion rate) are
tracked.
• The system that performs better according to predefined metrics is selected.
• Advantages:
• ✅ Provides direct, real-world feedback from users.
✅ Statistically sound results when large user traffic is available.
✅ Useful for testing new features or ranking models.
2. Relative Evaluation: Interleaving
• Interleaving is a method for directly comparing two ranking algorithms by
merging their results and measuring user preference.
• Process:
• Two ranking algorithms (A and B) generate separate result lists.
• The lists are interleaved, mixing results from both systems.
• Users interact with the interleaved list (click on results).
• Clicks are attributed back to the system that originally ranked the document.
• The system that gets more clicks is preferred.

You might also like