Unit-5 Adt

The document discusses Information Retrieval (IR) concepts, including retrieval models, text preprocessing, and evaluation measures, focusing on how information is organized, stored, and retrieved from document repositories. It outlines various IR models such as Boolean, Vector Space, and Probabilistic models, as well as different types of queries and text preprocessing techniques essential for effective data mining. Additionally, it addresses web search engines, analytics, and current trends in web search, emphasizing the importance of adapting to user needs and optimizing search results.

Uploaded by

rajikarthi2013

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views11 pages

Unit-5 Adt

Uploaded by

rajikarthi2013

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

UNIT V INFORMATION RETRIEVAL AND WEB SEARCH

IR concepts – Retrieval Models – Queries in IR system – Text Preprocessing –

Inverted Indexing – Evaluation Measures – Web Search and Analytics – Current trends.

IR CONCEPTS
Information Retrieval (IR) can be defined as a software program that deals with the
organization, storage, retrieval, and evaluation of information from document repositories,
particularly textual information. Information Retrieval is the activity of obtaining material
that can usually be documented on an unstructured nature i.e. usually text which satisfies
an information need from within large collections which is stored on computers. For
example, Information Retrieval can be when a user enters a query into the system.

What is an IR Model?
An Information Retrieval (IR) model selects and ranks the document that is required
by the user or the user has asked for in the form of a query. The documents and the queries
are represented in a similar manner, so that document selection and ranking can be
formalized by a matching function that returns a retrieval status value (RSV) for each
document in the collection. Many of the Information Retrieval systems represent document
contents by a set of descriptors, called terms, belonging to a vocabulary V. An IR model
determines the query-document matching function according to four main approaches:
Components of Information Retrieval/ IR Model

 Acquisition: In this step, the selection of documents and other objects from
various web resources that consist of text-based documents takes place. The
required data is collected by web crawlers and stored in the database.
 Representation: It consists of indexing that contains free-text terms, controlled
vocabulary, manual & automatic techniques as well. example: Abstracting
contains summarizing and Bibliographic description that contains author, title,
sources, data, and metadata.
 File Organization: There are two types of file organization methods.
i.e. Sequential: It contains documents by document data. Inverted: It contains
term by term, list of records under each term. Combination of both.
 Query: An IR process starts when a user enters a query into the system. Queries
are formal statements of information needs, for example, search strings in web
search engines. In information retrieval, a query does not uniquely identify a
single object in the collection. Instead, several objects may match the query,
perhaps with different degrees of relevancy.
RETRIEVAL MODELS
It is the simplest and easiest to implement IR model. This model is based on
mathematical knowledge that was easily recognized and understood as well. Boolean,
Vector and Probabilistic are the three classical IR models. These are the three main
statistical models—Boolean, vector space, and probabilistic—and the semantic model.
Types of retrieval model:
● Classical IR Model - It is the simplest and easy to implement IR model.
● Non-Classical IR Model - It is completely opposite to classical IR model
● Alternative IR Model
● Inverted Index.
● Stop Word Elimination.
● Stemming.
● Term Weighting.
● Term Frequency (tfij)

1. Boolean Model
In this model, documents are represented as a set of terms. Queries are formulated as a
combination of terms using the standard Boolean logic set-theoretic operators such as AND,
OR and NOT. Retrieval and relevance are considered as binary concepts in this model, so the
retrieved elements are an ―exact match‖ retrieval of relevant documents.
Boolean retrieval models lack sophisticated ranking algorithms and are among the
earliest and simplest information retrieval models. These models make it easy to associate
metadata information and write queries that match the contents of the documents as well as
other properties of documents, such as date of creation, author, and type of document.
2. Vector Space Model
The vector space model provides a framework in which term weighting, ranking of retrieved
documents, and relevance feedback are possible. Documents are represented as features and
weights of term features in an n dimensional vector space of terms. Features are a subset of the
terms in a set of documents that are deemed most relevant to an IR search for this particular set
of documents.
The process of selecting these important terms (features) and their properties as a sparse
(limited) list out of the very large number of available terms (the vocabulary can contain
hundreds of thousands of terms) is independent of the model specification. The query is also
specified as a terms vector (vector of features), and this is compared to the document vectors
for similarity/relevance assessment.
In the vector model, the document term weight wij (for term i in document j) is represented
based on some variation of the TF (term frequency) or TF-IDF (term frequency- inverse
document frequency) scheme (as we will describe below). TF-IDF is a statistical weight
measure that is used to evaluate the importance of a document word in a collection of
documents. The following formula is typically used:

In the formula given above, we use the following symbols:

 dj is the document vector.
 q is the query vector.
 wij is the weight of term i in document j.
 wiq is the weight of term i in query vector q.
 |V| is the number of dimensions in the vector that is the total number of
important keywords (or features).

3. Probabilistic Model
In the probabilistic framework, the IR system has to decide whether the documents belong
to the relevant set or the nonrelevant set for a query. To make this decision, it is assumed that
a predefined relevant set and nonrelevant set exist for the query, and the task is to calculate the
probability that the document belongs to the relevant set and compare that with the probability
that the document belongs to the nonrelevant set.
Given the document representation D of a document, estimating the relevance R and
nonrelevance NR of that document involves computation of conditional probability P(R|D) and
P(NR|D). These conditional probabilities can be calculated using Bayes‘ Rule
P(R|D) = P(D|R) × P(R)/P(D)
P(NR|D) = P(D|NR) × P(NR)/P(D)
A document D is classified as relevant if P(R|D) > P(NR|D). Discarding the constant P(D),
this is equivalent to saying that a document is relevant if:
P(D|R) × P(R) > P(D|NR) × P(NR)
The likelihood ratio P(D|R)/P(D|NR) is used as a score to determine the likelihood of the
document with representation D belonging to the relevant set.
4. Semantic Model
Semantic approaches include different levels of analysis, such as morphological, syntactic,
and semantic analysis, to retrieve documents more effectively. In morphological analysis,
roots and affixes are analyzed to determine the parts of speech (nouns, verbs, adjectives, and
so on) of the words. The development of a sophisticated semantic system requires complex
knowledge bases of semantic information as well as retrieval heuristics. These systems often
require techniques from artificial intelligence and expert systems. Knowledge bases like
Cyc15 and WordNet16 have been developed for use in knowledge-based IR systems based on
semantic models.
TYPES OF QUERIES IN IR SYSTEMS:
During the process of indexing, many keywords are associated with document set
which contains words, phrases, date created, author names, and type of document. They
are used by an IR system to build an inverted index which is then consulted during the
search. The queries formulated by users are compared to the set of index keywords. Most
IR systems also allow the use of Boolean and other operators to build a complex query.
The query language with these operators enriches the expressiveness of a user’s
information needs.
1. Keyword Queries:
● Simplest and most common queries.
● The user enters just keyword combinations to retrieve documents.
● These keywords are connected by logical AND operator.
● All retrieval models provide support for keyword queries.

2. Boolean Queries:
● Some IR systems allow using +, -, AND, OR, NOT, ( ), Boolean operators in
combination of keyword formulations.
● No ranking is involved because a document either satisfies such a query or does not
satisfy it.
● A document is retrieved for Boolean query if it is logically true as exact match in
document.
3. Phase Queries:
● When documents are represented using an inverted keyword index for searching,
the relative order of items in the document is lost.
● To perform exact phase retrieval, these phases are encoded in an inverted index or
implemented differently.
● This query consists of a sequence of words that make up a phase.It is generally
enclosed within double quotes.
4. Proximity Queries:
● Proximity refers to search that accounts for how close within a record multiple items
should be to each other.
● Most commonly used proximity search option is a phase search that requires terms
to be in exact order.
● Other proximity operators can specify how close terms should be to each other.
Some will specify the order of search terms.
● Search engines use various operators’ names such as NEAR, ADJ (adjacent), or
AFTER.
● However, providing support for complex proximity operators becomes expensive
as it requires time-consuming pre-processing of documents and so it is suitable for
smaller document collections rather than for web.
5. Wildcard Queries:
● It supports regular expressions and pattern matching-based searching in text.
Retrieval models do not directly support this query type.
● In IR systems, certain kinds of wildcard search support may be implemented.
● Example: usually words ending with trailing characters.

6. Natural Language Queries:

● There are only a few natural language search engines that aim to understand the
structure and meaning of queries written in natural language text, generally as
questions or narratives.
● The system tries to formulate answers for these queries from retrieved results.
● Semantic models can provide support for this query type.

TEXT PREPROCESSING
Text preprocessing is an initial phase in text mining. There are various preprocessing
techniques to categorize text documents. These are filtering, splitting of sentences,
stemming, stop words removal and token frequency count. Filtering has a set of rules
for removing duplicate strings and irrelevant text. The various text preprocessing
steps are:
1. Tokenization.
2. Lower casing.
3. Stop word removal.
4. Stemming.
5. Lemmatization.
The purpose of tokenization is to protect sensitive data while preserving its business
utility. This differs from encryption, where sensitive data is modified and stored
with methods that do not allow its continued use for business purposes. If
tokenization is like a poker chip, encryption is like a lockbox.
Stemming and Lemmatization are Text Normalization (or sometimes called Word
Normalization) techniques in the field of Natural Language Processing that are used
to prepare text, words, and documents for further processing.

Word Stemming Lemmatization

information inform information

informative inform informative

computers compute computers

feet feet foot

Stop words removal

Stop word removal is one of the most commonly used preprocessing steps across
different NLP applications. The idea is simply removing the words that occur
commonly across all the documents in the corpus. Typically, articles and pronouns
are generally classified as stop words.

The preprocessing of the text data is an essential step as there we prepare the text data
ready for the mining. If we do not apply then data would be very inconsistent and
could not generate good analytics results.
Text Pre-processing is used to clean up text data: Convert words to their roots (in other
words, lemmatize). Filter out unwanted digits, punctuation, and stop words. Some of
the common text preprocessing / cleaning steps are:
● Lower casing.
● Removal of Punctuations.
● Removal of Stop words.
● Removal of Frequent words.
● Removal of Rare words.
● Stemming.
● Lemmatization.
● Removal of emojis.
Evaluation measure
Evaluation measures for an information retrieval system are used to assess how well the
search results satisfied the user's query intent. The field of information retrieval has used
various types of quantitative metrics for this purpose, based on either observed user behavior
or on scores from prepared benchmark test sets. Besides benchmarking by using this type of
measure, an evaluation for an information retrieval system should also include a validation of
the measures used, i.e. an assessment of how well the measures what they are intended to
measure and how well the system fits its intended use case.
Metrics are often split into two types: online metrics look at users' interactions with the
search system, while offline metrics measure theoretical relevance, in other words how likely
each result, or search engine results page (SERP) page as a whole, is to meet the information
needs of the user.
Online metrics
Online metrics are generally created from search logs. The metrics are often used to
determine the success of an A/B test.
Session abandonment rate
Session abandonment rate is a ratio of search sessions which do not result in a click.
Click-through rate
Click-through rate (CTR) is the ratio of users who click on a specific link to the number of
total users who view a page, email, or advertisement. It is commonly used to measure the
success of an online advertising campaign for a particular website as well as the effectiveness
of email campaigns.
Session success rate
Session success rate measures the ratio of user sessions that lead to a success. Defining
"success" is often dependent on context, but for search a successful result is often measured
using dwell time as a primary factor along with secondary user interaction, for instance, the
user copying the result URL is considered a successful result, as is copy/pasting from the
snippet.
Zero result rate
Zero result rate (ZRR) is the ratio of Search Engine Results Pages (SERPs) which returned
with zero results. The metric either indicates a recall issue, or that the information being
searched for is not in the index.
Offline metrics
Offline metrics are generally created from relevance judgment sessions where the judges
score the quality of the search results. Both binary (relevant/non-relevant) and multi-level (e.g.,
relevance from 0 to 5) scales can be used to score each document returned in response to a query.
In practice, queries may be ill-posed, and there may be different shades of relevance.

WEB SEARCH
A web search engine is a specialized computer server that searches for data on the Web.
The search results of a user query are restored as a list (known as hits). The hits can include
web pages, images, and different types of files. There are various search engines that also
search and return data available in public databases or open directories. Search engines differ
from web directories in that web directories are supported by human editors whereas search
engines work algorithmically or by a combination of algorithmic and human input.
Web search engines are large data mining applications. There are several data mining
techniques are used in all elements of search engines, ranging from crawling (e.g., deciding
which pages must be crawled and the crawling frequencies), indexing (e.g., selecting pages to
be indexed and determining to which extent the index must be constructed), and searching
(e.g., determining how pages must be ranked, which advertisements must be added, and how
the search results can be customized or create “context aware”).

ANALYTICS
Analytics is the systematic computational analysis of data or statistics.[1] It is used for
the discovery, interpretation, and communication of meaningful patterns in data. It also entails
applying data patterns toward effective decision-making. It can be valuable in areas rich with
recorded information; analytics relies on the simultaneous application of statistics, computer
programming, and operations research to quantify performance.
Organizations may apply analytics to business data to describe, predict, and improve
business performance. Specifically, areas within analytics include descriptive analytics,
diagnostic analytics, predictive analytics, prescriptive analytics, and cognitive analytics.[2]
Analytics may apply to a variety of fields such as marketing, management, finance, online
systems, information security, and software services. Since analytics can require extensive
computation (see big data), the algorithms and software used for analytics harness the most
current methods in computer science, statistics, and mathematics
CURRENT TRENDS IN WEB SEARCH

1. Voice search will become even more relevant

Voice search is already an integral part of our daily lives: we ask Siri where the closest
gas station is or say “Hey Google, which Thai restaurant is the highest rated in my
town?“ At the moment, optimizing for these kinds of voice searches is
recommended especially for ecommerce or websites whose users are likely to have
their hands full. For example, if you run a recipe blog, you want your users to find
the answer on how long to let the dough rest without having to type with their
potentially dirty hands on the phone.
2. Your site search can no longer offer zero results pages
A zero result page for your user means a lost client for you. But what seems like a
problem can be a great opportunity to increase your revenue. Let’s go back to our
example. In this case, you cannot offer your user Ralph Lauren winter shoes. But you
can show them results for other relevant products such as summer shoes by Ralph
Lauren or winter shoes by other brands.
3. Search will become more personalized than ever
With personalization, you can offer relevant results for each user based on their
preferences and prior search behavior. Going back to our example, an HR person
might have already downloaded a pdf targeted towards HR managers on the website.
Based on their behavior, they would get assessed as a B2B user and can get more
B2B oriented results in their search.
4. Site search will feel less like search and more intuitive
A good site search is the one you do not even think about as a user. You use it so
intuitively that you don’t need to assess what you are doing – you just do it. In 2022,
site search will look even less like classical search.

IR Chap4
100% (1)
IR Chap4
32 pages
Nicholo Manuscript
No ratings yet
Nicholo Manuscript
50 pages
Ch6 - Text Vectorization - 1
No ratings yet
Ch6 - Text Vectorization - 1
63 pages
Information Retrieval: Adt-V Unit
No ratings yet
Information Retrieval: Adt-V Unit
106 pages
Analysis of Implication of Artificial Intelligence (Ai) in Robotics
No ratings yet
Analysis of Implication of Artificial Intelligence (Ai) in Robotics
84 pages
Sentiment Analysis On Manipuri Language
No ratings yet
Sentiment Analysis On Manipuri Language
46 pages
Applied Machine Learning Course Schedule: 1:fundamentals of Programming
No ratings yet
Applied Machine Learning Course Schedule: 1:fundamentals of Programming
33 pages
ISE Information Retrieval Mod-V
No ratings yet
ISE Information Retrieval Mod-V
48 pages
Capstone Review 02
No ratings yet
Capstone Review 02
54 pages
Introduction To Information Retrieval
No ratings yet
Introduction To Information Retrieval
50 pages
Method For Novelty Recommendation Using Topic Modelling
No ratings yet
Method For Novelty Recommendation Using Topic Modelling
8 pages
Term Frequency and Inverse Document Frequency
No ratings yet
Term Frequency and Inverse Document Frequency
26 pages
IET DAVV Be - Com - It-A - Apr - 2011
No ratings yet
IET DAVV Be - Com - It-A - Apr - 2011
19 pages
Chapter Five IR Models
No ratings yet
Chapter Five IR Models
28 pages
DDB Ch27
No ratings yet
DDB Ch27
60 pages
Mining The Web Searching and Integration
No ratings yet
Mining The Web Searching and Integration
5 pages
Introduction To Information Storage and Retrieval Systems: BY-Research Scholar
No ratings yet
Introduction To Information Storage and Retrieval Systems: BY-Research Scholar
42 pages
Introduction of IR Models
No ratings yet
Introduction of IR Models
67 pages
NLP Final
No ratings yet
NLP Final
33 pages
Mining Free Text Medical Notes
No ratings yet
Mining Free Text Medical Notes
8 pages
Introduction of IR Models
No ratings yet
Introduction of IR Models
62 pages
DeepCausality A General AI-powered Causal Inference
No ratings yet
DeepCausality A General AI-powered Causal Inference
12 pages
IR Basics Lec28 Oct 3 2011
No ratings yet
IR Basics Lec28 Oct 3 2011
26 pages
Comp9313: Big Data Management: Mapreduce
No ratings yet
Comp9313: Big Data Management: Mapreduce
65 pages
Information Retrieval System-Chapter-1
No ratings yet
Information Retrieval System-Chapter-1
23 pages
Information Retrieval Detailed Lecture Nov 2023
No ratings yet
Information Retrieval Detailed Lecture Nov 2023
39 pages
Information Retrieval
No ratings yet
Information Retrieval
15 pages
DSB - Unit4-Representing and Miniing text-decision-analytic-think-II
No ratings yet
DSB - Unit4-Representing and Miniing text-decision-analytic-think-II
46 pages
Qta Lse Day2 PDF
No ratings yet
Qta Lse Day2 PDF
55 pages
NLP Unit 1
No ratings yet
NLP Unit 1
133 pages
02 Chap02a-BooleanAndvector Models
No ratings yet
02 Chap02a-BooleanAndvector Models
30 pages
Transforming Amharic Text Classification With M2M-100's Multilingual Transfer Learning
No ratings yet
Transforming Amharic Text Classification With M2M-100's Multilingual Transfer Learning
5 pages
Fraud Detection in E-Commerce Using Natural Language Processing
No ratings yet
Fraud Detection in E-Commerce Using Natural Language Processing
43 pages
Informaiton Retrieval and Web Search
No ratings yet
Informaiton Retrieval and Web Search
44 pages
Non Invasive Real Time Multimodal Deception Detection Using Machine Learning and Parallel Computing Techniques
No ratings yet
Non Invasive Real Time Multimodal Deception Detection Using Machine Learning and Parallel Computing Techniques
16 pages
IR-Module 1 and 2
No ratings yet
IR-Module 1 and 2
48 pages
IR - Models
100% (3)
IR - Models
58 pages
Completed Unit II 17.7.17
No ratings yet
Completed Unit II 17.7.17
113 pages
5 IRModels IR
No ratings yet
5 IRModels IR
25 pages
4 IRModels
No ratings yet
4 IRModels
32 pages
Module 6 Updated Final
No ratings yet
Module 6 Updated Final
48 pages
CompletedUNIT 1 PPT 10.7.17
100% (6)
CompletedUNIT 1 PPT 10.7.17
87 pages
Unit II
No ratings yet
Unit II
73 pages
2 Introduction To Information Retrieval
No ratings yet
2 Introduction To Information Retrieval
38 pages
Applications of NLP
No ratings yet
Applications of NLP
48 pages
IR Models: - Why IR Models? - Boolean IR Model - Vector Space IR Model - Probabilistic IR Model
No ratings yet
IR Models: - Why IR Models? - Boolean IR Model - Vector Space IR Model - Probabilistic IR Model
46 pages
Adt Unit 5
No ratings yet
Adt Unit 5
31 pages
Algorithmic Forex Trading Using Combination of Numeric Time Series and News Analysis
No ratings yet
Algorithmic Forex Trading Using Combination of Numeric Time Series and News Analysis
5 pages
Information Retrieval System and The Pagerank Algorithm
No ratings yet
Information Retrieval System and The Pagerank Algorithm
37 pages
2019 - Modeling Information Retrieval by Formal Logic - A Survey
No ratings yet
2019 - Modeling Information Retrieval by Formal Logic - A Survey
37 pages
Healthcare Chatbot System Using Artificial Intelligence
No ratings yet
Healthcare Chatbot System Using Artificial Intelligence
8 pages
Information Retrieval - 1
No ratings yet
Information Retrieval - 1
47 pages
Chapter 2: Modeling: Advanced Topics in Information Retrieval
No ratings yet
Chapter 2: Modeling: Advanced Topics in Information Retrieval
28 pages
Lec 5 e Text Analytics Vector Space TF IDF
No ratings yet
Lec 5 e Text Analytics Vector Space TF IDF
51 pages
IR Systems Usually Adopt Index Terms To Process Queries Index Term
No ratings yet
IR Systems Usually Adopt Index Terms To Process Queries Index Term
24 pages
Information Retrieval
No ratings yet
Information Retrieval
5 pages
Information Retrieval Practical
No ratings yet
Information Retrieval Practical
10 pages
Unit 5 6 Pages Notes
No ratings yet
Unit 5 6 Pages Notes
3 pages
Gen Ai Lab Programs
No ratings yet
Gen Ai Lab Programs
15 pages
Unit V Easy To Learn
No ratings yet
Unit V Easy To Learn
21 pages
1.introduction Information Retrival
No ratings yet
1.introduction Information Retrival
31 pages
Module 2-Students
No ratings yet
Module 2-Students
143 pages
Kumar 2024 Ijca 924115
No ratings yet
Kumar 2024 Ijca 924115
7 pages
1 IR Introductionn
No ratings yet
1 IR Introductionn
30 pages
Cs8080 Ir Unit2 I Modeling and Retrieval Evaluation
No ratings yet
Cs8080 Ir Unit2 I Modeling and Retrieval Evaluation
42 pages
IRS Unit 3 by Krishna
No ratings yet
IRS Unit 3 by Krishna
50 pages
Ir Mod2 Notes
No ratings yet
Ir Mod2 Notes
26 pages
4 IRModels
No ratings yet
4 IRModels
46 pages
NLP Manual
No ratings yet
NLP Manual
34 pages
NLP See
No ratings yet
NLP See
27 pages
Chapter 4
No ratings yet
Chapter 4
48 pages
IR Chapter 4
No ratings yet
IR Chapter 4
15 pages
NLP Unit-Ii (Part-I)
No ratings yet
NLP Unit-Ii (Part-I)
19 pages
Unit Ii Part B 1. Write About Basic IR Model
No ratings yet
Unit Ii Part B 1. Write About Basic IR Model
17 pages
TF Idf
No ratings yet
TF Idf
15 pages
Minimize The Overhead of A User Locating Needed Information Precision and Recall
No ratings yet
Minimize The Overhead of A User Locating Needed Information Precision and Recall
14 pages
NLP See
No ratings yet
NLP See
9 pages
IRS Module 2
No ratings yet
IRS Module 2
24 pages
Module 1print
No ratings yet
Module 1print
5 pages
Web Search
No ratings yet
Web Search
30 pages
Information Retrieval
No ratings yet
Information Retrieval
9 pages
IR Unit II
No ratings yet
IR Unit II
4 pages
RM Ipr Unit 5 Unit 5 Full Notes
No ratings yet
RM Ipr Unit 5 Unit 5 Full Notes
24 pages
Machine and Deep Learning For Personality Traits Detection: A Comprehensive Survey and Open Research Challenges
No ratings yet
Machine and Deep Learning For Personality Traits Detection: A Comprehensive Survey and Open Research Challenges
57 pages
Unit III Data Analysis and Reporting
No ratings yet
Unit III Data Analysis and Reporting
14 pages
Unit II Security Controls
No ratings yet
Unit II Security Controls
46 pages
Unit 2
No ratings yet
Unit 2
13 pages
Bulu
No ratings yet
Bulu
47 pages
Unit III Bayesian Learning
No ratings yet
Unit III Bayesian Learning
8 pages
ISE Information Retrieval Mod-V (Uploaded by Snaptricks - In)
No ratings yet
ISE Information Retrieval Mod-V (Uploaded by Snaptricks - In)
48 pages
Lang Models: 04 December 2024 23:03
No ratings yet
Lang Models: 04 December 2024 23:03
4 pages
Unit V Notes Adbt Adbt
No ratings yet
Unit V Notes Adbt Adbt
7 pages