Unit-5 Adt
Unit-5 Adt
IR CONCEPTS
Information Retrieval (IR) can be defined as a software program that deals with the
organization, storage, retrieval, and evaluation of information from document repositories,
particularly textual information. Information Retrieval is the activity of obtaining material
that can usually be documented on an unstructured nature i.e. usually text which satisfies
an information need from within large collections which is stored on computers. For
example, Information Retrieval can be when a user enters a query into the system.
What is an IR Model?
An Information Retrieval (IR) model selects and ranks the document that is required
by the user or the user has asked for in the form of a query. The documents and the queries
are represented in a similar manner, so that document selection and ranking can be
formalized by a matching function that returns a retrieval status value (RSV) for each
document in the collection. Many of the Information Retrieval systems represent document
contents by a set of descriptors, called terms, belonging to a vocabulary V. An IR model
determines the query-document matching function according to four main approaches:
Components of Information Retrieval/ IR Model
Acquisition: In this step, the selection of documents and other objects from
various web resources that consist of text-based documents takes place. The
required data is collected by web crawlers and stored in the database.
Representation: It consists of indexing that contains free-text terms, controlled
vocabulary, manual & automatic techniques as well. example: Abstracting
contains summarizing and Bibliographic description that contains author, title,
sources, data, and metadata.
File Organization: There are two types of file organization methods.
i.e. Sequential: It contains documents by document data. Inverted: It contains
term by term, list of records under each term. Combination of both.
Query: An IR process starts when a user enters a query into the system. Queries
are formal statements of information needs, for example, search strings in web
search engines. In information retrieval, a query does not uniquely identify a
single object in the collection. Instead, several objects may match the query,
perhaps with different degrees of relevancy.
RETRIEVAL MODELS
It is the simplest and easiest to implement IR model. This model is based on
mathematical knowledge that was easily recognized and understood as well. Boolean,
Vector and Probabilistic are the three classical IR models. These are the three main
statistical models—Boolean, vector space, and probabilistic—and the semantic model.
Types of retrieval model:
● Classical IR Model - It is the simplest and easy to implement IR model.
● Non-Classical IR Model - It is completely opposite to classical IR model
● Alternative IR Model
● Inverted Index.
● Stop Word Elimination.
● Stemming.
● Term Weighting.
● Term Frequency (tfij)
1. Boolean Model
In this model, documents are represented as a set of terms. Queries are formulated as a
combination of terms using the standard Boolean logic set-theoretic operators such as AND,
OR and NOT. Retrieval and relevance are considered as binary concepts in this model, so the
retrieved elements are an ―exact match‖ retrieval of relevant documents.
Boolean retrieval models lack sophisticated ranking algorithms and are among the
earliest and simplest information retrieval models. These models make it easy to associate
metadata information and write queries that match the contents of the documents as well as
other properties of documents, such as date of creation, author, and type of document.
2. Vector Space Model
The vector space model provides a framework in which term weighting, ranking of retrieved
documents, and relevance feedback are possible. Documents are represented as features and
weights of term features in an n dimensional vector space of terms. Features are a subset of the
terms in a set of documents that are deemed most relevant to an IR search for this particular set
of documents.
The process of selecting these important terms (features) and their properties as a sparse
(limited) list out of the very large number of available terms (the vocabulary can contain
hundreds of thousands of terms) is independent of the model specification. The query is also
specified as a terms vector (vector of features), and this is compared to the document vectors
for similarity/relevance assessment.
In the vector model, the document term weight wij (for term i in document j) is represented
based on some variation of the TF (term frequency) or TF-IDF (term frequency- inverse
document frequency) scheme (as we will describe below). TF-IDF is a statistical weight
measure that is used to evaluate the importance of a document word in a collection of
documents. The following formula is typically used:
3. Probabilistic Model
In the probabilistic framework, the IR system has to decide whether the documents belong
to the relevant set or the nonrelevant set for a query. To make this decision, it is assumed that
a predefined relevant set and nonrelevant set exist for the query, and the task is to calculate the
probability that the document belongs to the relevant set and compare that with the probability
that the document belongs to the nonrelevant set.
Given the document representation D of a document, estimating the relevance R and
nonrelevance NR of that document involves computation of conditional probability P(R|D) and
P(NR|D). These conditional probabilities can be calculated using Bayes‘ Rule
P(R|D) = P(D|R) × P(R)/P(D)
P(NR|D) = P(D|NR) × P(NR)/P(D)
A document D is classified as relevant if P(R|D) > P(NR|D). Discarding the constant P(D),
this is equivalent to saying that a document is relevant if:
P(D|R) × P(R) > P(D|NR) × P(NR)
The likelihood ratio P(D|R)/P(D|NR) is used as a score to determine the likelihood of the
document with representation D belonging to the relevant set.
4. Semantic Model
Semantic approaches include different levels of analysis, such as morphological, syntactic,
and semantic analysis, to retrieve documents more effectively. In morphological analysis,
roots and affixes are analyzed to determine the parts of speech (nouns, verbs, adjectives, and
so on) of the words. The development of a sophisticated semantic system requires complex
knowledge bases of semantic information as well as retrieval heuristics. These systems often
require techniques from artificial intelligence and expert systems. Knowledge bases like
Cyc15 and WordNet16 have been developed for use in knowledge-based IR systems based on
semantic models.
TYPES OF QUERIES IN IR SYSTEMS:
During the process of indexing, many keywords are associated with document set
which contains words, phrases, date created, author names, and type of document. They
are used by an IR system to build an inverted index which is then consulted during the
search. The queries formulated by users are compared to the set of index keywords. Most
IR systems also allow the use of Boolean and other operators to build a complex query.
The query language with these operators enriches the expressiveness of a user’s
information needs.
1. Keyword Queries:
● Simplest and most common queries.
● The user enters just keyword combinations to retrieve documents.
● These keywords are connected by logical AND operator.
● All retrieval models provide support for keyword queries.
2. Boolean Queries:
● Some IR systems allow using +, -, AND, OR, NOT, ( ), Boolean operators in
combination of keyword formulations.
● No ranking is involved because a document either satisfies such a query or does not
satisfy it.
● A document is retrieved for Boolean query if it is logically true as exact match in
document.
3. Phase Queries:
● When documents are represented using an inverted keyword index for searching,
the relative order of items in the document is lost.
● To perform exact phase retrieval, these phases are encoded in an inverted index or
implemented differently.
● This query consists of a sequence of words that make up a phase.It is generally
enclosed within double quotes.
4. Proximity Queries:
● Proximity refers to search that accounts for how close within a record multiple items
should be to each other.
● Most commonly used proximity search option is a phase search that requires terms
to be in exact order.
● Other proximity operators can specify how close terms should be to each other.
Some will specify the order of search terms.
● Search engines use various operators’ names such as NEAR, ADJ (adjacent), or
AFTER.
● However, providing support for complex proximity operators becomes expensive
as it requires time-consuming pre-processing of documents and so it is suitable for
smaller document collections rather than for web.
5. Wildcard Queries:
● It supports regular expressions and pattern matching-based searching in text.
Retrieval models do not directly support this query type.
● In IR systems, certain kinds of wildcard search support may be implemented.
● Example: usually words ending with trailing characters.
TEXT PREPROCESSING
Text preprocessing is an initial phase in text mining. There are various preprocessing
techniques to categorize text documents. These are filtering, splitting of sentences,
stemming, stop words removal and token frequency count. Filtering has a set of rules
for removing duplicate strings and irrelevant text. The various text preprocessing
steps are:
1. Tokenization.
2. Lower casing.
3. Stop word removal.
4. Stemming.
5. Lemmatization.
The purpose of tokenization is to protect sensitive data while preserving its business
utility. This differs from encryption, where sensitive data is modified and stored
with methods that do not allow its continued use for business purposes. If
tokenization is like a poker chip, encryption is like a lockbox.
Stemming and Lemmatization are Text Normalization (or sometimes called Word
Normalization) techniques in the field of Natural Language Processing that are used
to prepare text, words, and documents for further processing.
The preprocessing of the text data is an essential step as there we prepare the text data
ready for the mining. If we do not apply then data would be very inconsistent and
could not generate good analytics results.
Text Pre-processing is used to clean up text data: Convert words to their roots (in other
words, lemmatize). Filter out unwanted digits, punctuation, and stop words. Some of
the common text preprocessing / cleaning steps are:
● Lower casing.
● Removal of Punctuations.
● Removal of Stop words.
● Removal of Frequent words.
● Removal of Rare words.
● Stemming.
● Lemmatization.
● Removal of emojis.
Evaluation measure
Evaluation measures for an information retrieval system are used to assess how well the
search results satisfied the user's query intent. The field of information retrieval has used
various types of quantitative metrics for this purpose, based on either observed user behavior
or on scores from prepared benchmark test sets. Besides benchmarking by using this type of
measure, an evaluation for an information retrieval system should also include a validation of
the measures used, i.e. an assessment of how well the measures what they are intended to
measure and how well the system fits its intended use case.
Metrics are often split into two types: online metrics look at users' interactions with the
search system, while offline metrics measure theoretical relevance, in other words how likely
each result, or search engine results page (SERP) page as a whole, is to meet the information
needs of the user.
Online metrics
Online metrics are generally created from search logs. The metrics are often used to
determine the success of an A/B test.
Session abandonment rate
Session abandonment rate is a ratio of search sessions which do not result in a click.
Click-through rate
Click-through rate (CTR) is the ratio of users who click on a specific link to the number of
total users who view a page, email, or advertisement. It is commonly used to measure the
success of an online advertising campaign for a particular website as well as the effectiveness
of email campaigns.
Session success rate
Session success rate measures the ratio of user sessions that lead to a success. Defining
"success" is often dependent on context, but for search a successful result is often measured
using dwell time as a primary factor along with secondary user interaction, for instance, the
user copying the result URL is considered a successful result, as is copy/pasting from the
snippet.
Zero result rate
Zero result rate (ZRR) is the ratio of Search Engine Results Pages (SERPs) which returned
with zero results. The metric either indicates a recall issue, or that the information being
searched for is not in the index.
Offline metrics
Offline metrics are generally created from relevance judgment sessions where the judges
score the quality of the search results. Both binary (relevant/non-relevant) and multi-level (e.g.,
relevance from 0 to 5) scales can be used to score each document returned in response to a query.
In practice, queries may be ill-posed, and there may be different shades of relevance.
WEB SEARCH
A web search engine is a specialized computer server that searches for data on the Web.
The search results of a user query are restored as a list (known as hits). The hits can include
web pages, images, and different types of files. There are various search engines that also
search and return data available in public databases or open directories. Search engines differ
from web directories in that web directories are supported by human editors whereas search
engines work algorithmically or by a combination of algorithmic and human input.
Web search engines are large data mining applications. There are several data mining
techniques are used in all elements of search engines, ranging from crawling (e.g., deciding
which pages must be crawled and the crawling frequencies), indexing (e.g., selecting pages to
be indexed and determining to which extent the index must be constructed), and searching
(e.g., determining how pages must be ranked, which advertisements must be added, and how
the search results can be customized or create “context aware”).
ANALYTICS
Analytics is the systematic computational analysis of data or statistics.[1] It is used for
the discovery, interpretation, and communication of meaningful patterns in data. It also entails
applying data patterns toward effective decision-making. It can be valuable in areas rich with
recorded information; analytics relies on the simultaneous application of statistics, computer
programming, and operations research to quantify performance.
Organizations may apply analytics to business data to describe, predict, and improve
business performance. Specifically, areas within analytics include descriptive analytics,
diagnostic analytics, predictive analytics, prescriptive analytics, and cognitive analytics.[2]
Analytics may apply to a variety of fields such as marketing, management, finance, online
systems, information security, and software services. Since analytics can require extensive
computation (see big data), the algorithms and software used for analytics harness the most
current methods in computer science, statistics, and mathematics
CURRENT TRENDS IN WEB SEARCH