0% found this document useful (0 votes)
6 views171 pages

Information Retrieval

The document outlines the syllabus for a course on Information Retrieval Systems, detailing various chapters covering topics such as document indexing, retrieval models, spelling correction, performance evaluation, text categorization, web information retrieval, and advanced topics like cross-lingual retrieval. It includes definitions, goals, components of IR systems, and applications in various domains. The content is structured into units and chapters, providing a comprehensive overview of information retrieval concepts and techniques.

Uploaded by

adarshsinghjnp4
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views171 pages

Information Retrieval

The document outlines the syllabus for a course on Information Retrieval Systems, detailing various chapters covering topics such as document indexing, retrieval models, spelling correction, performance evaluation, text categorization, web information retrieval, and advanced topics like cross-lingual retrieval. It includes definitions, goals, components of IR systems, and applications in various domains. The content is structured into units and chapters, providing a comprehensive overview of information retrieval concepts and techniques.

Uploaded by

adarshsinghjnp4
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 171

SEM

MU LYB.SC.
6 Computer Science

Information
Retrieval
Goice AasedGRt Systen (CBC wtb eftet ton Aatenit year 20N

Er. Hasan Phudinawala Pramitha Santhumayor

TECH-NEO
UBLICAIIONS
BSG.-Comp-SEM6) Table of
ContentsS
IR (MU-TY, ...Page
No.(1)

Table Of Contents

Unit 1

CHAPTER 1:Introduction to Information Retrieval


System 1-1 to 1-6
1.1 Definition and Goals of Information Retrieval
1.1.1 Information Retrieval Involves A Range of
Tasks and .11
Applications.

Components of an IR System.. ....1-2


1.2
1.3 Challenges and Applications of IR.... .1-3
.1-5
Chapter Ends.
.1-6
CHAPTER 2 :
Document Indexing, Storage, and
Compression
P2-1 to 2-12
2.1 Inverted Index
Inverted Index Construction and Compression ..2-1
2.2 Techniques. ...........2-3
2.2.1 Inverted Index Construction ..
2.2.1.1 Simple Index Construction ..2-3
2.2.1.2 Merging *.... 2-3
2-4
2.2.1.3 Data Placement .....2-5
2.2.1.4 MapReduce...
2-5
2.2.2 Compression Techniques ..... 2-7
2.2.2.1 Dictionary Compression... 2-8
2.2.2.2 Bit-Aligned Codes
2-8
2.2.2.3 Variable-Byte Code
.2-9
2.3 Document Represerntation and Tern
Weighting.... .2-10
2.3.1 Document Representation .2-10
2.3.2 Term Weighting...
.2-11
Chapter Ends 2-12

CHAPTER3: Retrieval Models 3-1 to 3-13


3.1 Boolean Model. 3-1
3.2 Boolean Operators. 3-1
3.3 Query Processing 3-3
3.3.1 Document-at-a-Time Query Processing 3-4
3.3.2 Eficient Query Processing with Heaps. 3-5

(New Syllabus w.e.f Academic Tech-Neo Publications


Year 23-24) (BC-12)
IR (MU-T.Y.
B.Sc.-Comp-SEM 6) Table of Contents ...Page No.(2)

3.3.3 Term-at-a-Time Query Processing 3-6


Vector Space Model... 3-7
3.4
Probabilistic Model.... 3-11
3.5
Chapter Ends. 3-13

CHAPTER 4 : Spelling Correction in IR Systems


4-1 to 4-7

4.1 Spelling Correction... .4-1


Challenges of Spelling Errors In Queries and Documents 4-2
4.2
4.3 Edit distance and String Similarity Measures 4-3
4.4 Techniques for Spelling Correction in IR Systems 4-4
4.4.1 k-gram Indexes for Spelling Correction... 4-4
4.4.2 Context Sensitive Spelling Correction..... 4-6
4.4.3 Phonetic Correction 4-7
Chapter Ends .4-7

A CHAPTER5: Performance Evaluation 5-1 to 5-6

5.1 Evaluation Metrics. 5-1


5.1.1 Recall and Precision 5-1
5.1.2 F measure. .5-2
5.1.3 Average Precision... 5-3
5.2 Test collections and Relevance Judgments .5-4
Chapter Ends 5-6

Unit 2

a CHAPTER 6 : Text Categorization and Filtering


6-1 to 6-17
6.1 Text Classification/Categorization Algorithms 6-1
6.1.1 Naïve Bayes.... 6-3
6.1.2 Support Vector Machine (SVM) 6-7
6.2 Feature Selection... 6-9
6.3 Dimensionality Reduction 6-10
6.4 Applications of Text Categorization and Filtering 6-11
...
Chapter Ends. 6-17

ech-Neo Publications
(New Syllabus w.e.f Academic Year 23-24) (BC-12)
Table of Contents ,..Page No,
(3)

CHAPTER 7
:Text Clustering for Information
Retrieval 7-1 to 7-13
71 Csteing Techniques. 7-1
7.1.1 K-means Clusterning 7-3
Hierarchical Clustering. 7-5
72 Evaluation of Clustering Results,
..7-10
Custering for Query Expansion and Result Grouping.
...7-12
7.3
Chapter Ends 7-13

CHAPTERS: Web Information Retrieval 8-1 to 8-14


81 Web Search Architecture and Challenges. 8-1
8.1.1 Web Search and Search Engine. .8-1
8.1.2 Web StrUcture 8-4
8.1.3 Challenges of Web Search. 8-6
8.1.4 Web Search Architecture. 8-8
8.2 Crawing and Indexing Web Pages.. 8-10
8.2.1 Web Crawling. 8-10
8.2.2 Indexing the Web Pages or Web Indexes. 8-12
8.3 Link Analysis and PageRank Algorithm. 8-13
8.3.1 Link Analysis .8-13
8.3.2 PageRank Algorithm. 8-14
Chapter Ends 8-14

CHAPTER 9: Learning to Rank 9-1 to 9-14

9.1 Learming toRank (LTR) :Algorithms and Techniques. 9-1

92 Pairwise and Listwise Learning to Rank Approaches. 9-4


9.2.1 Pairwise Learning to Rank Approaches. 9-4
9.22 Listwise Learning to Rank Approaches. 9-6
9.3 Supervised Learning for Ranking: RankSVM, RankBoost.... 9-7
93.1 RankSVM... 9-8
9.3.2 RankBoost 9-9
9.4 Evauation Metrics for Learning to Rank... 9-11
Chapter Ends 9-14

(New Syllabus w.e.f Acadernic Year 23-24) (BC-12) rech-Neo Publications


B.SC.-Comp-SEM G) Table of Contents ...Page No. (4)
IR (MU-T.Y.

CHAPTER 10:Link Analysis and its Role in IR


Systems 10-1 to 10-12

10.1 Web Graph Representation and Link Analysis 10-1


10.1.1 Web Graph. 10-1
10.1.2 Link Analysis. 10-3
10.1.3 Link Analysis Algorithns. 10-3
10.2 HITS and PageRank algorithms 10-4
10.2.1 HITS (Hyperlink-Induced Topic Search) Algorithms. 10-4
10.2.2 PageRank Algorithms 10-6
10.3 Applications of Link Analysis in IR Systems 10-10
Chapter Ends. 10-12

Unit 3

CHAPTER 11 :
Crawling and Near-Duplicate Page
Detection 11-1 to 11-11
11.1 Web Page Crawling Techniques: Breadth-First, Depth-First.........11-1
11.1.1 Breadth-First 11-2
11.1.2 Depth-First. 11-3
... 11-4
11.2 Focused Crawling
...
11.3 Near Duplicate Detection Algorithm.. 11-7
..... 11-10
11.4 Handling Dynamic Web Content During Crawling....
Chapter Ends 11-11

a CHAPTER 12 Advanced Topics in IR


:
12-1 to 12-17
12.1 Text Summarization. 12-1
12.1.1 Extractive Approach..... 12-4
12.1.2 Abstractive Approach... 12-5
12.2 Question Answering: Approaches for Finding Precise Answers..... 12-6
12.3 Recommender Systems. 12-10
12.3.1 Collaborative Filtering 12-10
12.3.2 Content-based Filtering.. 12-14
Chapter Ends. 12-17

ech-NeoPublications
(New Syllabus w.e.f Academic Year 23-24) (BC-12)
Table of Contents
...Page No. (5)

CHAPTER 13: Cross-Lingual and Multilingual


Retrieval 13-1 to 13-21
187 Retieval or Cross-ingual infomation Retrieva!
Ua
sn Mingu Retneva or Mutingual Infomation Retrieval
nua Search and Mutingual Search) 13-1
131.1 CrossLingual Retrieval or Gross-Lingual information
Retigva 13-1
13.12 Mutingua Retrieval or Muitilingual lInformation
Rstrieval 13-9
132 Chalenges and Techniques for Cross-Lngual Retrieval.. 13-4
1321 Techniques tor Cross-Lingual Retrieval. 13-4
1322 halenges for Cross-Lingual Retrieval 13-7
135 Machine Trasatton (Mn for IR.. 13-8
18 Mnguai Documert Representatons and Query Translation... 13-14
13.5 Evaluaton Techrigues for IR Systems. .. 13-16
Chagter Ends 13-21

CHAPTER 14: User-based Evaluation 14-1to 14-7


147 User-besed Evaveton. 14-1
User Studes 14-2
14.12 Suveys 14-2
142 Test Collectonsand Benchmarking 143
t43 Onine Evaluation Methods: AB Testing. Interieaving
Experiments 14-4
14.31 AB Testing 14-5
1432 Interteaving Experiments 14-7
Chapter Ends 14-7
UNIT 1
Introduction to
Information
CHAPTER 1
Retrieval System
Syllabus
:
Introduction to Information Retrieval (IR) systems Definition and
goals of informaion retrieval, Components of an IR system, Challenges
and applications of IR.

1.1 DEFINITION AND GOALS OF INFORMATION


RETRIEVAL
L

GO. Define Information retrieval. Also State its Goals. (5 Marks) i

Information retrieval is finding material (usually documents) of an


unstructured nature (usually text) that satisfies an information need from
within large coliections.
Information retrieval is a field concermed with the structure, analysis,
organization, storage, searching. and retrieval of information.
Information retrieval is concermed with representing. searching. and
manipulating large collections of electronic text and other human
language data.
IR systems and services are now widespread. with millions of people
depending On them daily to facilitate business, education and
entertainment.
are by far the most
Web search engines Google, Bing, and others
access to up-to-date
popular and heavily used IR services, providing
technical information, locating people and organizations, summarizing
news and events, and simpitying conparison shopping.
(lntro to IRS)..Page no.
IR(MU-T.Y,
B.Sc.-Comp-SEM 6) (1-2)
which
does not have clear.
The term "unstructured data" refers to data
semantically overt, easy-for-a-computer structure. It is the opposite of
structured data, the canonical example of which is relational database,
a

of the sort companies usually use to maintain product inventories and


personnel records. In reality, almost no data are truly "unstructurer»
This is.definitely true of all text data if you count the latent linguistie
structure of human languages.
IR is also used to facilitate "semi-structured" search such as finding
a

document where the title contains Java and the body contains thrcading.
The field of information retrieval also covers supporting users in
browsing or filtering document collections or further processing a set of
retrieved documents.

a 1.1.1 Information Retrieval Involves A Range of Tasks and


Applications

The usual search scenario involves someone typing in a query to a


a
search engine and receiving answers in the form of list of documents
in ranked order.
World Wide Web (web search) is by far the most common application
involving information retrieval, search is also a crucial part of
applications in corporations, government, and many other domains.
Vertical search is a specialized form of web search where the domain
of the search is restricted to a particular topic.
Enterprise search involves finding the required information in the huge

variety of computer files scattered across a corporate intranet.


Web pages are certainly a part of that distributed information store, but
most information will be found in sources such as email, reports,
presentations, spreadsheets, and structured data in corporate databases.
Desktop search is the personal version of enterprise search, where the
information sources are the files stored on an individual computer,
including email messages and web pages that have recently been
browsed.
Peer-to-peer search involves finding information in networks of nodes
or computers without any centralized control. This type of search began

(New Syllabus w.e.f Academíc Year 23-24) (BC-12) Gech-Neo Publications


(MU-T.Y. B.Sc.-Comp-SEM 6) (Intro to lIRS).Page no. (1-3)
IR
file sharing tool for music but can be used in any community based
o

on shared interests, or even shared locality in the case of mobile devices.


Search and related information retrieval techniques
are used for
advertising, for intelligence analysis, for scientific discovery, for health
care. for customer suPport, for real estate and so on.
Search based on a user query (sometimes called ad hoc search) because
the range of possible queries is huge and not prespecified is not the only
text-based task that is studied in information retrieval.
Other tasks include filtering, classification, and question answering.
on a
Filtering or tracking involves detecting stories of interest based
or some other
person's interests and providing an alert using email
mechanism.

1.2 COMPONENTS OF AN IR SYSTEM

Draw & Explain the Components of an IR System.


I

GQ. (5 Marks)

Intornation Necd

User

Result

Qucry

Deletions Scarch Engine Additions

Documcnts

Indes

Fig. 1.2.1:Components of an IR system.

(New Syllabus w.e.f Academic Year 23-24) (BC-12) STech-Neo Publications


(Intro to IRS)..Page no.
IR (MU-TY, B.SC.-Comp-SEM 6) (1-4)
a
search, a user has an information need, w
Before conducting which
underlies and drives the search process. sometimes refer
We
to this
information need as a topic, particularly when it is presented
in written
form as part of test collection for IR evaluation.
a As a
result of

a
her
information need, the user constructs and issues query to
the IR
system.
Typically, this query consists of
a
small number
terms, with twoof
to
three terms being typical for a Web search. We use "term" instead of
word", because a query term may in fact not be a word at all.
Depending on the information need, a query term may be a date. a
number, a musical note, or a phrase. Wildcard operators and other
partial-match operators may also be permitted in query terms. For
example, the term inform*" might match any word starting with
that prefix ("inform", "informs'". informal", "informant",
informative", etc.).
The user's query is processed by a search engine, which may be
running on the user's local machine, on a 1arge cluster of machines in a
remote geographic location, or anywhere in between. A major task of a
search engine is to maintain and manipulate an inverted index for a
document collection. This index forms the principal data structure used
by the engine for searching and relevance ranking.
To support relevance ranking algorithms, the search engine maintains
collection statistics associated with the index, such as the number of
documents containing each term and the length of each document. In
addition, the search engine usually has access to the original content of
the documents, in order to report meaningful results back to the user.
Using the inverted index, collection statistics, and other data, the search
engine accepts queries from its users, processes these queries, and
returns ranked lists of results.
To perform relevance ranking, the search engine computes a score,
sometimes called a retrieval status value (RSV), for each document.
After sorting documents according to their scores, the result list may be
subjected o further processing, such as the removal of duplicate or
redundant results.

(New Syllabus w.e.f Academic Year 23-24) (BC-12)


Guech-Neo Publications
B.Sc.-Comp-SEM 6)
(MU-T.Y. (Intro to IRS)..Page no. (1-5)
IR

For example, a Web Search engine might report only one or two results
Rm a single host or domain, eliminating the others in Favor of pages
Canm different sources. The problem
of scoring documents with respect
one of the most fundamental in
to a userr's query is the field.

1.3 CHALLENGES AND APPLICATIONS OF IR

Document routing, filtering, and selective dissemination reverse the


typical IR process. Whereas a typical search application evaluates
incoming queries against a given document collection, a routing,
fltering, or dissemination system compares newly created or discovered
documents to a fixed set of queries supplied in advance by users,
identifying those that match a given query closely enough to be of
nossible interest to the users. A news aggregator, for example, might use
a routing system to separate the day's news into sections such as
“business," *politics," and "lifestyle," or to send headlines of interest to
particular subscribers.
Text clustering and categorization systems group documents according
to shared properties. The difference between clustering and
categorization stems from the information provided to the system.
Categorization systems are provided with training data illustrating the
various classes. Examples of business," "politics," and "lifestyle"
articles might be provided to a categorization system, which would then
sort unlabelled articles into the same categories. A clustering system, in
contrast, is not provided with training examples. Instead, it sorts
documents into groups based on patterns it discovers itself.
Summarization systems reduce documents to a few key paragraphs,
sentences, or phrases describing their content. The snippets of text
displayed with Web search results represent one example.
Information extraction systems identify named entities, such as places
and dates, and combine this information into structured records that
describe relationships between these entities - for example, creating lists
of books and their authors from Web data.
Topic detection and tracking systems identify events in streams of news
articles and similar information sources, tracking these events as they
evolve.
ech-NeoPublications
(New Syllabus w.e.f Academic Year 23-24) (BC-12)
IR (MU-TY. B.SC-Comp-SEM6) (lntro to lRS)..Page no.
(1-6)
of
Expert search systems identify members organizations who are
experts in a specified area.
Question answering systems integrate information from multiple sources
to provide concise answers to specific questions. They often incorporate
and extend other IR technologies, including search, summarization. and
information extraction.
Multimedia information retrieval systems extend relevance ranking and
other IR techniques to images, video, music, and speech.
Chapter Ends...
UNIT 1
Document
Indexing, Storage,
CHAPTER 2 and Compression

Syllabus

Document Indexing, Storage, and Compression


:
Inverted index
construction and compression techniques, Document representation
and term weighting, Storage and retrieval of indexed
documents.

2.1 INVERTED INDEX


so it calls
Text search is very different from traditional computing tasks,
for its own kind of data structure, the inverted index.
The name "inverted index" is really an umbrella term for many different
kinds of structures that share the same general philosophy
The inverted index (sometimes called inverted file) is the central data
structure in virtually every information retrieval system.
At its simplest, an inverted index provides a mapping between terms and
their locations of occurrence in a text collection C.
An inverted index is organized by index term. The index is inverted
we
because usually, we think of words being a part of documents, but if
invert this idea, documents are associated with words.
Index terms are often alphabetized like a traditional book index, but they
need not be, since they are often found directly using a hash table.
Each index term has its own inverted list that holds the relevant data for
that term
dering Srage &
Compre). Page no. (2-2)
SEM S, (D

127148?

:2-5276

17:27:

127227:

1271:E

127:504

Fig 2LI:A schema-independent inverted index for Shakespeare's


plays. The ictionary provides a mapping from terms to their
positions of occurrence.

An inverted index is organized by index term. The index is inverted


becanse usuzlly we think of words being a part of documents, but if we
ivert this idea. documents are associated with words.

Index terms are often alphabetized like a traditional book index. but they
Deed not be. since they are often found directly using a hash table.

Exch index term has its own inverted list that hoids the relevant data for
t term

The fundamental components of an inverted index are illustrated in


bekerx figure :

The dictionary lists the terms contained in the vocabulary V of the


calection. Each term has associated with it a postings list of the
positions in stich it appears, consistent with the positional numbering.

(New Sylabus w.ef Aradenic Year 23-24) (BC-12) Grech-Neo Publications


{MU-TY. B.Sc.-Comp-SEM 6) (Doc. Indexing,Storage & Compre)..Page no. (2-3)
IR
The index shown above contains not document identifiers but "fat
sord positions of the individual term occurrences. This type of index is
oalled a schema-independent index because it makes no assumptions
about the structure (usually referred to as schema in the
database community) of the underlying text. We chose the schema
independent variant for most of the examples in this chapter because it is
the simplest.
We define an inverted index as an abstract data type (ADT) with four
:
methods
first(t) returns the first position at which the term t occurs in the
collection:
last(t) returns the last position at which t occurs in the collection:
next (t Current) returns the position of t's first occurrence after the
current position:
prev(t, current) returns the position of t's last occurrence before
the current position.

2.2 INVERTED INDEX CONSTRUCTION AND


COMPRESSION TECHNIQUES

a 2.2.1 Inverted Index Construction


Before an index can be used for query processing. it has to be created
from the text collection. Building a small index is not particularly difficult.
but as input sizes grow, some index construction tricks can be useful.

e 2.2.1.1 Simple Index Construction


The process involves only a few steps. A list of documents is passed to
the Build Index function, and the function parses each document înto
tokens,
These tokens are words, perhaps with some additional processing, such
as down casing or stemming. The function removes duplicate tokens,
using, for example, a hash table.
Then, for each token, the function deternmines whether a new inverted
list needs to be created in I, and creates one if necessary.

(New Syllabus w.e.f Academic Year 23-24) (BC-12) arech -Neo Publications
IR (MU-T.Y, B.SC-Comp-SEM 6) (Doc. Indexing,Storage & Compre)...Page no.(2-4)

Finally, the current document number, n, is added to the inverted list.


The result is a hash table of tokens and inverted lists.
The inverted lists are just lists of integer document
numbers and contain
no special information. This is enough to do very simple kinds
retrieval this indexer can be used for many small tasks--for example,
indexing less than a few thousand documents.
However, it is limited in two ways. First, it requires that all of the
inverted lists be stored in memory, which may not be practical for larger
collections.
Second, this algorithm is sequential, with no obvious way to parallelize
it. The primary barrier to parallelizing this algorithm is the hash table
which is accessed constantly in the inner loop.
Adding locks to the hash table would allow parallelism for parsing. but
that improvement alone will not be enough to make use of more than a
handful of CPU cores. Handling large collections will require Jess
reliance on memory and improved parallelism.

2.2.1.2 Merging
The classicway to solve the memory problem in the previous example is
by merging
We can build the inverted list structure I until memory runs out. When
that happens, we write the partial index Ito disk, then start making a
new one. At tne end of this process, the disk is filled with many partial
indexes, I, 12, 13,
.., In.
The system then merges these files into a single result. By definition, it
is not possible to hold even two of the partial index files in memory at
one time. s0 the input files neei to be carefully designed so that they can
be merged in small pieces.
4 One way to o
this is to store the partial indexes in alphabetical order. It
is then possible for 2 merge algoríthrn to nerge the partial indezes using
very litrie ETTY.

reh-ieo Pubicatisns
6)
(MU-T.Y. B.Sc.-Comp-SEM (Doc. Indexing, Storage &
Cormpre)..Page no. (2-5)
IR
aardvark
Index A 2|345 apple24
Indcx B aardvark 69 actor 15 42G8

Indes aardvark23|45 apple


24
Indcx B aardvark actor 15
42G8
Combincd index ardvark234569 actor
15
42 G8 apple 24

Bio. 2.2.1: An example of index merging. The first and second indexes
are merged together to produce the combined index.

2.2.1.3 Data Placement

Before diving into the mechanics of distributed processing, consider the


problems of handling huge amounts of data on a single computer.
Distributed processing and large-scale data processing have one major
aspect in common, which is that not all of the input data is available at
once.
In distributed processing, the data might be scattered among many
machines.
In large-scale data processing, most of the data is on the disk. In both
cases, the key to efficient data processing is placing the data correctly.

2.2.í.4 MapReduce

GQ. Write a short note on MapReduce.


MapReduce is a distributed programming frarmework that focuses on
data placement and distribution. As we saw in the last few examples,
proper data placement can make some problems very simple to compute.
By focusing on data placement, MapReduce can unlock the parallelism
in some common tasks and make it easier to process large amounts of
data.
MapReduce gets its name from the two pieces of code that a user needs
to write in order to use the framework: the Mapper and the Reducer.
The MapRcduce library automatically launches many Mapper and
Reducer tasks onacluster of rnachines.

(Nen Syllabus we.f Acadernic Year 23-24) (BC-12Z) ech-Neo Publications


6) (Doc. Indexing.Storage
&
Compre).Page no.
(2-6)
IR(MU-T.Y. B.Sc-Comp-SEM

is the path the data taleo


The interesting part about MapReduce, though,
berween the Mapper and the Reducer.
Jook at :L
Before we look at how the Mapper and Reducer work, let's
are
foundations of the MapReduce idea. The functions map and reduce
commonly found in functional languages.
a
In very simple terMs, the map function transforms list of items into
another list of items of the same length. The reduce function transforme
a list of items into a single item.

The MapReduce framework isn't quite so strict with its definitions: both
Mappers and Reducers can return an arbitrary number of iteme
Howeve. the general idea is the same.
Map
Input
Shuffie
Reduce
Output

Fig. 2.2.2 : MapReduce


The MapReduce steps are summarízed in Figure given above.
We assume that the data comes in a set of records.
The records are sent to the Mapper, which transforms these records into
pairs, each with a key and a value.

The next step is the shuffle, which the library performs by itself. This
up
operation uses a hash function so that all pairs with the same key end
next to each other and on the same machine.
The final step is the reduce stage, where the records are processed again,
but this time in batches, mcaning all pairs with the same key are
processed al once.

ech-Neo Publications
(Nev Syllabus ve.f Acadermíc Year 23-24) (BC-12)
. B.Sc.-Comp-SEM 6) (Doc. Indexing,Storage .
IR(MU-T.Y.
&
Compre)..Page no. (2-7)

2.2.2 Compression Techniques

IGO. Discuss about various Index Compression techniques.

Compression techniques are the most powerful tool for managing the
memory hierarchy. The inverted lists for a large collection are
themselves very large.
In fact, when it includes information about word position and document
extents, the index can be comparable in size3 to the document
collection.
Compression allows the same inverted list data to be stored in less
space.
The obvious benefit is that this could reduce disk or memory
requirements, which would save money.
More importantly, compression allows data to move up the memory
hierarchy. If index data is compressed by a factor of four, we can store
four times more useful data in the processor cache, and we can feed data
to the processor four times faster.

On disk, compression also squeezes data closer together, which reduces


seek times.
Unfortunately, nothing is free. The space savings of compression comes
at a cost: the processor must decompress the data in order to use it.
Therefore, it isn't enough to pick the compression technique that can
store the most data in the smallest amount of space.
In order to increase overall performance, we need to choose a
compression technique that reduces space and is easy to decompress. we
consider only lossless compression techniques. Lossless techniques store
data in less space, but without losing information.
There are also lossy data compression techniques, which are often used
for video, images, and audio. These techniques achieve very high
compression ratios (r in our previous discussion, but do this by throwing
away the least important data.
Inverted list pruning techniques, which we discuss later, could be
considered a lossy compression technique, but typically when we talk
about compression, we mean only lossless methods.
Tech-Neo Publications
(New Syllabus w.e.f Academic Year 23-24) (BC-12)
IR (MU-TY, B.St-Qomp-SEM 6) (Dec. ndexing,Slorage &
Compre))..Page no.
(2-8)

2.2.2.1 Dictionary Compression


This sectinprsents a series of dictionary data structures that achieve
inTeasingly higher compression ratios. The dictionary is shnl
onnared with the postings file.
So why compress it if it is responsible for only a smallpercentage of the
Overall space requirements of the IR system?
One of the primary factors in determining the rcsponse time of an IR
Svstem is the number of disk sceks neccssary to process a query.

If parts of the dictionary are on disk, then many more disk seeks are
necessary inquery evaluation.
Thus, the main goal of compressing the dictionary is to fit it in main
memory, or at least a large portion of it, to support high query
through
put.
Although dictionarices of very large collections fit into the memory of a
standard desktop machine, this is not true of many other application
scenarios.
For example, an enterprise search server for a large corporation may
have to index a multi tera byte collection with a comparatively large
vocabulary because of the presence of documents in many different
languages.
We also want to be able to design scarch systems for limited hardware
such as nmobile phones and onboard computers.

Other reasons for wanting to conserve memory are fast startup tinne and
having to share resources with other applications.

a 2.2.2.2 Bit-Aligned Codes


Code words are restricted to end on byte boundaries. In all of the
techniqucs we'll diseuss, we are looking at ways to store small numbers
in inverted lists (such as word counts, word positions, and delta cncoded
document numbers) in as little spaceas possible.

Onc of the simplest codes is the unary code. You are probably fanmiliar
with binary, which encodes numbers with (wo synbols, typically 0
and I.

(New Syllabus wef Academic Year 23-24) (BC-12) lech-Neo Publications


B.Sc.-Comp-SEM 6) (Doc. Indexing, Storage no.
JR(MU-T.Y.
&
Compre).Page (2-9)

A unary number system 1s a base-1 encoding, which means it uses a


single symbol to encode numbers. Here are some examples:
NumberCode

1 |10
2 110
3 1110
4 |11110
111110

In general, to encode a number k in unary, we output k 1s, followed by


a 0. We need the 0 at the end to make the code unambiguous.

This code is very efficient for small numbers such as 0 and 1, but
quickly becomes very expensive.
For instance, the number 1023 can be represented in 10 binary bits, but
requires 1024 bits to represent in unary code. Now we know about two
kinds of numericencodings.
Unary is convenient because it is compact for small numbers and is
inherently unambiguous.
Binary is a better choice for large numbers, but it is not inherently
unambiguous.
A reasonable compression scheme needs to encode frequent numbers
with fewer bits than infrequent numbers, which means binary encoding
is not useful on its own for compression.

A 2.2.2.3 Variable-Byte Code

The codes described in the previous sections are bit-aligned as they do


not represent an integer using a multiple of a fixed number of bits, e.g., a
byte.
But reading a stream
of bits in chunks where each chunk is a byte of
memory (or a multiple of a byte, c.g., a memory word - 4 or 8 bytes), is
simpler and faster bccause the data itself is written in memory in this
way.
IR(MU-TY, B.SC.-Comp-SEM 6) (Doc. Indexing,Storage
&
Compre))..Page no.
(2-10)
Thenefore, could be preferable to use byte-aligned or word-aligned
it
codes when decoding speed isthe main concernrather than compression

etiiveness.
Variable byte (VB)encoding uses an integralnumber of bytes to encode
a gap.

The last 7 bits of a byte are "payload" and encode part of the gap.
The first bit of the byte is a continuation bit. It is set to 1
for the last byta
of the encoded gap and to 0 otlherwise.
To decode a variable byte code, we read a sequence of bytes with
continuation bit 0 ternminated by a byte with continuation bit 1.
We then extract and concatenate the 7-bit parts.
The main advantage of Variable -Byte codes is decoding speed: we just
need to read one byte at a time until we found a value smaller than 2

Conversely, the number of bits to encode an integer cannot be less


than 8, thus Variable-Byte is only suitable for large numbers and its
compression ratio may not be competitive with the one of bit-aligned
codes for small integers.

2.3 DOCUMENT REPRESENTATION AND TERM


WEIGHTING

A 2.3.1 Document Representation

It is concerned about how textual documents should be represented in

various tasks, e.g. text processing, retrieval and knowledge discovery


and mining.

Its prevailing approach is the vector space model, i.e. a document di is


represented as a vector of term weights, where is the collection of terms
that occur at least once in the document collection D.

TM

arech-t
(New Syllabus w.e.f Academic Year 23-24) (BC-12) STech-Neo Publications
B.Sc.-Comp-SEM 6) (Doc. Indexing,Storage (2-11)
IR(MU-T.Y. Compre)...Page no.
&

Term Weighting
A 2.3.2
f G0. Explain the Term Weighting w.r.t document Indexing.

Term weighting is a procedure that takes place during the text indexing
process in order to assess the value of each term to the document.
Term weighting is the assignment of numerical values to terms that
represent their importance in a document in order to improve retrieval
effectiveness.
Essentially it considers the relative importance of individual words in an
information retrieval system, which can improve system effectiveness,
since not all the terms in a given document collection are of equal
importance.
Index term weights reflect the relativ importance of words in
documents, and are used in computing scores for ranking.
The specific form of a weight is determined by the retrieval model. The
weighting component calculates weights using the document statistics
and stores them in lookup tables.
Weighing the terms is the means that enables the retrieval system to
determine the importance of a given term in a certain document or a
query.
It is a crucial component of any information retrieval system, a
component that has shown great potential for improving the retrieval
effectiveness of an information retrieval system.
Each term in a document a weight for that term, that depends on the
number of occurrences of the term in the document. Assign the wveight
to be equal to the number of occurrences of term t in document d.

This weighting scheme is referred to as Term Frequency and is denoted


tf,a. with the subscripts denoting the term(t) and the document (d) in
order.
Document frequency : The document frequency di, defined to be the
number of documents in the collection that contain a term t.

ech-NeoPublications
(New Syllabus w.e.f Academic Year 23-24) (BC-12)
IR (MU-T.Y. B.Sc.-Comp-SEM 6) (Doc. Indexing, Storage
&
Compre)..Page no.
(2-12)

Denoting as usual the total number of documents in a collection by N,


we define the inverse document freguency (idf) of a term
t as follows:

idf, =
N

R T-idf weighting
We now combine the definitions of term frequency and inverse
document frequency, to produce a composite weight for each term in
each document.
The tf-idf weighting scheme assigns to term t a weight in document d
given by
=
tf- idí, ta X idí
Chapter Ends...
UNIT 1

Retrieval Models
CHAPTER 3

Syllabus
Retrieval Models : Boolean model: Boolean operators, query
processing, Vector space model: TF-IDF, cosine similaríity,
query
docurment matching, Probabilistic model: Bayesian retrieval, relevance
feedback.

3.1 BOOLEAN MODEL

iGQ. Explain Boolean Model & explain its operators.

Apart from the implicit Boolean filters applied by Web search engines,
explicit support for Boolean queries is important in specific application
areas such as digital libraries and the legal dornain.
In contrast to ranked retrieval, Boolean retrieval returns sets of
a
documents rather than ranked lists. Under the Boolean retrieval model,
term t is considered to specify the set of documents containing it.
The standard Boolean operators (AND, OR, and NOT) are used to
construct Boolean queries, which are interpreted as operations over
these sets, as follows:

3.2 BOOLEAN OPERATORS

GQ. Explaín Boolean Operators with examples.

A AND B intersection of A
and B (A nB)
A OR B union of A
and B (AU B) NOT
SEV 5) (Retievai Woceis)..Page ro.
R MLTY9S-Qom (3-2)

A compiement of A with raspect to the ocument Collscaon (A-)


where A and are termS ONer Bo0ltan cueries.
oT

Tabie 3.2.1: Texi fragmt from Shakespeares Romeo and Juliet,

Docament Content

F TOu da. st. Iaa for yoa: I ere sgood a man as

Fr exume. gver e colecüon in the above given Table, the query


(urel OR "T) AND "You spectes the set {1, 3}.
witeras de query (quarel OR "sir) AND NOT “you" specifies the
set (2.5}
Cur algorittn for soving Boolean queries is another variant of the
pírase searchíng algorithra.
The algorith locates candidate solutions to a Boolean query where
each candidate solution represents a range of documents that together
satisfy the Boolean query, such that no smalle range of documents
contained within it also satisfies the query.
When the range represented by a candidate solution has a length of 1,
this single document satisfies the query and should be included in the
result set.
To simplify our definition of our Boolean search algorithm, we define
two functions that operate over Boolean queries, extending the nextDoc
arid prevDoc methods of schema-dependent inverted indices.
docRight(Q, u) - end point of the first candidate solution to Q starting after
document u
-
docíeft(Q. v) start point of the last candidate solution toQ ending before
document v

(New Syllabus w.e.f Academic Year 23-24) (BC-12)


tech-eo Publications
IR (MU-T.Y. B.Sc.-Comp-SEM 6) (Retrieval Models)...Page no. (3-3)

For terms we define:


docRight(t, u) = nextDoc(t, u)
docLett(t, v) = prevDoc(t, v)
and for the AND and OR operators we define:
docRight(A AND B, u)= max(docRight(A, u), docRight(B, u))
docLeft(A AND B,Y) min(docLeft(A, v), docLeft(B, v)
docRight(A OR B, u) = min(docRight(A, u), docRight(B, u)

docLeft(A OR B,v)= max(docLeft(A, v), docLeft(B, v))


To determine the result for a given query, these definitions are applied
recursively. For example:
docRight((“quarrel" OR “sir") AND "you", 1)
= max(docRight("quarrel" OR "sir", 1), docRight("you", 1)

= max(min(docRight("quarrel", 1), docRight("sir", 1)), nextDoc("you", 1))


= max(min(nextDoc("quarrel", 1), nextDoc(“"sir", 1)), 3)
=
max(min(2, 2), 3)
=3
docLeft((*"quarrel" OR "sir") AND "you", 4)
= min(docLeft("quarrel" OR "sir",4), docLeft("you", 4))
= min(max(docLeft("quarrel", 4), docLeft("sir", 4)), prevDoc("you", 4))
= min(max(prevDoc(quarrel", 4), prevDoc("sir", 4), 3)
= min(max(2, 3), 3)

=3

3.3 QUERY PROCESSING

IR.
iGQ. Explain Query Processing in

Efficient query processing is a particularly important problem in web


search, as it has reached a scale that would have been hard to imagine
just 10 years ago.

(New Syllabus w.e.f Academic Year 23-24) (BC-12) ech-Neo Publications


IR (MU-T.Y. B.Sc.-Comp-SEM 6) (Retrieval Models)...Page no.
(3-4)
People all over the world type in over halfa billion queries every
day,
searching indexes containing bilions of web pages.
Inverted indexes are at
the core of all
modern web search engines.
The
query processing algorithm depends on the retrieval model, and dictates
the contents of the index.
This works in reverse, too, since we are unlikely to choose a
retrieval
model that has no efficient query processing algorithm.
Traditional information retrieval systems usually follow the disjunctive
approach, while Web search engines often employ conjunctive query
semantics.
The conjunctive retrieval model leads to faster query processing than the
disjunctive model, because fewer documents have to be scored and
ranked.

However, this performance advantage comes at the cost of a lower


recall: If a relevant document contains only two of the three query terms.
it will never be returned to the user.
This limitation is quite obvious for the query Q shown above. Of the
half-million documents in the TREC collection, 7,834 match the
disjunctive interpretation of the query, whereas only a single document
matches the conjunctive version. Incidentally, that document is not even
relevant.

a 3.3.1 Document-at-a-Time Query Processing

The most common form of query processing for ranked retrieval is


called the document-at-a-time approach.
In this method all matching documents are enumerated, one after the
other, and a score is computed for each of them.
At the end all documents are sorted according to their score, and the top
k results (where k is chosen by the user or the application) are returned
to the user.

(New Syllabus w.e.f Academic Year 23-24) (BC-12) aech 1-Neo Publications
IR (MU-T.Y. B.Sc.-Comp-SEM 6)
(Retrieval Models)...Page no. (3-5)
rankBM25 DocumentAtATime ((t, ..t), k) =

m0ll m is the total number of matching documents


- oo) )
d- min, Sin{nextDoc(t,
while d<o do
results[m].docid -d
results[m].score (- Z log (N / Nt) • TF;
BM2S (, d)
i=1
m (-m + 1

d- min,
sisn, {nextDoc(4, d))
sort results[0..(m - 1)] in decreasing order of score
return results[0..(k - 1)]

Fig. 3.3.1 :Document-at-a-time query processing with BM25.

The overall time complexity of the algorithm is


O(m n + m: log(m))
where n is the number of query terms and n is the number of matching
documents (containing at least one query term).
The term m n corresponds to the loop starting in line 3 of the
algorithm. The term m log(m) corresponds to the sorting of the search
-

results in line 8.

A 3.3.2 Efficient Query Processing with Heaps


We can use Reheap to overcome the limitations of the previous
algorithm.
In the revised version of the algorithm, we employ two heaps: one to
manage the query terms and, for each term t, keep track of the next
document that contains t; the other one to maintain the set of the top k
search results seen so far.
The terms heap contains the set of query terms, ordered by the next
document in which the respective term appears (nextDoc). It allows us
to perform an efficient multiway merge operation on the n postings lists.
The results heap contains the top k documents encountered so far,
ordered by their scores. It is important to note that result's root node

(New Sylabus w.e.f Academic Year 23-24) (BC-12) Sech-Neo Publications


(Retrieval Models)...Page no..(3-6)
IR (MU-TY, B.SC.-Comp-SEM 6)

seen so far, but the kth-best


does not contain the best document
document seen so far.
This allows us to maintain and continually
update the top k search
document in the top
rsults by replacing the lowest-scoring k
we a new
whenever find document
(and restoring the heap property) that
scores better than the old one.

The worst-case time complexity of the revised version


of the document-
at-a-time algorithm is
- +
O(Nq log(n) Nq log(k)

a
3.3.3 Term-at-a-Time Query Processing
As an alternative to the document-at-a-time apprOach, some search
fashion.
engines process queries in a term-at-a-time
Instead of merging the query terms' postings lists by using a heap, the
search engine examines, in turn, all (or some) of the postings for each
query term. It maintains a set of document score accumulators.
For each posting inspected, it identifies the corresponding accumulator
to the
and updates its value according to the posting's Score contribution
respective document.
the
When all query terms have been prOcessed, the accumulators contain
a heap may be used to
final scores of all matching documents, and
collect the top k search results.
One of the motivations behind the term-at-a-time approach is that the
index is stored on disk and that the query terms' postings lists may be
too large to be loaded into memory in their entirety.
In that situation a document-at-a-time implementation would need to
jump back and forth between the query terms' postings lists, reading a
small number of postings into memory after each such jump, and
incurring the cost of a nonsequential disk access (disk seek).
For short queries, containing two or three terms, this may not be a
an
problem, as we can keep the number of disk seeks low by allocating
appropriately sized read-ahead buffer for each postings list.

(New Syllabus w.e.f Academic Year 23-24) (BC-12)


aechr-Neo Publications
IR (MU-T.Y.
B.Sc.,-Comp-SEM 6)
(Retrieval Models)..Page no. (3-7)
However, for queries containing more than a dozen terms (e.g., after
applying pseudo-relevance feedback see Section 8.6), disk seeks may
become a problem.
A term-at-a time implementation does not exhibit any nonsequential
disk access pattern. The search engine processes each term's postings
list in a linear fashion, moving on to term u when it is done with
term t;

3.4 VECTOR SPACE MODEL

GO. Write a brief note on Vector Space Model.

The vector space model is one of the oldest and best known of the
information retrieval models.
The vector space model is intimately associated with the field as a whole
and has been adapted to many IR problems beyond ranked retrieval,
including document clustering and classification, in which it continues
to play an important role.

In recent years, the vector space model has been largely overshadowed
by probabilistic models, language models, and machine learning
approaches.
Naturally, for a collection of even modest size, this vector space model
produces vectors with millions of dimensions.
This high-dimensionality might appear inefficient at first glance, but in
many circumstances the query vector is sparse, with all but a few
components being zero.
For example, the vector corresponding to the query "william",
"shakespeare", "marriage" has only three nonzero components.
To compute the length of this vector, or its dot product with a document
vector, we need only consider the components corresponding to these
three terms.
On the other hand, a document vector typically has a nonzero
component for each unique term contained in the document, which may
consist of thousands of terms. However, the length of a document vector
is independent of the query.

(New Syllabus w.e.f Academic Year 23-24) (BC-12) rhTech-Neo Publications


IR (MU-TY. B.SC-Comp SEM6) (Retrieval Models)...Page no.
(3-8)
a or
It may be precomputed and stored in frequency positional index
or may be
along with other document-specific information, it applied to
normalize the document vector in advance, with the components of
the
normalized vector taking the place of term frequencies in thhe postings
lists.
AS a ranking method the cosine similarity measure has intuitive
appeal
and natural simplicity. If we can appropriately represent queries and
documents as vectors, cosine similarity may be used to rank the
documents with respect to the queries.
In representing a document or query as a vector, a weight must be
assigned to each term that represents the value of the corresponding
component of the vector.
Throughout the long history of the vector space model, many formulae
for assigning these weights have been proposed and evaluated.
With few exceptions, these formulae may be characterized as belonging
to a general family known as TF-IDF weights.

When assigning a weight in a document vector, the TF-IDF weights are


computed by taking the product of a function of term frequency (,d) and
a function of the inverse of document fequency (1/N).

When assigning a weight to a query vector, the within-query term


frequency (q) may be substituted for f, in essence treating the query as
a tiny document. It is also possible (and not at all unusual) to use
different TF and IDF functions to determine weights for document
vectors and query vectors.
TF-IDF

Function of Term Frequency Function of Inverse document Frequency


We emphasize that a TF-IDF weight is a product of functions of term
frequency and inverse document frequency.
A common error
is to use the raw f,d value for the term frequency
component, which may lead to poor performance.
Over the years a number of variants for both the TF and the IDF
functions have been proposed and evaluated.

(iew Sylabus we.f Acadernic Year 23-24) (BC-12) Grech-Neo Publications


B.Sc.-Comp-SEM 6) (Retrieval Models)...Page no. (3-9)
IR (MU-T.Y.

The IDF functions typically relate the document frequency to the total
number of documents in the collection (N).
The basic intuition behind the IDF functions is that a term appearing in
many documents should be assigned a lower weight than a term
appearing few documents. Of the two functions, IDF comes closer to
in
having a 'standard form".
IDF = log (NIN)
The first one, ranked retrieval, allows the search engine to rank search
results according to their predicted relevance to the query. The second
one, lightweight structure, is a natural extension of the Boolean model to
the sub-document level.
Instead of restricting the search process to entire documents, it allows
the úser to search for arbitrary text passages satisfying Boolean-like
constraints (e.g., "show me all passages that contain 'apothecary' and
'drugs' within 10 words").
Cosine similarity
I GQ. What do you mean by Cosine Similarity in Vector space Model ?

Cosine similarity is a metric that measures the similarity between two


vectors in a multi-dimensional space, such as the vectors representing
documents in the VSM.
In the context of VSM, it quantifies how alike two documents are based
on their vector representations.
The key idea behind cosine similarity is to calculate the cosine of the
angle between two vectors.
Ifthe vectors are very similar, their angle will be small, and the cosine
value will be close to 1. Conversely, if the vectors are dissimilar, the
angle will be large, and the cosine value will approach 0.
How is Cosine Similarity Calculated ?
The formula for calculating cosine similarity between two vectors A and
B is as follows:
A •B
Cosine Similarity (A, B) =
I|A ||·|| B||

(New Syllabus w.e.f Academic Year 23-24) (BC-12)


ech-NeoPublications
IR (MU-T.Y. B.SC.-Comp-SEM 6) (Retrieval Models)...Page no.
(3-10)
Where:
A-B represents the dot product of vectors A and B.

lA| and |Bl represent the Euclidean norms (magnitudes) of vectors A


and B. respectively.
The cosine similarity value ranges from -1 (completely dissimilar) to
(completely similar).
higher cosine similarity score indicates greater similarity between
A

th
two vectors.
Cosine Similarity in a Vector Space Model
In a VSM, cosine similarity is crucial for information retrieval an

document ranking. Here's how it works in practice:


Vector Representation :
We represent documents and queries as
vectors using techniques like TF-IDF. Each document in the corpus and
the query are converted into vectors in the same high-dimensional space.
Cosine Similarity Calculation : To determine the relevance of a
document to a query, we calculate the cosine similarity between the
query vector and the vectors representing each document in the corpus.
Ranking: Documents with higher cosine similarity scores to the query
are considered more relevant and are ranked higher. Those with lower
Scores are ranked lower.
:
Cosine similarity has several advantages when applied to text data
Scale Invariance : Cosine similarity is scale-invariant, meaning it's not
affected by the magnitude of the vectors. This makes it suitable for
documents of different lengths.
Angle Measure : It focuses on the direction of vectors rather than their
absolute values, which is crucial for text similarity, where document
length can vary.
Eficiency : Calculating cosine similarity is computationally efficient,
making it suitable for large-scale text datasets.
Query-document matching
In the vector space model, there is an implicit assumption that relevance
is related to the similarity of query and document vectors.

ech-Neo Publications
(New Syllabus w.e.f Academíc Year 23-24) (BC-12)
IR (MU-T.Y.
B.Sc.-Comp-SEM 6) (Retrieval Models)..Page no. (3-11)
In other words, documents "closer" to the query are more likely to be
relevant. This is primarily a model of topical relevance, although
features related to user relevance could be incorporated into the vector
representation.
Relevance feedback, a technique for query modification based on user
jdentified relevant documents.
This technique was first introduced using the vector space model. The
well-known Rocchio algorithm was based on the concept of an optimal
query, which maximizes the difference between the average vector
representing the relevant documents and the average vector representing
the non-relevant documents.

3.5 PROBABILISTICMODEL

iGO. Describe the types of Probabilistic Model in Information Retrieval


!
1-
One of the features that a retrieval model should provide is a clear
statement about the assumptions upon which it is based. The Boolean
and vector space approaches make implicit assumptions about relevance
and text representation that impact the design and effectiveness of
ranking algorithms.
The ideal situation would be to show that, given the assumptions, a
ranking algorithm based on the retrieval model will achieve better
effectiveness than any other approach.
One early theoretical statement about effectiveness, known as the
Probability Ranking Principle (Robertson, 1977/1997), encouraged the
development of probabilistic retrieval models, which are the dominant
paradigm today.
These models have achieved this status because probability theory is a
strong foundation for representing and manipulating the uncertainty that
is an inherent part

Bayesian retrieval
In any retrieval model that assumes relevance is binary, there will be
two sets of documents, the relevant documents and the non-relevant
documents, for each query.

(New Syllabus w.e.f Academic Year 23-24) (BC-12)


aech -Neo Publications
(Retrieval Models).Page no.
IR (MU-TY, B.SC-Comp SEN6) (3-12)

Given a search engine could be described


new document, the task of a

the relevant set or the


deciding whether the document belongs in non-
relevant2 set.
as
document relevant
That is. the system should classify the OT non-
relevant, and retrieve it if it is relevant.
Given some calculating the probability that the document is
way of

relevant and the probability that it is non-relevant, then it would seem


set that has
reasonable to classify the document into the the highest
probability.
a
In other word we would decide that document D
is relevant
if P(RID) > PNRID), where P(RID) is conditional probability
a

representing the probability of relevance given the representation of thes


document, and P(NRID) is the conditional probability of non-relevanes
This is known as the Bayes Decision Rule, and a system that classifer
documents this way is calleda Bayes classifier.

Relevant
P(R| D) Documents
mayEVEn

P(NR|D)

Document
Non-Relevant
Documents

or
Fig. 3.5.1: Classifying a documet as relevant non-relevant

Relevance feedback
as a language model.
It is possible to represent the topic of a query
we use the name
Instead of calling this the query language model,
relevance model since it represents the topic covered by relevant
documents.
text generated from
The query can be viewed as very small sample of
a

the relevance model, and relevant documents are much larger samples ol
text from the same model.

ech-Neo Publications
(New Syllabus w.e.f Academic Year 23-24) (BC-12)
B.Sc.-Comp-SEM 6) (Retrieval Models)..Page no. (3-13)
IR (MU-T.Y.

Given some examples of relevant documents for a query, we could


estimate the probabilities in the relevance model and then use this model
to predict the relevance of
new documents.
In fact, this is a version of the classification model where we interpret
PDIR) as the probability of generating the text in a document given a
relevance model.
This is also called the document likelihood model. Although this model,
unlike the binary independence model, directly incorporates term
frequency, it turns out that P(DIR) is difficult to calculate and compare
across documents.
This is because documents contain large and extremely variable
number of words compared to a query.
Consider two documents Da and Db, for example, containing 5 and 500
words respectively. Because of the large difference in the number of
words involved, the comparison of P(DalR) and P(DbIR) for ranking will
be more difficult than comparing P(QIDa) and P(QIDb), which use the
same query and smoothed representations for the documents.
In addition, we still have the problem of obtaining examples of relevant
documents.
Ranking based on relevance models actually requires two passes.
The first pass ranks documents using query likelihood to obtain the
weights that are needed for relevance model estimation.
In the second pass, we use KL-divergence to rank documents by
comparing the relevance model and the document model.
Note also that we are in effect adding words to the query by smoothing
the relevance model using documents that are similar to the query.
Many words that had zero probabilities in the relevance model based on
query frequency estimates will now have non-zero values.
What we are describing here is exactly the pseudo-relevance feedback
process
In other words, relevance models provide a formal retrieval model for
pseudo-relevance feedback and query expansion.
Chapter Ends...
UNIT 1 Spelling
Correction in IR
CHAPTER 4 Systems

Syllabus

Spelling Correction in IR Systems :


Challenges of spelling errors in
queries and documents, Edit distance and string similarity measures.
Techniques for spelling correction in IR systems.

4.1 SPELLING CORRECTION

GQ. What do you mean Spelling Correction in IR? Discuss its challenges.

We look at the problem of correcting spelling errors in queries. For


instance, we may wish to retrieve documents containing the term carrot
when the user types the query caro.
Google reports that the following are all treated as misspellings of the
query britney spears: britian spears, britney's spears, brandy spears and
prittany spears.
We look at two steps to solving this problem: the first based on edit
distance and the second based on k-gram overlap.

Before getting into the algorithmic details of these methods, we first


review how search engines provide spell-correction as part of a user
experience.
(MU-T.Y. B.Sc.-Comp-SEM 6) (Speling Correction in IR Systems)..Page no. (4-2)
IR

4.2 CHALLENGES OF SPELLING ERRORS IN QUERIES


AND DOCUMENTS

Spell checking isan extremely important part of query processing.


Approximately 10-15% of queries submitted to web search engines
contain spelling errors, and people have come to rely on the "Didyou
mean: ..." feature to correct these errors.
These errors are similar to those that may be found in a word processing
document.
In addition, however, there will be many queries containing words
related to websites, products, companies, and people that are unlikely to
be found in any standard spelling dictionary.
Some examples from the same query log are :

1. realstateisting.bc.com
2. akia 1080i manunal
3. ultimatwarcade
4. mainscourcebank
5. dellottitouche
The wide variety in the type and severity of possible spelling errors in
queries presents a significant challenge.
In order to discuss which spelling correction techniques are the most
effective for search engine queries, we first have to review how spelling
correction is done for general text.
The basic approach used in many spelling checkers is to suggest
corrections for words that are not found in the spelling dictionary.
Suggestions are found by comparing the word that was not found in the
dictionary towords that are in the dictionary using a similarity measure.
A given spelling error may have many possible corrections. For
example, the spelling error "lawers" has the following possible
corrections (among others) at edit distance 1: lawers lowers, lawyers,
layers, lasers, lagers.
The spelling corrector has todecide whether to present all of these to the
user, and in what order to present them.

ech-Neo Publications
(New Syllabus w.e.f Academic Year 23-24) (BC-12)
IR MATY, BSComp SEM 6) (Speling CorectioniniR Systems)...Page no.
(4-3)
The noisy channel model for spelling correction is a
general framework
can
tat zidress the isues of ranking. context, and run-on errors.
The
moil is called a noisy channel" because it is based on Shannon's
they of communication. The intuition is that person chooses a
a
word
wtecutput (ie., write). based on a probability distribution P(w),
peson then tries to write the word
The w,
but the noisy
channel
causes the person to
(prumahly the person's brain) write
the word e
instead, with probability P(eiw).
The pobabiliües P(w), called the language model. capture informatinn
about the frequency of occurrence of a word in text (e.g.. what is h
probability ef the word "lawyer" occurring in a document or guery
and contextual information such as the probability of observing a word
given that another word has just been observed (e.g.. what is the
probabiity of "lawyer foliowing the word "rial"?).
The probabiliies P(eiw), called the error model, represent information
about the frquency of different types of spelling errors.
The probabilities for words (or strings) that are edit distance 1 away
ivm the word w will be quite high. for example. Words with higher edit
distances will generally have lower probabilities, although homophones
will have high probabilities.
Note that the error model will have probabilities for writing the corect
word (P(wiw)) as well as probabilities for Spelling errOrs.
Thisenables the spelling comector to suggest a corection for all vords,
even if the originai word was correctly spelled. If the highest-probability
corecion is the same word, then no correction is suggested to the user.

H 4.3 EDIT DISTANCE AND STRING SIMILARITY


MEASURES

i GQ. Explain Edit distance algorithm in detail.

Given two-character strings sl and s2, the edit distance between them is
the minimum number of edit operations required to transform sl into s2.

Most commonly, the edit operations allowed for this purpose are :

i) insert a character into a string

(New Syllabus w.e.f Academic Year 23-24) (BC-12) Sech-Neo Publications


(MU-T.Y, B.Sc.-Comp-SEM
6)
(Spelling Gorrection in IR Systems)...Page no. (4-4)
IR

i) delete a character from a string


(iii) replace a character of a string by another character
Eor these operations, edit distance is sometimes known as Levenshtein
distance.
Eor example, the edit distance between cat and dog is 3.
In fact, the notion of edit distance can be generalized to allowing
a
different weights for different kinds of edit operations, for instance
s by the
higher veight may be placed on replacing the character
a
character p, than on replacing it by the character (the latter being closer
to s on the keyboard).
Setting weights in this way depending on the likelihood of letters
substituting for each other is very effective in practice However, the
remainder of our treatment here will focus on the case in which all edit
operations have the same weight.

Example 1

Input :
strl =cat", str2 = cut"
Output :1
Explanation : We can convert strl into str2 by replacing 'a' with 'u'.
Example 2
Input : strl = "sunday", str2 = "saturday"
Output :3
Explanation : Last three and first characters are same. We basically
need to convert un" to "atur". This can be done using below three
operations. Replace 'n' with 'r', insert t, insert a.

4.4 TECHNIQUES FOR SPELLING CORRECTION IN IR


SYSTEMS

GQ. Discuss any two Techniques for Spelling correction in detail.

a 4.4.1 k-gram Indexes for Spelling Correction


To further limit the set of vocabulary terms for which we compute edit
distances to the query term, we now show how to invoke the k-gram

(New Syllabus w.e.f Academic Year 23-24) (BC-12) ech-Neo Publications


IR(MU-T.Y. B.So-Comp-SEM 6)
(Splling Correction in IR Systems).age no.
(4-5)
index to assist with retrieving vocabulary terms with low edit distance
the query4
Once we retrieve such terms,
we can then find the ones of
least edit
distance from q.
In fact, we will use the k-gramindex to retrieve vocabulary terms
that
have many k-grams in common with the query.
We
will argue that for reasonable definitions
"many k-grams of
in
common," the retrieval process is essentially that of a single scan

through the postings for the k-grams in the query string q.

bo
H aboard
H about boardroom border

or border
H lord morbidH sordid

rd
H aboard ardent oardrom border

Fig. 4.4.1:Matching at least two of the three 2-grams in the query bord

K-Grams
K-grams are k-length subsequences of a string. Here, k can be 1, 2, 3
and so on. For k=l, each resulting subsequence is called a "unigram";
for k=2, a bigram": and for k=3, a "trigram'". These are the most widely
used k-grams for spelling correction, but the value of k really depends
on the situation and context.

Unigrams: [",
Bigrams: ["ca", "at",
"a", ,
As an example, consider the string "catastrophic". In this case,

"a, $", ", T, "o", "p", "h",


"ta", "as", "st", tr", "ro", "op", "ph", "hi",
"ic"]
Trigrams: ["cat", "ata", "tas", "ast", "str", "tro", "rop". "oph",
"phi", “"hic"]

The 2-gram (or bigram) index shown in above figure (a portion of) the
postings for the three bigrams in the query bord.

(New Syllabus w.e.f Academic Year 23-24) (BC-12) ech-Neo Publications


(MU-T.Y. B.Sc.-Comp-SEM
6)
(Spelling Corre ction in
IR Systems)..Page no. (4-b)
IR
at least
Suppose we wanted to retieve vOcabulary terms that contained
rwoof these three bigrams.
us enumerate al] such terms; in
A
single scan of the postings would let
the example of Figure given above, we would enumerate aboard,
boardroom and border.
are :

Thesteps involved for spelling correction


Find the k-grams of the misspelled word.
For each k-gram, linearly scan through the postings list in the
k-gram
index.
Find k-gram overlaps after having linearly scanned the lists
(no extra
time complexity because we are finding the Jaccard coefficient).
Return the terms with the maximum k-gram overlaps.

A 4.4.2 Context Sensitive Spelling Correction

Context sensitive spelling correction Isolated-term correction would fail


to correct typographical errors such as flew form Heathrow, where
all

three query terms are correctly spelled.


a search engine
When a phrase such as this retrieves few documents,
may like to offer the corected query flew from Heathrow.
The simplest way to do this is to enumerate corrections of each of the
three query terms even though each query term is correctly spelled, then
try substitutions of each corection in the phrase.
For the example flew form Heathrow, we enumerate such phrases as
fled form Heathrow and flew fore Heathrow.
For each such substitute phrase, the search engine runs the query and
determines the number of matching results.
This enumeration can be expensive if we find many corrections of the
individual terms, since we could encounter a large number of
combinations of alternatives. Several heuristics are used to trim this
Space.
In the example above, as we expand the alternatives for flew and form.
we retain only the most frequent combinations in the collection or in the
query logs, which contain previous queries by users.

(New Syllabus w.e.f Academic Year 23-24) (BC-12) aech-Neo Publications


IR(MU-T,Y,B.Sc.-Comp-SEM 6) (Spelling Correction in IR Systems).Page no, (4.

A 4.4.3 Phonetic Correction


Our final technique for tolerant retrieval has to do with phoneti
correction: misspellings that arise because the user types a query
that
sounds like the target term. Such algorithms are especially applicable to
searches on the names of people.
The main idea here is to generate, for each term, a "phonetic hash
SO
that similar-sounding terms hash to the same value.
The idea owes its origins to work in international police departments
from the early 20th century, seeking to match names for wanted
criminals despite the names being spelled differently in different
countries.
It is mainly used to correct phonetic misspellings in proper nouns.
Algorithms for such phonetic hashing are commonly collectively known
as soundex algorithms.
However, there is an original soundex algorithm, with various variants.
built on the following scheme:
Tumevery term to be indexed into a 4-character reduced form.
Build an inverted index from these reduced forms to the original
terms; call this the soundex index.
Do the same with query terms.
When the query calls for a soundex match, search this soundex
index.
The variations in different soundex algorithms have to do with the
conversion of terms to 4-character forms.
A commnonly used conversion results in a 4-character code, with the first
character being a letter of the alphabet and the other three being digits
between 0and 9.
Chapter Ends..
UNIT 1
Performance
CHAPTER 5
Evaluation

Syllabus
:

Performance Evaluation Evaluation metrics: precision, recal,


F-meaSure, average precision, Test collections and relevance
judgments, Experimental design and significance testing.

5.1 EVALUATION METRICS

a 5.1.1 Recall and Precision

What are various performance evaluation metrices ?


1
I GQ.

GQ. Explain Recall and Precision as evaluation metrices.


measures, recall and precîsion,
The twO most common effectiveness
were introduced in the Cranfield studies to summarize and compare
search results.
at
Intuitively, recall measures how well the search engine is doing
measures
finding all the relevant documents for a query, and precision
how well it is doing at rejecting non-relevant documents.
a query, there is
The definition of these measures assumes that, for given
a set of documents that is retrieved and a set that is not retrieved (the rest
of the documents).
a same
This obviously applies to the results of Boolean search, but the
definition can also be used with a ranked search.
R LTY.BSc-Cop-SEW6) (Pefomance Evaluetion).Page no.
(5-2)
Recall is the proporion of relevant documents that are retrieved. OP
Recall (R)is the iraction of relevani doCumenis ihat are retrieved
Erelevznt items etrigved) = P(retrieved I
Reall = #elerant iems)
relevant)

Numbes of relevznt documents rerieved


recal = Total number of relevant docurnents

Precison is the pruportion of retriered ocuments that are relevani. 0p


Precision Ps he irion o reievei documents that are relevant
relevant items reiiered)
= P(relevent Í retrieved)
rei items)
Number of relevent documents rerievei
Preciin = Toral zunbez of ocuments rerieved
There is z inplicit zssumpion in using these measures that the tast
ioes ieng s many of te relevant ocumenis as possible and
iiinz he munbez of tOT-TeleTEnt documents retrieved.
I ohez ROrds. ereaii tereare 500 relevant documents for a querv, the
ESET iS
inesi in ining them ali.
La sel. recall inicates the iacion of relevant documents that
nes in the resait sez. wiereas precision indicates the fraction of the

Pecision is noa2s the positive preiictive value. and is often used in


meticzl izmosic tests iee ne probability that a positive test is
oOTet s pericalziy impot1t.

5.1.2 Fmeasure

The F measue is 2an eiieciveness measure based on recall and precision


tüas is sed for evzluating cizssiffcation performance and also in some
sezrch picaions. It has the aávantage of summarizing effectiveness in
2 single numbes.

A single mezsure that redes of precision versus recall is the F measure,


hich is the weighted harmonic mean of precision and recal.

(ies Sylsbus weíAcadernic Yeer 23-24) BC-12) Gech-Neo Publications


(MU-T.Y. B.Sc.-Conp-SEM 6) (Perforrnence Evalsation)..Pags o. (5-3)
as the harmonic mean of recal] and precision, which i:
It is defined
1
2RP
(R+ P)

The harnonic mean emphasízes the importance of small values, wheseas


ihe arithmetic mean is affected more by values that are unusuaily large
(outliers).
search result that returned neariy the entire document coilectíon, for
A

example, would have a recallof 1.0 and a precision near 0.


mean will
The arithmetíc mean of these values is 0.5. but the harmoníc
be ciose to 0.
summary of the effectiveness of
The harmoníc mean is clearly a better
this retrieved set.

A 5.1.3 Average Precision


average of the
For a single information need, Average Precision is the
precision value obtained for the set of top k documents existing after
over
each relevant document is retrieved. and this value is then averaged
information needs.
a
Average precision has a number of advantages. It is single number that
is based on the ranking of all he relevant documents. but the value
depends heavily on the highly ranked relevant documents.
Average Precision is calculated as the weighted mean of precisions at
each threshold; the weight is the increase in recall from the prior
threshold.
Mean Average Precision is the average of AP of each class. However,
the interpretation of AP and mAP varies in different contexts.
The mAPis calculated by finding Average Precision (AP) for each class
and then average over a number of classes.
N
mAP = Api
i=1

ech-NeoPublications
(New Syllabus w.e.f Academic Year 23-24) (BC-12)
IR (MU-T.Y. B.SC.-Comp-SEM 6) (Performance Evaluation)...Page no.
(5-4)
Mean Average Precision Formula
The mAP incorporates the trade-off between precision and recall
considers both false positives (FP) and false negatives (FN).
.
Thisproperty makes mAP a suitable metric for most
detection
applications.

5.2 TEST COLLECTIONS AND RELEVANCE


JUDGMENTS

I GQ. Explain the significance of Test collections & Relevance judgement in


Performance evaluation.
a
The topics and judgments, together with the document collection, form
test collection.
A central goal of TREC is to create test collections that may be re-)Sed
for later experiments. For instance, if a new IR technique or ranking
formula is proposed, its inventor may use an established test collection
tocompare it against standard methods.
Reusable test collections may also be employed to tune retrieval
formulae, adjusting parameters to optimize performance.
If a test collection
is to be reusable, it is traditionally assumed that the
judgments should be as exhaustive as possible. Ideally, all relevant
documents would be located. Thus, many evaluation experiments
actively encourage manual runs (involving human intervention) in order
to increase the number of known relevant documents.
Here is a list of the most standard test collections and evaluation series.
We focus particularly on test collections for ad hoc information retrieval
system evaluation, but also mention a couple of similar test collections
for text classification.
The Cranfield collection. This was the pioneering test collection in
allowing precise quantitative measures of information retrieval
effectiveness, but is nowadays too small for anything but the most
elementary pilot experiments. Collected in the United Kingdom starting
in the late 1950s, it contains 1398 abstracts of aerodynamics journal
articles, a set of 225 queries, and exhaustive relevance judgments of all
(query, document) pairs.
TA

(New Syllabus w.e.f Academic Year 23-24) (BC-12) Tech-Neo Publications


(MU-T.Y. B.Sc.-Comp-SEM
6)
(Performance Evaluation)..Page no. (-)
IR
Text Retrieval Conference (TREC). The U.S. National Institute of
Standards and Technology (NIST) has run a large IR test bed evaluation
1992. Within this framework, there have been many tracks
series Since
over a range of different test collections.
test collections comprise containing 1.89 million
6 CDs
In total, these
documents (mainly, but not exclusively, newswire articles) and
relevance judgments for 450 information needs, which are called topics
passages.
and specified in detailed text
Individual test collections are defined over different subsets of this data.
The early TRECs each consisted of 50 information needs, evaluated
over different but overlapping sets of documents.
over about 528,000
TRECS 6-8 provide 150 information needs
is
newswire and Foreign Broadcast Information Service articles. This
is the
probably the best subcollection to use in future work, because it
test document
largest. and the topics are more consistent. Because the
collections are so large, there are no exhaustive relevance judgments.
GOV2 more recent years, NIST has done evaluations on larger
page
document collections, including the 25-million-page web
were
collection. From the beginning, the NIST test document collections
orders of magnitude larger than anything available to researchers
previously and GOV2 is now the largest Web collection easily available
for research purposes.
NIITest Collections for IR Systems (NTCIR). The NTCIR project has
built various test collections of similar sizes to the TREC collections,
focusing on East Asian language and cross-language information
retrieval, where queries are made in one language
over a document
collection containing documents in one or more other languages.
Cross Language Evaluation Forum (CLEF). This evaluation series
has concentrated on European languages and cross-language information
retrieval.
Relevance judgments
A set of relevance judgments, standardly a binary assessment of either
relevant or nonrelevant for each query-document pair.

uTech-Neo Publications
(NewSyllabus w.e.f Academic Year 23-24) (BC-12)
(Performance Evaluation)..Page no.
IR(MU-T.Y. B.SC.-Comp-SEM 6) (5-6)

The standard approach to information retrieval system cvaluation


revolves around the notion of relevant and nonrelevant documents.
With
a document
respect to a user information nccd, in the test collcction
is
given a binary classification as cither relevant or nonrclcvant.

This decision is referred to as the gold standard or ground tr.u


judgment of relevance.
Evaluation of retricval models and search engines is a very active armn

with much of the current focus on using large volumes of log data fro
uscr interactions, such as clickthrough data, which rccords tha
documents that were clicked on during a scarch scssion.
Clickthrough and other log data is strongly corrclated with rclcvance en
itcan be uscd to cvaluate scarch, but scarch enginc companies still uge
relevance judgments in addition to log data lo cnsure the validity of their
results.
Chapter Ends..
UNIT 2 Text
Categorization
CHAPTER 6 and Filtering

Syllabus

Text Categorization and Flltering: Text classification


algorithms:
Nalve Bayos, Support Vactor Machines, Feature
selection and
dimensionality roduction, Applications of text categorization and
filtaring.

D 6.1 TEXT CLASSIFICATION/CATEGORIZATION


ALGORITHMS

Explain the term text categorization


or text classification.
GQ.

us text classification is a task of


Text calegorization also termcd
aulomatically sorting a set of documcnts into categories (classes) from
same
predefincd set. We consider classification and catcgorization thc
process.
no labcls, since
Related problem partition documents into subscts,
:
a
cach subsct has no label, it is not class instcad, cach subset called
a is
we consider
cluster, the parilioning proccss is called clustering,
clustering as a simpler variant of text classification.

Example
We can classify Emails into
spam or non-spam, news articles into
ctc., academic
different categories like Politics, Stock Market, Sports,
papers are often classified by technical domains and sub-domains.
IR (MU-T.Y,B.Sc.-Comp-SEM 6) (Text Categorization and Filtering) ..Page no.
(6-2)
The Text Classification Problem
A
classifier can beformally defined D: a collection of documents
C= {C; Cz
C:a set of classes
L with their respective labels a
text classifier is a binary function
x {0, 1), which assigns to
F:D C
each pair (d, Cpj
d,eD and
C,E C, a value of
1, if d, is a member of class c,
0, if d, is not a member of class c,
Broad definition, admits supervised and unsupervised algorithms.
For high accuracy, use supervised algorithm
multi-label : one or more labels are assigned to each document
single-label : a single class is assigned to each document

Classification functionF
defined as binary function of document-class pair [d, c,]
can be modified to compute degree of membership of d, in cp

documents as candidates for membership in class C

candidates sorted by decreasing values of F(d, Cp)

Text Classification Algorithms

GQ. Discuss about Text Classification algorithms.

Text categorization is an effective activity that can be accomplished


using a variety of classification algorithms.
Text classification algorithms are categorized into two groups.
Supervised algorithms
Unsupervised algorithms
The below diagram shows the text classification algorithms.

TH

(New Syllabus w.e.f Academic Year 23-24) (BC-12) ech:


hTech-Neo Publications
-Comp-SEM 6) (Text Categorization and Filtering) ...Page no. (6-3)
(MU-T.Y, B.Sc.-
IR Text Classification Algorithms
Algorithms Supervised Algorithms
Unsupervised
UexpersedAgrts
CaActoa Ngertas

Cisarty

Decscs Relez Ercate


Trees
letos Brs
AzjonstaHeadci

0T SA Soxretrse
Roccio
Sngt

Keess
Ruoriona Exosrgte

Lri

Supervised Algorithms
set used to learn a classification
Depend on a training set. Training
examples, the better is the
function. The larger the number of training
fine tuning of the
classifier
to training examples.
Overfitting : classifier becomes specific the
use a set of unseen objects commonly refered
Toevaluate the classifier,
to as test set.
Unsupervised Algorithms: Clustering
even class labels are
Input data : Set of documents to classify, not
provided
subsets (clusters)
Task of the classifier Separate documents into
:

automatically separating procedure is called clustering.

A6.1.1 Naive Bayes

i GQ. Explain Naive Bayes Algorithm.


i GQ. Explain Bayes Theorem.
which is
Naïve Bayes algorithm is a supervised learning algorithm,
problems.
based on Bayes theorem and used for solving classification

(NewSyllabus w.e.f Academic Year 23-24) (BC-12) Tech-Neo Publications


IR(MUTY BSCOom SEMG) (Text Categonzation and Filtering) ...Page no.
(6-4)
It one of the simple and most cffecctive Classification algorithms
is
which heips in building the fast machine learning models that can make
qunck preditions.
a
It
is mainiy used in text classification that includes high-dimensional
training ataset.
It is a probabilistic classifier, which means it predicts on the basis of t'h
probability of an object.

Somc popular examples of Naive Bayes Algorithm are spam filtration


Sentimental analysis. and classifying articles.
The Naive Bayes algorithm is comprised of two words Naive and
Bayes, which can be described as:
Naive : It is
called Naïve because it assumes that the 0ccurrence of a
certain feature is independent of the occurrence of other features. Such
as if the fruit is identified on the bases of colour, shape, and taste, then
red. spherical. and sweet fruit is recognized as an apple. Hence each
feature individually contributes to identify that it is an apple without
depending on each other.
Bayes: It is calledBayes because it depends on the principle of Bayes'
Theorem
Bayes' Theorem
Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is
used to determine the probability of a hypothesis with prior knowledge.
It depends on the conditional probability.

The formula for Bayes' theorem is given as:


P(B IA)P(A)
P(A IB) =
P(B)
Where.
P(AB) is Posterior probability: Probability of hypothesis A on the
observed event B.
PBA) is Likelihood probability : Probability of the evidence given
that the probability of a hypothesis is true.
:
PA)is Prior Probability Probability of hypothesis before observing
the evidence.

(New Sylabus w.e.f Academic Year 23-24) (BC-12) rech


-Neo Publications
(MU-T.Y. B.Sc.-Comp-SEM 6) (Text Categorízation and Filtering) ...Page no. (6-5)
IA
P(B) is Marginal Probability : Probability of Evidence.
givven Dataset ,Apply Naive Baye's Algorithm and Predict that if a
Considerithe
properties then which type of the fruit it is
the following ,long!
fruit =has Sweet
Eruit (Yellow,

Frequcncy Table:
Yellow Sweer |Long Total
Fruit
350 450 G50
|Mango
IBanana 400 300 i350 400
50 100 50 150
Others
800 850 400 1200
|Total

Tahle name: Samples for classification for the Naive Baye Theorem

1.Mang0:
p/X I Mango) = P(Yellow | Mango) * P(Sweet | Mango) * P(Long
|

Mango)
a)P(Yellow | Mango) = (P(Yellow | Mango) * P(Yellow) )/P (Mango)
= (350/800)(800/1200)) / (650/1200)
*
1
P(Yellow | Mango)=0.53

b)P(Sweet | Mango) = (P(Sweet | Mango) * P(Sweet) )/ P (Mango)


=((450/850) * (850/1200) ) / (650/1200)
P(Sweet | Mango)= 0.69 2

c)P(Long |
Mango) = (P(Long | Mango) * P(Long) )/ P (Mango)
=
(0/650) * (400/1200)) / (800/1200)
P(Long | Mango)= 03
On multiplying eq 1,2,3 ==> P(X | Mango) = 0.53 * 0.69 * 0

P(X | Mango) = 0

2. Banana:
P(X | Banana) = P(Yellow |
Banana) * P(Sweet | Banana) * P(Long |

Banana)
a) P(Yellow Banana) = (P( Banana | Yellow ) * P(Yellow) )/P (Banana)
|

=
(400/800) * (800/1200) /(400/1200)
P(Yellow | Banana) =
14

(New Syllabus w.e.f Academic Year 23-24) (BC-12)


ech-NeoPublications
Filtering) ..Page no.
B.Sc.-Comp-SEM 6) (Text Categorization and (6-6)
IR(MU-T.Y.
* P(Sweet) )/P (Banana)
b) P(Sweet | Banana) (P( Banana | Sweet)
=

= (300/850) * (850/1200)) / (400/1200)


P(Sweet | Banana) =.75-5

c)P(Long | Banana) = (P( Banana | Yellow


) * P(Long) )/ P (Banana)
= (350/400) * (400/1200)) / (400/1200)
P(Yellow | Banana) = 0.875 6

On multiplying eq 4,5,6 ==> P(X | Banana) = 1 * .75 * 0.875

P(X | Banana) = 0.6562

3. Others:
P(X = P(Yellow * P(Sweet Others) * P(Long
| Others) | Others) | |

Others)
a) P(Yellow | Others) = (P( Others| Yellow ) * P(Yellow) )/ P (Others)
= (50/800) * (800/1200)) / (150/1200)
=
P(Yellow | Others) 0.34-7

b) P(Sweet| Others) = (P( Others| Sweet ) * P(Sweet) )/ P (Others)


= (100/850) * (850/1200))
/ (150/1200)
P(Sweet | Others) = 0.678

c) P(Long |
Others) = (P( Others| Long) * P(Long) )/ P (0thers)
=
(50/400) * (400/1200)) / (150/1200)
P(Long | Others) = 0.34 9

On multiplying eg 7,8,9 ==> P(X | Others) = 0.34 * 0.67* 0.34


P(X | Others) = 0.07742
So finally from P(X | Mango) == 0,, P(X I Banana) == 0.65 and P(X|
Others) ==0.07742.
We can conclude Fruit(Yellow,Sweet, Long} is Banana.

(New Syllabus w.e.f Academic Year 23-24) (BC-12) Tech-Neo Publications


B.Sc.-Comp-SEM 6) (Text Categorization and
(MU-T.Y. Filtering) ...Page no. (6-7)
Ig

Training Algorithm
Let v be the vocabulary of all words in D

For each category C;C


Let Di be the subset of documents in category C;

P(C)= ID/DI
Iet T;be the concatenation of all documents in d,
Tot n. be the total number of word occurrences in T:
For eachword W; V
Let n; be the number of occurrences of W; in T;
Let P(W C) = (n,+1)/(n,+ IV)

6.1.2 Support Vector Machine (SVM)


L
GO. Explain in brief the concept of Support Vector Machine (SVM).

Unlike the Naïve Bayes classifier, which is based purely on probabilistic


principles, the next classifier we describe is based on geometric
principles.
Support Vector Machines, often called SVMs, treat inputs such as
documents as points in some geometric space.
For simplicity, we first describe how SVMs are applied to classification
problems with binary class labels, which we will refer to as the
"positive" and "negative" classes.
In this setting, the goal of SVMs is to find a hyperplane4 that separates
the positive examples from the negative examples.
Support Vector Machine (SVM) is a very popular model. SVM applies a
geometric interpretation of the data. By default, it is a binary classifier.
It maps the data points in space to maximize the distance between the
two categories.
For SVM, data points are N-dimensional vectors, and the method looks
a
for an N-1 dimensional hyperplane to separate them. This is called
linear classifier.
Many hyperplanes could satisfy this condition. Thus, the best
hyperplane is the one that gives the largest margin, or distance, between
the two categories. Thus, it is called the maximum margin hyperplane:

(New Syllabus w.e.fAcademic Year 23-24) (BC-12) Tech-Neo Publications


no. (6.o1
IR MUTY,B.SC-Comp-SEM 6) (Text Categorization and Filtering) ...Page

-b=

Fig. 6.1.1
and
of points corresponding to twO categories, blue
We can see a set
green. The red line indicates the maximum margin hyperplane that
over the dashed line are
separates both grOups of points. Those points
called the vectors.
in the original
Frequently happens that the sets are not linearly separable
a higher-dimensional
space. Therefore, the original space is mapped into
can efficiently
space where the separation could be obtained. SVMs
so-called kernel trick.
performa non-linear classification using the
functions, which
The kermel trick consists of using specific kernel
space into a higher
simplify the mapping between the original
dimensional space.
a Bayes Classifier and SVM.
1
GQ, Explain the difference between Naïve
models for
Naive Bayes comes under the class of generative
the class
classification. It models the posterior probability from
to a
conditional densities. So, the output is a probability of belonging
class.
on a discriminant function given by
SVM on the other hand is based
are estimated from
y=W.x + b. Here the weights w and bias parameterb
the margin
the training data. It tries to find hyperplane that maximises
a

and there is optimization function in this regard.


are more
Performance wise SVMs using the radial basis function kernel
likely to perform better as they can handle non-linearities in the data.

ech-Neo Publications
(Nevw Syllabus w.e.f Acadernic Year 23-24) (BC-12)
B.Sc.-Comp-SEM 6) (Text Categorization and Filtering) ..Page no.
(MU-TY. (6-9)
IR
performs best when the features are independent of each
Naive Bayes
which often does not happen in real. Having said that it still
other
performs good even when the features are not independent.

FEATURE SELECTION
M 6.2

Eynlain in detail how feature selection can be achieved in IR.

Feature selection
is
the process of selecting a
subset of the terms
as
occurring in the training set and using only this subset features in text
classification.
Feature selection serves twO main purposes. First, it makes training and
size of the
applying a classifier more efficient by decreasing the
effective vocabulary. This is of particular importance for classifiers that,
unlike NB, are expensive to train. Second, feature selection often
increases classification accuracy by eliminating noise features.
A noise feature is one that, when added to the document representation,
say
increases the classification error on new data. Suppose rare term,
a

arachnocentric, has no information about a class, say China, but all


our
instances of arachnocentric happen to occur in China documents in
training set.
Then the learning method might produce a classifier that misassigns test
an incorrect
documents containing arachnocentric to China. Such
generalization from an accidental property of the training set is called
overfiting.
SELECTFEATURES (ID, c, k)
1
V EXTRACTVOCABULARY (D)
2 £- 0
3 for each te V

4 do A (t, c) < COMPITTEPEATUREUTTLITY (ID, t, c)


5 APPEND (1,, (A (, c), t))
return PEATURESWITHLARGESTVALIT ES (L, k)
Fig. 6.2.1: Basic feature selection algorithm for selecting the
k best features.

ech-Neo Publications
(New Syllabus w.e.f Academic Year 23-24) (BC-12)
IR (MU-T.Y, B.Sc. -Comp-SEM 6) (Toxt Catogorization and Filtering) ..Page
no.(6-10)
We can view feature selcction as a mcthodfor replacing a
complex
classifier (using all features) with a
simpler one (using he
subset
of features).
The basic feature selection algorithm is shown in above
figure, For
given class C,we compute a utility measure A(t, c) for each term
of the
vocabulary and select the k terms that have the highest
values of A(t. c)
All other terms are discarded and not used
in classification. We uill
introduce three different utility measures in
this section: mutual
information, A(t, c) = I(Ut ; Cc); the x test,
A(t, c) = X (t, c); and
frequency, A(t,c) = N(t, c).

6.3 DIMENSIONALITY REDUCTION

GQ. Explain the term Dimensionality


Reduction.

Dimensionality reduction refers to


techniques that transform a high
dimensional dataset into a lower-dimensional
representation while
preserving its essential structure and characteristics.
The aim is to reduce the
computational complexity, improve
visualization, and eliminate redundant or
noisy features.
Advantages of Dimensionality Reduction
Dimensionality reduction offers several
advantages:
1. Improved Computational Efficiency :
Reducing the number of
dimensions simplifies the data
representation and accelerates the
training and inference process.
2
Enhanced Visualization :
By reducing the dataset to two or
three
dimensions, we can visualize and
explore the data more effectively.
3
Noise and Outlier Removal : Dimensionality
reduction techniques can
help filter out noisy features or outliers
that may negatively impact the
model's performance.
iGQ. Differentiate between Feature Selection and Dimensionality
Reduction.

While both feature selection and dimensionality


reduction aim to reduce
the number of features, they differ in
their approach:

(New Syllabus w.e.f Academic Year 23-24) (BC-12) Sech-Neo Publications


-SEM E) (Text Categorization and Filtering) .Page no. (5-11)

: a
Feature Selection Selects subset
of relevant features while keeping

vrigital fcature space intact. The focus is on identifying the most


tlhe
ittfvttdtíve (catures fur modelling.
Dinensionality Reduction : Projects the data onto a lower-dimensional
a
transforuing ttie feuture space. The objective is to create
(onpressed irepresenlation that cuptures tlhe cssence of the original data.

APPLICATIONS OF TEXT CATEGORIZATION AND


6,4 FILTERING

filtering.
eo Explain in brief applications of text categorization and
in IR.
Discuss the various applications of Text categorization
GQ.

Text categorization
Text categorization is a machine learning
technique that assigns a set of
text.
predefined categories to open-ended
structure, and categorize pretty
Text classifiers can be used to organize,
much any kind of text - from documents,
medical studies and files, and

all over the web.


can be organized by topics; support tickets
For example, new articles
can be organized by
can be organized by urgency; chat conversations
and so on.
brand mentions can be organized by sentiment;
language;
are spam detection, sentiment
The most common applications
classification, and online advertisement classification.
1. Spam detection
eliminate
Classification techniques can be used to help detect and
to be any content
various types of spam. Spam is broadly defined
purposes, such as unsolicited
that is generated for malevolent
a page,
advertisements, deceptively increasing the ranking of web
or spreading a virus.
One important characteristic of
spam is that it tends to have little, if
any, useful content. This definition of spam is very subjective,
one person may not be useful to
because what may be useful to
to come up with an
another. For this reason, it is often difficult
objective definition of spam.
rech-Neo Publications
(New Syllabus w.e.f Academic Year 23-24) (BC-12)
No, (66-12)

Th
N
Many of spam, incung emai spam,
Nps
wah page spm. SpammeS
aheemnt sqam. blng snem. and
Thereiore,
iassiñcanon technique that wOTks io
l sam. Inai VT Ncializd spam classiñers

ne
hshdi äant Taie S0) wieher ii is sPam, The

2 Seninert Aahsis

Casying zreiseents

Gre-iec Pubicaicrs
B.Sc.-Comp-SEM 6) (Text Categorization and Fltering) ..Page no. (6-13)
(MU-T.Y,E
IR
advertisers must pay the search engine only if a user clicks on the
verisement. A user may click on an adverisement for a number
reasons. Clearly, if the adverisement is "topically relevant" then
of

he 1ser may click on it However, this is not the only reason why a
user may click
Customers often use socil media t0 express their opinions about
and experiences of products or servicas. Text ciassifñcation is ofen
732i 10ideniiy thetweeis that brands must esDODd to.
Text classificaion is also nsed in language identifñcation. like
ideniñing the anguage of new wees OT DOSIS. For examnle.
Google Translate has an uiomaic language ideniication feature.
Authorship amibuion, or idanifying the unknowz authors of exis
om 2 pool 2uhors. is anoiher popular use case of text
o
ciassincaion. nd ii's used in a range of îelds iîom iorensic
used to
analvsis t0 lierrY Sudia. Text classiñcaron has also been
news.
seteg2te T2ke news irOm real
Language dezacion is 2nOhar retezample of e
classiñcaion.
text 2ccOTing O its
iha is. the pross ofclzssiing incoming
laguase. Thes clasSiners re ofen sed for roing puposs

2rpTODzte eam).

Text Filtering
Fieing i te OcSs of eraluang documents on 2n 0ngoing besis

IF RIOves reinint unwni infomaion îom n nfomaion


STEAm sing auiomEi0T cODUIEizei mehois
A Ting svstemonsiss sevedools hat help people find the most
of
you can aiicte to
valuabl infomation so in ne limitei ime.
Teailiszenview coreczi ireciozal and valuable documents.
IE
raincSeliminates tie harmful infomaion.

Her Sylabus w.ef Academic Year 23-24) 6C-LE G-Neo Publiations


IR (MU-TY. B.SC-Comp SEM 6) (Text Categorization and Filtering)..Page no. (6-14)

Types of Text Filtering


There are three types of text filtering.
1. Content-Based Filtering
2. Collaborative Filtering
3. Hybrid Filtering

1. Content-Based filtering
Objects to be filtered: generally, texts, filter engine based on content analvsie
These filtering methods are based on the description of an item and a
profle of the user's preferred choices.
In a content-based recommendation system, keywords are used to
describe the items: besides, a user profile is built to state the type item
of
this user likes.

The algorithms try to recommend products which are similar to the ones
that a user has liked in the past. The idea of content-based filtering is
that if you like an item, youwill also like a 'similar item.
2. Collaborative filtering
Objects to be filtered: products/goods, filter engine based on usage analysis.
This filtering method is usually based on collecting and analyzing
information on user's behaviors, their activities or preferences and
predicting what they will like based on the similarity with other users.
A key advantage of the collaborative filtering approach is that it does
not rely on machine analyzable content and thus it is capable of
accurately recommending complex items such as movies without
requiring an "understanding" of the item itself.
3. Hybrid Filtering
Combination of the two previous approaches.
Recent research shows that combining collaborative and content-based
recommendation can be more effective.
Hybrid approaches can be implemented by making content-based and
collaborative-based predictions separately and then combining them.
Further, by adding content-based capabilities to a collaborative-based
approach and vice versa; or by unifying the approaches into one model.

(New Syllabus w.e.f Academic Year 23-24) (BC-12) ech-Neo Publications


B.Sc.--Comp-SEM 6) (Text Categorization and Filtering) (6-15)
(MU-T.Y. ...Page no.
IR
Applications of Text filtering
user is trying to search for a particular book, the search engine will
Ifthe
1.
recommend SOme of the similar titles from their past likes. This
technology is used by some of the major companies like Netflix,
Pandora''s search engines. Such type of systems is mostly used with text
documents.
Tf the
person wants to watch a
ask other users'
movie, he/she might
2.
opinion or friends about the particular movie. Because different peoples
teue different opinions. So that in this case only those people can see a
movie who has similar
interests.

The website makes recommendations by comparing the watching and


3. as well as
searching habits of simila users (i.e., collaborative filtering)
a user has
hy offering movies that share characteristics with films that
rated highly (content-based filtering).

4.
Even job searching uses hybrid fileting system which is the combination
of content-based filtering and collaborative filtering approach. The main
motto is to make easy job search for users. This recommendation
as it makes it easy for the
depends on the user's past experiences as well
users to get recommendation of various job profiles on basis of their past
experiences, projects, internships, skills, etc.
5. Searching friends online in Facebook, whom to be friend with, also part
of collaborative filtering. Even song listings based on previous history or
choice in Spotify is also another example of collaborative filtering.
SPACE FOR NOTES

esh-NeoPublications
(New Syllabus w.e.f Academic Year 23-24) (BC-12)
IR (MU-T.Y. B.Sc.-Comp-SEM 6) (Text Categorization and Filtering)...Page no.
(6-16)

GQ. Differentiate between Information Filtering and Information Retrieval,

Information Filtering Information Retrieval

Information Filtering is about Information retrieval 1S


about
processing a stream of information fulfilling immediate queries
from a
to match your static set of likes, library of information available
tastes and preferences.

Example: a clipper service which Example: you have a deal sto

reads all the news articles containing 100 deals and a


query
published today and serves you comes from a user.
You show the
content that is relevant to you deals that are relevant to that auery
based on your likes and interests.

Information filtering is concerned IR is typically concerned with single


with repeated uses of the system, uses of the system, by a person
with
by a person or persons with long-a one-time goal and one-time query
term goals or interests

Filtering assumes that profiles can IR recognizes inherent problems in


be Correct specifications of the adequacy of queries as
information interests representations of information needs

Filtering is mainly concerned of IR is concerned with the collection


texts with the distribution texts to and organization of texts
groups or individuals.

Filtering is concerned with long- IR is concerned with responding to


term changes over a series of| the user's interaction with texts
information- seeking episodes. within a single information-seeking
episode
Models: - Probabilistic model Models: Boolean IR model Vector
space IR model Probabilistic IR
model Language Model

(New Syllabus w.e.f Academic Year 23-24) (BC-12) ech-Neo Publications


no.
B.SSc.-Comp--SEM 6) (Text Categorization and Filtering) ...Page (6-17)
(MU-TY.
VA
differences between Classification and Clustering.
List the
GQ.
Classification Clustering

Classification a supervised Clustering is an unsupervised


a
approach where specific learning approach where grouping
learning
provided to the machine to is done on similarities basis.
label is Here the
classify new observations.
machine needs
proper testing and

training for the label verification.


approach. Unsupervised learning approach.
Supervised learning
dataset. It does not use a training dataset.
It uses a training
It uses algorithms to categorize the|
It uses statistical concepts in which
new data as per the observations of the data set is divided into subsets
with the same features.
the training set.

In classification, there are 1abels for In clustering, there are no labels for

training data.
training data.

Itsobjective is to find which class


a Its objective is to group a set of
new object belongs to fornm the set objects to find whether there is any
of predefined classes. relationship between them.

It is more complex as compared to It is less complex as compared to

clustering. clustering.

Chapter Ends...
UNIT 2 Text Clustering
for Information
CHAPTER 7
Retrieval

Syllabus

Text Ciustering for Information Retrieval: Clusteríng techniques:


K- means, hierarchical clustering, Evaluation of clustering resuits,

Clustering for query expansion and result grouping.

H7.1 CLUSTERING TECHNIQUES

GQ. Explain the concept of clustering in IR.

Clustering is the process of grouping a set of documents into clusters of


similar documents.
Documents within a cluster should be similar. Documents from different
clusters should be dissimilar.
Clustering is the most common form of unsupervised learning.
After clustering, each cluster is assigned a number called a cluster ID.
Two of the most popular clustering algorithms in detail -K Means and
Hierarchical.
Types of Clustering Methods
The clustering methods are broadly divided into Hard clustering
(datapoint belongs to only one group) and Soft Clustering (data points can
belong to another group also).
B..Sc.-Comp-SEM
M6) (Text Clustering for IR)...Page no.
(MU-T.Y. (7-2)
IR
are also other various approaches of Clustering exist. Below
But there
main, clustering methods.
arethe
Partitioning Clustering
.tris a type ofclustering that divides the data into non-hierarchical
groups. It is also known as the centroid-based method.
The most common example of partitioning clustering is the
K-Means Clustering algorithm.
pensity-Based Clusterina
2
The density-based clustering method connects the highly-dense
areas into clusters, and the arbitrarily shaped distributions are
formed as long as the dense region can be connected.
This algorithm does it by identifying different clusters in the
dataset and connects the areas of high densities into clusters.
The dense areas in data space are divided from each other by
Sparser areas.
3. Distribution Model-Based Clustering
In the distribution model-based clustering method, the data is
dvided based on the probability of how a dataset belongs to a
particular distribution.
The grouping is done by assuming some distributions commonly
Gaussian Distribution.
.
Connectivity-based Clustering
on the notion that
As the name suggests, these models are based
the data points closer in data space exhibit more similarity to each
other than the data points lying farther away. These models can
follow two approaches.
In the first approach, they start by classifying all data points into
separate clusters & then aggregating them as the distance
decreases.
In the second approach, all data points
are classified as a single

cluster and then partitioned as the distance increases. Also, the


choice of distance function is subjective.

ech-Neo Publications
iew Syllabus w.e.f Academic Year 23-24) (BC-12)
(Tet CLstgrg iar Fags r. (7-3)
scalability fog

clusterin

sas c Casteing
COTITMOn

A7.1.1 K-eArs Clustring

G0Eia teKS custarcaigoriT


KMens is z paiicr-tasei clastering cnique that uses the distance

CeTE
E
Eciean sences beee
e
points as 2 Criterion for

Assuzizg ee as t uaters of ea objects, K-Mezns groups them

Eac
cistEr tasa clser ceT allocate andeach of them is piaced at
fartter staCS.
Esery icamng datz point gets placed in the cluster with the closest
cater ceteT.
This procEss is regeetei until ali the ae
points get zssigned to any
ciuster. Orce all the date points 2re covered the cluster centers or
Centroids are recalclated.
Document representations in clustering it uses vector space model. As in
VEstar space ciassification. we measure relatedness between vectors by
Enclidean istarca, wiich ís aimost equivalent to cosine similarity.
Each ciuster in K-meens Îs defined by a centroid.

(Niew Syllalus w.ef Acaderric Year 23-24) (BC-12) aech-Neo Publications


B.T-CanpSEA (Tes Custeing for RFagse no.(7-4)
eTY.
Otjectiveparitioning criterion: minimize the average squared
centroid
Fference from the
centroid:
definition of
Raall

here e use o to denote a cluster


find the minimum average squared difference by iteratingtwo
Wetiry to
steps.
centroid.
reassignment: assign each vector to its closest
as the average of the
recomputation: recompute each centroid
vectors that were assigned to it in reassignment.
1

Number of
ckuste K

Centroid

No obec
oove group
D:starce obects to
Cerroids

Grouping based on
inimm cistance

Algorithm
Input: K: no of clusters
D: data set containing n object

Output
a set of K clusters

Steps
I. Arbitrarily choose k objects fromD as the initial cluster centers
2. Repeat

ech-Neo Publications
ien Syllabus w.e.f Acadernic Year 23-24) (BC-2)
IR (MU-TY,B.Sc -Comp SEM 6) (Text Clustering for IR)...Page
no..(7-5)

3. Reassign each object to the cluster to which the object is tlhe


moSt
sìmilar based on the distance measure.
4. Recompute the centroid for newly formed cluster Until no change.

a7.1.2 Hierarchical Clustering

GQ. Explain Hierarchical Clustering in detail.

Hierarchical clustering involves creating clusters that have


predetermined ordering from top to bottom.
For example, all files and folders on the hard disk are organized in.
hierarchy.
Build a tree based hierarchical taxonomy from a set of documents ie
called dendrogram.
There are two types of hierarchical clustering: Divisive
and
Agglomerative.
Divisive clustering
Itbegins with a single cluster that consists of all of the instances.
During each iteration it chooses an existing cluster and divides it into
two (or possibly more) clusters.
Thisprocess is repeated until there is a total of K clusters.

The output of the algorithm largely depends on how clusters are chosen
and split.
Divisive clustering is a top-down approach.
Agglomerative clustering
The other general type of hierarchical clustering algorithm is called
agglomerative clustering, which is a bottom-up approach.
An agglomerative algorithm starts with each input as a separate cluster.
That is, it begins with N clusters, each of which contains a single input.
The algorithm then proceeds by joining two (or possibly more) existing
clusters to form a new cluster. Therefore. the number of clusters
decreases after each iteration.
The algorithm terminates when there are K clusters.

(New Syllabus w.e.f Academic Year 23-24) (BC-12) ech-Neo Publications


B.SC.-Comp-SEM 6) (Text Clustering for lR)..Page no. (7-6)
(MU-T.Y..
is largely
divisive clustering, the output of the algorithm
with
As clusters are chosen and joined.
dependent on how
or divisive clustering
hierarchy generated by an agglomerative
The
can be conveniently
visualized using a dendrogram.
algorithm
graphically represents how a hierarchical clustering
dendrogram
progresses.
algorithm
Agglomerative Clustering Algorithm
Clustering Algorithm.
GQ.
Explain the Agglomerative
manner as follows:
algorithm forms clusters in a bottom-up
The
own cluster.
Initially put each article in its
the two clusters with smallest distance.
Among all currentclusters,
pick
merging the
two clluusters with a new cluster formed by
Replace these
ones.
two original
cluster.
above two steps until there is only one remaining
Repeat the

1. Input
clustered
a set of N documents to be
an XN similarity (distance) matrix
N
are produced,
each document to its own cluster N clusters
2. Assign
containing one document each
3. Find the two closest clusters
merge them into a single cluster

number of clusters reduced N-]


to
new cluster and each old cluster
4. Recompute distances between
one single cluster of size N is produced
5. Repeat steps 3 and until
SPACE FOR NOTES

atech-Neo Publications
(New Syllabus w.ef Academic Year 23-24) (BC-12)
IR (MU-T.Y. B.Sc.-Comp-SEM 6) (Text Clustering for IR)..Page no.
(7-7)

Example

STEP D STEP 1 STEP 2 STEP 3 STEP 4

ab abcde

cde

de

:
Methodsto find closest pair of clusters
GQ. Explain various methods to find clusters in clustering algorithm.

Method used for computing cluster distances defines three variants of


the algorithm are
1. Single-Linkage
2. Complete-Linkage
3. Average-Linkage

1. Single Linkage
In single linkage hierarchical clustering, the distance between two
clusters is defined as the shortest distance between two points in each
cluster. For example, the distance between clusters " and "s" to the left
isequal to the length of the arrow between their two closest points.

=
L(I.s) min(D(r,rHKs)

(NewSyllabus w.e.f Academic Year 23-24) (BC-12) Sech-Neo Publications


.-Comp-SEM 6) (Text Clustering for IR)..Page no. (7-8)
B.Sc.-

Complete Linkage
2. hierarchical clustering, the distance between two
complete linkage
between two points in each
In defined as the longest distance
clusters is
cluster.
between clusters "" and " to the left
is
example, the distance
of the arrow between their two furthest points.
For
to the length
equal

L(r,s) =nmax( D(x, x,))

} 3. Average linkage
hierarchical clustering, the distance between two
In average linkage
as average distance between each point in one
clusters is defined the
cluster.
cluster to every point in the other
clusters and s" to the left is
For example, the distance between
arrow between connecting the points of
equal to the average length each
one cluster to the other.

1
L(T, s) = D(*:

(New Syllabus
ech-NeoPublications
w.e.f Academic Year 23-24) (BC-12)
(Text Clustering for lR)...Page no
IR (MU-T.Y. B.SC-Comp-SEM 6) .(7-9)
GQ. Diferentiate between K-Means and Hierarchical clustering. [G.Q.J

K means Clustering Hierarchical clustering


K-means. using a pre-specified Hierarchical methods can be either
number of clusters, the method divisive or agglomerative.
assigns records to each cluster to
find the mutually exclusive cluster
of spherical shape based on
distance.
K Means clustering needed ii hierarchical clustering one can
advance knowledge of i.e. no. of
K
stop at any number of clusterS, One
clusters one want to divide your find appropriate by interpreting the
data. dendrogram.
One can use median or mean as a Agglomerative methods begin with
cluster center to represent each 'n' clusters and sequentially
cluster. combine similar clusters until only
one cluster is obtained

Methods used are normally less Divisive methods work in the


computationally intensive and are opposite direction, beginning with
suited with very large datasets, one cluster that includes all the
records and Hierarchical methods
are especially useful when the
target is to arrange the clusters into
a natural hierarchy.

In K Means clustering, since one In Hierarchical Clustering, results


start with random choice of are reproducible in Hierarchical
clusters, the results produced by clustering
running the algorithm many times
may differ.
K- Means clustering is simply a A hierarchical clustering is a set of
division of the set of data. objects nested clusters that are arranged as
into non-overlapping subsets |a tree.
(clusters) such that each data object
is in exactly one subset).
K Means clustering is found to Hierarchical clustering don't work
work wellwhen the structure of the as well as, k means when the shape
clusters is hyper spherical of the clusters is hyper spherical.
(like circle in 2D, sphere in 3D).

(New Syllabus w.e.f Academic Year 23-24) (BC-12) Gech-Neo Publications


B.Sc.-Comp-SEM 6) (Text Clustering for IR)...Page
no.(7-10)
IA(MU-T
EVALUATION OF CLuSTERING RESULTS
M 72
how the clusters are evaluated in IR? [G.Q]
Explain
GQ.
i important factors by which clustering
can be evaluated are
Three
(a)Clustering tendency
clusters, k
b) Number of
quality
(c) Clustering

Clustering tendency
1.
sure that data set
Refore evaluating the clusteing performance, making
sue are working has clustering tendency and does not contain uniformly
distributed points is very important.
clusters identified
Tf the data does not contain clustering tendency, then
may be irrelevant.
by any state-of-the-art clustering algorithms

Non-uniform distribution of points in data set becomes important in


clustering.

2. Number of Optimal Clusters, k


Some of the clustering algorithms like K-means, require number of
clusters, k, as clustering parameter. Getting the optimal number of
clusters is very significant in the analysis.
a
Ifk is too high, each point will broadly start representing cluster and if
k is too low, then data points are incorrectly clustered.

Finding the optimal number of clusters leads to granularity in clustering.


as it
There is no definitive answer for finding right number of cluster
depends upon
i. Distribution shape
ii. Scale in the data set
1ii. Clustering resolution required by ser.
Ihere are two major approaches to find optimal number of clusters:
Domain knowledge
Data driven approach

(New Sylabus ech-Neo Publications


W.e.f Academic Year 23-24) (BC-12)
IR (MU-T.Y. B.SC.-Comp-SEM 6) (Text Clustering for IR)..Page
no,
(7-11)
Donmain knowledge - Domain knowledge might give some
For
knowledge on finding number of clusters. example, prior

clustering iris data set, if


we
have the prior knowledge
case
of
in
= of species
3 3.
(sertosa, virginica, versicolor), then
k
Domain knowledge
driven k value gives more relevant insights.
ii. Data driven approach - If the domain knowledge is not
available.
mathematical methods help in finding out right number of
clusters,
These methods are discussed below:

Empirical Method
A
simple empirical method of finding number of
clusters
where N
total number is
Square rOOt of N/2 is of data points, so
that each cluster contains square root of2 *N.

Elbow Method
Within-cluster variance is a measure of ompactness of the

cluster.
Lower the value of within cluster variance, higher the
compactness of cluster formed.
Sum of within-cluster variance, W, is calculated for clustering
analyses done with different values of k.
W is a cumulative measure how good the points are clustered
in the analysis. Plotting the k values and their corresponding
sum of within-cluster variance helps in finding
the number of
clusters.

Statistical Approach
Gap statistic is a powerful statistical method to find the optimal
number of clusters, k. Similar to Elbow method, sum of within
cluster (intra-cluster) variance is calculated for different values
of k.
Then Random data points from reference null distribution are

generated and Sum of within-cluster variance is calculated for

the clustering done for different values of k.

Cluster number with maximum Gap statistic value corresponu


to optimal number of clusters.

(New Syllabus w.e.f Academic Year 23-24) (BC-12)


Tech-Neo Publications
B.Sc.-Comp-sSEM
lR)..Page no. (7-12
6)

(MU-T.Y. (Text Clustering for


IR
Clustering quality
3.
clustering is done, how well the clustering has performed can
be
Once
quantified by a number of metrics. Ideal clustering is characterised by
minimal intra cluster distance and maximal inter cluster distance.
assess the clustering
There are majorly twO types of measures to
performance.
are
Extrinsic Measures which require ground truth labels. Examples
Adiusted Rand index, Fowlkes-Mallows
scores, Mutual information
based scores, Homogeneity, Completeness and V-measure.
TntrinsicMeasures that do not require ground truth labels. Some of the
clustering performance measures are Silhouette Coefficient, Calinski
Harabasz Index, Davies-Bouldin Index etc.

CLUSTERING FOR QUERY EXPANSION AND


>7.3 RESULT GROUPING

GO. Explain in brief query expansion and result grouping.


are coherent
It aims to group a set of data objects into clusters that
internally but basically different from each other.
In this work, we involve clustering algorithms in Information Retrieval
(IR) to strengthen the user's original query with appropriate additional
terms and return more relevant information.
Used in early search engines as a tool for indexing and query
formulation.
Specified preferred terms and relationships between them also called
controlled vocabulary particularly useful for query expansion by adding
synonyms or more specific terms using query operators based on
thesaurus.
It improves search effectiveness.
A variety of automatic or semi-automatic query expansion techniques
have been developed, goal is to improve effectiveness by matching
related terms, semi-automatic techniques require user interaction to
select best expansion terms.

ech-NeoPublications
(New Syllabus w.e.f Academic Year 23-24) (BC-12)
IR (MU-TY, BSC-Comp SEM6) (Text Clustering for IR)..F
Page
no.(7-13)
Approaches usually based on an analysis of term co-0CCurrence:

the entire document collection


a large collection of queries

the top-ranked documents in a result list


The following are the term association measures

Dice's Coefficient
Two functions are rank equivalent
if they produce the same ordering
rank
of Rears
2.n,ah
n, + n,

Mutual Information

P(a.b) rank
log Pa) P(b) = log N.
n, + n,

Expected Mutual Information Measure


(EMIM)
P(a.b) rank
P(a.b)
log P(a) P(b) log N
N n, n Db log N

Pearson's Chi-squared (X) measure

2
1
N N rank
n,gNnb
N.,NN
expected number of co-occurrences
if the two terms occur independently

Chapter Ends...
UNIT 2 Web Information
Retrieval
CHAPTER 8

Syllabus
architecture and challenges,
search
Information Retrieval : Web analysis and PageRank
Web
web pages,
Link
indexing
Crawling and
algorithm.

CHALLENGES
ARCHITECTURE AND
WEB SEARCH
8.1
Search Engine
Web Search and
8.1.1 search engine.
search and The
Explain the
term web information globally.
GQ.
allows people to share
World Wide Web bound. can be
The grows without which
information Pages'", Each of (URL).
amount of known as
*Web
are Resource Locator published
documents called Üniform pages
Web an identifier a set of any
addressed by Web Sites', search for
into the User can
pages are
grouped https://mu.ac.in/. or phrase. It then
Web Example, keywords user.
form of return to the
together. For passing query in database and
information by information in its on the Web.
relevant information
searches for for
means searching
search simply
Web
(Web Information Retrieval)...F
IR (MU-TY. RSC-ome-SEM 6) Page
no.
(8-2)
Web searching from
The tem may he Used to differentiate
searching
local users' PCs or servers in the company datacenter.
the user needs tool to search
a
In order to extract information,
The toolis called a search engine. A
Web
search engine the Web,
on the Weh
a specializea is
Computer server that searches for information
Examples of search engines are Google, Yahoo!, MSN Search,
bing.
Web search engine discover pages by crawling the web,
discovering
new pages by following hyperlinks. Access to particular
web paoes.8
be restricted in various ways.

The set of pages which cannot be included in search engine


indexes is
often called The hidden 'deep web' or'web dark matter.
web or

The search engine looks for the keyword in the index for
predefined
database instead going
of directly to the web to search for the keyword.
then uses software to search for the information in the database.
This
software component is known as web crawler.
Once web crawler finds the pages, the search engine then showe sl..
relevant web pages as a result. These retrieved web pages
generally
include title of page, size of text portion, first several sentences etc.

t Advantages and Disadvantages of Search engines

iGQ. List the advantages and disadvantages of web search engines.

Advantages of Search engines


1. Time Saving: Search engine helps us to save time by the following two
ways it eliminates the need to find information manually, and performs
search operations at a very high speed.
:
2. Variety of information The search engine offers various variety of
resources toobtain
relevant and valuable information from the Internet.
By using a search engine, we can get information in various fields such
as education, entertainment, games, etc. The information which we gel
from the search engine is in the form of blogs, pdf, ppt, text, image,
videos, and audios.
Precision: All scarch engines have prrecise
3. the ability to provide more
results.

TM

(New Syllabus w.e.f Academic Year 23-24) (BC-12) ech-NeoPublications


B.Sc.-Comp-SEM 6) (Web Information Retrieval)..Page no. (8-3)
(MU-T.Y,.
IR
: Most search engines allow end-users to search their
Free Access
4, free. In search engines, there is no restriction related to a
content for
so all end users spend a lot of time to search
number of searches,
valuable Content to fuulfill their requirements.
Advanced Search : Search engines allow
us
to use advanced search
5. valuable, and informative results. Advanced
options to get relevant,
our searches more flexible as well as sophisticated.
search results make
you want to search for a specific site,
For example, when
tune without quotes followed by the site's web address.
"site:'"
pisadvantages of Search engine
to display relevant,
Sometimes the search engine takes too much time
1.
content.
valuable, and informative
update their algorithm,
Search engines, especially Google, frequently
Google runs.
2.

and it is very difficult to find the algorithm in which


Tt makes end-users effortless
as they all time use search engines to solve
3.
their small queries also.
results may not be relevant
Spam and Irrelevant Results Some search
:
4

through irrelevant
to the query, leading users to waste time sifting
information.
users search and browsing
Privacy Issues Some search engines tracks
:
5.
history, which raises privacy concerns.
Types of search engines

1
GQ. Explain search engine types.

Crawler based
programs to
Crawler-based search engines use automated software
survey and categorize web pages. The programs used by the search
pages are called spiders, crawlers, robots or
engines to access our web
bots.
are:
Examples of crawler-based search engines
(www.ask.com)
Google (www.google.com) b) Ask Jeeves

ech-Neo Publications
(New Syllabus w,e.f Academic Year 23-24) (BC-12)
W lnfomation Retieval)...Page
no.(8-4)

Directories
what category
the
wedsites within specitic categories in site
vgs they plar
databaN.
ditorRs website
omprensiely check the and
The humn atis a pr-detined
rank
set of rules, i.
intomtion thev find, using
ho the

(Www.yabon.Om)
a) Tahoo Direy

Hybrid Search Engìnes


nginRS UN A QOmbination of both cawler-based results
Hyhid sNh mor NANh engines these
Nults Mor snd days ane
and dìtor
mìng t a hybidhad model.
engines are:
Examples ofhybrid srh
a) Yahoo (www.yahoo.com)
b) Google (www. google.com)

Meta Search Engines


Meta search engines take the results from all the other search engines
results, andcombine tnem into one large listing.
Examples of Meta search engines include:
a) Meta crawier (www.metacrawler.com)
b) Dogpile (www.dogpile.com)

A 8.1.2 Web Structure

GQ. Explain the structure of web.


I GQ. Explain the bow-tie structure of the web.

Almost every website on the Internet has a distinct design &

organization structure.
Website designers usualy create distinguishable layout templates tor
pages of different functions.

(New Sylabus w.e.f Academic Year 23-24) (BC-12) rech-Neo Publications


(MU-T.Y,
B.Sc.-Comp-SEM 6) (Web Information Retrieval)...Page no. (8-5)
IR
then organize the website by linking various pages with
They
hyperlinks, each of which is represented by a URL string following
some pre-defined syntactic patterns.
a
A web crawl is task performed by special purpose software that surfs
starting from a multitude of web pages and then continuously
the Web,
following the hyperlinks it encountered until the end of the crawl. One
the intriguing findings of this crawl wvas that the Web has a bow-tie
of
belowv Fig. 8.1.1.
structire as shown in the

Tubes

IN OUT
SCC

Tendrlls

Disconnected

Fig. 8.1.1 :
Bow-Tie Structure of the Web

The central core of the Web (the knot of the bow-tie) is the strongly
connected component (SCC), which means that for any twO
pages in the
SCC, a user can navigate from one of them to the other and back by
clicking onlinks embedded in the pages encountered.
a
The left bow, called IN, contains pages that have directed path of links
leading to the SCC and its relative size was 21.5% of the crawled pages.
Pages in the left bow might be either new pages that have not yet been
linked to, or older web pages that have not become popular enough to
become part of the SCC.
The right bow, called OUT, contains pages that can be reached from the
SCC by following a directed path of links and its relative size
was also
21.5% of the crawled pages.
a
Pages in the right bow might be pages in e-commerce sites that have
policy not to link to other sites.

ech-Neo Publications
(New Syllabus w.e.f Academic Year 23-24) (BC-12)
(Web
Infomation Retrieval)...Page no.(8-6)
RATNBSCComp-SEM 6)

and the tuhes that


are the tendris together
components
The other
portion of the Web.
ofthecrawled
compised 21.5% was about 8% of the
whose total size crawi.
Disconnected components
from IN to OUT bypassing the
a directed path
A
web page in Tubes has from IN Or leads
Tendrils can either be reached into
SCC, and a page in

OUT.
Web Search
A 8.1.3 Challenges of
challenges posed by Web
search?
GQ. What
are the

itself
1, Data-centric: related to the data
an
Distributed data
-
Data spans over many computers
on
platforms. Available bandwidth and reliability the network
interconnections vary widely.
- New
High percentage of volatile data computers/sites/pages
can be added and removed easily. We als0 have dangling links eto
or
when domain or file names change disappear.
- to cope up.
Large volume of data scaling issues difficult
Unstructured and redundant data No Conceptual
structurelorganization. HTML pages are only semi-structured.
Much data is repeated (copies/mirrors).

Quality of data - There is no editorial process ata can be false,


invalid, poorly written, with many typos.
-
Heterogeneous data Multiple media types, multiple formats
Different languages, different alphabets.

2. Interaction-centric: related to the users and their


interactions
Expressing a query and interpreting results.

3. User key chalenge


The users do not exactly understand how to
provide a sequence of
Words for the search.

(Vew Sylabus wef Acadernic Year


23-24) (BC-12) Sech-Neo Publications
B.SC.-Comp-SEM 6) (Web Information Retrieval)..Page no. (8-)
(MU-T. Y.
R

The users have problems understanding Boolean logic: therefore,


e 1ser cannot perform advanced searching to conceive a good
query.
Around 85% of users only look the first page of the result, so
at

relevant answers might be skipped. In order to solve the problems


easy to use and provide relevant
above, the search engine must be
answers to the query.

System key challenge


4.
answers, even to poorly
To do a fast search and return relevant
formulated queries its challenging.
users to search. Specify the
So here are some guidelines helping
page and
words clearly, stating which words should be in the
page. If looking for a company,
which words should not be in the
guess the URL by using the
institution, or organization, try to
www prefix followed by the name, and then (.com,.edu,
.org, gov,

or country code).
anyone can publish data on the Web,
The user should notice that
so information that they get from search engines might not be
accurate.
?
GO. Explain in brief how Search Engines works

Search engines crawl the Web, looking at particular site


items to get an
idea what a site is about.
to deliver search
Search engines perform several activities in order
results - crawling, indexing, processing, calculating relevancy,
and
retrieving.
Crawling
see what is there.
First, search engines crawl the Web to
a crawler or a
This task is performed by a piece of software, called
spider or bots.
Spiders follow links from one page to another and index everything they
find on their way.

ech-Neo Publications
(New Syllabus w.e.f Academic Year 23-24) (BC-12)
irtermatten Fetrevai). Page ng.
(8-8)

datubusK. tvm wbers it cun later


SOred in agant be
gN

Pcsing

intie
8.S-Cones-SEM 6)
(Web infomation Retrieval).Page no. (8-9)
ceNT.Y.
Engine Architecture
Simplified Search Searth rtetact

Query Engine

CaNier Indexer

Fig. S.I.2

Wes crewier everses web pages.


erawle is softwre progm tuet
dEwuicadsten fr ndeing. nd follows te hygeriinks are
rerrced
a
a
piges web crewler is iso TOWI s a crawler.
owrloziei
sriTer. 2 ander 0r 2 sOtware robot.

The secOctcUgotet is the inezer wicn is resDOnsTble for crearing

Searc Index

e s z
in
e
web es Crzled. Iisted in

e
prsing list).
to peiinks in 2
Te sezrh ieI ll lso sore infonaion peaining
engine to perfom
STE link datzhase mich llows he seach process web
Npnik 2alysis, ich is usei 2s part of the ranking of

esh-Neo Pubilications
ew Sylabs wefAcacdemis Year 23-24) (BC-12)
IR (MU-T.Y, B.SC.-Comp-SEM 6) (Web Infomation Retrieval).Page no.
(8-10)
The link database can also be organized as an inverted filein such
URLS and a way
that its index file is populated by the posting list for
URL entry, called the source URL, contains all the destination each
URLS
forming links between these source and destination URLS.
Query Engine
Query engine is a well-guarded secret, since search engines are
rightly
paranoid, fearing web sites who wish to
increase their ranking
unscrupulously taking advantage of the algorithms the search by
engine
uses to rank result pages.
Search engines view such manipulation as spam, since it
has direct
effects on the quality of the results presented to the user.
Search Interface
Once the query is processed, the query engine sends the
results list to the
search interface, which displays the results on the user screen.
's
The user interface provides the look and feel of the search
engine,
allowing the user to submit queries, browse the results
list, and click on
chosen web pages for further browsing.

8.2 CRAWLING AND INDEXING WEB PAGES

a 8.2.1 Web Crawling


I GQ. Define the term web crawling and explain the
features of web i

crawling.

Crawling is the process of gathering pages from the internet, in order to


index them. The objective of crawling is to quickly and efficiently
gather as
many useful web pages as possible, together with
the link structure that
interconnects them.

Features of a crawler
Robustness : Ability tohandle spider traps. The Web contains servers
that create spider traps, which are generators of web pages that mislead
crawlers into getting stuck fetching an infinite number of pages in a
particular domain. Crawlers must be designed to be resilient to such
Iraps.

(New Syllabus w.e.f Acadernic Year 23-24) (BC-12) Tech-Neo Publications


B.Sc.-Comp-SEM 6) (Web Information Retrieval)...Page no. (8-11)
(MU-T.Y,,
IA
Politeness Web servers have both implicit and explicit policies
at which a crawler can visit them. These politeness
regulating the rate
respected.
policies must be
a
Distributed : The crawler should have the ability to execute in
distributed fashion across multiple machines.

Scalable : The crawler architecture


should permit scaling up the crawl
machines and bandwidth.
adding extra
rate by
: crawl system should make efficient
Performance and efficiency The
use of various system resources including processor, storage and
bandwidth.
network
The crawler should be biased towards fetching useful2
Quality :
pagesfirst.
operate in
Ereshness In many applications, the crawler should
:
copies of previously fetched
continuous mode: it should obtain fresh
pages.
many ways
Extensible : Crawlers should be designed to be extensible in
so on. This
to cope with new data formats, new fetch protocols, and

demands that the crawler architecture be modular.

Web Crawler Architecture


GO. Explain web crawler architecture.
structure of a
Web Crawler Architecture refers to the design and
program that automatically browses the web for information.

The architecture of a web crawler is responsible for defining how the


crawler functions, what it does, and how it interacts with the websites it
visits.
The simple scheme outlined above for crawling demands several
modules that fit together as shown in below Fig. 8.2.1

ech-Neo Publications
(New Syllabus w.e.f Academic Year 23-24) (BC-12)
(Web Information Retrieval)...Page no..(8-12)
IR (MU-T.Y. B.SC.-Comp-SEM 6)

Doc
robots URL
FP's templates Set

URL Dup
Contern
Pars Scen? Filter URL
fetc Elim

URL Fronticr

Fig. 8.2.1
1. The URL frontier, containing URLS yet to be fetched in the
a
current crawl (in the case of continuous crawling, URL may
have
been fetched previously but is back in the frontier for re-fetching).
2 A DNS resolution module that determines the web server from
which to fetch the page specified by a URL.
3. A fetch module that uses the http protocol to retrieve the web page
at a URL.
4. A parsing module that extracts the text and set of links from a
fetched web page.
5. A duplicate elimination module that determines whether an
extracted link is already in the URL frontier or has recently been
fetched.
We beginby assuming that the URL frontier is in place and non-empty.
We follow the progress of a single URL through the cycle of being
fetched, passing through various checks and filters, then finally (for
continuous crawling) being returned to the URL frontier.

a 8.2.2 Indexing the Web Pages or Web Indexes

i GQ. Explain how to do indexing of the web pages?

Web indexing means creating indexes for individual Web sites,


intranets, collections of HTML documents, or even collections of Web
sites.
Indexes are systematically arranged items, such as topics or names, that
serve as entry points to go directly to desired information within a larger
document or set of documents.
Indexes are traditionally alphabetically arranged. But they may also
make use of hierarchical arrangements also.
Indexing is an analytic process of determining which concepts are worth
indexing, what entry labels to use, and how to arrange the entries.

(New Syllabus w.e.f Academic Year 23-24) (BC-12) tech-Neo Publications


B.SC.-Comp-SEM 6) (Web Information Retrieval)...Page no. (8-13)
(MU-T.Y,
user
IA is often a browsable list of entries from which the
Web index may be non-displayed and scarched by the user
A selections, but it
makes
a search box.
(yping into a kind of Web index that resembles an alphabelical
A-Z index is are hyperlinked
site
back-of-the-book
A style index, where the index entries
Web page or page section, rather than using
appropriate
directlyto the Web indexes work particularly well in sites that have
a

numbers.
page one or two levels of hierarchy.
structure with only
for smaller
flat complement search engines on larger web sites and use a
Indexes Whether to back-
provide a cost- effective alternative. on
sites, they style index hierarchy of categories will depend
or a the
of-the-book content is changing.
size of the site and how rapidly the
size of the are best done by individuals skilled in
indexing who also
Site indexes indexing tools.
in HTML in using HTML
Or
basic skills
have embedded indexing of Word, FrameMaker,
Electronic indexing includes
InnDesign electronic documents of publications, online help and
PDF and When the pages are edited or
tagging.
Content Management System with new page numbers
or anchors or
changed, the index is regenerated
from the index to the relevant page paragraph.
or
URLS With a hyperlink
ALGORITHM
LINK ANALYSIS AND PAGERANK
8.3

8.3.1 Link Analysis

cO Define the term link analysis.


I GQ.

The analysis of hyperlinks and the


graph structure of the web have been
search.
instrumental in the development of web
interconnected links and nodes to
Link analysis uses a network of
are not easily seen in raw data.
identify and analyze relationships that
on the web are a large knowledge source which
The links between pages many ends.
is exploited by link analysis algorithms for
will determine a quality
Many algorithms similar to PageRank and HITS a page.
or authority score based on the number of in-coming links of
similar pages, web
Link analysis is applied to identify thematically
communities and other social structures
antecedents in the field of
Link analysis for web search has intellectual
citation analysis.
a useful indicator of what page(s) to
Link analysis also proves to be
Crawl next while crawling the web; this is done by using link
analysis to
queues.
guide the priority assignment in the front

(New Syllabus w.e.f Academic Year 23-24) (BC-12) e Tech-Neo Publications


Vab tntmation Retrieval)..Paga no.
(G-\4)

& 8.3.2 PageRank Algorithm

GQ ExDlain PageRank algerith.


ot
It is a
sing neasur basd only on the
link sucuN wehpages,
A
weh age is important it it is
pointed to by oher important wehpags,
every node in tlhe
Our technique for link analysis assigns to
fist pnge rank. weh
numerical
a score betzen 0and 1 knon as it
gph a composite
Gien a query.
a web
seanh engne computes score
for
of features such aS Cosine
each weh page that combines hundreds
the Page rank scor
similarity and term proximity together with

Fig. 8.3.1

The above figure shows The random surfer at node A proceeds with
:

a
probability 1/3 to each ofB, C and D. Consider random surfer vhe
a
begins at a web page (a node of the web graph) and executes random
walk on the Web as follows. At each time step, the surfer proceeds from
his current page A to a randomly chosen web page that A hyperlinks to.
The above figshows the surfer at a node A, out of which there are three
hyperlinks to nodes B,C and D; the surfer proceeds at the next time step
to one of these three nodes, with equal probabilities 1/3. As the surfer
proceeds in this random walk from node to node, he visits some nodes
more often than others; intuitively, these are nodes with many links
coming in from other frequently visited nodes. The idea behind Page
Rank is that pages visited more often in this walk are more important.
Chapter Ends...
UNIT 2
Learning to Rank
CHAPTER 9

Syllabus
: Algorithms and Tochniquos, Supervisod loarnlng
Learning to Rank
ranking: RankSVM, RankBoost, Palrwiso and lstwlse loarning to
for
rank approaches Evaluation metrics for loarningto rank,

LEARNINGTO RANK (LTR) ALGORITHMS AND


:
9.1
TECHNIQUES

GQ. Discuss the term learning to rank.


!
learning,
Learning to Rank (LTR) bclongs to supcrvised machine
where we need a historical datasct to train the modcl,
to a query. In
Aiming to sort a list of items in terms of thcir relevance
classical machine learning in problems like classification and regression,
on a featurc vcctor.
the goal is to predict a single value bascd
the
LTR algorithms operate on a sct of fcature vcctors and predict
optimal order of items.
LTR is a class of algorithmic techniques that helps you
serve query
results that are not only relevant but are ranked by that relevancy.
which
lt can be represented as a function of (Query, rclevant documents)
= list of ranked
returns ranked documents in relevant order f(Q,D)
documents.
(Leaming to Rank)..Page no.
6) (9-2)
IR M-TY,B.S-Como-SEM
some of them
many difierent applications. Here are
LTR has
. Search engines.
user types a query into
A
a browser

web pages In
a way
search bar
that
the most
The search engine should rank the
top positions.
relevant resulEs appear in
movie recommender system choosing
Recommender systems.
A

2 a user based
which fim shouid be recommended to On an 1nput

query.
the three major LTR approaches types
LIR algorithms Uses are
pointwise. pairwise. and listise.
These approactes associated algorithms are showed in the below
iagram.
Fortafse Lstse
Tea cf
cæáces Tgz tte ertire st

Gerzar cf caricates

artrE. slgoriters

We ill be discussíng RantSVM, RenkBoost in nezt sections.


Learning to Rank (LTR) :Techniques/Approaches
GO, Eplein tre ifferert ranting tehnigues or
approaches.
Fron the high leel, the majority of LTR algorithrns uSe stochastic
gradíent descent to find the nOst optirnal
ranking,

( sewSylilsbus
wef AcadernicYear 23-24) (BC-12)
FiTech-Neo Publícations
BB.Sc.-Comp-SEM 6)
(Learning toRank)...Page no.
(MU-T.Y. (9-3)
IA
Depending on how an algorithm chooses and Compares ranks of at
items
each iteration, there exist three principal methods :
Pointwise ranking.
1.
Pairwise ranking.
2.

3
Listwise ranking.

1. Pointwise ranking
}
Pointwise approaches look at a single document at a time using
classification or regression to díscover the best ranking for
individual results.
We give each document points for how well it fits during these
processes. We add those up and sort the result list. Note here that
each document score in the result list for the query is independent
of any other document score, i.e. each document is considered a
*point" for the ranking, independent of the other "points".
For pointwise approaches, the score for each document is
independent of the other documents that are in the result list for the
query.
Pointwise ranking optimízes document scores independently and
does not take into account relative scores between different
documents. Therefore, it does not directly optimíze the ranking
quality.
In the pointw/ise approach, scores are predicted individually for
scores are sorted. It
each feature vector. Ultimately, the predicted
does not matter which type of model (decision tree, neural
network, etc.) is used for prediction.
The advantage of pointwise ranking is simplicity.
as an isolated
The disadvantages are- Each instance is treated
create the training
point. Explicit pointwise labels are required to
dataset.
we can use the Pairwise Ranking
To overcome these challenges,
method.

ech-Neo Publications
New Syllabus w.e.f Acadernic Year 23-24) (BC-12)
IR (MU-T.Y, B.SG.-Comp-SEM6) (Learning to Rank)..Page
no.
(9-4)

query

f(g, d) = Si

d SCore or
model relevance
document
probability

Fig. 9.1.1:Pointwise model architecture. As the input, the model


accepts
a query and a feature vector.

2. Pairwise ranking

3. Listwise ranking

9.2 PAIRWISE AND LISTWISE LEARNING TO RANK


APPROACHES

A9.2.1 Pairwise Learning to Rank Approaches

Pairwise approaches look at two documents together.


They also use classification or regression to decide which pair ranks
higher.
We compare this higher-lower pair against the ground truth (the gold
Standard of hand-ranked data we discussed earlier) and
adjust the
ranking if it doesn't match.
The goal is to minimize the number of cases where the pair of results are
in the wTOng order relative to the ground truth (also called inversions).
Pairwise models work with a pair of documents at each iteration.
Depending on the input format there are two types of pairwise models -

Pair-input models and Single-input models.


In Pair-input models input to the model is two feature vectors. The
model output is the probability that the first document is ranked higher
than the second.

(NewSylabus wief Acadernic Year 23-24) (BC-12) Tech-Neo Publications


6)
B.Sc.-Comp-SEM (Learning to Rank)...Page no.
(MU-T.Y. (9-5)

During training, these probabilities are calculated for different pairs of


vectors. The weights of the model are adjusted through gradient
feature
on ground truth ranks.
descent based

di

f(q, d, d) - P(d > d)

probability that di is
ranked higher than d

e 9.2.1: Paira query


input model architecture. As an input, the
and two concatenated feature vectors
model accepts

Single-input models accepts a single feature vector as an input. During


training, each document in a pair is independently fed into the model to
receive its own score. Then both scores are compared and the model is
adjusted through gradient descent based on ground truth ranks.
During inference, each document receives a score by being passed to the
model. The scores are then sorted to obtain the final ranking.
Pair-input models are rarely used in practice and single-input models are
preferred over them.
RankNet, LambdaRank and LambdaMART are pairwise approaches.
= S:
f(g, d)

SCore

g(s, s,)
=
P(d. > d)

probability that d. is
ranked higher than d
= s,
f(g, d)

d
SCore

Fig. 9.2.2 : Single input model architecture


ech-NeoPublications
(New Syllabus w.e.f Academic Year 23-24) (BC-12)
(Lsarringto Rank)..Page
A(MU-TY. BSC-Cang SEM6) no.(9-6)
a query and a
single feature
As an input the model takes vector
representing a docuneat.
The ranking prediction is caiculated after the model independently
assigned scores to two feature vectors.

A 9.2.2 Listwise Learring to Rank Approaches

Listwise algorithms opúmise ranking meirics explicitly.


In this approach. insead of considering pairs of documents. the list of
ranked documents is taken into account, along with their relevance
labels.
Listwise approaches directly look at the entire list of documents and
try
to come up with the optimal ordering for it.
Listwise approaches decide on the optimal ordering of an entire lict oe

documents.
Truth lists are identified. and the machine uses that data to rank its list
Listwise approaches use probability models to minimize the ordering
error..
There are two main sub-techniques for doing listwise Learning to Rank:
1. Direct optimization of IR measures such as NDCG. E.g., SoftRank.
AdaRank.
2. Minimize a loss function that is defined based on understanding the
unique properties of the kind of ranking you are trying to achieve.
E.g., ListNet, ListMLE.
Listwise approaches can get fairly complex compared to pointwise or

pairwise approaches.
They can get quite complex compared to the pointwise or pairwise
approaches.
Unlike pointwise or pairwise ranking, listwise methods take as an input
a whole Jist of documents at a single time.

This leads to big computations but also gives more robustness since the
algorithm is provided with more information at each iteration.

(New Syllabus w.e.f Academic Year 23-24) (BC-12) ech-Neo Publications


8.Sc.-Comp-SENM. 6) (Lsarríng to Rarig..Page ro. (9-7)
A(H-T.YE

serres or rarkirgs fox


Gii docuents

d.
a
model takes
Listwise model architecture. As an input, the
9.2.3:
Fig. query and feature vectors of all documents.
RANKING:
SUPERVISED LEARNING FOR
H9.3 RANKSVM, RANKBOOST

to supervised machine learning,


Learning to Rank (LTR) belongs
to train the model.
where we need a historical dataset
training iteration, the model predicts scores for a pair of
During each
documents. Therefore, the loss function
should be pairwise and consider
the scores of both
documents.
as its argument z the difference between
In general, pairwise loss takes
si] multiplied by a constant
G.
two scores s[i]
can have one of the
Dependingon thealgorithm, the loss function
following forms:
Loss function Algorithm
-
L(z) = (1 z). RankSVM

L(z) = e' ankBoost

L(z)
=
log(1 e) RankNet

z= o (S;- S)

Tech-Neo Publications
(New Syllabus w.e.f Academic Year 23-24) (BC-12)
IR (MU-TY, B.SC.-Comp-SEM 6) (Learning to Rank))..Page
no. (9-8)

a 9.3.1 RankSVM

i GQ. Explain RankSVM method.

The ranking SVM algorithm is a learning retrieval function that


employs
pairwise ranking methods to adaptively sort results based on hoy
'relevant' they are fora specific query.
The ranking SVM unction uses a mapping function to describe
the
match between a
search query and the features of each of the possible
results.
This mapping function projects each data pair (such as search query a

and clicked web-page, for example) onto a


feature space. These features
are combined with
the corresponding click-through data (which can
act
proxy for howrelevant page is for
as a a a
specific query) and can
then
be used as the training data for the ranking SVM algorithm.
Generally, ranking SVM includes three steps in the training period:
1
It maps the similarities between queries and the clicked pages ont
a certain feature space.

2. It calculates the distances between any two of the vectors obtained


in step 1.
3. It forms an optimization problem which is similar to a standard
SVM classification and solves this problem with the regular SVM
solver.
Given n training queries {q) their associated document pairs

(x. x, ), and the coresponding ground truth label the

mathematical formulation of Ranking SVM is as shown below, where a

linear scoring function is used,


i.e., f(x) =w X.

min lo l2 +C E (i) u,v


j=l u,v:yuy
(i) (i) () (i)
s.t. w' (x
.X)21-if y,=1
(i) ..., n
G.20,i=1,
u,v

(Nev Syllabus w.e.f Academic Year 23-24) (BC-12) aech-Neo Publications


B.Sc.-Comp-SEM 6)
(Learning to Rank)...Page no. (9-9)
(MU-T.
Y.,
IA see, the objective function in Ranking SVM is exactly the
we can
As
in SVM, where the margin term 1/2 Iwll controls the
same as w. The difference between Ranking SVM and
complexity of model
are constructed from document pairs.
SVMlies in
the constraints, which
function in Ranking SVM is hinge loss defined on document
a

The loss is labeled as


example, for training query if document
a q, x,

pairs. For
more relevant than document
x, (mathematically, y,, +1), then
being T
larger than x, by a margin of 1, there is no loss. Otherwise,
w

w x,, is
if be Eu, v.
the loss will
well rooted in the framework of SVM, it inherits
Ranking SVM is
Since For example,
properties of SVM.
with the help of margin
nice can have good generalization ability.
maximization, Ranking SVM
can also be applied to Ranking SVM,
so as to handle
tricks
Kernel
problems.
complex non-linear

RankBoost
A9.3.2
RankBoost method.
iGQ. Explain
The method of RankBoost adopts AdaB0ost for the
classification over

document pairs.
The only difference between RankBoost and
AdaBoost is that the
on document pairs while that in
distribution in RankBoost is defined
AdaBoost is defined on individual documents.
1, where D, is
The algorithm flow of RankBoost is given in Algorithm
the distribution on document pairs, f, is the weak ranker
selected at the t
th iteration, and a, is the weight for linearly combining the weak
rankers.
RankBoost actually minimizes the exponential loss defined below:

L(;x, Xy: Yu,y)= exp (-y,y (f(x) f(x,)))


Algorithm 1
Learning Algorithm for RankBoost
Input : document pains
Given : initial distribution Dion input document, pairs.
For t= 1, ..,T
Train weak ranker f,, based on distribution D,.

(New Syllabus w.e.f Academic Year 23-24) (BC-12)


STech-Neo Publications
(Leaming to Rank)....Page no.
B.SC-Comp-SEM 6)
(9-10)
IR (MU-T.Y,

Choose 4 (). 6
(i) exp (G4(f, («)-f,
Update D,., (y)=.6y*) (i)

(i)

where Z,= ž
n
D,, (,
(i)
x,() ) exp (a4 (F, (*,)-f,a)
i=l u,v:yx =1
Output : f(x) = Ž a, f, (x)
can see that RankBoost learns the optimal weak
From Algorithm 1, one
at based on the current distribution of

ranker f, and its coefficient the


ways of choosing a, are discussed in
document pairs (Dt). Three
can be shown
generally,
mnost for any given weak ranker f, it that
First,
a,, has unique minimum,
a which can b
as a function of
Z, viewed
a search.
found numerically via simple binary
case that f, takes a valye
The second method is applicable in the special
one can minimize Z, analytically as follows Eoe
from {0,1}. In this case,
be{-1,0,1 ), let
n (1) (i)
D,
W,,= (i) (*,X,) f,c)-f, () =b)
i=l u,v:y,,=

Then

W,1
The third way is based on the approximation of Z,, which is applicable
when f, takes a real value from [0, 1]. In this case, if
we define:

(i) (i)

j=l u,v:y
(1)
D,(%,
)f, «)-f,)
then

Because of the analogy to AdaBoost, RankBoost inherits many nice


properties from AdaBoost, such as the ability in feature selection,
convergence in training, and certain generalization abilities.

(New Syllabus w.e.f Academic Year 23-24) (BC-12)


Brech
STech-Neo Publications
.-Comp-SEM 6) (Leaming to Rank)...Page no. (9-11)
B.Sc.-
(MU-T.Y.
a EVALUATION METRICS FOR LEARNING TO RANK

M 94
detail various ranking evaluation metrics.
Discussin
GQ. a
question which
arises naturally is how to estimate the quality of
The algorithm.
ranking
types of information retrieval metrics- Unranked,
are several
There User-oriented.
Ranked and
Examples
Metric T'ype
MSE, RMSE, MAE, precision, recall
Unranked
Kendall Tau distance precision ®k, recatl @k, APOk,
Ranked
MAP@ k, RR, MRR

User-oriented nDCG, RBP,ERR

Unranked metrics
evaluation metrics considers that the set of relevant
Unranked
the user and users' feedback.
documents for a query is independent of
machine
are mainly used for evaluating the performance of
They a measure for information
learming classification problems and not good
retrieval systems.
some unranked metrics like error rate, fallout,
Even though not suitable,
and miss rate are used in the field of speech
recognition and information
retrieval.
are some unranked metrics.
MAE, MSE,RMSE, Precision and Recall
average of the absolute
MAE (Mean Absolute Error) represents the
difference between the actual and predicted values in the
dataset. It
measures the average of the residuals in the dataset.
average of the squared
MSE (Mean Squared Error) represents the
set. It
difference between the original and predicted values in the data
measures the variance of the residuals.
square root of Mean
RMSE (Root Mean Squared Error) is the
Squared error. It measures the standard deviation of residuals.

ech-NeoPublications
(New Syllabus w.e.f Academic Year 23-24) (BC-12)
l OAming lo Rank),
(9-12)

In shon, preision is he fraction of relevant


and Ruall
in Ihow mAny
ilems
And recall is the frraction
aNg all
mNndarions ar coL
ansver the coverage of
reevan items, i is to (uestion,
ovnt ihoms in all elevant items, how
many Are
capuredin
Ng all thos cnshdeni
TNmmcndations
Ranked metrics
and items further down
This returming aranked ondering of items inthc
usod or seen. then the following mctrics should
Is arr less FkelN to be
be considerai.
Precison@k, Recall @k, Average
Kendal Tau distance. Ba
Precision@k Mean Average Precision (MAP) @k, Reciprocal
MMRR (Mean Reciprocal Rank)
are the most commOn ranked
(RR).
merics.
A.
KendalI Tau distance is based on the number of rank inversions
a
invertion is a pair of documents (i. j) such as document having
i

greater relevance than document j. appears after on the search result than
„Kendali Tau distance calculates all the number of inversions in the
ranking. The lower the number of inversions, the better the search result
is.
Precision @k would be the fraction of relevant items in the top k
recommendations. and recall@k would be the coverage of relevant
imes in the top k.
Average Precision or AP@K is a metric that tells you how a single
@ K

soried preiction compares with the ground truth. E.g. AP would tell
you how correct a single ranking of documents is, with respect to a
single query.It is the sum of precision @K where the item at the k..
rank is relevant (rel(k) divided by the total number of relevant items (r)
in the top K recommendations.

MAP isthe Mean A


veragePrecision It simply implies mean of AP@k
for all the users. In order to do this, we divide the sum of all APs by m
where m is min(k, a) where a is the number of actual relevant
recommendations while our algorithm is supposed to recommend k.

(New Syllabus w.e.f Acadernic Year 23-24) (BC-12) rech-Neo Publications


CompEM )
(RR),
Reciprncal Rank
hetween 0 and
mumher locterd:
result is
relevant
value of is
RR I/k.
tlhe iCastures
(MRR) Recíprocal
MRR(Mean
systerns that a
evaluatc where (arik is the sositíon
the reciprKul runk îfratk
is

singlc
qIery,
highest-ranked
the Correct ans#er Was
renet in he auer). hiet the
If no
query). (0,
reciprocal rank is
metricsS
Jiser-oriented a
metrícs consider ranking posítions of items thus being
Though
ranked
unranked Ones, ey stil! have significant
a
over the
preferable choice
the information about user behaviour is not raken into
downside:
account. user
User-orientedapproaches make certain assurnptions about

and based on it, produce metrics that suit ranking problems


behaviour
better.
are NDCG (Normalized Discounted
User-oriented metrics
(Rank-Biased Precision) and ERR
Cunulative Gain), RBP
(Expected Reciprocal Rank).
Cumulative Gain. will be
NDCG stands for Normnalized Discounted
broken down into 3 parts.
scores for the
1. Cumulative Gain : Summation of all relevance
is, it will not take
recommended items in the list. Problem with it
into account the ordering of the items.
comes to
2. DCG
:
This is where the Discounted Cumulative Gain
score.
rescue by accounting for position along with relevance
3. NDCG: DCG would like
a
good measure but it will vary
recommendations. That's
significantly depending on the top k
comes into
where the Normalized Discounted Cumulative Gain
place which divides the DCG byIDCG(ideal DCG).

ech-NeoPublications
(New Syllabus w.e.f Academic Year 23-24) (BC-12)
wotkfiow,
every Pssible
eve toe jtenion 19xatnine
tit docuinen anyther
19
yuenlially prvgteses from yie
p 1ernnates the 3eart
and with nverse protability
makes SUre
Cie curreut ducunent. RPB forrnulain
etween ) atid 1.
-
ERR (Expeded Reciprycal Benk) As the
1;AsUTeSs the averaye reciprocal rank aeroSS THANy queries,
similar to RPB but with a Jittde difference: if the ursent iten
(R )for the user. then the search predure ends. Oherwise,
p
1O1 relevant
(/ Ri ), hen with probability the use decides
whether he or she wants t9 continue the search process, Sf s9, he search
jproceedsto the next jtem. )therwise, he users end 1he scarch procedure,
Chupter Ends..
UNIT2 Link Analysis
and its Role in
CHAPTER 10 IR Systems

Syjlabus

itg Role in R
Systens : Web reph represertetion
Analyzis and
Link algoritfns, HITS 2nd PagePenk algoritrs,
link anaysis
and systerns.
of link ernelysis JR
in
Applications

LINK
GRAPH REPRESENTATION AND
WEB
H
10.1
ANALYSIS

Web Graph
a 10.1.1
graph.
GQ, Explain the termweb
We can view the static Web consisting
of
static HTML pages together
as a directed graph in which each web
with the hyperlinks between them
a edge.
page is a node and cach hyperlink directed

In this example we have siz


pages labeled A-F. Page B has in-degree 3
not strongly connected: there is
and out-degree 1. This example graph is
no path from any of page's B-F to page A.
are pairs of pages
This directed graph is not strongly connected: there
such that one cannot procced from
one page of the pair to the other by
following hyperlinks.
&
its Role in IRS)...Page
(Link Analysis no. (10-2)
IR (MU-T.Y, B.Sc,-Comp-SEM
6)

Fig. 10.1.1:A sampleweb graph.

Types of Links

1. Inbound links or Inlinks

Inbound links are links into the site from the outside.
Inlinks are one way to increase a site's total Page Rank.
Sites are not penalized for nlinks.
2.
Outbound links or Outlinks
Outbound links are links from a page to other pages in a site or other
sites.

3. Dangling links

Dangling links are simply links that point to any page with no outgoing
links.
There is ample evidence that these links are not randomly distributed;
This distribution is widely reported to be a power law, in which the total
number of web pages with in-degree i is proportional to 1/ia; the value
of a typically reported by studies (a=1).

(New Syllabus w.e.f Academic Year 23-24) (BC-12) rech-Neo Publications


B.Sc.-Comp-SEM 6) (Link Analysis & its Role in IRS)..Page no. (10-3)
(MU-T.Y.,
A
Link Analysis
10.1.2
term link analysis.
1

Define the
GQ hyperlinks and the graph structure of the web have been
analysis of
web search.
The
instrumental in the development of
a network of interconnected links and nodes to
analysis Uses
Link relationships that are not easily seen in raw data.
Leand analyze
source which
pages on the web are large knowledge
a

between
The links analysis algorithms for many ends.
exploited by link
is PageRank and HITS will determine a quality
algorithms similar to
Many a page.
score based on the number of in-coming links of
authority
Or
is applied to identify thematically similar pages, web
Link
analysis
structures
communities and other social
web search has intellectual antecedents in
the field of
Link analysis
for
citation analysis.
also proves to be a useful indicator of what page(s) to
Link analysis analysis to
next while crawling the web; this is done by using link
crawl queues.
assignment in the front
guide the priority

10.1.3 Link Analysis Algorithms

GO. Explain link analysis algorithms.


uses a network of interconnected links and nodes to
Link analysis
relationships that are not easily seen in raw data.
identify and analyze
pages on the web are a large knowledge source which
The links between
is exploited by link analysis algorithms
for many ends.
where the web pages form the
They view the web as a directed graph
pages form the directed edges
nodes and the hyperlinks between the web
algorithms propagate
between these nodes then the Link-based ranking
page importance through links.
These algorithms are :

HITS (Hyperlink Induced Topic Search)


PageRank algorithm

(New Syllabus w.e.f Academic Year 23-24) (BC-12)


Sech-Neo Publications
&
its Role inIAS).Page no.(10-4)
(Link Analysis
6)
R (MU-TY, BS-Comp-SEM
are related tosocial networks. They exploit the
Both algorithms to theìr levels
hyperlinks of the Web to rank pages according of

"prstige" or "anthority". will determine


similar to PageRank and HITS a quality
NMany algorithms
on the mmber of in-coming
links of a page.
or
authority score based
PageRank algorithms in next section
W'e will be discussing HITS and
ALGORITHMS
10.2 HITS AND PAGERANK

HITS (Hyperlink-lnduced Topic Search) Algorithms


10.2.1
iGQ. Explain HITS algorithm.

pages on the Web


Iwo types of important
on a topic
1. Authority: has authoritative content
pages, e.g.. n
2. Hub: pages which link to many authoritative
directory or catalog
to many good authorities.
A good hub is one which links
to many good hubs.
A good authority is one which is linked by
scores. One is called ite
Given a query, every web page is assigned twO
score.
hub score and the other HUB SCORE its authority
than one
For any query, we compute two ranked lists of results rather
scores and that of the other
The ranking of one list is induced by the hub
by the authority scores.

An authority A lub

Fig. 10.2.1

A good hub page is one that points to many good authorities, a good
authority page is one that is pointed to by many good hub pages.

(New Syllabus w.e.f Academic Year 23-24) (BC-12) Gech-Neo Publications


B.Sc.-Comp-SEM 6)
(Link Analysis & its Role in IRS)..Page no. (10-5)
YE
(MU-T our subset of the wcb, we use h(v) to denote its hub
A pagc in v
=
score. Initially, we sct h(v) a(v) 1 for all
=
web authority
For a its
SCOre
and a(v) denote by v y the existence of a hyperlink
V. We also
n0dcs
y.
from v to updates to the hub and
a
iterative algorithm is pair of
core of the by below Equation which capture the
The scores of all pages given good
authorily
that good hubs point to good authorities and that
otions
intuitive n by good hubs.
authorities are pointed to

a(y)
h(v) t- 2

a(v) -2 h(y)
to the sum of the
sets the hub score of page
v

equation words, if v links to


The first to. In other
scores of the pages it links
authority hub score increases. The second
with high authority scores, its
pages
reverse role; if page
v is linked to by good hubs, its
the
line plays
authorityscore increases.

algorithm
Advantages of HITS
to rank pages according to the query
HITS Scores due to its ability
pages.
1.
resulting in relevant authority and hub
string,
may be combined with other information retrieval
The ranking also
2.
based rankings.
query (as compared to PageRank).
3 HITS is sensitive to user
hubs
pages are obtained on basis of calculated authority and
4. Important
value.
order
calculating authority and hubs in
5 HITS is a general algorithm for
to rank the retrieved data.
on a
Web graph by finding set of pages with a search
6. HITS induces
given query string.
7. Results demonstrate that HITS
calculates authority nodes and hubness
correctly.

ech-Neo Publications
(New Syllabus w.e.f Academic Year 23-24) (BC-12)
(Link Analysis
&
its Role in IRS). ..Page
IA (MU-T.Y, B.Sc.-Comp-SEM 6) no.(10-6)

Drawbacks of HITS algorithm


query time
1
Since HITS is a qucry dependent algorithm the evaluation
is
expensive.
2. The rating or scores authorities and hubs could rise due to flaws
of
a donc
page designer. HITS assumes that when User creates
by the web a web
page he links hyperlink from his
a page to another authority page, as
he
authority page is in some way
honestly believes that the related to his
a page
page (hub). A
situation nmay occur when that contains links 10 a
separate topics may receive a high hub rank
large number of which
page
is
not relevant to the given query. Though this is not
the mOSt
a very
relevant source for any information, still has
it
high hub rank if;
points to highly ranked authorities.
3 HITS emphasize mutual reinforcement between authority and hub tul
pages. A good hub is apage that points to many good authorities and a
good authority is a page that is pointed to by many good hubs.
4. Topic drift occurs when there are irelevant pages in the root set and
they are strongly connected. Since the root set itself contains non.
relevant pages, this will reflect on to the pages in the base set. Also, the
web graph constructed from the pages in the base set, will not have the
most relevant nodes and as a result the algorithm will not be able to find
the highest ranked authorities and hubs for a given query.
5. HITS invokes a traditional search engine to obtain a set of pages
relevant to it, expands this set with its inlinks and outlinks, and then
attempis to find two types of pages, hubs (pages that point to many
pages of high quality) and authorities (pages of high quality).

a 10.2.2 PageRank Algorithms

GQ. Explain PageRank algorithm.


GQ. How the PageRank is calculated?
GQ. Discuss the advantages and disadvantage of PageRank algorithm.

The web page ranking algorithms rank the search results depending
upon their relevance to the search query. For these algorithms rank the
search results in descending order of relevance to the query string being
searched.

(New Syllabus w.e.f Academic Year 23-24) (BC-12) rech-NeoPublications


.B.So.-Comp-SEM 6) (Link Analysis & its
Role in IRS)..Page no.
(10-7)
(MU-T.
for a
A
page's ranking specific query depends on factors
like- its
A web words and concepts
the in the query, its overall link
relevance to
popularity etc.
PageRank
measure based only on the link structure of web pages.
sScoring
It is a page important if it is pointed to by other important web pages.
web is
first
technique for link analysis assigns to every node in the web
Our
numerical SCore between 0 and 1 known as it page rank.
graph a
query, a web search engine computes
composite score for
a
a
Given
of features such as cosine
web page that combines hundreds
cach
similarity and term
proximity together with the Page rank score.

Fig. 10.2.2

The above figure showS: The random surfer at node A proceeds with
probability 1/3 to each of B, and D. Consider a random surfer who
C

begins at a web page (a node of the web graph) and


executes a random
walk on the Web as follows. At each time step, the surfer proceeds from
his current page A to a randomly chosen web
page that A hyperlinks to.
are three
The above fig shows the surfer at a node A, out of which there
hyperlinks to nodes B, C and D; the surfer proceeds at the next time step
to one of these three nodes, with equal probabilities 1/3. As the surfer
some nodes
proceeds in this random walk from node to node, he visits
more often than others: intuitively, these are nodes with many links
coming in from other frequently visited nodes.
more often in this walk
The idea behind Page Rank is that pages visited
are more important.

(New
ech-Neo Publications
Syllabus w.e.f Academic Year 23-24) (BC-12)
(Link Analysls
&
its Rolo IRS)..Pago no,
IR (MU-T.Y, B.Sc,-Comp-SEM 6) (10-8)

The PageRank computation: p


probabilily matrix arc N-vcctors
The left cigenvectors of the transition

I Such that
TP = A
n arc (lhe
The entrics in the principal cigenvcctor
N
stcady-stute
probabilities of the random walk with tcleporting, and thus
thc
PageRank values for the corresponding web pages.
if n
is the probability distribution of thc surfer across the web
pages, lhe
remains in the steady-state distribution t. Given that s the slcady- Tn

lz, so I is an cigen valuc of


=
state distribution, we have that nP
P.
Thus, if we were to compute the principal left eigenvcctor of
the malrix
P-the one with eigen value 1- we would have computed the PageRank
values.
We give here a rather elementary method, sometimes known as ou.
iteration. If x is the initial distribution over the states, then he
distribution at time t is xPt. As t grows large, we would expect that the
distribution xPr2 is very similar to the distribution xPt + 1, since for
large we would expect the Markov chain to attain its steady state. Thie
is independent of the initial distribution x.
The power iteration method simulates the surfer's walk: begin at a state
and run the walk for a large number of steps t, keeping track of the visit
frequencies for each of the states.
After a large number of steps t, these frequencies "settle down" so that
the variation in the computed frequencies is below some predetermined
threshold. We declare these tabulated frequencies to be the PageRank
values, We consider the web graph in Exercise 21.6 with a = 0.5. The
transition probability matrix of the surfer's walk with teleportation is
then
1/6 2/3 1/6\
P =5/12 1/6 5/12
1/6 2/3 1/6)

(New Syllabus w.e.f Academic Year 23-24) (BC-12) Gech-Neo Publications


B.Sc.-Comp-SEM 6) (Link Analysle & its Rolo In IR3)..Pago no. (10-9)
(MU-T.Y,
IA surfer starts in state 1, corresponding to the initial
that the
Imagine 0). Then, after onc step tlhe
distribution Vcctor Xo =(|0
probability
distribution is
=
1/6) 2/3 X

(I/6
aP= PageRank
Advantages of
pages pointing to il are
spam, A page is important if thc
Fighting
Since it is not casy lor Web pagc oWner to add in-links into
important. casy to influence
other imporlant pages, it is thus not
page from
his/her
Runk,
Page mcasure and is query indepcndent. Page Rank
a global
Rank is at
Page pagcs are computcd and saved ofT-linc rathcr than the
valucs of all the
2

qucry time. more


make the wcbpage link analytic become
PageRank algorithm
3.
robust. qucry
a global Scale measurement PageRank is
PageRank is
4.
independent.

pisadvantages of
PageRank
rank. IL is because a ncw page cven has
Older pages may have higher
1. may not have many links in the carly
some very good contents but it
State.
by the "link-farms"
2.
PageRank can be easily increased
and HITS.
GO. Differentiate between PageRank

PageRank HITS

on Link analysis algorithm


Link analysis algorithm based
random surfer model.

Web Structure Mining, Web Structure Mining, Web


Content Mining

PageRank computes single For given a query HITS invokes


a

measure of quality for a page at traditional search engine to retrieve


crawl time. This measure is then set of pages relevant to it and then

rech-Neo Publications
(New Syllabus w.e.fAcademic Year 23-24) (BC-12)
no
EI Lnk Analysis
&
its Role in IRS)..Page (10-10)
IR MU.T Y B
Sc-np-SE
HITS
PageRank
find hubs and
omhined tradtional attempts
at authorities. Since thís computation
mformation setrieval scoe
much | is carried out at query time, it
is no
QLTY titne, The adv antagc is
feasible for today's search engines,
greater efficiency handle
which need to millions of

queries per day.

HITSemphasizes mutual
PageRank does not attempt to
capNUTE the distinction betwcen reinforcement betwecn authority
and hub webpages.
hubs and authorities. It ranks
pages just hy authority.
a few
Can be unstable: changing few
a Can be unstable: changing
links can lead t0 quite different
Iinks can Jcad to quite different
rankings. ranking

Content, Back and l'orward links


Back links

Since this Relevancy is more this


Rcevancy is less.
algorithm ranks the pages on the algorithm uses the hyperlinks to
give good results and also consider
indexing time
the content of thc page

Computed for allweb pages stored Performed on the subsct generated


by cach qucry
prior to the query

Computes authorities only Computes authorities and hubs

Fast to compuie Easy to compute, Time execution


is hard.

10.3 APPLICATIONS OF LINK ANALYSIS IN IR SYSTEMS

GQ. Explain the various applications of link analysis in IR systems.

Apart from ranking link analysis can also be used for deciding which
web pages to add to the collection of web pages, i.e., which pages
to

crawl.

ech-Neo Publications
(New Sylabus wef Academic Year 23-24) (BC-12)
B o
B.Sc.-Comp-3EM 6)
(Link Analyis its Psdsin RS).Page (1G-11)
(MU-T.Y.
IA roho1 or spider) perlorms a traversal of the weh graph
crawier (or
of fetching high-quality pages. Aiier fetching a page, ii
with the goal which page out ol the set of uncrawled pages to fech
decide
needsto
next,
is (o crawl the pages with highest numter links from
of
approach
One order of PageRank. Link analysis was als
crriawled pages first in the
the
a
search-byy-example approach to searching, gíven one relevant
usedfor
pages related to iL. We can use the HITS algorthm for this
page find
While link analysis is widely used in intelligence, it also has
problem,
domains, These include cltation analysis,
applications in many other
detection and
enforcement, IT network ecurity, fraud
(AML.).
investigation and Anti-Money Laundering

Cltation Analysis
is the study of citations amnong scientific paers and
Citation Analysis
journals,
factor for Mcicntific a journal,
Astandard measure in this licld is impact a papcr in the
average number of citations reccived by
delined (o be the can
over the past two years, "This type of voting hy in-links
eiven joturnal
the scicntifie
serve as a proxy for the collcctivC attention that
thus
in the journal.
communily pays to papers publishcd
Law Enforcement
and cfficiency of Law
Technology helps in incrcasing the productivity
Enforcenment Agcncics. A strong partnership
betwecn police and
investigations, greatly reduce
technology would facilitate quick criminal
crime, and help to uphold law and
order.

Big Data can be quite useful in identifying


crime trends and hotspots.
to centralized databases and give the
The smartphone apps are connected
investigating officer real-time
access to data on missing individuals,
very quickly.
vehicles, bodies, and criminal histories
the five pillars of
Technology can ensure integrated data gathering from
the criminal justice system
- police, court, prosecution, jails and
forensics which helps the police with their investigation.
agency is a valuable
A technologically advanced law enforcement
addition to the national security of country.

(New Syllabus w.e.f Academic Year 23-24) (BC-12)


aech-Neo Publications
&
its Role in IRS)..Page no.
IR(MU-T.Y. B.Sc.-Comp--SEM 6) (Link Analysis (10-12)

IT Network Security
an important tool to ensure the
Network communication has become
society. The retrieval of network
efficient operation of modern on the annotation
communication information is mainly based and
characteristics in network communicatio
extraction of information
as to cary out retrieval matching.
The key of its application is to organize the
information organically.
a
Network communication information retrieval is not only fast.
but also Ihas important
effective way to obtain the required information,
comnmunication
research value and role for the security of network
Fraud Detection and investigation
Fraud detection is of paramount importance for banks and othex
companies that deal with a significant number of financial transactione
and are therefore at higher risk of suffering from financial fraud
However, other sectors such as ecommerce companies, credit card
companies, electronic payment platforms, and B2C fintech companies
also need to employ fraud detection to prevent or limit financial fraud.
Most common applications of fraud detection include account-related
fraud and payment and transaction fraud. Account fraud is further
new
divided into new account fraud and account takeover fraud, In
account fraud, new accounts are created by using fake identities.
Such frauds can be identified by using the patterns of various devices
and session indicators for detecting fake identities.
Anti-Money Laundering (AML)
The purpose of the AML rules is to help detect and report suspicious
activity including the predicate offenses to money laundering and
terrorist financing, such as securities fraud and market manipulation.
The importance of AML in banking and other industries that use it
comes down to protecting business operations and the economy as well
as upholding your moral responsibility. Specifically, compliance with
AML allows institutions to: Avoid sanctions and fines, Save money &
Prevent criminal activity
Chapter Ends...
UNIT 3 Crawling and
Near-Duplicate
CHAPTER 11 Page Detection

Syllabus
Web page crawling
Near-Duplicate Page Detection :
Crawling and crawling, Near-duplicate
breadth-first, depth-first, focused
techniques:
Handling dynamic web content during
page detection algorithms,
crawling.

BREADTH
WEB PAGE CRAWLING TECHNIQUES:
M
11.1
FIRST, DEPTH-FIRST

computer
a program that browses the World Wide
A web crawler is
manner.
Web in sequencing and automated
as spider which can be used for accessing the
A crawler also referred
server as per user pass queries commonly for
web pages from the web
search engine.
crawling web pages.
A web crawler also used sitemap protocol for
an algorithm design
Crawling the web is not a programming task, but
content is very large.
and system design challenge because the web

1ne web crawling


processes starts from a URL but the starting URL will
not reach all the web pages.
Web crawler techniques or strategies are called to be web crawler
algorithms.
(Crawling & Noar-Duplicate PD)..Page no.
IR (MU-TY, BSC.-Comp-SEM 6) (11-2)

There an wo algorithuns- Breadth-First, Depth-rirst


next scctions.
We willbe discussing Brcadth-First, Depth-First in

11. 1.1 Breadth-First

GQ. Explain the breadth-first web crawling technique.

Breadth First Search


Breadth First Search is an algorithm for traversing or searching tree or
on a level by level, i.c., algorithm
graph data structures. It works starts at
the root URL and searches all the neighbours URL the at same level

If the desired URL is found, then the search terminates. If it is not, then
search proceeds down tothe next level and repeat the processes until the
goal is reached.
It uses the boundary as a FIFO queue, crawling links in the order in
which they are encountered. The Breadth First Search algorithm is
generally used where the objective lies in the depthless parts in a deeper
tree.
The time complexity of breadth first search can be expressed
as O (IVI + IEI), since every vertex and every edge will be explored in
the worst case. Where |V| is the number of vertices and Elis the number
of edges in the graph.
BreadthFirst (Startingthis)

for (i=0;i<=SlartingUrl;i+ +)
ENQUEURBoundary, url):
do

uri=Dequeue(Boundary);
Page=Fetch(Url);
Visited=Visited +1;
Enqueue(BoundaryExtractLinks(Page):
}

while(Visited < MaxPages && Boundary r= Null);

(New Syllabus w.e.f Academic Year 23-24) (BC-12) Grech-Neo Publications


B.So.-Comp-SEM 6) (Crawling & Near-Duplícate PD)..Page no. (11-3)
(MU-T.
,
IR Depth-First
11.1.2
depth-first web. crawling technique.
Explain the
or
GQ.
an algorithm for traversing
or searching tree
Search is
Depth First
structures.
graph data systematically traverse through the search
of
powerful technique
a
It is root node and traverse deeper through the child node.
starting at the most
by priority is given to the left
one Child, then
Te there is more than
more child is available.
traverse deep until no
child and
next unvisited node and then continues in a
backtracked to the
It is makes sure that all the edges
are visited
manner. This algorithm
similar
once. are large
when the branches
well suited for search problems, but
It is
algorithm takes might end up in an infinite loop.
then this

Algorithm DES(graph G. Vertex v}


/Recursive algorithm
G.incidentEdges(v)
for alledges e in

do
unexplored then
ifedge r is
w= G.opposite(v, e)
if vertex w is unexplored then
label e as discovery edge
recursively calf DES(G, w)
else
label e as bark edge

I GQ. What are the differences between DFS and


BFS web crawler?

covers the breadth


DFS goes to the deepest levels of the graph and then
while BFS covers the breadth with gradually going deep.
page and
Depth-first crawling explores a website by starting at the home
then going down through the links to find
new pages. This crawling
finds new content quickly.

(New Syllabus
rech-Neo Publications
w.e.f Academic Year 23-24) (BC-12)
IA FOCSED CRAWLING

ice of giczig. Also. sOME pases bave 0Te OR-topic

opica i2 c be combinai togeüez in orer to deteTmine which

There are three major challenges for focused crawling :


1. esto dseine e
relevze of 2
reieved eb page.

esNeoPubisiors
SEM (Craning & Naz-Dupicate FD)..Page na (i1-5)

idenü ptential URLs t can lezito reant gags


rievant URLS so the cenier knows exriy

Focused crawling
Architecture of
detail the architetureeof Focused rawing.

cOntains a
list ot
unisited URLS aintainei by e cTawler
downloader fetches URLs fom URL queue and cowzlczis
Web pg
nding pages irom the inteaez
as
puser exTCIOT eriCs infomaion sucà the tems and the
The nd
a dowzlozded page.
hyperiink URLS Îom
calculator calculates relevence of a page ih espect to topic.
Relevane
COre to URLs exracted from the page.
rd Sgns pages is related to
Taais Älter analvzes whether the content oi parsei
or no. If the page is relevanL the URLS extr2Cted îom it wili be
topit
queue. othewie 2idei to the Irelerant ahle
sied to the URL
A foCused crewling algorihm loeds 2
page and ertacis the links, By

rating the links based on keyworis the crawler


decides which pase to
rrreve next. The Web is rzverseilint bv link and the exisüng wok is
extended in the arez of foCuei docunent cTAWling.
Seee URLs teel

URL Wes Page Irelevai


dowcicae T-ble

Parser &

Reevn: Reevce Togir Spei:


Page DB Caczltor

Relevari Irev:
Topic
Fler

Fig. 11.2.1

Cies Sylatus w.e.f Aczdemic Yezr 22-24) (2-12) Tech-Neo Pubicaiors


IR (MUT.Y,B,66,-Cornp-SEM 6) (Cravdingg & tlsar-Duplicate PD).PaJ6 rio, (11-6)

There arc various cHteyorics in focused crawlers:

Classic focuscdcraler
2
Semantic crawler
3. Lcarningeravwlcr

Classte focused crawiers


CGuidesthe scarch towards interestcd pages by taking the user tleres
which deseribes the topic as input.
They assign príoriics to the línks based on tiie topic of qucry and the
pageswith high priority arc downloaded first.
These prioritics are computed on the bass of similarity betwcen the
topic and the page containing the links.
Text similarity is computed using an information similarity model such
as the Boolcan or the Vector Space Model

2, Semantic crawlers
It is a variíatíon of classic focuscd crawlers.
To compute topíc to page relevance downloaded priorities are assigned
to pages by applying semantic similarity criteria, the sharing of
conceptually símilar tcrms defines the relevance of a page and the topic.
Ontology is uscd to define the conceptual simílarity betwcen the terrns.

3. Learning crawlers
Uses training proCeSs to guide the crawling process and
a to) assign visit
príoritics to web pagCs.
lcarning cra/lcr supplies training set whích consist of relevant and
a
A
not relevant Weh pages in order to traín the learning crawler.
Links arc cxtracted from weh pages by assigning the hígher visít
prioríties to classify rclevant topíc.
Methods based on context graphs and Hidden Markov Modeis take into
account not only the page Content but also the link structure of the Wehb
page.
ard the probabílíty that a given page will lead to relevant
a

aredh-tieo Pubications
(New Sfiaboss wsf Aadernic Ycar 23-24) (eC-12)
B.c-Connip-SEM 6) (Craing Near-Dupicate PD)..Page rio. (11-7)

DUPLICATE DETECTION ALGORITHM


NEAR

concept of near duplication detetion algorithist.


DiscUSs the
GO
contains rrmultiple opies
Gf
the sze citest, BysOTTE estírates,
Web
The 4()% of the
pages ths
We are duplicates
of other pages.
many as avoidindezing rmultipie ugáes of the sarne ortert, to
as enginessiryto i
Search overteads,
kecpdown storage and proessíng
approach to detsating duplícates is tocpste, t cch
The sinplet fingerprint that is z ucinsA (zy f4-sity digee of the
a
web paC,
we
paze. Then, henes the fingeints f two
characters on that 2re equel zrai if so
we test whether the pages thernelves
are cqul,
pagcs
one ofthemttohca duplicats upyof thethes.
declare
approach fails
cTUcizl 2nd ideprad
This sirmplistic
phenomeion on the: web is nearduplication.
f
Contents of one web page zre identicz! to thxe
rmany cascs, the
- dete end
In a few characters s2y, ntation showingthe
z

another except for we want to be


page wzs last rnodiiied. Even in suchcaes,
tirne at wwhich the oTE
two pages to be cloe enDugh tha we oniy indez
ahletodeclare the peges, 2n
of czhaustively comparing all pairs of web
copy. Shon
2nd filter
infezsibie task at the scale bíllíonsof pages, how can He detet
of

out such near duplicates?


theproblernofdetectingnear-duplicate web
Wenowdescribea solution to
pages. The answer ies in a technique knoWn
asshingling,
k. and a sequence of terrns in
a docusment d,
Given a positive integer
define the k-shingles of d to be the
set of all consecutive seqences of k

terrns in d.
a rose is a rose is a rose. The
As an exarnple, consider the following text:
= a
4-shingles for this text (k 4 is typical
value used in the detectíon of
a, rOSE ÍS a TOSe and is a rose ís.
near-duplicate web pages)are a rOSE Ís
occur twice in the text. ntuitively,
The first two of these shingles each
two docurnents are near duplícates if the sets of shingles
generated from
them are nearly the sane. We
now make thís intuition precise, then
develop a method for efficiently computing and comparing
the sets of
shingles for all web pages.

Grech-tNeo Publicatiorns
tien Sflatbus vw.e.f Acaderníc Year 23-24) (BC-12)
R NUT RSCopSEM 6) (Crewtng& Noar-Duplicate PO)..Fage no. (11-8)
Le Sid) denote the sct of shingles
. Recal!
of document
the Jnccard d,

mcasures the degre of overlap between the sets


S(d,)
U
and Sid,as IStd,)N Sd,WSid,) Sid,\, denote this by J(Sd,). Sd)
Cr tcs for near duplication hetwen d, and d, is lo compute this
Jarard coficient, if it ececds a eset threshold (say, 0.9), we declan
ihem near dupbcates and climinate one from indexing. However, this
docs not appear to have simplifcd mattcrs: we stiil have to compute
Jacard coefficients pairwise.
To avnid this, wT use a form of hashing. Firsi, we map every shingle
nto a hash value over a large space. say 64 bits.
For j = 1. 2. iet H{G) be the corresponding set of 64-bit hash values
derived from Sd ). We now invoke the following trick to detect
document pairs whose sets HO have large Jaccard overlaps. Let a be a
random permutation fromn the 64-bit integers to the 64-bit integers.
Denote by Tidi) the set of permuted hash values in H(d): thus for each h

Hd,). there is a corresponding value z(h) € II(d). Let X 7j be the
smaliest integer in Ii(d, ). Then
J(Sd,). Sd,)) = P (x, = x)

H(:)

H(<:) and I(da)

Document 2
n h
conpitetiog. n he rst atp (eop tow), we appy a
o4-St
A
rndo peTtarnt0 perute H;) end Hlaa), obtatrtg Ti(d;) end IH(:)

Proof. Wc give the proof in a slightly more general seting: consider a


fzraly of sets whose elements are drawn from a common universe.
Vi Ihe sets as columns of a malrix A, with one row for each element
ii he ufissetse. The clement a = Jif element i is present in the set S,
thus te jih coluInn represents. Let ! be a random permutation of the

(tien Syabus wei Atadernic Yeer 23-24) (BC-12) rech-NeoPublications


G) (Crawling & Noar-Duplicato PD).Pago no. (11-9)
-Comp-SEM
B.Sc.
A(BIU-T Y.
column that results from applying i1 to
denote by T1(S,) he
ofA::
ows Pinally, let
x,
be the index of the first row in which thc
column.
tle jth
has a 1. We then prove that for any two columns j,, j.
column II(S;)
= J(S,1 Spl)
r(-) prove this, the theorem
follows. Consider two columns jl j2
we can
If below Figure
showVn in the rows four
as of entrics of S,, and S;: partition the
into
ordered pairs a
columns, those with 0 in S,, and
a
The both these of

with 0's in
types: those a S, and a in S,p, and finally those
O
with 1's in
with l in
in S,2· those rows of Figure 19.9
I columns. Indeed, the first four
both of these rows.
four types of
exemplify all of these
S

0 --
0
1

coefficient is 2/5.
Two sets and S;2; their Jaccard
S;

rows with 0's in both columns,


Ca the
Denote by Coo the number of
second, Co the third and
C
the fourth. Then,
CH
J(S,,, Sp)= Co + Cio + CH

that the right-hand side of Equation


To complete the proof by showing row
x, = consider scanning columns j, j, in increasing
equals P( X).
index until the first
non-zero entry is found in either column. Because il
row has in al
IS a random permutation, the
probability that this smallest
both columns is exactly the right-hand side of Equation.

New Syllabus w.e.f Academic Year 23-24) (BC-12)


arechNeo Publications
Near-Duplicate PD)...Page
IR (MU-T.Y. B.Sc.-Comp-SEM16) (Crawling
E
&
no.(11-10)

W 11.4 HANDLING DYNAMIC WEB CONTENT DURING


CRAWLING

GQ. How the dynamicweb content can be handled in crawling process 2

A dynamic URIL is a URL of Web page with content that denende


a

variable parameters that are provided to the server that delivers the
content. The parameters may be already present in the URL itself or they
may be the result of user input.
a
A
dynamic URL typically results from search of database-driven
website or the URL of a website that runs a script.
In contrast to static URLs, in which the contents of the webpage do not
change unless the changes are coded into the HTML, dynamic URLS are
typically generated from specific queries to a website's database.
The webpage has some fixed content and some part of the webpage is a
template to display the results of the query, where the content comes
from the database that is associated with the website. This results in the
page changing based on the data retrieved from the database per the
dynamic parameter. Dynamic URLS often contain the following
characters: ?, &, %, +, =, $, cgi.
However, sometimes a parameter in a dynamic URL may not result in
modifying the page content in any way.
One of the parameters of the example dynamic URL above is sessionid
followed by a corresponding value that is unique to a user. The
<sessionid'" parameter is used by the website to track the user during a
particular session in order to tailor the user's experience based on
knowledge obtained about what actions the user has made during the
session. The <sessionid" may be inserted into the URL as a result from
a user registering and logging into the website.

Another parameter similar to sessionid parameter is the source tracker


parameter. Like the sessionid parameter, the source tracker parameter
has no effect on the content of webpage; it is only used for logging
traffic sources to the webpage.
One approach for a web crawler is to intelligently analyse a particular
webpage and compare the particular webpage against other webpages to

(New Syllabus w.e.f Academic Year 23-24) (BC-12) ech-Neo Publications


B.Sc.-Comp-SEM6) (Crawling &
Near-Duplicate PD)..Page no. (11-11)

content of the particular webpage is truly unique.


whether the
deterimine prone to error (i.e., not all duplicates
an approach is still
However, such
duplicates).
identiffed as
are resources are consumed by simply
a significant amount of.
Furthermore, less performing the comparisons. By
webpages, much
accessing the multiple webpages of a website, that time may
accessing
wasting time other valid, non-duplicate webpages.
be used accessing to
not a web crawler is to implement strict rules
approach for
Another to avoid accessing duplicates. For
dynamic URLS in order
handle may only access a small number of webpages
web crawler
example, a
looking" URLS.
with 'similar access URLS that are
a web crawler may not
another example,
URLS with
Ás number of characters in order to avoid
a certain
greater than measures prevent web crawlers from
identifiers. However, such
session
amount of unique content.
accessing a significant
URLs is for webmasters to
approach for handling dynamic
Another or to rewrite
respective websites to avoid dynamic URLS
modify their
to make them appear static so that web crawlers will
dunamic URLS websites
their respective websites. Webmasters of
crawl the entirety of order to
user traffic to their respective websites in
typically desire lots of
revenue.
generate advertisement
webmasters want web crawlers to crawl all relevant
Accordingly, web
on their respective websites. However, because of
webpages
URLS, Webmasters must spend
crawler difficulties in handling dynamic
a considerable amount of time modifying
their respective websites.
more efficiently handle dynamic URLs in
Therefore, there is a need to
order to avoid unnecessarily accessing
duplicate webpages.
and generates content by
Many of the websites are JavaScript heavy
doing asynchronous JavaScript calls
after page is loaded. The different
JS frameworks like Angular, React.js,
Vue.js are popular. The use of
and provides many
these frameworks makes developer life simpler
benefits for creating dynamic sites.
Chapter Ends...
UNIT 3 Crawling and
Near-Duplicate
CHAPTER 11 Page Detection

Sylatus
Brattitfirt leni-firt
disIssíng-Bradtiini, egti-fírtn et inns

Eread-Frz
ntteraatt-irst

ll te tegiru CLIte sate ee

irí==ang.
PDPegE na (11-3)

11.12 Depti-Frst

Algnitta DESgaphG. Vetex v}


LRecaire algohm
ir al eges e n GincidentEdges(r)
#edge r is nexplored then
r=G. opposite(T. e)

Hrerie ris nexplored then


lsbei e as discorerr edge
recursielr calí DESIG. }
else
label e as bark edge
1
CQ. VWhst are he cmerences beiween DFS and BFS web crawler?

DS goes to ihe eepest levels of the graph and then covers the breadth
while BFS cOvers the breadth wih gradually going deep.
Depth-first crawiing explores a website by starting at the home page and
then going dowT throngh the links to find new pages. This crawling
finds new conteni quickiy.

(New Syllabus w.ef Acadermic Year 23-24) (BC-12) ech-Neo Publications


aSa-Comp SE ) (Craning & Near-Dupicate FD) Pageno. (11-4)

other hand. Bradh-firSt Crawing starts at the home page and


On the
t0 explone all of the links before moving on o the next
crawling is ofen usedfor websites wih many pages.
This type of
pages gei crawled erenually.
2nsuing tÌngt al
FOCUSED CRAWLING
11.2
focused crawling.
Explainin detail
GQ.

used to collect those web pages that are relevant io a


Fcused crawler is
while filtering oui the irelevant. Thus, focused crawing
paricular topic
for an individual user.
used to generate data
pages that are
A focused crawler atempS to download only those about

a paricular topic.
Focused crawlers rely
on the fact that pages about a topic tend to have
same topic. li his were perfectly true, it
inks 10 other pages on the
one On-topic page. then craw! all
would be possible to starr a craw! at
by following links from a single rooi page. In
pages on that iopic just
practice. a number of popular
pages for a specifñc topic are typically

used as seeds.
means for determining
Focused crawlars require some automaic
topic.
whether a page is about a paricular
a
Text classifiers are tools that
can make this kind of distinction, Once
Dage is downloaded, the crawler
uses the classifier to decide whether the
page is on iopic. If it is. the page is kept, and links from
the page are

used to ind other related sites. The anchor


text in the outgoing links is
an important clue of topicality. Also, Some
pages have more on-topic

links than others.


page are visited, the crawler can keep
As links from a particular web
pages and use this to determine
track of the topicality of the downloaded
whether to download other similar
pages. Anchor text data and page lnk
to determine which
topicality data can be combined together in order
pages should be crawled next.
:

There are three major challenges for focused crawling


a page.
1. It needs to detemine the relevance of retrieved web

ech-Neo Publications
(Vew Syllabus w.e.f Academic Year 23-24) (BC-12)
IR (MU-T.Y. BSc.-Comp-SEM 6) (Crawling & Near-Duplicate PPD)...Page no. (111-5)

can lead to relevant nDace


2. Predict and identify potential URLs that
3, Rank and order the relevant URLS SO the crawier knows exactly
what to follow next.
Architecture of Focused crawling

GQ. Explain in detail the architecture of Focused crawling.

URL queue contains a list of unvisited URLS maintained by the crawler


and is initialized with seed URLs.
Web page downloader fetches URLs from URL queue and downloads
corresponding pages from the internet.
The parser and extractor extracts information such as the terms and the
hyperlink URLs from a downloaded page.
Relevance calculator calculates relevance of a page with respect to topic.
and assigns score to URLs extracted from the page.
Topic filter analyzes whether the content of parsed pages is related to
topic or not. If the page is relevant, the URLs extracted from it will be
added to the URL queue, otherwise added to the Irrelevant table.
A focused crawling algorithm loads a page and extracts the links. By
rating the links based on keywords the crawler decides which page to
retrieve next. The Web is traversed link by link and the existing work is
extended in the area of focused document crawling.
Seed URLs
Internel

URL Web Page Irrelevant


Queue downloader Table

Parser &
Extractor

Relevant Relevance Topic Specific


Page DB Calcultor Weight Table

Relevant Irrelevant
Topic
Filter

Fig. 11,2.1

(New Sylabus w.e.f Academic Year 23-24) (BC-12) Sech-Neo Publications


B.Sc.--Comp-SEM 6) (Crawling &
Near-Duplicate PD)..Page no.
Y.
(11-6)
(MU-T.
A
categories in focused crawlers:
are various
There
crawler
Classic focused
1.
Semantic crawler
2

Learning crawler
3,

Classic focused
crawlers
1.
interested pages
the search towards
Guides
by taking the user query
as input.
which describes the topic
to the links based on the topic of query
assign priorities
They
and the
pages with high
priority are downloaded first.
on the basis of similarity
priorities are computed
These
between the

topic and the page


containing the links.
Text similarity is computed using an information similarity model such
as the Boolean or the Vector Space Model

2.
Semantic crawlers
Ir isvariation of classic focused crawlers.
a

are assigned
To compute topic to page relevance downloaded priorities
to pages by applying semantic similarity criteria, the sharing of
a page and the topic.
conceptually similar terms defines the relevance of
Ontology is used to define the conceptual similarity between the terms.

3. Learning crawlers
Uses a training process to guide the crawling process and to assign visit
priorities to web pages.
A learning crawler supplies a training set which consist of relevant and
not relevant Web pages in order to train the learning crawler.

Links are extracted from web pages by assigning the higher visit
priorities to classify relevant topic.
Methods based on context graphs and Hidden Markov Models take into
account not only the page content but also the link structure of the Web
a page.
and the probability that a given page will lead to relevant

(New Syllabus w.e.f Academic Year 23-24) (BC-12)


arech-Neo Publications
6)
B.Sc.-Comp-SEM (Crawling & Noar-Duplicate PD)...Pago no.(11-7)
(MU-T Y.
IA

11.3 NEAR DUPLICATE DETECTION ALGORITHM

Discuss the concept of near duplication detection algorithm.


e0
The Web contains multiple copics of the same content. By some estimates,
gg many as
40% of the pages on the Web are duplicates of other pages.
Search engines try to avoid indexing multiple copies of
thesame content, to
keepdown storage and processing overhcads.
The simplest approach to detecting duplicates is to compute, for cach
web page, a fingerprint that is a succinct (say 64-bit) dígest of the
characters on that page. Then, whenever the fingerprints of two web
pages are equal, we test whether the pages themselves are equal and if so
declare one of them to be a duplicate copyof the other.
a crucial and widespread
This simplistic approach fails to capture
phenomenon on the web is near duplication.
In many cases, the contents of one web page are identical to those of
- say, a notation showing the date and
another except for a few characters
time at which the page was last modified. Even in
such cases, we want to be

able to declare the two pages to be close enough


that we only index one
pages, an
copy. Short of exhaustively comparing all pairs of web
pages, how can we detect and filter
infeasible task at the scale ofbillions of
out such near duplicates?
ofdetectingnear-duplicate web
Wenow describe a solution to the problem
as
pages. The answer lies in a technique known shingling.
a integer k and a sequence of terms in a document d,
Given positive
define the k-shingles of d to be the
set of all consecutive sequences of k
terms ind.
rose. The
consider the following text: a rose is a rose is a
As an example,
= a typical value used in the detection of
4-shingles for this text (k 4 is a rose is.
a rose is a, rose is a rose and is
near-duplicate web pages) are
Intuitively,
each occur twice in the text.
The first two of these shingles from
tWO documents are
near duplicates if the sets of shingles generated
same. We now make this intuition precise, then
them are nearly the
a method for efficiently computing and comparing the sets of
develop
shingles for allweb pages.
ech-Neo Publications
TNew Syllabus w.e.f Academic Year 23-24) (BC-12)
IR (MU.TY,B.SC.-Comp-SEM 6) (Crawling &
Noar-Duplicato PD).Pago no. (11-8)

S(d) denote the set of' shingles of document


the Jaccard d,. Recall
Coctieient, which measures the degree of overlap between the sets
S(d,)
and S(d,)as IS(d,) n S(d,)VIS(d,)US(d,)l; denotethis by J(S(d,). S(d,)
Our test for ncar duplication betveen d,
nd d, is to compute this
Jaccard cocticient: i it exeeeds a preset threshold (say, 0.9), we declare
thenm ncar duplieates and elininate one
from indexing. However, this
docs not appear to have simplified natters: we still have to compute
Jaccardcoetticicnts pairvise.
To avoid this, we use a forn of hashing. First, we map every shingle
into a hash value over a large space, say 64 bits.
For j = l, 2, let H(d,) be the corresponding set of 64-bit hash values
derived from Sd, ). We now invoke the following trìck to detect
document pairs whose sets H) have large Jaccard overlaps. Let n be a
random permutation from the 64-bit integers to the 64-bit integers.
Denote by Il(dj) the set of permuted hash values in H(d): thus for each h

H(d), there is a corresponding value n(h) E I(d,). Let X Ij be the
smallest integer in II(d, ). Then
= P
J(S(d), S(d)) (x, =x)

H(ds) H(d2)
264–1

H(«) and II(dy) H(d:) and II(d}

Document 1 Docment2
RguN 19.1 Mutratkon of hinge ketthes We e
two documents going through
four stase of hngle In th f Atp (top oW), wa apply a
ketch computation.
haNi to ache ingle docunt to obtat H(d:) ant H(d) (ctrcdw), Nxt64-bit
we
lapplv arandom pernmutatonTtoparmutt H(d1) And H(d^), obtaintns I(d:) and II(G2)
lauAree) The thid row hows only n() &nd H(da), who the bottom Ow Ahowe th
míntmum value a ad
x
for endh doCument.

Proof. We give the proof in a slightly more general setting: consider a


family of sets whose elements are drawn from a common universe.
View the sets as columns of a matrix A, with one row for each element
in the universe. The element a, = 1 if element i is present in the set S;
that the jth column represents. Let II be a random permutation of the

ech-Neo Publications
(New Syllabus w.e.f Academic Year 23-24) (BC-12)
(Crawing & .Page no.
Near-Duplicate PD)... (11-9)
IR(MU-T.Y. BSC-Comp-SEM 6)
from applying II to
of A; denote by Il(S,) the column that results
ws
row in which
the jth column. Finally. let x be the index of the first the
any two columns J),j,
olumn lI(S,) has a l. We then prove that for

P(5l=) = J(S;. S;z)

II we can pove thìs, the theorem follows. Consider tvo columns


i
l.
as shown in the below Figure
The orderd pairs of entries of S;; and S; partition the rows into four
n
types: those wih 0's in both of these columns, those with a 0 in S:, and
a in
I in
S. those with a l in S,, and 0 in S;, and finally those with l's
both of these columns. Indee, the first four rows of Figure 19.9
exemplify all of these four types of Ows.
S Sz

EE
1

Two es S;, and S,: their Jacand coætticient is 2/5,


Denote by Co» the number of ows witth 0's in both columns, Co the
sOnd, C the third and C;, the fourth. Then.
J(S; S = Cat +
CC1
Tocomplete the prof by showing that the right-hand side of Equation
equais P(,= consider scanning columns ,, in increasing row
index until the fiNN NON-ZeN entry is found in either column. Bemse I1
is a random permutation. the probability that thìs smallest row has a l in
both columns is exactly the right-hand side of Equation.

rech-NeoPublications
(New Syabus wef Academic Year 23-24) (BC-12)
no. (11-10)
B.Sc.-Comp-SEM16) (Crawling & Near-Duplicate PD).Page

HANDLING DYNAMIC WEB CONTENT DURING


11.4 CRAWLING
M

content can be handled in crawling process


?
web
dynamic
Howtthe
GQ. Web page with content that depends on
dynamic
URL is a URL
ofa
are provided to the server that delivers the
A
parameters that
variable present in the URL itself orthey
parameters may be already
Content. The
of user input.
result
may be the of a database-driven
URL typically results from search
dynamic
website that runs
A a a script.
website orthe URL.of
wvebpage do not
in which the contents the
of
URLS,
contrast to static
HTML, dynamic URLs are
thhe changes are coded into the
change unless a website's database.
generated from specific queries to
typically a
of the webpage is
some fixed content and some part
The webpage
has
content comes
display the results of the query, where the
template to in the
is associated with the website. This results
Fom the database that per the
based on the data retrieved from the database
page changing following
parameter. Dynamic URLs often contain the
dynamic
=, S, cgi.
characters: ?, &, C, +,
a parameter in a dynamìc URL may result in nt
However, sometimes
any way.
moditying the page content in
sessionid
One of the paraneters of the
example dynamic URL above is
is unique to a user. The
followed by a corresponding value that a
parameter is used by the website to track the user during
"sessionid"
experience based on
particular session in onder to tailor the user's
user has made during the
knowledge obtained about what actions the
session. The sessionid" may be inserted into the
URL as a result from
a user registering and logging into the website.
source tracker
Another parameter similar to sessionid parameter is the
parameter. Like the sessionid parameter, the source tncker parameter
has no etfect on the content of webpage; it is only used for loggng
traffic sources to the webpage.
analyse a particular
One approach tor a web erawler is to intelligently
to
Webpage and compare the particular webpage against other webpages

(New
ech-Neo Publications
Syllabus w.e.f Academic Y'ear 23-24) (BC-12)
R no.
(MU-TY. B.S.-Comp-SEM6) (Crawling & Near-Duplicate PD).Page (11-11)
determine whether the content of the particular webpage 1s truly unique.
However, such an approach is still prone to error (1.e., not allduplicates
are identified as duplicates).

Furthermore, a significant amount of resources are consumed


by simply
accessing the webpages, much less performing the
comparisons. By
wasting time accessing multiple webpages of a website,
that time may
not be used accessing other valid, non-duplicate
webpages.
Another approach for a web crawler is to
implement strict rules to
handle dynamic URLS in order to
avoid accessing duplicates. For
example, a web crawler may only access a
small number of webpages
with "similar looking" URLs.
As another example, a web
crawler may not access URLS
greater than a certain number that are
of characters in order to avoid URLS
session identifiers. However, With
such measures prevent web
accessing a significant amount crawlers from
of unique content.
Another approach for handling
dynamic URLs is for webmasters
modify their respective websites to
to avoid dynamic URLs or
dynamic URLS to make to rewrite
them appear static so that
crawl the entirety of their web crawlers will
respective websites. Webmasters
typically desire lots of user of websites
traffic to their respective
generate advertisement revenue. websites in order to
Accordingly, webmasters
want web crawlers to
webpages on their crawl all relevant
respective websites. However,
crawler difficulties in because of web
handling dynamic URLs,
a considerable amount webmasters must spend
of time modifying their
respective websites.
Therefore, there is a
need to more efficiently
order to avoid unnecessarily handle dynamic URLS
accessing duplicate in
Many of the websites are webpages.
JavaScript heavy
doing asynchronous and generates content
JavaScript calls after page by
JS frameworks like Angular, is loaded. The different
these frameworks Reactjs, Vue.js are
makes developer popular. The use
life simpler and of
benefits for creating provides many
dynamic sites.

Chapter Ends...
UNIT 3 Cross-Lingual
and Multilingual
CHAPTER 13 Retrieval

Syllabus

Muttilingua! Retrieval :
Cialerges ad
Cress-Lingusl and P,

CrOSs-ingual rstrieral, Wachine transipn ior


19cirigues for
1Mutiingual
ireDresetetions ad guey irarsatin, Evzizin
Sysierns.
Testrigu9s Íor I

13.1 CROSS-LINGUAL
RETRIEVAL OR CROSS-LINGUAL
INFORMATION RETRIEVAL AND MULTILINGUAL
RETRIEVAL OR MULTILINGUAL INFORMATION
RETRIEVAL (CROSS-LINGUAL SEARCH AND
MULTILINGUAL SEARCH)

or Cross-Lingual
13.1.1 Cross-Lingual Retrieval
Information Retrieval

GQ Explain the term Cross-Lingual Retrieval.


i

a
Cross-Lingual Information Retrieval (CLIR) ís retrieval task in which
are written in different
Scarch queries and candídate documents
languages, CLIR can be very useful in some scenarios.
use
Cross-language information retrieval refers more specifically to the
Case where users formulate their information need in
one language and
the system retrieves relevant documents in another.
IR (MU-TY. B.SC.-Comp-SEM Multi Retrieval)...Page no.
6) (Cross-Lingual
&

(13-2)

For example, a reporter may want to search foreign language news to


obtain different perspectives for her story; an inventor may explore the
patents in another country to understand prior art.
While CLIR is concerned with retrieval for given language pairs (i.e. all
the documents are given in aspecific language and need to be retrieved
to queries in another language).
Cross-lingual embeddings attempt to ensure that words that mean the
same thing in different languages map to almnost the same vector.
Various approaches can be adopted to create a cross lingual search
system. They are as follows:
Query Translation
2. Document Translation
1. Query Translation approach
GQ. Explain the term query translation.

In this approach, the query is translated into the language of the


document.
Many translation schemes could be possible like dictionary-based
translation or more sophisticated machine translations.
The dictionary-based approach uses a lexical resource like bi-lingual
dictionary to translate words from source language to target document
language. Thistranslation can be done at word level or phrase level. The
main assumption in this approach is that user can read and understand
documents in target language. In case, the user is not conversant with
the target language, he/she can use some external tools to translate the
document in foreign language to his/her native language. Such tools
need not be available for all language pairs.
Query Translation of query has the advantage that the computational
effort i.e., time and space, is less as compared with other methods.
Query translation has following disadvantages
A query does not provide enough contexts to automatically find the
intended meaning of each term in the query.
Translation errors affect retrieval performance sensibly.

(New Syllabus w.e.f Academic Year 23-24) (BC-12) tech-Neo Publications


B.SC.-Comp-SEM6) (Cross-Lingual & Multi
(MU-T.Y, Retrieval)...Page no.
(13-3)
a multilingual
case of searching database, query must be translated
In
intoeach one of the languages of database.

Document translation approach


2.
Explain the term document translation.
!GQ.
most desirable translation if the purpose is
This is the to allow users to
a query in documents different
search for than their own language and
retrieve result back in their language. It does not require the user to learn
foreign Janguage to understand the documents retrieved. Hence it is a
better approach in this scenario.
There are twotypes in this approach. First, post translation or "as-and
when-needed" or "on-the-fly translation", where documents of any
other language being searched by user are translated into user language
at query time.
IR process mostly uses indexing technique to speed up the searching
process of documents. But indexing is not possible in post translation, so
this approach is infeasible because it requires more time for translation.
This approach has scalability issues. There are too many documents to
be translated and each document is quite large as compared to a query.
This makes the approach practicaly unsuitable.

a 13.1.2 Multilingual Retrieval or Multilingual Information


Retrieval

GQ. Explain the term Multilingual Retrieval.


The term Multilingual Information Retrieval (MLIR) involves the study
of systems that accept queries for information in various languages and
return objects (text, and other media) of various languages, translated
into the user's language.

MLIR is concerned with retrieval from a document collection where


documents in multiple languages co-exist and need to be retrieved to a
query in any
language.
Multlingual embeddings are happy if the embeddings work well in
anguage A and work well in language B separately without any
guarantees about interaction between different languages.
MLIR is
thus inherently more difficult than CLIR.
(New Syllabus tech-NeoPublications
w.e.f. Academic Year 23-24) (BC-12)
IR (MU-T,Y.B.Sc.-Comp-SEM 6) (Cross-Lingual & Multi Retrieval).Page no. (13-4)

M 13.2 CHALLENGES AND TECHNIQUES FOR CROSS


LINGUAL RETRIEVAL

Cross-Lingual Information Retrieval (CLIR)


It is a subfield of infomation retrieval dealing with retrieving
information written in a language different from the language of the
user's query.
For example, a user may pose their query in English but retrieve relevant
documents written in French. To do so, most of CLIR Systems use
translation techniques.

A 13.2.1 Techniques for Cross-Lingual Retrieval


GQ Explain the techniques of Cross-Lingual Information Retrieval (CLIR).

CLIR techniques can be classified into four different categories based


on different resources:
1. Dictionary-based CLIR techniques
2. Parallel corpora based CLIR techniques
3. Comparable corpora based CILIR techniques
4. Machine translator based CLIR techniques

1. Dictionary-based
In dictionary-based query translation, the query wll be processed
linguistically, and only keywords are translating using Machine
Rcadable Dictionaries (MRD), MRDs are electronic versions of printed
dictionaries, either in general domain or specific domain. Translating the
query using the dictionaries is much faster and simpler than translating
the documents.
Some common problems associated with dictionary-based translation:
1. Untranslatable words (like new compound words, proper names,
spelling varians, and special terms): Not every form of words used
in query is always found in dictionary. Sometimes problem occurs
in translating different compound words (formed by combination of
new words) due to the unavailability of their proper translation in
díctionary.

(Nev Sylabus w.ef Academic Year 23-24) (BC-12)


aech:
LTech-Neo Publications
B.Sc.-Comp-SEM 6) (Cross-Lingual & Multi Retrieval).Page no. (13-5)
(MU-T. Y.

Processing inflected words: Inflected


of word forms are usually not

2 dictionaries.
found in
Lexical ambiguity in source and target languages: Relevant forms
3 meaning for information retrieval are:
oflexical
homnonymous and 2) polysemous words.
1)
if they have at least two different
Tuo words are homonymous
of words are unrelated. E.g.- She will park the
meanings and senses
park.
car so we can walk in the
-
moving vehicle to a place usually a car park
of
Park- action
nature
Park a public area close to
uses dictionary to
This is the simplest technique which literally
4. one used for the
retrieve information in other language(s) than the
query.
very
serious drawbacks, most
Unfortunately, it has a few but
5.
which raises
notably the issue of words having different meanings
the question of accuracy.

2. Parallel corpora based


one or more
It is collection of texts, each of which is translated into
a

languages other than the original language.


such as co
Parallel corpora are also used to decide the relationships,
occurrences, between terms of different languages.

The texts in cach language are not translations of each other,


but cover
the same topic area, and hence contain an equivalent vocabulary.

They often contain many sentence pair that


arce good translations of cach
other E.g.: news feed of CNN, BBC etc.
as all the information is
This is a very effective and reliable technique
retrieved from the so-called parallel corpora which are made up of the
Same text that has previously been translated into two or multiple
languages.
Since the translation has already been done, the technique using parallel
corpora climinates the risk of mistranslations and other problems
aSsociated with thc dictionary-based techniquc. However, information
lhat is retrievable from parallel corpora is obviously limited.

(NewSyllabus
W.e.f Academic Year 23-24) (BC-12)
lrech-Neo Publications
IR (MU-T.Y,B.Sc.-Comp-SEM 6) (Cross-Lingual
&
Multi Retrieval)..Page no. (13-6)

3. Comparable corpora based


Another commonly used technique is very similar to that using parallel
corpora.
The only difference is that the basis is comparable corpora these contain
text in multiple languages which, however, is not translation but rather
deals with the same subject. As a result, the vocabulary is more or less
the same.
4. Machine translation based
Cross-lingual IR with query translation using machine translation seems
to be an obvious choice compared to the other two above. The
advantages of using the machine translation is that it saves time while
translating large texts.
There are four different approaches to deal with machine translation:
i. Word-for-word approach,
ii. Syntactic transfer approach,
iii. Semantic transfer approach,
iv. Interlingual approach.
The goal of CLIR machine translation (MT) systems is to translate
queries from one language to another by using a context.
Many researchers criticize MT-based CLIR approach. The reasons
behind their criticisms mostly stem from the fact that the current
translation quality of MT is poor.
Another reason is that MT systems are expensive to develop, and their
applicationdegrades the retrieval efficiency (run time performance) due
to the lengthy processing times associated with linguistic analysis.
Although it has its drawbacks, machine translation is actually very
useful andquite reliable too under condition that it is used properly. In
the recent years, machine translation programmes got much more
accurate than they used to be just a decade ago but unfortunately, they
are stillnot accurate enough to eliminate the need for human translation.

(New Syllabus w.e.f Academic Year 23-24) (BC-12) Sech-Neo Publications


B.Sc.-Comp-SEM 6) (Cross-Lingual & Multi
Retrieval),.Page
no.(13-7)

Challenges for Cross-Lingual Retrieval


13.2.2

Discuss the
challenges in Cross-Lingual Information Retrieval (CLIR).
GQ.

following challenges in creating a CLIR system:


We facethe
While
Translation ambiguity: translating from source language
to target
more
than one translation may be possible.
Janguage, Selecting
a
appropriate translation is challenge.
the word OOO (maan, respect/neck) has two meanings
For
example,

neck and
respect.

Phrase
identification and translation: Identifying phrases in limited
as a
Context and translating them whole entity rather than individual
difficult.
word translation is
Translate/transliterate a term: There are ambiguous names which need to
transliterated instead of translation. • For example, 000000
as well as su.
Rhaskar. Sun) in Marathi refers to a person's name
context is a challenge.
Detecting these cases based on available
up fetching
Transliteration errors: Errors while transliteration might end
the wrong word in target language.
the
Dictionary coverage: For translations using bi-lingual dictionary,
on
exhaustiveness of the dictionary is important criteria for performance
system.
Font: Many documents on web are not in Unicode format. These
further
documents need to be converted in Unicode format for
processing and storage.
1. Morphological analysis (different for different languages)
to
2. Out-of-Vocabulary (00V) problems: New words get added
language which may not be recognized by the system.
SPACE FOR NOTES

(NewSyllabus 1) (BC-12)
aech -Neo Publications
w.e.f Academic Year 23-24)
no. (13-8)
IR (MU-T.Y, B.Sc.-Comp-SEM 6) (Cross-Lingual & Multi Retrieval)..Page

13.3 MACHINE TRANSLATION (MT) FOR IR

in IR.
! GQ. Briefly explain the concept of Machine Translation
(MT) for IR.
I GQ Explain the different approaches of Machine Translation

Machine Translation is one of the parts of language processing within


Computational Linguistic.
or query
The machine-translation method translates either the document
by using a machine translation systemn.
The main disadvantage of Machine Translation is computationally
or
expensive. In situations where there is a large collection of documents
when searching for documents on the web, machine translation is
impractical.
MT systems can be classified according to thcir core methodology:
Under this classification, three main paradigms can be found:
1. Rule-based approach,
2. Corpus-based approach
3. Statistical-bascd approach.
In the rule-bascd approach, human experts specify a set of rules to
describc the translation process, so that an enormous amount of input
from human cxperts is required.
On the other hand, under the corpus-bascd approach the knowlcdge is
automatically extractcd by analysing translation cxamplcs from a
parallcl corpus built by human experts.
Combining thc fcatures of the two major classifications of MT systems
gave birth to the HybridMachine Translation Approach.

1. Rule Based Machinc Translation (RBMT)

Inthe ficld of MT Rulc Based Machínc Translation (RBMT) is the


first stratcgy devcloped. It has much to do with the morphological,
syntactic and scrnantíc inforrmatíon about the source and target
language. Linguistíc rulcs arc built over this information. Also,
millíons of bilingual dictionarics for thc languagc pair are uscd.

(Nesu Sylabus we.f kcadesmic ear 23-24) (BC-12) arech-Neo Publications


B.Sc.-Comp-SEM 6)
(Cross-Lingual & Multi Retrieval)...Page
(MU-T.Y. no. (13-9)
IA with
RBMT is able to deal the needs of wide variety of
linguistic
phenomena and is extensible and maintainable. However,
exceptions in grammar add difficulty tothe system.
process requires high investment.
The research
Rule-Based Machine
Translation (RBMT), also known as
Knowledge-Based Machine Translation or also known as
Classical Approach of MT.
Knowledge-based Machine Translation (KBMT)-KBMT does
not require total understanding, but assumes that an interpretation
engine can achieve successful translation intoseveral languages.
KBMT must be supported by world knowledge and by linguistic
semantic knowledge about meanings of words and their
combinations. Thus, a specific language is needed to represent the
meaning of languages. It is the knowledgebase that converts the
source representation into an appropriate target representation
bcforc synthesizing into the target sentence.
KBMT systems provide high quality translations. Neverthelcss,
they arc quite expensive to produce due to the large amount of
knowlcdge necded to accurately represent sentences in different
languages.
Approaches of Rule-Based Machine Translation (RBMT),
Direct Translation MT, Transfer-based MT, Interlíngua MT, and
Dictionary-based MT arc the four diffcrent approaches that comc under
the RBMT category.
Direct Transiation MT

In the dircct translation method, the Source Languagc (SL) text is


a
analyscd structurally up to the morphological level and is designed for
specific source and target languagepair.
a language, called Sourcc
This approach is capablc of translating
Languagc (SL) dircctly to another language called Targel Languagc
(TL).
on the quality and
Ihe performancc of direct MT system depends
a

morphological
quantity of the sourcc-targct language dictionaries,
word-by-word translation with
analysis, lext proccssing softwarc, and
minor grammatical adjustments on word order and morphology.

Wew Syllabus
tech-NeoPublications
w.ef Acadermic Year 23-24) (BC-12)
no. (13-10)
RUT.Y.BSCCono SEM 6) Coss-ingual &Muli Retrieval)..Page

Transfer based MT

Transfer model belongs to the second generation of machine translation.


In this sourne languace is transfomed into an abstract, less language

spaific rpresentation.
An equiv alent representation (with same level of abstraction) is then
genrated for the target language using bilingual dictionaries and
gramma rules.
These systems have three major components:
Analysis
Analysis of the source iext is done based on inguistic information such
as morphology. part-of-speech, syntax, semantics, etc. Heuristics as well
as algorithms are applied to parse the source language and derive the
syntactic structure (for Janguage pair of the same family, for example
Tarmi! and Telugu are siblings of same family i.e., Dravidian
Languages etc.) of the text to be translated; Or the semantic structure
(for language pair of different families, Hindi from Devnagari Family
and Telugu from Dravidian Family)
Transfer
The syntactic/semantic structure of source language is then transferred
intothe syntactic/semantic structure of the target language.
Synthesis is also as Generation
This module repiaces the constituents in the source language to the
target language equivalents. This approach, however, has dependency on
the language pair involved.
Interlingua based MT

This is considered to belong to third generation of machine translation.


In is an inherent part of a branch called Interlinguistic.

Interlingua aims to create linguistic homogeneity across the globe.


Interlingua is a combination of two Latin words Inter and Lingua which
means between/lintermediary and language respectively.
In Inierlingua, SOurce language is transformed into an
auxiliary/intermediary Janguage (representation) which is independent

(New Syllabus w.e.f Academic Year 23-24) (BC-12) ech-Neo Publications


(MU-T.Y. B.Sc.-Comp-SEMI6) (Cross-Lingual& Multi Retrieval)...Page no. (13-11)
IR
of any of
the languages involved in the translation. The translated verse
Eor the target language IS then derived through this auxiliary
representation.
Tance only two modules 1.e., analysis and synthesis are required in
this
of system. Also, because of its independency on the language pair
for translation, this system has much relevance in multilingual machine
translation.
pictionary based Machine Translation
This method of translation is based on entries of a language dictionary.
To develop the translated verse the word's equivalent is used.
Machine-readable or electronic dictionaries are the base of the first
generation of machine translation.
To some extent this method is still can translate of phrases but not
sentences fully. Finally, on the basis of more or less utilizes

2. Corpus Based Machine Translation


A corpus-based approach analyzes large document collections
comparable or parallel-corpus to construct a statistical translation model.
To overcome the problem of knowledge acquisition problem of rule
based machine translation, Corpus based machine translation also refers
as data driven machine translation is an altermative approach for
machine translation.
Corpus-based MT uses, as it names points, a bilingual parallel corpus to
obtain knowledge for new incoming translation. A large amount of raw
data in the form of parallel corpora is used in CBMT. This rawv data
contains text and their translations.
These corpora are used for acquiring translation knowledge. Example
based MMachine Translation Approach is one kind of Corpus based
approach.
Example-Based Machine Translation (EBMT)-It is achieved by its
Use of bilingual-corpus with parallel texts as its main knowledge, in
which translation by analogy is the main idea. n
EBMT system point
to point mapping is done. It takes a group of sentences in the source
language and corresponding translations produce of each sentence in the

(New Syllabus w.e.f Academic Year 23-24) (BC-12)


ech-NeoPublications
& Multi FRetrieval)..Pago no. (13-12)
IR(MU-T.Y, B.Sc,-Comp-SEM 6) (Cross-Lingual

These examples are used to translate similur types or


rget language,
sentences of source language to the target language.
In EBMT, there ae four tusks: exanmple acquisition, exannple base and
managenent, exanmple application and synthesis.

Statistical-based Machine Translation (SMMT)


3.

The statistical approach comes under Empirical Machine Translation


(EMT) systems, whiclh rely on large parallel aligned corpora.
Statistical machine translation is a data-oriented statistical framework
for translaing text fronm one natural language to another bscd on the
knowledge and statistical models extracted from bilingual corpora.
In statistical-based MT, bilingual or multilingual textual copora of the
source and target language or languages are requircd. A superviscd or
unsupervised statistical machine learning algorithm is used to build
statistical tables from the corpora, and this process is callcd the learning
or training.
The statistical tablcs consist of statistical information, such as the
characteristics of well-fomed sentences, and the correlation between the
languages. During translation, the collected statistical information is
used to find the best translation for the input sentences, and this
translation step is called the decoding process.
There are three different statistical approaches in MT are Word-based
Translation, Phrasc-based Translation, and Hierarehical phrase
based model.
World-based Machine Translation
As the name suggests, the words in an input sentence are translaled word
by word individually, and these words finally are arranged in a specilic
way to get the target sentence.
The alignment between the words in the input and output sentences
normally follows certain patterns in word-based translation. This
approach is the very first attempt in the statistical-based MT system that
is comparatively simple and efficient.

The main disadvantage of this system is the ovcrsimplified word by


word translation of sentences, which may reduce the performance of the
translation system.

tech
(New Syllabus w.e.f Academic Year 23-24) (BC-12) Tech-Neo Publications
B.So.-Comp-SEM 6) (Cross-l.Jngual & Multl Rotrioval)...Pago no. (13-13)
(MU-TY,
IR
phrase-based Machine Translation
more 1ccuratc SMT approach, called phrase-based translalion, was
A
source and lurget sentence are divided into
introduccd, where cach
of words belore translation,
separate phrascs instead
alignment between the phrases in the input and output sentences
The
putterns, which is very similur lo word-bnsed
normally follows certain
translation, Even though the phrase-based models reSult in bctter
performance than the word-bascd translation, they did not improve the
palterns,
model of sentence order
model is based on flat reordering patlerns, and
The alignment
may perform wcll with
experiments show that this rcordering technique
orders but not as well with long scntences and complex
local phrase
orders.
Translation
uierarchical phrase-based Machine
two methods, developed a
By considering the drawback of previous
more sophisticated SMT approachi, called the hierarchical plhrasc-bascd
model.
The advantage of this approach is that hicrarchical
plrases hve
lhigher lcvel of
rccursive structures instcad of simple phrascs. This
accurncy of the SMT systenm.
abstraction approach lurther improved the
Hybrid Machine Translation Approach
rule-based translation
By taking the advantage of both statistical and
methodologies, a new approach was developed, called hybrid-based
in the area of MT
approaclh, which has proven to have better efficicncy
scctors
systems. At present, several governmcntal and private bascd MT
use this hybrid-bascd approach to develop translation from source to
on both rules and statistics.
larget language, which is based
a ways. In some
The hybrid approach can bc uscd in number of dilferent
a
cases, translations are performed in the first stage using rulc-based
statistical
approach followed by adjusting or correcting the oulput using
pre-proccss lhe input
information. In the other way, rules arc uscd to
output of a statistical-based
data as well as post-process the statistical
translation system. This technique is better than the previous and
bas

more power, flexibility, and control in translation.

(New Syllabus w.e.f Academic Year 23-24) (BC-12)


Publications
IR(MU-TY. B.Sc.-Comp-SEM 6) (Cröss-L.ingual & Multi Ftstrisvsl)..PsGS io (1714)

13.4 MULTILINGUAL DOCUMENT REPRESENTATIONS


ANDQUERY TRANSLATION

r GQ Explain how the Multilingual document representation can be dons in


IR ?

If we put all the documents into a mixed collcction, the first question is
how to distinguish words in different languages, especially for
homographs such as "but" in English and "but" in French.
We propose the following solution: to associate a language tag to every
indexing term. When a query is submitted the system, and the user
indicates he languages of interest, the original query is transiated
Separately into all these languages. Allthe translations, and the original
query, will be grouped into a large query expression for every language
of interest.
One possible advantage of this approach is that the weights of index
terms in different languages may be more comparable, because they are
determined in the same way. Although the weights may still be
unbalanced because of the unbalanced occurrences of index term in the
document collection, the problem is much less severe than if document
collections are processed separately.
Another advantage results from the removal of the problemaic merging
step. The retrieval result naturally contains answers in differet
languages. One may expecta higher effectiveness
This approach contains the following five main steps
1. Language identification
This step aims to identify the language of each document. so that
the document can be submitted to the appropriate language
dependent pre-processing.
Nowadays, the automatic language identification is no longer a
difficult problem. There are systems that are able to determine the
language accurately using statistical language models.
2. Language-dependent preprocessing
Each document is then submitted to a language-dependent pre
processing. This includes the following steps: -

(New Syllatbus we.f Academic Year 23-24) (BC-12) ech-Neo Publications


(MU-T.Y. B.Sc.-Comp-SEM 6) (Cross-Lingual & Multi Retrieval)..Page no.
IA (13-15)
Stop words
ineach language are removed separately.
Bach word is
stemmed/lemmatized using the
appropriate
stemmerlemmatizer of the language.
Stems/lemmas are associated with
the appropriate language tags
as
such _f,_e, i, -g, and_S.

Al the pre-processed documents form a new document collection,


with the words in different languages clearly distinguished
with
language tags.
3. Indexing of the mixed document collection
Indexing is performed as for a monolingual document collection.
Indexing terms in different languages are weighted according to
the same weighting schema. A unique index file is created for all
the languages.
In our case, we use the SMART system for indexing and retrieval.
Terms are weighted using the following tf*idf schema:
tf(t, d) = log(freq(t, d)+1),
idf(t) = log(N/n(t),
occurrences of term t in
where freq(t, d) is the frequency of
documents in the mixed
document d, N is the total number of
the number of documents
document collection, and n(t) is
containing t.
4. Query Translation
the case,
similar processes are performed. In our
On the query side,
to be retrieved
are in English, and the documents
original queries
French, Italian, German and Spanish.
are in English, Italian,
translated separately into French,
query is and then
An original words are stemmed
The translation
German and Spanish. language tag as for
document
appropriate
associated with the
indexes. a unique
are then put together to form
words original
All the translation corresponding to the
including a part
multilingual query,
query.
rech-Neo Publications
(Mew (BC-12)
Syllabus w.e.f Academic Year 23-24)
& Multi Retrieval)..Page no. (13-16)
IR (MU-T.Y. B.Sc.-Comp-SEM 6) (Cross-Lingual
query
There is a problem of term weighting in the mixed
the
expression. As the translations are made independently,
may not be
resulting probabilities for different languages
comparable.
our
For query translation fronm English to the other languages, in
case, we use a set of statistical translation models. These models
have been trained on a set of parallel texts.
6 Retrieval
The retrieval is performed exactly in the same way as in monolingual
retrieval. The output is a list of documents in different languages.

13.5 EVALUATION TECHNIQUES FOR IR SYSTEMS

GQ. Discuss in detail evaluation techniques forIR Systems.

Evaluation measures for an Information Retrieval (R) system assess


how well an index, search engine or database returns results from a
collection of resources that satisfy a user's query.
The success of an IR system may be judged by a range of criteria
including relevance, speed, user satisfaction, usability, efficiency and
reliability.
However, the most important factor in determining a system's
effectiveness for users is the overall relevance of results retrieved in
response to a query.
Evaluation measures may be categorised into two ways including offline
or online,
Online measures
Online metrics are generally created from search logs. The metrics arc
often used to determine the success of an AVB test.
Session abandonment rate
Session abandonnment rate is a ratio of scarclh sessions which do not
result in aclick.

(New Syllabus w.e.f Academic Year 23-24) (BC-12) rech-NeoPublications


6)
(MU-T.Y. B.Sc.-Comp-SEM (Cross-Lingual & Multi
Retrieval)..Page no. (13-17)
IR
Click-through rate
Click-through rate (CTR) is the ratio of users who
click on a specific
of total users who
number view a page, email, or advertisement. It
link to the
ommonly used to measure tne succesS of an online advertising campaign
a particular website as well as the effectiveness of email campaigns.
for
Session success
rate
Cossion success rate measures the rati0 of user sessions that lead to a
SuccesS. Defining
"success" is often dependent on context, but for search a
SuCcessful result is often measured using dwell time as primary factor along
a

user interaction, for instance, the user copying the result URL
with secondary
successful result, as is copy/pasting from the snippet.
considered a
is
Zero result rate
Tero result rate (ZRR) is the ratio of Search Engine Results Pages
a
(SERPS) which returned with zero results. The metric either indicates recall
issue, or that the information being searched for is
not in the index.

offline metrics
Offline metrics are generally created from relevance judgment sessions
binary
where the judges score the quality of the search results. Both
(relevant/non-relevant) and multi-level (e.g., relevance from 0 to 5) scales
can be used to score cach document returned in response to a query. In
may be different shades of
practice, queries may be ill-posed, and there
relevance.
Precision
are relevant to
Precision is the fraction of the documents retrieved that
the user's information necd.
I{relevant documents n {retrieved documents
}l

prccision = I(retrieved documents)|


to positive predictive
In binary classification, precision is analogous
into account. IL can also
value. Precision takes all rctrieved documents
be evaluated considering only the
topmost results returncd by the system
using Precision ®k.

ech-Neo Publcations
(New Syllabus w.e.f Academic Year 23-24) (BC-12)
frekev at documents) ^ (retrieved documents}!
1{relevart documernts }!
In hmary clafication, scrsitivity. So. it can be
rocal} is often called

kket t av the ptohability that a reicvant document is etricved by the


geri: is trivial io achicvC tecall of 100% by returning ali documents
in reponsç to any qucry. Therefore, recall alonc is not cnough but one
Os to neasurc the urmbet of non-rclevant documcnts aiso, for
CA2ple by computing the prccision.
Fall-out
The proportion of non-rclevant documents that arc retricved, out of ali
non-reievant documcnts available:
1non-tclevant documents} {retrieved documents}!
Eall-out
I( non-relevant documents}!

In binary classification. fall-out is closcly rclated to specificity and is


cqual to l-specificity). It can be looked at as the probability that a non
relevant document is retrieved by the qucry.
It istrivial to achieve fall-out of 0% by returning zcro documents in
rsponse to any query.
Fscore / F-measure
The weighted harmonic means of precision andrecall, the traditional F.
easurC or balanced F-score is:
2
precision recal!
(precision + recall)
This is aiso known as the F, measure, because recall and precision are
evenly weighted. The general formula for non-negative real is:
(i+iprecision recall)
("precision + recall )

Tut caher commonly used F measures are the F; measure, which


weights recall twice as much as precision, and the Fos measure, which
sreaghts precison twice as ruch as recall.

ech-tNeo Publications
iew Sylabus wei Acaiermic Veat 73-24) (BC-12)
derived s that eassres
F-sneasure e
eflestrvsess t
The
with respest to a user
retrieval
prrecision". it is based on the reasare
to recall as

p R

is
and their relationship
where =
Fa = l-E
E.measure can be a better single metric when compared to precision and
can
recall: both precision and recal! give different information that
smore than
complement cach other when combined. If one of them exceis
it.
the other, F-measure will reflect

Average precision
Precision and recall are single-value metrics based on the whole list of
a
documents returned by the system. For systems that return ranked
sequence of documents, it is desirable to aiso consider the order in
which the returned documents are presented. By computing a precisiOD
and recall at every position in the ranked sequence of documents., one
can plot a precision-recall curve, plotting precision pír) as a function of
recall r. Average precision computes the average value of p() over the
interval from r=0 to r=l.

AveP = p(r) dr

That is the area under the precision-recall curve. This integral is in


practice replaced with a finite sum over every position in the ranked
sequence of documents:

AveP =
A
P(k) r(k)

k=l
u is the
where k is the rank in the sequence of retrieved docunents,
number of retrieved documents, P(k) is the precision at cut-off k in the
list, and Ar(k) is the change in recall from items k-I to &.

ech-NeaPublications
New Syilabus w.e.f Academic Year 23-24) (BC-l2)
IR (MU-T.Y. B.Sc.-Comp-SEM 6) (Cross-Ling. & Mult. Retrie.)...Page no. (13-20)

This finite sum is equivalent to:

P(k) x rel(k)

AveP k=l
total number of relevant documents
where rel(k) is an indicator function equalling 1
if the item at rank k is a
relevant document, zero otherwise.
Precision at k
For modern information retrieval, recall is no longer a mneaningful
metric. as many queries have thousands of relevant documents, and few
users wvill be interested in reading all of them.
Precision at k documents (P@k) is still a useful metric (e.g., P@10 or
"Precision at 10" corresponds to the number of relevant results among
the top 10 retrieved documents), but fails to take into account the
positions of the relevant documents among the top k.
Another shortcoming is that on a query with fewer relevant results than
k, evena perfect system will have a score less than 1. It is easier to score
manually since only the top k results need to be examined to determine
if they are relevant or not.
R-precision

R-precision requires knowing all documents that are relevant to a query.


The number of relevant documents, R, is used as the cutoff for
calculation, and this varies from query to query. For example, if there
are 15 documents relevant to "red" in a corpus (R=15), R-precision for
"red" looks at the top 15 documents returned, counts the number that are
relevant turns that into a relevancy fraction: r/R=r/15.
Note that the R-Precision is equivalent to both the precision at the R-th
position (P@) and the recall at the R-th position. Empirically, this
measure is often highly correlated to mean average precision.

Mean average precision


Mean average precision (MAP) for a set of queries is the mean of the
average precision scores for each query. Where Q is number of queries.

(New Syllabus w.e.f Academic Year 23-24) (BC-12) Sech-Neo Publications


-Comp-SEM 6) (Cross-Ling. & Mult. Retrie.)..Page no. (13-21)
B.SC.
/MU-T.Y.
I2

) AveP(q)
q=l
MAP

Discounted Cumulative Gain (DcG)


a graded relevance scale of documents from the result set to
DCG uses
a on its position in
the usefulness, or gain, of document based
evaluate
list. The premise of DCG is that highly relevant documents
the result be penalized as
appearing lower in search result list should the graded
a

proportional to the position of


relevance value reduced logarithmically
is

the result.
p as:
The DCG accumulated at a particular rank position is defined

= rel,
DCG,
log, (i + 1)
i=1
or systems, to
Since result set may vary in size among different queries
uses an ideal
compare performances the normalised version of DCG
a relevance,
DCG. To this end, it sorts documents of result list by
normalizes the
producing an ideal DCG at positionp (DCG), which
SCOre.
DCGp
nDCO, =DCG,
to obtain a measure of
The nDCG values for all queries can be averaged
the average performance of ranking algorithm.
a Note that in a perfect
same as the IDCG, producing
ranking algorithm, the DCG, willbe the
an nDCG of 1.0. All nDCG calculations are then relative values on the
interval0.0to 1.0 and so are crosS-query comparable.
Chapter Ends...
UNIT 3
User-based
Evaluation
CHAPTER 14

Syllabus

User-bassd evaluatíon uSer studies, surveys, Test collectíons and


:

benchrmartking, Onlíne evalustior rnethods: HB testing, interleavíng


experiments.

W 14.1 USER-BASED EVALUATION

in IR.
GQ Explain the concept of user-based evaluatíon
Evaluatíon is highly impotant for designing, developing and
as
maíntaining effective information retrieval or search systems it allows
the measurement of how successfully an ínformation retrieval systern
meets its goal of helping users fulfil their informatíon needs.
The success of an IR system may be judged by a range of criteria
including relevance, speed, user satisfaction, usability, efficiency and
reliability. However, the most important factor in determining a system's
effectiveness for users is the overall relevance of results retrieved in
response to a query.
User-based evaluation is evaluation through user participation, that is,
evaluation that involves the people for whom the system is intended: the
uSers.
User-based evaluation techniques include: experimental methods,
observational methods, questionnaires, interviews, and physiological
monitoring methods.
B,npSEM E)
(UseeKag (14-2)
The moSt
Uerkessed evaluati% are User
studis, Survezs
Iafornatíon nss zre prsenially
usful in

the kind of services Ín existenss,


Any inforSaticon systern wouid definitsly reguire idenifiction of user
requiresnents.

14.1.1 User Studies


The user studies that are priTnarily desígned to explicate user behaviour
and SEarch experienes USually fus on a slíce oT Segrnent of searcn

procEss le.g, query foImulation, arch result enersination, judgrnent of


docurnent relevance) and seek to controf other conteztual elernents of
search as much as possible (e.g., document relevance, task compleíty).
The user studies hich fali ínto this category are generally more
interested in ezplaining or predicting some of the elerments associated
with user behaviour and interactíon experience, rather than
dernonstratíng the goodness of a particuiar system or interface
cormponent.
User studies under this category often examine the variations in some of
the search contextual features (e.g., task types, stage of search, topícal
and domain knowledge, number of relevant results on the ranked
result lists) and study how users behavioural signals change when these
contextual features vary.
User studies are sometimes referred to as user research or user testing.

e 14.1.2 Surveys
answers directly from
This involves questioning users and obtaining
users about their behaviour, attributes, values, conditions and/or
preferences.
user studies. It at times
This is by far the most frequently used method in
also leads to somewhat biased results.
on the format of
Surveys can be both qualitative and quantitative, based
questions used.

ech-Neo Publications
(New Syllabus w.e.f Academic Year 23-24) (BC-12)
IR(MU-T.Y, B.Sc.-Comp-SEM 6) (User-based evaluation)...Page no. (14-3)

14.2 TEST COLLECTIONS AND BENCHMARKING

A test collection usually consists of a document collection, a set of


topics that describe a user's information need and a set of relevance
judgments indicating which documents in the collection are relevant to
each topic.
When constructing a test collection there are typically a number of
practical issues that must be addressed. By modifying the components of
a test
collection and evaluation measures used, different retrieval
problems and domains can be simulated.
The original and most common problem modelled is ad hoc retrieval:
the situation in which an information retrieval system is presented witha
previously unseen query.
Test collection-based evaluations have also been carried out on tasks
including question answering, information filtering, text summarization,
topic detection and tracking, image and video retrieval, and text
summarization.
Test collection-based evaluation is highly popular as a method for
developing retrieval strategies.
Benchmarks can be used by multiple researchers to evaluate in a
standardised manner and with the same experimental set up, thereby
enabling the comparison of results.
User-oriented evaluation, although highly beneficial, is costly and
complex and often difficult to replicate.
IR systems index documents that are retrieved in response to users'
queries. A test collection must contain a static set of documents that
should reflect the kinds of documents likely to be found in the
operational setting or domain. This might involve digital library
collections or sets of Web pages; texts or multimedia items (e.g., images
and videos). The notion of a static document collection is important as it
ensures that results can be reproduced upon re-use of the test collection.
For each topic in the test collection, aset of relevance judgments must
be created indicating which documents in the collection are relevant to
each topic.

(New Sylabus w.e.f Academic Year 23-24) (BC-12) ech-NeoPublications


(MU-T.Y. EB.Sc.-Comp-SEM6) (User-based evaluation).Page
IR no. (14-4)
oalevance judgmens can be binary (relevant or not
relevant) or use
graded relevance judgments, e.g., highly relevant,
partially relevant or
non-relevant.

14.3 ONLINE EVALUATION METHODS : A/B TESTING,


INTERLEAVING EXPERIMENTS

Online evaluation is based on implicit measurement


of real users'
experience of an IR system. Implicit measurement is
the by-product of

nGers' natural interaction, such as clicks or dwell time.

Online evaluation uses a specific set of tools and methods


complementary to other evaluation approaches utilized in academic and
industry research settings.
Comparing to offline evaluation (using human relevance judgments),
online evaluation is more realistic as it addresses questions about
actual
users' experience with an IR system.
Online evaluations are carried out in controlled experiments on user
metrics. These experiments can be categorized depending on how we
define the quality (effectiveness) and at what granularity level we
measure it.
In terms of quality, experimental approaches are divided into two types,
absolute and relative types.
1. In an absolute quality experiment, one is interested in measuring
the performance of a single IR system while in relative quality
experiments, two IR System are compared which is more
challenging to draw general conclusion over time.
A relative evaluation could be when we compare the height of two
2.

trees using one metric. Relative evaluation could be challenging as


the transitivity and performance comparisons are not
straightforward sometimes to draw.
Absolute online evaluation is usually used with AB testing which is a
a randomized
User experience research methodology that includes
experimnent with two variant, A' and B', of the same application. It
determines which variant drives more user conversions. Different
Segments of users are selected for the experimentation.

ech-NeoPublications
(New Syllabus w.e.f Academic Year 23-24) (BC-12)
(User-tased evaluation)...Page io.
IR (MJTY. B.SCCon SEM6) (14-5)

Relative online evaluation uses interleaving comparison. which


Popular tachnique foz evzluating IR systems
based on implicit
User
feedback The basic idea of the diferent vaiations of the interieaving
approzch is to do paired online comparisons of two rankings. This
is
done by merging the two rznkings into one interieaved ranking
and
present it to the user in an interactive way. The goal of this techniaue :.
to be fai and unbizsed in interpreting user clicks data, as well a
comparison judgmens. Also. interieaving aims to eliminate the post-hes
interpretation of observational data.
Challenges with online evaluation
Relevance : The user feedback is implicit and not fully representative of
user behaviour. For instance. user clicks are not relevance scores
(although they are coreiated). Therefore, it is challenging to link online
metrics to user satisfaction or relevance.
Biases :Factors such as the position of documentson the result page can
affect uSer behaviour leading to biased user feedback, such as clicked
documents.
Experiment efects : How to balance experimenting a single
with
ranker versus exploring other rankers?
Reusability : Unlike using labelied data for offline evaluaion, collected
online data cannot be confidently re-used for evaluating other rankers.

14.3.1 A/B Testing

GQ. Explain in detail about AB testing in IR.


!

AB testing (also known as split testing or bucket testing) is a


nethodology for comparing two versions of a webpage or app against
each other to determine which one performs better.
A/B testing isessentially an experiment where two or more variants of a
page are shown to users at random, and statistical analysis is used to
deternine which variation performs better for a given conversion goal.
In A/B testing, we show most users the normal system (system A) but
show a sTmallrandomiy-selected group of users a test system (system B).
This is commonly used to test interface changes, ranking changes, etc

(New Sylabus w.e.f Acadernic Year 23-24) (BC-12) Gech-Neo Publications


B.Sc.-Comp-SEM 6) (User-based evaluation)..Page no. (14-6)
(MU-T.Y.
IR
has been the predominant method for online evaluation.
ABtestng
shows candidate IR systems to randomized groups of users and
It metrics such as clicks or views.
compares user feedback
can siow down innovations
Despitebeing straightforward, AB testing
weeks and high user traffic to reach statistically
they require
significant conclusions. This happens when user metrics are noisy
clicks, views) and
sparse (e.g., streams, purchases), and when IR
(e.g.
mature with smaller-effect innovations.
systems grow
testing
Working of A/B
Screen and modify it to create
or app
Tn an AB test, you take a webpage
version of the same page. This change can be
as
simple as a
a second
or be a complete redesign of the page.
single headline, button
of the page
half of your traffic is shown the original version
Then,
Arooun as control or A) and half are shown the modified version of
the

page (the variation or B).


As visitors are served either the control or variation, their
engagement
collected in a dashboard and
wih each experience is measured and
can then determine whether
analysed through a statistical engine. You
a positive, negative or
changing the experience (variation or B) had
or
neutral effect against the baseline (control A).
Advantages of A/B testing
you run alternative indices or searches in parallel,
1. With AB Testing,
compare effectiveness.
capturing click and conversion events to
your main index or search and
2. You make small incremental changes to
- your users - before
have those changes tested live and transparently by
making them official.
-
an essential source of information your
3 AB Testing goes directly to
users - by including them in the decision-making process, in the most
reliable and least burdensome way.
to measure the usability and
4. These tests are widely used in the industry
effectiveness of a website.
way to quantitatively determine the
5. Website A/B testing provides a great
tactics that work best with visitors to your website.

ech-Neo Publications
(New Syllabus w.e.f Academic Year 23-24) (BC-12)
IR (MU-T.Y. B.Sc.-Comp-SEM 6) (Usor-basod ovaluation),..Pago no. (14-7)

MB lesting allows individuals, teams and companics to make carcful


changes to their user experiences while collecting data on the impactit
makes. This allows thenn to construct hypothescs and to lcarn what
clements and optimizations of their expcricnces inpact user behaviour
the most.

a 14.3.2 Interleaving Experiments

i GQ. Explain indetail about lnterleaving experiments in IR.

Interleaving is a paired test that cvaluates user prefercnce betwecn two


IR systems. First, zip the ranking tesults from two compared systenms
into one combincd list to present to all users. Then, attribute user
cngagennent credit on the interleaved list back to the compared systems,
nd decide the winner that rcccives more crclit (hrough a
statistical test).
Interleaving cmerges as a more sensitive online testing method to free
up experimentation bundwidth and expedile innovations.
Instead of presenting separate users with control and treatment results,
interleaving merges their results into a single interleaved result and
presents to all users.
User actions on the interleaved result are attributcd back to the (wo IR
systems being conpared, and the better one is whichever received
(statistically significantly) more attributed actions.
However, practical challenges have limited applicability of interleaving
in IR and search systems. Since it is a paired test that directly evaluates
user prelerence between two candidate systems, interleaving measures
user feedback metrics in presence of both systems, wlhich is not the samc
as A/B testing where absolute metrics are measured on each individual
system. This means the raw results cannot directly order multiple
(more than two) systems.
Chapter Ends.
Mr. Sachin S. Shah
Managlng Dlrector
LLP
Tech-Neo Publlcatlon
B.E, In (Industrlal Electronlcs) (1992 Batch)
Puno)
Bharall Vdyopeoth COE (AIillated to Unveralty ol
[email protected]

Sem TY.B.SC. COMPUTER SCIENCE


6

Core Subjects
USCS601: Data Science
USCS602:Cloud Computing and Web Servlces
Enhancement Elective (SEE I)
I
Skill
USCS6031 : Wireless and Sensor Networks
USCS6032: Information Retrieval
SkillEnhancement Elective I| (SEE I)
USCS6041: Data Mining Warehousing &

USCS6042: Ethical Hacking

Generic Elective
USCS6051: Customer Relationship Management
UScS6052: Cyber Laws andIPR
Sure Marks
iotes oAst
El.utinu faielence

CHHAPTEWiSt UNIYIRSITY
PAPLR SoLUriONS

For Orders Contact : ISBN


978-93-5583-825-4
Krishna Book Collections
Ground Floor, A SH 01, Mahavir Ville, 'A' WinE, Bhandarkar Road, Price 175/
Opp. Ramashray Hotel, Near Matunga Post Office, Matunga (BC-12)
Mumbai - 400019,
E-mail : [email protected] [email protected]
Mobile No.: Dharmesh Sota -9 +91 98207 41455, www.techneobooks.in
Tulsidas Sota -9+91 98331 33921 /98330 82745/98330 82761
IECH-NEO
Books are.avallable on FlUpkarts amazon boohwalg PUBLICATIO NS
uhono akors hao Zonation
A Sachin Shah Venturo

PHOTOCOPY (Xerox) OF BOOK IS STRIGTLY PROHIBITED This 1


WARNING book is protected under The Copyright H

Any person found selling, stocking or


carrying photocopied book may be arrested tor u
In the criminal offence of copyright
plracy under sectlon 63 and 65 of The Copyright Act.
Jtmus.abeutsuch Plracy. gn montlonod omall. Informor wll bo sultahly rowardnd and hleldentity will be kept Stricly co

You might also like