Information Retrieval
Information Retrieval
MU LYB.SC.
6 Computer Science
Information
Retrieval
Goice AasedGRt Systen (CBC wtb eftet ton Aatenit year 20N
TECH-NEO
UBLICAIIONS
BSG.-Comp-SEM6) Table of
ContentsS
IR (MU-TY, ...Page
No.(1)
Table Of Contents
Unit 1
Unit 2
ech-Neo Publications
(New Syllabus w.e.f Academic Year 23-24) (BC-12)
Table of Contents ,..Page No,
(3)
CHAPTER 7
:Text Clustering for Information
Retrieval 7-1 to 7-13
71 Csteing Techniques. 7-1
7.1.1 K-means Clusterning 7-3
Hierarchical Clustering. 7-5
72 Evaluation of Clustering Results,
..7-10
Custering for Query Expansion and Result Grouping.
...7-12
7.3
Chapter Ends 7-13
Unit 3
CHAPTER 11 :
Crawling and Near-Duplicate Page
Detection 11-1 to 11-11
11.1 Web Page Crawling Techniques: Breadth-First, Depth-First.........11-1
11.1.1 Breadth-First 11-2
11.1.2 Depth-First. 11-3
... 11-4
11.2 Focused Crawling
...
11.3 Near Duplicate Detection Algorithm.. 11-7
..... 11-10
11.4 Handling Dynamic Web Content During Crawling....
Chapter Ends 11-11
ech-NeoPublications
(New Syllabus w.e.f Academic Year 23-24) (BC-12)
Table of Contents
...Page No. (5)
document where the title contains Java and the body contains thrcading.
The field of information retrieval also covers supporting users in
browsing or filtering document collections or further processing a set of
retrieved documents.
GQ. (5 Marks)
Intornation Necd
User
Result
Qucry
Documcnts
Indes
a
her
information need, the user constructs and issues query to
the IR
system.
Typically, this query consists of
a
small number
terms, with twoof
to
three terms being typical for a Web search. We use "term" instead of
word", because a query term may in fact not be a word at all.
Depending on the information need, a query term may be a date. a
number, a musical note, or a phrase. Wildcard operators and other
partial-match operators may also be permitted in query terms. For
example, the term inform*" might match any word starting with
that prefix ("inform", "informs'". informal", "informant",
informative", etc.).
The user's query is processed by a search engine, which may be
running on the user's local machine, on a 1arge cluster of machines in a
remote geographic location, or anywhere in between. A major task of a
search engine is to maintain and manipulate an inverted index for a
document collection. This index forms the principal data structure used
by the engine for searching and relevance ranking.
To support relevance ranking algorithms, the search engine maintains
collection statistics associated with the index, such as the number of
documents containing each term and the length of each document. In
addition, the search engine usually has access to the original content of
the documents, in order to report meaningful results back to the user.
Using the inverted index, collection statistics, and other data, the search
engine accepts queries from its users, processes these queries, and
returns ranked lists of results.
To perform relevance ranking, the search engine computes a score,
sometimes called a retrieval status value (RSV), for each document.
After sorting documents according to their scores, the result list may be
subjected o further processing, such as the removal of duplicate or
redundant results.
For example, a Web Search engine might report only one or two results
Rm a single host or domain, eliminating the others in Favor of pages
Canm different sources. The problem
of scoring documents with respect
one of the most fundamental in
to a userr's query is the field.
Syllabus
127148?
:2-5276
17:27:
127227:
1271:E
127:504
Index terms are often alphabetized like a traditional book index. but they
Deed not be. since they are often found directly using a hash table.
Exch index term has its own inverted list that hoids the relevant data for
t term
(New Syllabus w.e.f Academic Year 23-24) (BC-12) arech -Neo Publications
IR (MU-T.Y, B.SC-Comp-SEM 6) (Doc. Indexing,Storage & Compre)...Page no.(2-4)
2.2.1.2 Merging
The classicway to solve the memory problem in the previous example is
by merging
We can build the inverted list structure I until memory runs out. When
that happens, we write the partial index Ito disk, then start making a
new one. At tne end of this process, the disk is filled with many partial
indexes, I, 12, 13,
.., In.
The system then merges these files into a single result. By definition, it
is not possible to hold even two of the partial index files in memory at
one time. s0 the input files neei to be carefully designed so that they can
be merged in small pieces.
4 One way to o
this is to store the partial indexes in alphabetical order. It
is then possible for 2 merge algoríthrn to nerge the partial indezes using
very litrie ETTY.
reh-ieo Pubicatisns
6)
(MU-T.Y. B.Sc.-Comp-SEM (Doc. Indexing, Storage &
Cormpre)..Page no. (2-5)
IR
aardvark
Index A 2|345 apple24
Indcx B aardvark 69 actor 15 42G8
Bio. 2.2.1: An example of index merging. The first and second indexes
are merged together to produce the combined index.
2.2.í.4 MapReduce
The MapReduce framework isn't quite so strict with its definitions: both
Mappers and Reducers can return an arbitrary number of iteme
Howeve. the general idea is the same.
Map
Input
Shuffie
Reduce
Output
The next step is the shuffle, which the library performs by itself. This
up
operation uses a hash function so that all pairs with the same key end
next to each other and on the same machine.
The final step is the reduce stage, where the records are processed again,
but this time in batches, mcaning all pairs with the same key are
processed al once.
ech-Neo Publications
(Nev Syllabus ve.f Acadermíc Year 23-24) (BC-12)
. B.Sc.-Comp-SEM 6) (Doc. Indexing,Storage .
IR(MU-T.Y.
&
Compre)..Page no. (2-7)
Compression techniques are the most powerful tool for managing the
memory hierarchy. The inverted lists for a large collection are
themselves very large.
In fact, when it includes information about word position and document
extents, the index can be comparable in size3 to the document
collection.
Compression allows the same inverted list data to be stored in less
space.
The obvious benefit is that this could reduce disk or memory
requirements, which would save money.
More importantly, compression allows data to move up the memory
hierarchy. If index data is compressed by a factor of four, we can store
four times more useful data in the processor cache, and we can feed data
to the processor four times faster.
If parts of the dictionary are on disk, then many more disk seeks are
necessary inquery evaluation.
Thus, the main goal of compressing the dictionary is to fit it in main
memory, or at least a large portion of it, to support high query
through
put.
Although dictionarices of very large collections fit into the memory of a
standard desktop machine, this is not true of many other application
scenarios.
For example, an enterprise search server for a large corporation may
have to index a multi tera byte collection with a comparatively large
vocabulary because of the presence of documents in many different
languages.
We also want to be able to design scarch systems for limited hardware
such as nmobile phones and onboard computers.
Other reasons for wanting to conserve memory are fast startup tinne and
having to share resources with other applications.
Onc of the simplest codes is the unary code. You are probably fanmiliar
with binary, which encodes numbers with (wo synbols, typically 0
and I.
1 |10
2 110
3 1110
4 |11110
111110
This code is very efficient for small numbers such as 0 and 1, but
quickly becomes very expensive.
For instance, the number 1023 can be represented in 10 binary bits, but
requires 1024 bits to represent in unary code. Now we know about two
kinds of numericencodings.
Unary is convenient because it is compact for small numbers and is
inherently unambiguous.
Binary is a better choice for large numbers, but it is not inherently
unambiguous.
A reasonable compression scheme needs to encode frequent numbers
with fewer bits than infrequent numbers, which means binary encoding
is not useful on its own for compression.
etiiveness.
Variable byte (VB)encoding uses an integralnumber of bytes to encode
a gap.
The last 7 bits of a byte are "payload" and encode part of the gap.
The first bit of the byte is a continuation bit. It is set to 1
for the last byta
of the encoded gap and to 0 otlherwise.
To decode a variable byte code, we read a sequence of bytes with
continuation bit 0 ternminated by a byte with continuation bit 1.
We then extract and concatenate the 7-bit parts.
The main advantage of Variable -Byte codes is decoding speed: we just
need to read one byte at a time until we found a value smaller than 2
TM
arech-t
(New Syllabus w.e.f Academic Year 23-24) (BC-12) STech-Neo Publications
B.Sc.-Comp-SEM 6) (Doc. Indexing,Storage (2-11)
IR(MU-T.Y. Compre)...Page no.
&
Term Weighting
A 2.3.2
f G0. Explain the Term Weighting w.r.t document Indexing.
Term weighting is a procedure that takes place during the text indexing
process in order to assess the value of each term to the document.
Term weighting is the assignment of numerical values to terms that
represent their importance in a document in order to improve retrieval
effectiveness.
Essentially it considers the relative importance of individual words in an
information retrieval system, which can improve system effectiveness,
since not all the terms in a given document collection are of equal
importance.
Index term weights reflect the relativ importance of words in
documents, and are used in computing scores for ranking.
The specific form of a weight is determined by the retrieval model. The
weighting component calculates weights using the document statistics
and stores them in lookup tables.
Weighing the terms is the means that enables the retrieval system to
determine the importance of a given term in a certain document or a
query.
It is a crucial component of any information retrieval system, a
component that has shown great potential for improving the retrieval
effectiveness of an information retrieval system.
Each term in a document a weight for that term, that depends on the
number of occurrences of the term in the document. Assign the wveight
to be equal to the number of occurrences of term t in document d.
ech-NeoPublications
(New Syllabus w.e.f Academic Year 23-24) (BC-12)
IR (MU-T.Y. B.Sc.-Comp-SEM 6) (Doc. Indexing, Storage
&
Compre)..Page no.
(2-12)
idf, =
N
R T-idf weighting
We now combine the definitions of term frequency and inverse
document frequency, to produce a composite weight for each term in
each document.
The tf-idf weighting scheme assigns to term t a weight in document d
given by
=
tf- idí, ta X idí
Chapter Ends...
UNIT 1
Retrieval Models
CHAPTER 3
Syllabus
Retrieval Models : Boolean model: Boolean operators, query
processing, Vector space model: TF-IDF, cosine similaríity,
query
docurment matching, Probabilistic model: Bayesian retrieval, relevance
feedback.
Apart from the implicit Boolean filters applied by Web search engines,
explicit support for Boolean queries is important in specific application
areas such as digital libraries and the legal dornain.
In contrast to ranked retrieval, Boolean retrieval returns sets of
a
documents rather than ranked lists. Under the Boolean retrieval model,
term t is considered to specify the set of documents containing it.
The standard Boolean operators (AND, OR, and NOT) are used to
construct Boolean queries, which are interpreted as operations over
these sets, as follows:
A AND B intersection of A
and B (A nB)
A OR B union of A
and B (AU B) NOT
SEV 5) (Retievai Woceis)..Page ro.
R MLTY9S-Qom (3-2)
Docament Content
=3
IR.
iGQ. Explain Query Processing in
(New Syllabus w.e.f Academic Year 23-24) (BC-12) aech 1-Neo Publications
IR (MU-T.Y. B.Sc.-Comp-SEM 6)
(Retrieval Models)...Page no. (3-5)
rankBM25 DocumentAtATime ((t, ..t), k) =
d- min,
sisn, {nextDoc(4, d))
sort results[0..(m - 1)] in decreasing order of score
return results[0..(k - 1)]
results in line 8.
a
3.3.3 Term-at-a-Time Query Processing
As an alternative to the document-at-a-time apprOach, some search
fashion.
engines process queries in a term-at-a-time
Instead of merging the query terms' postings lists by using a heap, the
search engine examines, in turn, all (or some) of the postings for each
query term. It maintains a set of document score accumulators.
For each posting inspected, it identifies the corresponding accumulator
to the
and updates its value according to the posting's Score contribution
respective document.
the
When all query terms have been prOcessed, the accumulators contain
a heap may be used to
final scores of all matching documents, and
collect the top k search results.
One of the motivations behind the term-at-a-time approach is that the
index is stored on disk and that the query terms' postings lists may be
too large to be loaded into memory in their entirety.
In that situation a document-at-a-time implementation would need to
jump back and forth between the query terms' postings lists, reading a
small number of postings into memory after each such jump, and
incurring the cost of a nonsequential disk access (disk seek).
For short queries, containing two or three terms, this may not be a
an
problem, as we can keep the number of disk seeks low by allocating
appropriately sized read-ahead buffer for each postings list.
The vector space model is one of the oldest and best known of the
information retrieval models.
The vector space model is intimately associated with the field as a whole
and has been adapted to many IR problems beyond ranked retrieval,
including document clustering and classification, in which it continues
to play an important role.
In recent years, the vector space model has been largely overshadowed
by probabilistic models, language models, and machine learning
approaches.
Naturally, for a collection of even modest size, this vector space model
produces vectors with millions of dimensions.
This high-dimensionality might appear inefficient at first glance, but in
many circumstances the query vector is sparse, with all but a few
components being zero.
For example, the vector corresponding to the query "william",
"shakespeare", "marriage" has only three nonzero components.
To compute the length of this vector, or its dot product with a document
vector, we need only consider the components corresponding to these
three terms.
On the other hand, a document vector typically has a nonzero
component for each unique term contained in the document, which may
consist of thousands of terms. However, the length of a document vector
is independent of the query.
The IDF functions typically relate the document frequency to the total
number of documents in the collection (N).
The basic intuition behind the IDF functions is that a term appearing in
many documents should be assigned a lower weight than a term
appearing few documents. Of the two functions, IDF comes closer to
in
having a 'standard form".
IDF = log (NIN)
The first one, ranked retrieval, allows the search engine to rank search
results according to their predicted relevance to the query. The second
one, lightweight structure, is a natural extension of the Boolean model to
the sub-document level.
Instead of restricting the search process to entire documents, it allows
the úser to search for arbitrary text passages satisfying Boolean-like
constraints (e.g., "show me all passages that contain 'apothecary' and
'drugs' within 10 words").
Cosine similarity
I GQ. What do you mean by Cosine Similarity in Vector space Model ?
th
two vectors.
Cosine Similarity in a Vector Space Model
In a VSM, cosine similarity is crucial for information retrieval an
ech-Neo Publications
(New Syllabus w.e.f Academíc Year 23-24) (BC-12)
IR (MU-T.Y.
B.Sc.-Comp-SEM 6) (Retrieval Models)..Page no. (3-11)
In other words, documents "closer" to the query are more likely to be
relevant. This is primarily a model of topical relevance, although
features related to user relevance could be incorporated into the vector
representation.
Relevance feedback, a technique for query modification based on user
jdentified relevant documents.
This technique was first introduced using the vector space model. The
well-known Rocchio algorithm was based on the concept of an optimal
query, which maximizes the difference between the average vector
representing the relevant documents and the average vector representing
the non-relevant documents.
3.5 PROBABILISTICMODEL
Bayesian retrieval
In any retrieval model that assumes relevance is binary, there will be
two sets of documents, the relevant documents and the non-relevant
documents, for each query.
Relevant
P(R| D) Documents
mayEVEn
P(NR|D)
Document
Non-Relevant
Documents
or
Fig. 3.5.1: Classifying a documet as relevant non-relevant
Relevance feedback
as a language model.
It is possible to represent the topic of a query
we use the name
Instead of calling this the query language model,
relevance model since it represents the topic covered by relevant
documents.
text generated from
The query can be viewed as very small sample of
a
the relevance model, and relevant documents are much larger samples ol
text from the same model.
ech-Neo Publications
(New Syllabus w.e.f Academic Year 23-24) (BC-12)
B.Sc.-Comp-SEM 6) (Retrieval Models)..Page no. (3-13)
IR (MU-T.Y.
Syllabus
GQ. What do you mean Spelling Correction in IR? Discuss its challenges.
1. realstateisting.bc.com
2. akia 1080i manunal
3. ultimatwarcade
4. mainscourcebank
5. dellottitouche
The wide variety in the type and severity of possible spelling errors in
queries presents a significant challenge.
In order to discuss which spelling correction techniques are the most
effective for search engine queries, we first have to review how spelling
correction is done for general text.
The basic approach used in many spelling checkers is to suggest
corrections for words that are not found in the spelling dictionary.
Suggestions are found by comparing the word that was not found in the
dictionary towords that are in the dictionary using a similarity measure.
A given spelling error may have many possible corrections. For
example, the spelling error "lawers" has the following possible
corrections (among others) at edit distance 1: lawers lowers, lawyers,
layers, lasers, lagers.
The spelling corrector has todecide whether to present all of these to the
user, and in what order to present them.
ech-Neo Publications
(New Syllabus w.e.f Academic Year 23-24) (BC-12)
IR MATY, BSComp SEM 6) (Speling CorectioniniR Systems)...Page no.
(4-3)
The noisy channel model for spelling correction is a
general framework
can
tat zidress the isues of ranking. context, and run-on errors.
The
moil is called a noisy channel" because it is based on Shannon's
they of communication. The intuition is that person chooses a
a
word
wtecutput (ie., write). based on a probability distribution P(w),
peson then tries to write the word
The w,
but the noisy
channel
causes the person to
(prumahly the person's brain) write
the word e
instead, with probability P(eiw).
The pobabiliües P(w), called the language model. capture informatinn
about the frequency of occurrence of a word in text (e.g.. what is h
probability ef the word "lawyer" occurring in a document or guery
and contextual information such as the probability of observing a word
given that another word has just been observed (e.g.. what is the
probabiity of "lawyer foliowing the word "rial"?).
The probabiliies P(eiw), called the error model, represent information
about the frquency of different types of spelling errors.
The probabilities for words (or strings) that are edit distance 1 away
ivm the word w will be quite high. for example. Words with higher edit
distances will generally have lower probabilities, although homophones
will have high probabilities.
Note that the error model will have probabilities for writing the corect
word (P(wiw)) as well as probabilities for Spelling errOrs.
Thisenables the spelling comector to suggest a corection for all vords,
even if the originai word was correctly spelled. If the highest-probability
corecion is the same word, then no correction is suggested to the user.
Given two-character strings sl and s2, the edit distance between them is
the minimum number of edit operations required to transform sl into s2.
Most commonly, the edit operations allowed for this purpose are :
Example 1
Input :
strl =cat", str2 = cut"
Output :1
Explanation : We can convert strl into str2 by replacing 'a' with 'u'.
Example 2
Input : strl = "sunday", str2 = "saturday"
Output :3
Explanation : Last three and first characters are same. We basically
need to convert un" to "atur". This can be done using below three
operations. Replace 'n' with 'r', insert t, insert a.
bo
H aboard
H about boardroom border
or border
H lord morbidH sordid
rd
H aboard ardent oardrom border
Fig. 4.4.1:Matching at least two of the three 2-grams in the query bord
K-Grams
K-grams are k-length subsequences of a string. Here, k can be 1, 2, 3
and so on. For k=l, each resulting subsequence is called a "unigram";
for k=2, a bigram": and for k=3, a "trigram'". These are the most widely
used k-grams for spelling correction, but the value of k really depends
on the situation and context.
Unigrams: [",
Bigrams: ["ca", "at",
"a", ,
As an example, consider the string "catastrophic". In this case,
The 2-gram (or bigram) index shown in above figure (a portion of) the
postings for the three bigrams in the query bord.
Syllabus
:
5.1.2 Fmeasure
ech-NeoPublications
(New Syllabus w.e.f Academic Year 23-24) (BC-12)
IR (MU-T.Y. B.SC.-Comp-SEM 6) (Performance Evaluation)...Page no.
(5-4)
Mean Average Precision Formula
The mAP incorporates the trade-off between precision and recall
considers both false positives (FP) and false negatives (FN).
.
Thisproperty makes mAP a suitable metric for most
detection
applications.
uTech-Neo Publications
(NewSyllabus w.e.f Academic Year 23-24) (BC-12)
(Performance Evaluation)..Page no.
IR(MU-T.Y. B.SC.-Comp-SEM 6) (5-6)
with much of the current focus on using large volumes of log data fro
uscr interactions, such as clickthrough data, which rccords tha
documents that were clicked on during a scarch scssion.
Clickthrough and other log data is strongly corrclated with rclcvance en
itcan be uscd to cvaluate scarch, but scarch enginc companies still uge
relevance judgments in addition to log data lo cnsure the validity of their
results.
Chapter Ends..
UNIT 2 Text
Categorization
CHAPTER 6 and Filtering
Syllabus
Example
We can classify Emails into
spam or non-spam, news articles into
ctc., academic
different categories like Politics, Stock Market, Sports,
papers are often classified by technical domains and sub-domains.
IR (MU-T.Y,B.Sc.-Comp-SEM 6) (Text Categorization and Filtering) ..Page no.
(6-2)
The Text Classification Problem
A
classifier can beformally defined D: a collection of documents
C= {C; Cz
C:a set of classes
L with their respective labels a
text classifier is a binary function
x {0, 1), which assigns to
F:D C
each pair (d, Cpj
d,eD and
C,E C, a value of
1, if d, is a member of class c,
0, if d, is not a member of class c,
Broad definition, admits supervised and unsupervised algorithms.
For high accuracy, use supervised algorithm
multi-label : one or more labels are assigned to each document
single-label : a single class is assigned to each document
Classification functionF
defined as binary function of document-class pair [d, c,]
can be modified to compute degree of membership of d, in cp
TH
Cisarty
0T SA Soxretrse
Roccio
Sngt
Keess
Ruoriona Exosrgte
Lri
Supervised Algorithms
set used to learn a classification
Depend on a training set. Training
examples, the better is the
function. The larger the number of training
fine tuning of the
classifier
to training examples.
Overfitting : classifier becomes specific the
use a set of unseen objects commonly refered
Toevaluate the classifier,
to as test set.
Unsupervised Algorithms: Clustering
even class labels are
Input data : Set of documents to classify, not
provided
subsets (clusters)
Task of the classifier Separate documents into
:
Frequcncy Table:
Yellow Sweer |Long Total
Fruit
350 450 G50
|Mango
IBanana 400 300 i350 400
50 100 50 150
Others
800 850 400 1200
|Total
Tahle name: Samples for classification for the Naive Baye Theorem
1.Mang0:
p/X I Mango) = P(Yellow | Mango) * P(Sweet | Mango) * P(Long
|
Mango)
a)P(Yellow | Mango) = (P(Yellow | Mango) * P(Yellow) )/P (Mango)
= (350/800)(800/1200)) / (650/1200)
*
1
P(Yellow | Mango)=0.53
c)P(Long |
Mango) = (P(Long | Mango) * P(Long) )/ P (Mango)
=
(0/650) * (400/1200)) / (800/1200)
P(Long | Mango)= 03
On multiplying eq 1,2,3 ==> P(X | Mango) = 0.53 * 0.69 * 0
P(X | Mango) = 0
2. Banana:
P(X | Banana) = P(Yellow |
Banana) * P(Sweet | Banana) * P(Long |
Banana)
a) P(Yellow Banana) = (P( Banana | Yellow ) * P(Yellow) )/P (Banana)
|
=
(400/800) * (800/1200) /(400/1200)
P(Yellow | Banana) =
14
3. Others:
P(X = P(Yellow * P(Sweet Others) * P(Long
| Others) | Others) | |
Others)
a) P(Yellow | Others) = (P( Others| Yellow ) * P(Yellow) )/ P (Others)
= (50/800) * (800/1200)) / (150/1200)
=
P(Yellow | Others) 0.34-7
c) P(Long |
Others) = (P( Others| Long) * P(Long) )/ P (0thers)
=
(50/400) * (400/1200)) / (150/1200)
P(Long | Others) = 0.34 9
Training Algorithm
Let v be the vocabulary of all words in D
P(C)= ID/DI
Iet T;be the concatenation of all documents in d,
Tot n. be the total number of word occurrences in T:
For eachword W; V
Let n; be the number of occurrences of W; in T;
Let P(W C) = (n,+1)/(n,+ IV)
-b=
Fig. 6.1.1
and
of points corresponding to twO categories, blue
We can see a set
green. The red line indicates the maximum margin hyperplane that
over the dashed line are
separates both grOups of points. Those points
called the vectors.
in the original
Frequently happens that the sets are not linearly separable
a higher-dimensional
space. Therefore, the original space is mapped into
can efficiently
space where the separation could be obtained. SVMs
so-called kernel trick.
performa non-linear classification using the
functions, which
The kermel trick consists of using specific kernel
space into a higher
simplify the mapping between the original
dimensional space.
a Bayes Classifier and SVM.
1
GQ, Explain the difference between Naïve
models for
Naive Bayes comes under the class of generative
the class
classification. It models the posterior probability from
to a
conditional densities. So, the output is a probability of belonging
class.
on a discriminant function given by
SVM on the other hand is based
are estimated from
y=W.x + b. Here the weights w and bias parameterb
the margin
the training data. It tries to find hyperplane that maximises
a
ech-Neo Publications
(Nevw Syllabus w.e.f Acadernic Year 23-24) (BC-12)
B.Sc.-Comp-SEM 6) (Text Categorization and Filtering) ..Page no.
(MU-TY. (6-9)
IR
performs best when the features are independent of each
Naive Bayes
which often does not happen in real. Having said that it still
other
performs good even when the features are not independent.
FEATURE SELECTION
M 6.2
Feature selection
is
the process of selecting a
subset of the terms
as
occurring in the training set and using only this subset features in text
classification.
Feature selection serves twO main purposes. First, it makes training and
size of the
applying a classifier more efficient by decreasing the
effective vocabulary. This is of particular importance for classifiers that,
unlike NB, are expensive to train. Second, feature selection often
increases classification accuracy by eliminating noise features.
A noise feature is one that, when added to the document representation,
say
increases the classification error on new data. Suppose rare term,
a
ech-Neo Publications
(New Syllabus w.e.f Academic Year 23-24) (BC-12)
IR (MU-T.Y, B.Sc. -Comp-SEM 6) (Toxt Catogorization and Filtering) ..Page
no.(6-10)
We can view feature selcction as a mcthodfor replacing a
complex
classifier (using all features) with a
simpler one (using he
subset
of features).
The basic feature selection algorithm is shown in above
figure, For
given class C,we compute a utility measure A(t, c) for each term
of the
vocabulary and select the k terms that have the highest
values of A(t. c)
All other terms are discarded and not used
in classification. We uill
introduce three different utility measures in
this section: mutual
information, A(t, c) = I(Ut ; Cc); the x test,
A(t, c) = X (t, c); and
frequency, A(t,c) = N(t, c).
: a
Feature Selection Selects subset
of relevant features while keeping
filtering.
eo Explain in brief applications of text categorization and
in IR.
Discuss the various applications of Text categorization
GQ.
Text categorization
Text categorization is a machine learning
technique that assigns a set of
text.
predefined categories to open-ended
structure, and categorize pretty
Text classifiers can be used to organize,
much any kind of text - from documents,
medical studies and files, and
Th
N
Many of spam, incung emai spam,
Nps
wah page spm. SpammeS
aheemnt sqam. blng snem. and
Thereiore,
iassiñcanon technique that wOTks io
l sam. Inai VT Ncializd spam classiñers
ne
hshdi äant Taie S0) wieher ii is sPam, The
2 Seninert Aahsis
Casying zreiseents
Gre-iec Pubicaicrs
B.Sc.-Comp-SEM 6) (Text Categorization and Fltering) ..Page no. (6-13)
(MU-T.Y,E
IR
advertisers must pay the search engine only if a user clicks on the
verisement. A user may click on an adverisement for a number
reasons. Clearly, if the adverisement is "topically relevant" then
of
he 1ser may click on it However, this is not the only reason why a
user may click
Customers often use socil media t0 express their opinions about
and experiences of products or servicas. Text ciassifñcation is ofen
732i 10ideniiy thetweeis that brands must esDODd to.
Text classificaion is also nsed in language identifñcation. like
ideniñing the anguage of new wees OT DOSIS. For examnle.
Google Translate has an uiomaic language ideniication feature.
Authorship amibuion, or idanifying the unknowz authors of exis
om 2 pool 2uhors. is anoiher popular use case of text
o
ciassincaion. nd ii's used in a range of îelds iîom iorensic
used to
analvsis t0 lierrY Sudia. Text classiñcaron has also been
news.
seteg2te T2ke news irOm real
Language dezacion is 2nOhar retezample of e
classiñcaion.
text 2ccOTing O its
iha is. the pross ofclzssiing incoming
laguase. Thes clasSiners re ofen sed for roing puposs
2rpTODzte eam).
Text Filtering
Fieing i te OcSs of eraluang documents on 2n 0ngoing besis
1. Content-Based filtering
Objects to be filtered: generally, texts, filter engine based on content analvsie
These filtering methods are based on the description of an item and a
profle of the user's preferred choices.
In a content-based recommendation system, keywords are used to
describe the items: besides, a user profile is built to state the type item
of
this user likes.
The algorithms try to recommend products which are similar to the ones
that a user has liked in the past. The idea of content-based filtering is
that if you like an item, youwill also like a 'similar item.
2. Collaborative filtering
Objects to be filtered: products/goods, filter engine based on usage analysis.
This filtering method is usually based on collecting and analyzing
information on user's behaviors, their activities or preferences and
predicting what they will like based on the similarity with other users.
A key advantage of the collaborative filtering approach is that it does
not rely on machine analyzable content and thus it is capable of
accurately recommending complex items such as movies without
requiring an "understanding" of the item itself.
3. Hybrid Filtering
Combination of the two previous approaches.
Recent research shows that combining collaborative and content-based
recommendation can be more effective.
Hybrid approaches can be implemented by making content-based and
collaborative-based predictions separately and then combining them.
Further, by adding content-based capabilities to a collaborative-based
approach and vice versa; or by unifying the approaches into one model.
4.
Even job searching uses hybrid fileting system which is the combination
of content-based filtering and collaborative filtering approach. The main
motto is to make easy job search for users. This recommendation
as it makes it easy for the
depends on the user's past experiences as well
users to get recommendation of various job profiles on basis of their past
experiences, projects, internships, skills, etc.
5. Searching friends online in Facebook, whom to be friend with, also part
of collaborative filtering. Even song listings based on previous history or
choice in Spotify is also another example of collaborative filtering.
SPACE FOR NOTES
esh-NeoPublications
(New Syllabus w.e.f Academic Year 23-24) (BC-12)
IR (MU-T.Y. B.Sc.-Comp-SEM 6) (Text Categorization and Filtering)...Page no.
(6-16)
In classification, there are 1abels for In clustering, there are no labels for
training data.
training data.
clustering. clustering.
Chapter Ends...
UNIT 2 Text Clustering
for Information
CHAPTER 7
Retrieval
Syllabus
ech-Neo Publications
iew Syllabus w.e.f Academic Year 23-24) (BC-12)
(Tet CLstgrg iar Fags r. (7-3)
scalability fog
clusterin
sas c Casteing
COTITMOn
CeTE
E
Eciean sences beee
e
points as 2 Criterion for
Eac
cistEr tasa clser ceT allocate andeach of them is piaced at
fartter staCS.
Esery icamng datz point gets placed in the cluster with the closest
cater ceteT.
This procEss is regeetei until ali the ae
points get zssigned to any
ciuster. Orce all the date points 2re covered the cluster centers or
Centroids are recalclated.
Document representations in clustering it uses vector space model. As in
VEstar space ciassification. we measure relatedness between vectors by
Enclidean istarca, wiich ís aimost equivalent to cosine similarity.
Each ciuster in K-meens Îs defined by a centroid.
Number of
ckuste K
Centroid
No obec
oove group
D:starce obects to
Cerroids
Grouping based on
inimm cistance
Algorithm
Input: K: no of clusters
D: data set containing n object
Output
a set of K clusters
Steps
I. Arbitrarily choose k objects fromD as the initial cluster centers
2. Repeat
ech-Neo Publications
ien Syllabus w.e.f Acadernic Year 23-24) (BC-2)
IR (MU-TY,B.Sc -Comp SEM 6) (Text Clustering for IR)...Page
no..(7-5)
The output of the algorithm largely depends on how clusters are chosen
and split.
Divisive clustering is a top-down approach.
Agglomerative clustering
The other general type of hierarchical clustering algorithm is called
agglomerative clustering, which is a bottom-up approach.
An agglomerative algorithm starts with each input as a separate cluster.
That is, it begins with N clusters, each of which contains a single input.
The algorithm then proceeds by joining two (or possibly more) existing
clusters to form a new cluster. Therefore. the number of clusters
decreases after each iteration.
The algorithm terminates when there are K clusters.
1. Input
clustered
a set of N documents to be
an XN similarity (distance) matrix
N
are produced,
each document to its own cluster N clusters
2. Assign
containing one document each
3. Find the two closest clusters
merge them into a single cluster
atech-Neo Publications
(New Syllabus w.ef Academic Year 23-24) (BC-12)
IR (MU-T.Y. B.Sc.-Comp-SEM 6) (Text Clustering for IR)..Page no.
(7-7)
Example
ab abcde
cde
de
:
Methodsto find closest pair of clusters
GQ. Explain various methods to find clusters in clustering algorithm.
1. Single Linkage
In single linkage hierarchical clustering, the distance between two
clusters is defined as the shortest distance between two points in each
cluster. For example, the distance between clusters " and "s" to the left
isequal to the length of the arrow between their two closest points.
=
L(I.s) min(D(r,rHKs)
Complete Linkage
2. hierarchical clustering, the distance between two
complete linkage
between two points in each
In defined as the longest distance
clusters is
cluster.
between clusters "" and " to the left
is
example, the distance
of the arrow between their two furthest points.
For
to the length
equal
} 3. Average linkage
hierarchical clustering, the distance between two
In average linkage
as average distance between each point in one
clusters is defined the
cluster.
cluster to every point in the other
clusters and s" to the left is
For example, the distance between
arrow between connecting the points of
equal to the average length each
one cluster to the other.
1
L(T, s) = D(*:
(New Syllabus
ech-NeoPublications
w.e.f Academic Year 23-24) (BC-12)
(Text Clustering for lR)...Page no
IR (MU-T.Y. B.SC-Comp-SEM 6) .(7-9)
GQ. Diferentiate between K-Means and Hierarchical clustering. [G.Q.J
Clustering tendency
1.
sure that data set
Refore evaluating the clusteing performance, making
sue are working has clustering tendency and does not contain uniformly
distributed points is very important.
clusters identified
Tf the data does not contain clustering tendency, then
may be irrelevant.
by any state-of-the-art clustering algorithms
Empirical Method
A
simple empirical method of finding number of
clusters
where N
total number is
Square rOOt of N/2 is of data points, so
that each cluster contains square root of2 *N.
Elbow Method
Within-cluster variance is a measure of ompactness of the
cluster.
Lower the value of within cluster variance, higher the
compactness of cluster formed.
Sum of within-cluster variance, W, is calculated for clustering
analyses done with different values of k.
W is a cumulative measure how good the points are clustered
in the analysis. Plotting the k values and their corresponding
sum of within-cluster variance helps in finding
the number of
clusters.
Statistical Approach
Gap statistic is a powerful statistical method to find the optimal
number of clusters, k. Similar to Elbow method, sum of within
cluster (intra-cluster) variance is calculated for different values
of k.
Then Random data points from reference null distribution are
ech-NeoPublications
(New Syllabus w.e.f Academic Year 23-24) (BC-12)
IR (MU-TY, BSC-Comp SEM6) (Text Clustering for IR)..F
Page
no.(7-13)
Approaches usually based on an analysis of term co-0CCurrence:
Dice's Coefficient
Two functions are rank equivalent
if they produce the same ordering
rank
of Rears
2.n,ah
n, + n,
Mutual Information
P(a.b) rank
log Pa) P(b) = log N.
n, + n,
2
1
N N rank
n,gNnb
N.,NN
expected number of co-occurrences
if the two terms occur independently
Chapter Ends...
UNIT 2 Web Information
Retrieval
CHAPTER 8
Syllabus
architecture and challenges,
search
Information Retrieval : Web analysis and PageRank
Web
web pages,
Link
indexing
Crawling and
algorithm.
CHALLENGES
ARCHITECTURE AND
WEB SEARCH
8.1
Search Engine
Web Search and
8.1.1 search engine.
search and The
Explain the
term web information globally.
GQ.
allows people to share
World Wide Web bound. can be
The grows without which
information Pages'", Each of (URL).
amount of known as
*Web
are Resource Locator published
documents called Üniform pages
Web an identifier a set of any
addressed by Web Sites', search for
into the User can
pages are
grouped https://mu.ac.in/. or phrase. It then
Web Example, keywords user.
form of return to the
together. For passing query in database and
information by information in its on the Web.
relevant information
searches for for
means searching
search simply
Web
(Web Information Retrieval)...F
IR (MU-TY. RSC-ome-SEM 6) Page
no.
(8-2)
Web searching from
The tem may he Used to differentiate
searching
local users' PCs or servers in the company datacenter.
the user needs tool to search
a
In order to extract information,
The toolis called a search engine. A
Web
search engine the Web,
on the Weh
a specializea is
Computer server that searches for information
Examples of search engines are Google, Yahoo!, MSN Search,
bing.
Web search engine discover pages by crawling the web,
discovering
new pages by following hyperlinks. Access to particular
web paoes.8
be restricted in various ways.
The search engine looks for the keyword in the index for
predefined
database instead going
of directly to the web to search for the keyword.
then uses software to search for the information in the database.
This
software component is known as web crawler.
Once web crawler finds the pages, the search engine then showe sl..
relevant web pages as a result. These retrieved web pages
generally
include title of page, size of text portion, first several sentences etc.
TM
through irrelevant
to the query, leading users to waste time sifting
information.
users search and browsing
Privacy Issues Some search engines tracks
:
5.
history, which raises privacy concerns.
Types of search engines
1
GQ. Explain search engine types.
Crawler based
programs to
Crawler-based search engines use automated software
survey and categorize web pages. The programs used by the search
pages are called spiders, crawlers, robots or
engines to access our web
bots.
are:
Examples of crawler-based search engines
(www.ask.com)
Google (www.google.com) b) Ask Jeeves
ech-Neo Publications
(New Syllabus w,e.f Academic Year 23-24) (BC-12)
W lnfomation Retieval)...Page
no.(8-4)
Directories
what category
the
wedsites within specitic categories in site
vgs they plar
databaN.
ditorRs website
omprensiely check the and
The humn atis a pr-detined
rank
set of rules, i.
intomtion thev find, using
ho the
(Www.yabon.Om)
a) Tahoo Direy
organization structure.
Website designers usualy create distinguishable layout templates tor
pages of different functions.
Tubes
IN OUT
SCC
Tendrlls
Disconnected
Fig. 8.1.1 :
Bow-Tie Structure of the Web
The central core of the Web (the knot of the bow-tie) is the strongly
connected component (SCC), which means that for any twO
pages in the
SCC, a user can navigate from one of them to the other and back by
clicking onlinks embedded in the pages encountered.
a
The left bow, called IN, contains pages that have directed path of links
leading to the SCC and its relative size was 21.5% of the crawled pages.
Pages in the left bow might be either new pages that have not yet been
linked to, or older web pages that have not become popular enough to
become part of the SCC.
The right bow, called OUT, contains pages that can be reached from the
SCC by following a directed path of links and its relative size
was also
21.5% of the crawled pages.
a
Pages in the right bow might be pages in e-commerce sites that have
policy not to link to other sites.
ech-Neo Publications
(New Syllabus w.e.f Academic Year 23-24) (BC-12)
(Web
Infomation Retrieval)...Page no.(8-6)
RATNBSCComp-SEM 6)
OUT.
Web Search
A 8.1.3 Challenges of
challenges posed by Web
search?
GQ. What
are the
itself
1, Data-centric: related to the data
an
Distributed data
-
Data spans over many computers
on
platforms. Available bandwidth and reliability the network
interconnections vary widely.
- New
High percentage of volatile data computers/sites/pages
can be added and removed easily. We als0 have dangling links eto
or
when domain or file names change disappear.
- to cope up.
Large volume of data scaling issues difficult
Unstructured and redundant data No Conceptual
structurelorganization. HTML pages are only semi-structured.
Much data is repeated (copies/mirrors).
or country code).
anyone can publish data on the Web,
The user should notice that
so information that they get from search engines might not be
accurate.
?
GO. Explain in brief how Search Engines works
ech-Neo Publications
(New Syllabus w.e.f Academic Year 23-24) (BC-12)
irtermatten Fetrevai). Page ng.
(8-8)
Pcsing
intie
8.S-Cones-SEM 6)
(Web infomation Retrieval).Page no. (8-9)
ceNT.Y.
Engine Architecture
Simplified Search Searth rtetact
Query Engine
CaNier Indexer
Fig. S.I.2
Searc Index
e s z
in
e
web es Crzled. Iisted in
e
prsing list).
to peiinks in 2
Te sezrh ieI ll lso sore infonaion peaining
engine to perfom
STE link datzhase mich llows he seach process web
Npnik 2alysis, ich is usei 2s part of the ranking of
esh-Neo Pubilications
ew Sylabs wefAcacdemis Year 23-24) (BC-12)
IR (MU-T.Y, B.SC.-Comp-SEM 6) (Web Infomation Retrieval).Page no.
(8-10)
The link database can also be organized as an inverted filein such
URLS and a way
that its index file is populated by the posting list for
URL entry, called the source URL, contains all the destination each
URLS
forming links between these source and destination URLS.
Query Engine
Query engine is a well-guarded secret, since search engines are
rightly
paranoid, fearing web sites who wish to
increase their ranking
unscrupulously taking advantage of the algorithms the search by
engine
uses to rank result pages.
Search engines view such manipulation as spam, since it
has direct
effects on the quality of the results presented to the user.
Search Interface
Once the query is processed, the query engine sends the
results list to the
search interface, which displays the results on the user screen.
's
The user interface provides the look and feel of the search
engine,
allowing the user to submit queries, browse the results
list, and click on
chosen web pages for further browsing.
crawling.
Features of a crawler
Robustness : Ability tohandle spider traps. The Web contains servers
that create spider traps, which are generators of web pages that mislead
crawlers into getting stuck fetching an infinite number of pages in a
particular domain. Crawlers must be designed to be resilient to such
Iraps.
ech-Neo Publications
(New Syllabus w.e.f Academic Year 23-24) (BC-12)
(Web Information Retrieval)...Page no..(8-12)
IR (MU-T.Y. B.SC.-Comp-SEM 6)
Doc
robots URL
FP's templates Set
URL Dup
Contern
Pars Scen? Filter URL
fetc Elim
URL Fronticr
Fig. 8.2.1
1. The URL frontier, containing URLS yet to be fetched in the
a
current crawl (in the case of continuous crawling, URL may
have
been fetched previously but is back in the frontier for re-fetching).
2 A DNS resolution module that determines the web server from
which to fetch the page specified by a URL.
3. A fetch module that uses the http protocol to retrieve the web page
at a URL.
4. A parsing module that extracts the text and set of links from a
fetched web page.
5. A duplicate elimination module that determines whether an
extracted link is already in the URL frontier or has recently been
fetched.
We beginby assuming that the URL frontier is in place and non-empty.
We follow the progress of a single URL through the cycle of being
fetched, passing through various checks and filters, then finally (for
continuous crawling) being returned to the URL frontier.
numbers.
page one or two levels of hierarchy.
structure with only
for smaller
flat complement search engines on larger web sites and use a
Indexes Whether to back-
provide a cost- effective alternative. on
sites, they style index hierarchy of categories will depend
or a the
of-the-book content is changing.
size of the site and how rapidly the
size of the are best done by individuals skilled in
indexing who also
Site indexes indexing tools.
in HTML in using HTML
Or
basic skills
have embedded indexing of Word, FrameMaker,
Electronic indexing includes
InnDesign electronic documents of publications, online help and
PDF and When the pages are edited or
tagging.
Content Management System with new page numbers
or anchors or
changed, the index is regenerated
from the index to the relevant page paragraph.
or
URLS With a hyperlink
ALGORITHM
LINK ANALYSIS AND PAGERANK
8.3
Fig. 8.3.1
The above figure shows The random surfer at node A proceeds with
:
a
probability 1/3 to each ofB, C and D. Consider random surfer vhe
a
begins at a web page (a node of the web graph) and executes random
walk on the Web as follows. At each time step, the surfer proceeds from
his current page A to a randomly chosen web page that A hyperlinks to.
The above figshows the surfer at a node A, out of which there are three
hyperlinks to nodes B,C and D; the surfer proceeds at the next time step
to one of these three nodes, with equal probabilities 1/3. As the surfer
proceeds in this random walk from node to node, he visits some nodes
more often than others; intuitively, these are nodes with many links
coming in from other frequently visited nodes. The idea behind Page
Rank is that pages visited more often in this walk are more important.
Chapter Ends...
UNIT 2
Learning to Rank
CHAPTER 9
Syllabus
: Algorithms and Tochniquos, Supervisod loarnlng
Learning to Rank
ranking: RankSVM, RankBoost, Palrwiso and lstwlse loarning to
for
rank approaches Evaluation metrics for loarningto rank,
web pages In
a way
search bar
that
the most
The search engine should rank the
top positions.
relevant resulEs appear in
movie recommender system choosing
Recommender systems.
A
2 a user based
which fim shouid be recommended to On an 1nput
query.
the three major LTR approaches types
LIR algorithms Uses are
pointwise. pairwise. and listise.
These approactes associated algorithms are showed in the below
iagram.
Fortafse Lstse
Tea cf
cæáces Tgz tte ertire st
Gerzar cf caricates
artrE. slgoriters
( sewSylilsbus
wef AcadernicYear 23-24) (BC-12)
FiTech-Neo Publícations
BB.Sc.-Comp-SEM 6)
(Learning toRank)...Page no.
(MU-T.Y. (9-3)
IA
Depending on how an algorithm chooses and Compares ranks of at
items
each iteration, there exist three principal methods :
Pointwise ranking.
1.
Pairwise ranking.
2.
3
Listwise ranking.
1. Pointwise ranking
}
Pointwise approaches look at a single document at a time using
classification or regression to díscover the best ranking for
individual results.
We give each document points for how well it fits during these
processes. We add those up and sort the result list. Note here that
each document score in the result list for the query is independent
of any other document score, i.e. each document is considered a
*point" for the ranking, independent of the other "points".
For pointwise approaches, the score for each document is
independent of the other documents that are in the result list for the
query.
Pointwise ranking optimízes document scores independently and
does not take into account relative scores between different
documents. Therefore, it does not directly optimíze the ranking
quality.
In the pointw/ise approach, scores are predicted individually for
scores are sorted. It
each feature vector. Ultimately, the predicted
does not matter which type of model (decision tree, neural
network, etc.) is used for prediction.
The advantage of pointwise ranking is simplicity.
as an isolated
The disadvantages are- Each instance is treated
create the training
point. Explicit pointwise labels are required to
dataset.
we can use the Pairwise Ranking
To overcome these challenges,
method.
ech-Neo Publications
New Syllabus w.e.f Acadernic Year 23-24) (BC-12)
IR (MU-T.Y, B.SG.-Comp-SEM6) (Learning to Rank)..Page
no.
(9-4)
query
f(g, d) = Si
d SCore or
model relevance
document
probability
2. Pairwise ranking
3. Listwise ranking
di
probability that di is
ranked higher than d
SCore
g(s, s,)
=
P(d. > d)
probability that d. is
ranked higher than d
= s,
f(g, d)
d
SCore
documents.
Truth lists are identified. and the machine uses that data to rank its list
Listwise approaches use probability models to minimize the ordering
error..
There are two main sub-techniques for doing listwise Learning to Rank:
1. Direct optimization of IR measures such as NDCG. E.g., SoftRank.
AdaRank.
2. Minimize a loss function that is defined based on understanding the
unique properties of the kind of ranking you are trying to achieve.
E.g., ListNet, ListMLE.
Listwise approaches can get fairly complex compared to pointwise or
pairwise approaches.
They can get quite complex compared to the pointwise or pairwise
approaches.
Unlike pointwise or pairwise ranking, listwise methods take as an input
a whole Jist of documents at a single time.
This leads to big computations but also gives more robustness since the
algorithm is provided with more information at each iteration.
d.
a
model takes
Listwise model architecture. As an input, the
9.2.3:
Fig. query and feature vectors of all documents.
RANKING:
SUPERVISED LEARNING FOR
H9.3 RANKSVM, RANKBOOST
L(z)
=
log(1 e) RankNet
z= o (S;- S)
Tech-Neo Publications
(New Syllabus w.e.f Academic Year 23-24) (BC-12)
IR (MU-TY, B.SC.-Comp-SEM 6) (Learning to Rank))..Page
no. (9-8)
a 9.3.1 RankSVM
pairs. For
more relevant than document
x, (mathematically, y,, +1), then
being T
larger than x, by a margin of 1, there is no loss. Otherwise,
w
w x,, is
if be Eu, v.
the loss will
well rooted in the framework of SVM, it inherits
Ranking SVM is
Since For example,
properties of SVM.
with the help of margin
nice can have good generalization ability.
maximization, Ranking SVM
can also be applied to Ranking SVM,
so as to handle
tricks
Kernel
problems.
complex non-linear
RankBoost
A9.3.2
RankBoost method.
iGQ. Explain
The method of RankBoost adopts AdaB0ost for the
classification over
document pairs.
The only difference between RankBoost and
AdaBoost is that the
on document pairs while that in
distribution in RankBoost is defined
AdaBoost is defined on individual documents.
1, where D, is
The algorithm flow of RankBoost is given in Algorithm
the distribution on document pairs, f, is the weak ranker
selected at the t
th iteration, and a, is the weight for linearly combining the weak
rankers.
RankBoost actually minimizes the exponential loss defined below:
–
Choose 4 (). 6
(i) exp (G4(f, («)-f,
Update D,., (y)=.6y*) (i)
(i)
where Z,= ž
n
D,, (,
(i)
x,() ) exp (a4 (F, (*,)-f,a)
i=l u,v:yx =1
Output : f(x) = Ž a, f, (x)
can see that RankBoost learns the optimal weak
From Algorithm 1, one
at based on the current distribution of
Then
W,1
The third way is based on the approximation of Z,, which is applicable
when f, takes a real value from [0, 1]. In this case, if
we define:
(i) (i)
j=l u,v:y
(1)
D,(%,
)f, «)-f,)
then
M 94
detail various ranking evaluation metrics.
Discussin
GQ. a
question which
arises naturally is how to estimate the quality of
The algorithm.
ranking
types of information retrieval metrics- Unranked,
are several
There User-oriented.
Ranked and
Examples
Metric T'ype
MSE, RMSE, MAE, precision, recall
Unranked
Kendall Tau distance precision ®k, recatl @k, APOk,
Ranked
MAP@ k, RR, MRR
Unranked metrics
evaluation metrics considers that the set of relevant
Unranked
the user and users' feedback.
documents for a query is independent of
machine
are mainly used for evaluating the performance of
They a measure for information
learming classification problems and not good
retrieval systems.
some unranked metrics like error rate, fallout,
Even though not suitable,
and miss rate are used in the field of speech
recognition and information
retrieval.
are some unranked metrics.
MAE, MSE,RMSE, Precision and Recall
average of the absolute
MAE (Mean Absolute Error) represents the
difference between the actual and predicted values in the
dataset. It
measures the average of the residuals in the dataset.
average of the squared
MSE (Mean Squared Error) represents the
set. It
difference between the original and predicted values in the data
measures the variance of the residuals.
square root of Mean
RMSE (Root Mean Squared Error) is the
Squared error. It measures the standard deviation of residuals.
ech-NeoPublications
(New Syllabus w.e.f Academic Year 23-24) (BC-12)
l OAming lo Rank),
(9-12)
greater relevance than document j. appears after on the search result than
„Kendali Tau distance calculates all the number of inversions in the
ranking. The lower the number of inversions, the better the search result
is.
Precision @k would be the fraction of relevant items in the top k
recommendations. and recall@k would be the coverage of relevant
imes in the top k.
Average Precision or AP@K is a metric that tells you how a single
@ K
soried preiction compares with the ground truth. E.g. AP would tell
you how correct a single ranking of documents is, with respect to a
single query.It is the sum of precision @K where the item at the k..
rank is relevant (rel(k) divided by the total number of relevant items (r)
in the top K recommendations.
singlc
qIery,
highest-ranked
the Correct ans#er Was
renet in he auer). hiet the
If no
query). (0,
reciprocal rank is
metricsS
Jiser-oriented a
metrícs consider ranking posítions of items thus being
Though
ranked
unranked Ones, ey stil! have significant
a
over the
preferable choice
the information about user behaviour is not raken into
downside:
account. user
User-orientedapproaches make certain assurnptions about
ech-NeoPublications
(New Syllabus w.e.f Academic Year 23-24) (BC-12)
wotkfiow,
every Pssible
eve toe jtenion 19xatnine
tit docuinen anyther
19
yuenlially prvgteses from yie
p 1ernnates the 3eart
and with nverse protability
makes SUre
Cie curreut ducunent. RPB forrnulain
etween ) atid 1.
-
ERR (Expeded Reciprycal Benk) As the
1;AsUTeSs the averaye reciprocal rank aeroSS THANy queries,
similar to RPB but with a Jittde difference: if the ursent iten
(R )for the user. then the search predure ends. Oherwise,
p
1O1 relevant
(/ Ri ), hen with probability the use decides
whether he or she wants t9 continue the search process, Sf s9, he search
jproceedsto the next jtem. )therwise, he users end 1he scarch procedure,
Chupter Ends..
UNIT2 Link Analysis
and its Role in
CHAPTER 10 IR Systems
Syjlabus
itg Role in R
Systens : Web reph represertetion
Analyzis and
Link algoritfns, HITS 2nd PagePenk algoritrs,
link anaysis
and systerns.
of link ernelysis JR
in
Applications
LINK
GRAPH REPRESENTATION AND
WEB
H
10.1
ANALYSIS
Web Graph
a 10.1.1
graph.
GQ, Explain the termweb
We can view the static Web consisting
of
static HTML pages together
as a directed graph in which each web
with the hyperlinks between them
a edge.
page is a node and cach hyperlink directed
Types of Links
Inbound links are links into the site from the outside.
Inlinks are one way to increase a site's total Page Rank.
Sites are not penalized for nlinks.
2.
Outbound links or Outlinks
Outbound links are links from a page to other pages in a site or other
sites.
3. Dangling links
Dangling links are simply links that point to any page with no outgoing
links.
There is ample evidence that these links are not randomly distributed;
This distribution is widely reported to be a power law, in which the total
number of web pages with in-degree i is proportional to 1/ia; the value
of a typically reported by studies (a=1).
Define the
GQ hyperlinks and the graph structure of the web have been
analysis of
web search.
The
instrumental in the development of
a network of interconnected links and nodes to
analysis Uses
Link relationships that are not easily seen in raw data.
Leand analyze
source which
pages on the web are large knowledge
a
between
The links analysis algorithms for many ends.
exploited by link
is PageRank and HITS will determine a quality
algorithms similar to
Many a page.
score based on the number of in-coming links of
authority
Or
is applied to identify thematically similar pages, web
Link
analysis
structures
communities and other social
web search has intellectual antecedents in
the field of
Link analysis
for
citation analysis.
also proves to be a useful indicator of what page(s) to
Link analysis analysis to
next while crawling the web; this is done by using link
crawl queues.
assignment in the front
guide the priority
An authority A lub
Fig. 10.2.1
A good hub page is one that points to many good authorities, a good
authority page is one that is pointed to by many good hub pages.
a(y)
h(v) t- 2
a(v) -2 h(y)
to the sum of the
sets the hub score of page
v
algorithm
Advantages of HITS
to rank pages according to the query
HITS Scores due to its ability
pages.
1.
resulting in relevant authority and hub
string,
may be combined with other information retrieval
The ranking also
2.
based rankings.
query (as compared to PageRank).
3 HITS is sensitive to user
hubs
pages are obtained on basis of calculated authority and
4. Important
value.
order
calculating authority and hubs in
5 HITS is a general algorithm for
to rank the retrieved data.
on a
Web graph by finding set of pages with a search
6. HITS induces
given query string.
7. Results demonstrate that HITS
calculates authority nodes and hubness
correctly.
ech-Neo Publications
(New Syllabus w.e.f Academic Year 23-24) (BC-12)
(Link Analysis
&
its Role in IRS). ..Page
IA (MU-T.Y, B.Sc.-Comp-SEM 6) no.(10-6)
The web page ranking algorithms rank the search results depending
upon their relevance to the search query. For these algorithms rank the
search results in descending order of relevance to the query string being
searched.
Fig. 10.2.2
The above figure showS: The random surfer at node A proceeds with
probability 1/3 to each of B, and D. Consider a random surfer who
C
(New
ech-Neo Publications
Syllabus w.e.f Academic Year 23-24) (BC-12)
(Link Analysls
&
its Rolo IRS)..Pago no,
IR (MU-T.Y, B.Sc,-Comp-SEM 6) (10-8)
I Such that
TP = A
n arc (lhe
The entrics in the principal cigenvcctor
N
stcady-stute
probabilities of the random walk with tcleporting, and thus
thc
PageRank values for the corresponding web pages.
if n
is the probability distribution of thc surfer across the web
pages, lhe
remains in the steady-state distribution t. Given that s the slcady- Tn
(I/6
aP= PageRank
Advantages of
pages pointing to il are
spam, A page is important if thc
Fighting
Since it is not casy lor Web pagc oWner to add in-links into
important. casy to influence
other imporlant pages, it is thus not
page from
his/her
Runk,
Page mcasure and is query indepcndent. Page Rank
a global
Rank is at
Page pagcs are computcd and saved ofT-linc rathcr than the
valucs of all the
2
pisadvantages of
PageRank
rank. IL is because a ncw page cven has
Older pages may have higher
1. may not have many links in the carly
some very good contents but it
State.
by the "link-farms"
2.
PageRank can be easily increased
and HITS.
GO. Differentiate between PageRank
PageRank HITS
rech-Neo Publications
(New Syllabus w.e.fAcademic Year 23-24) (BC-12)
no
EI Lnk Analysis
&
its Role in IRS)..Page (10-10)
IR MU.T Y B
Sc-np-SE
HITS
PageRank
find hubs and
omhined tradtional attempts
at authorities. Since thís computation
mformation setrieval scoe
much | is carried out at query time, it
is no
QLTY titne, The adv antagc is
feasible for today's search engines,
greater efficiency handle
which need to millions of
HITSemphasizes mutual
PageRank does not attempt to
capNUTE the distinction betwcen reinforcement betwecn authority
and hub webpages.
hubs and authorities. It ranks
pages just hy authority.
a few
Can be unstable: changing few
a Can be unstable: changing
links can lead t0 quite different
Iinks can Jcad to quite different
rankings. ranking
Apart from ranking link analysis can also be used for deciding which
web pages to add to the collection of web pages, i.e., which pages
to
crawl.
ech-Neo Publications
(New Sylabus wef Academic Year 23-24) (BC-12)
B o
B.Sc.-Comp-3EM 6)
(Link Analyis its Psdsin RS).Page (1G-11)
(MU-T.Y.
IA roho1 or spider) perlorms a traversal of the weh graph
crawier (or
of fetching high-quality pages. Aiier fetching a page, ii
with the goal which page out ol the set of uncrawled pages to fech
decide
needsto
next,
is (o crawl the pages with highest numter links from
of
approach
One order of PageRank. Link analysis was als
crriawled pages first in the
the
a
search-byy-example approach to searching, gíven one relevant
usedfor
pages related to iL. We can use the HITS algorthm for this
page find
While link analysis is widely used in intelligence, it also has
problem,
domains, These include cltation analysis,
applications in many other
detection and
enforcement, IT network ecurity, fraud
(AML.).
investigation and Anti-Money Laundering
Cltation Analysis
is the study of citations amnong scientific paers and
Citation Analysis
journals,
factor for Mcicntific a journal,
Astandard measure in this licld is impact a papcr in the
average number of citations reccived by
delined (o be the can
over the past two years, "This type of voting hy in-links
eiven joturnal
the scicntifie
serve as a proxy for the collcctivC attention that
thus
in the journal.
communily pays to papers publishcd
Law Enforcement
and cfficiency of Law
Technology helps in incrcasing the productivity
Enforcenment Agcncics. A strong partnership
betwecn police and
investigations, greatly reduce
technology would facilitate quick criminal
crime, and help to uphold law and
order.
IT Network Security
an important tool to ensure the
Network communication has become
society. The retrieval of network
efficient operation of modern on the annotation
communication information is mainly based and
characteristics in network communicatio
extraction of information
as to cary out retrieval matching.
The key of its application is to organize the
information organically.
a
Network communication information retrieval is not only fast.
but also Ihas important
effective way to obtain the required information,
comnmunication
research value and role for the security of network
Fraud Detection and investigation
Fraud detection is of paramount importance for banks and othex
companies that deal with a significant number of financial transactione
and are therefore at higher risk of suffering from financial fraud
However, other sectors such as ecommerce companies, credit card
companies, electronic payment platforms, and B2C fintech companies
also need to employ fraud detection to prevent or limit financial fraud.
Most common applications of fraud detection include account-related
fraud and payment and transaction fraud. Account fraud is further
new
divided into new account fraud and account takeover fraud, In
account fraud, new accounts are created by using fake identities.
Such frauds can be identified by using the patterns of various devices
and session indicators for detecting fake identities.
Anti-Money Laundering (AML)
The purpose of the AML rules is to help detect and report suspicious
activity including the predicate offenses to money laundering and
terrorist financing, such as securities fraud and market manipulation.
The importance of AML in banking and other industries that use it
comes down to protecting business operations and the economy as well
as upholding your moral responsibility. Specifically, compliance with
AML allows institutions to: Avoid sanctions and fines, Save money &
Prevent criminal activity
Chapter Ends...
UNIT 3 Crawling and
Near-Duplicate
CHAPTER 11 Page Detection
Syllabus
Web page crawling
Near-Duplicate Page Detection :
Crawling and crawling, Near-duplicate
breadth-first, depth-first, focused
techniques:
Handling dynamic web content during
page detection algorithms,
crawling.
BREADTH
WEB PAGE CRAWLING TECHNIQUES:
M
11.1
FIRST, DEPTH-FIRST
computer
a program that browses the World Wide
A web crawler is
manner.
Web in sequencing and automated
as spider which can be used for accessing the
A crawler also referred
server as per user pass queries commonly for
web pages from the web
search engine.
crawling web pages.
A web crawler also used sitemap protocol for
an algorithm design
Crawling the web is not a programming task, but
content is very large.
and system design challenge because the web
If the desired URL is found, then the search terminates. If it is not, then
search proceeds down tothe next level and repeat the processes until the
goal is reached.
It uses the boundary as a FIFO queue, crawling links in the order in
which they are encountered. The Breadth First Search algorithm is
generally used where the objective lies in the depthless parts in a deeper
tree.
The time complexity of breadth first search can be expressed
as O (IVI + IEI), since every vertex and every edge will be explored in
the worst case. Where |V| is the number of vertices and Elis the number
of edges in the graph.
BreadthFirst (Startingthis)
for (i=0;i<=SlartingUrl;i+ +)
ENQUEURBoundary, url):
do
uri=Dequeue(Boundary);
Page=Fetch(Url);
Visited=Visited +1;
Enqueue(BoundaryExtractLinks(Page):
}
do
unexplored then
ifedge r is
w= G.opposite(v, e)
if vertex w is unexplored then
label e as discovery edge
recursively calf DES(G, w)
else
label e as bark edge
(New Syllabus
rech-Neo Publications
w.e.f Academic Year 23-24) (BC-12)
IA FOCSED CRAWLING
esNeoPubisiors
SEM (Craning & Naz-Dupicate FD)..Page na (i1-5)
Focused crawling
Architecture of
detail the architetureeof Focused rawing.
cOntains a
list ot
unisited URLS aintainei by e cTawler
downloader fetches URLs fom URL queue and cowzlczis
Web pg
nding pages irom the inteaez
as
puser exTCIOT eriCs infomaion sucà the tems and the
The nd
a dowzlozded page.
hyperiink URLS Îom
calculator calculates relevence of a page ih espect to topic.
Relevane
COre to URLs exracted from the page.
rd Sgns pages is related to
Taais Älter analvzes whether the content oi parsei
or no. If the page is relevanL the URLS extr2Cted îom it wili be
topit
queue. othewie 2idei to the Irelerant ahle
sied to the URL
A foCused crewling algorihm loeds 2
page and ertacis the links, By
Parser &
Relevari Irev:
Topic
Fler
Fig. 11.2.1
Classic focuscdcraler
2
Semantic crawler
3. Lcarningeravwlcr
2, Semantic crawlers
It is a variíatíon of classic focuscd crawlers.
To compute topíc to page relevance downloaded priorities are assigned
to pages by applying semantic similarity criteria, the sharing of
conceptually símilar tcrms defines the relevance of a page and the topic.
Ontology is uscd to define the conceptual simílarity betwcen the terrns.
3. Learning crawlers
Uses training proCeSs to guide the crawling process and
a to) assign visit
príoritics to web pagCs.
lcarning cra/lcr supplies training set whích consist of relevant and
a
A
not relevant Weh pages in order to traín the learning crawler.
Links arc cxtracted from weh pages by assigning the hígher visít
prioríties to classify rclevant topíc.
Methods based on context graphs and Hidden Markov Modeis take into
account not only the page Content but also the link structure of the Wehb
page.
ard the probabílíty that a given page will lead to relevant
a
aredh-tieo Pubications
(New Sfiaboss wsf Aadernic Ycar 23-24) (eC-12)
B.c-Connip-SEM 6) (Craing Near-Dupicate PD)..Page rio. (11-7)
terrns in d.
a rose is a rose is a rose. The
As an exarnple, consider the following text:
= a
4-shingles for this text (k 4 is typical
value used in the detectíon of
a, rOSE ÍS a TOSe and is a rose ís.
near-duplicate web pages)are a rOSE Ís
occur twice in the text. ntuitively,
The first two of these shingles each
two docurnents are near duplícates if the sets of shingles
generated from
them are nearly the sane. We
now make thís intuition precise, then
develop a method for efficiently computing and comparing
the sets of
shingles for all web pages.
Grech-tNeo Publicatiorns
tien Sflatbus vw.e.f Acaderníc Year 23-24) (BC-12)
R NUT RSCopSEM 6) (Crewtng& Noar-Duplicate PO)..Fage no. (11-8)
Le Sid) denote the sct of shingles
. Recal!
of document
the Jnccard d,
H(:)
Document 2
n h
conpitetiog. n he rst atp (eop tow), we appy a
o4-St
A
rndo peTtarnt0 perute H;) end Hlaa), obtatrtg Ti(d;) end IH(:)
with 0's in
types: those a S, and a in S,p, and finally those
O
with 1's in
with l in
in S,2· those rows of Figure 19.9
I columns. Indeed, the first four
both of these rows.
four types of
exemplify all of these
S
0 --
0
1
coefficient is 2/5.
Two sets and S;2; their Jaccard
S;
variable parameters that are provided to the server that delivers the
content. The parameters may be already present in the URL itself or they
may be the result of user input.
a
A
dynamic URL typically results from search of database-driven
website or the URL of a website that runs a script.
In contrast to static URLs, in which the contents of the webpage do not
change unless the changes are coded into the HTML, dynamic URLS are
typically generated from specific queries to a website's database.
The webpage has some fixed content and some part of the webpage is a
template to display the results of the query, where the content comes
from the database that is associated with the website. This results in the
page changing based on the data retrieved from the database per the
dynamic parameter. Dynamic URLS often contain the following
characters: ?, &, %, +, =, $, cgi.
However, sometimes a parameter in a dynamic URL may not result in
modifying the page content in any way.
One of the parameters of the example dynamic URL above is sessionid
followed by a corresponding value that is unique to a user. The
<sessionid'" parameter is used by the website to track the user during a
particular session in order to tailor the user's experience based on
knowledge obtained about what actions the user has made during the
session. The <sessionid" may be inserted into the URL as a result from
a user registering and logging into the website.
Sylatus
Brattitfirt leni-firt
disIssíng-Bradtiini, egti-fírtn et inns
Eread-Frz
ntteraatt-irst
irí==ang.
PDPegE na (11-3)
11.12 Depti-Frst
DS goes to ihe eepest levels of the graph and then covers the breadth
while BFS cOvers the breadth wih gradually going deep.
Depth-first crawiing explores a website by starting at the home page and
then going dowT throngh the links to find new pages. This crawling
finds new conteni quickiy.
a paricular topic.
Focused crawlers rely
on the fact that pages about a topic tend to have
same topic. li his were perfectly true, it
inks 10 other pages on the
one On-topic page. then craw! all
would be possible to starr a craw! at
by following links from a single rooi page. In
pages on that iopic just
practice. a number of popular
pages for a specifñc topic are typically
used as seeds.
means for determining
Focused crawlars require some automaic
topic.
whether a page is about a paricular
a
Text classifiers are tools that
can make this kind of distinction, Once
Dage is downloaded, the crawler
uses the classifier to decide whether the
page is on iopic. If it is. the page is kept, and links from
the page are
ech-Neo Publications
(Vew Syllabus w.e.f Academic Year 23-24) (BC-12)
IR (MU-T.Y. BSc.-Comp-SEM 6) (Crawling & Near-Duplicate PPD)...Page no. (111-5)
Parser &
Extractor
Relevant Irrelevant
Topic
Filter
Fig. 11,2.1
Learning crawler
3,
Classic focused
crawlers
1.
interested pages
the search towards
Guides
by taking the user query
as input.
which describes the topic
to the links based on the topic of query
assign priorities
They
and the
pages with high
priority are downloaded first.
on the basis of similarity
priorities are computed
These
between the
2.
Semantic crawlers
Ir isvariation of classic focused crawlers.
a
are assigned
To compute topic to page relevance downloaded priorities
to pages by applying semantic similarity criteria, the sharing of
a page and the topic.
conceptually similar terms defines the relevance of
Ontology is used to define the conceptual similarity between the terms.
3. Learning crawlers
Uses a training process to guide the crawling process and to assign visit
priorities to web pages.
A learning crawler supplies a training set which consist of relevant and
not relevant Web pages in order to train the learning crawler.
Links are extracted from web pages by assigning the higher visit
priorities to classify relevant topic.
Methods based on context graphs and Hidden Markov Models take into
account not only the page content but also the link structure of the Web
a page.
and the probability that a given page will lead to relevant
H(ds) H(d2)
264–1
Document 1 Docment2
RguN 19.1 Mutratkon of hinge ketthes We e
two documents going through
four stase of hngle In th f Atp (top oW), wa apply a
ketch computation.
haNi to ache ingle docunt to obtat H(d:) ant H(d) (ctrcdw), Nxt64-bit
we
lapplv arandom pernmutatonTtoparmutt H(d1) And H(d^), obtaintns I(d:) and II(G2)
lauAree) The thid row hows only n() &nd H(da), who the bottom Ow Ahowe th
míntmum value a ad
x
for endh doCument.
ech-Neo Publications
(New Syllabus w.e.f Academic Year 23-24) (BC-12)
(Crawing & .Page no.
Near-Duplicate PD)... (11-9)
IR(MU-T.Y. BSC-Comp-SEM 6)
from applying II to
of A; denote by Il(S,) the column that results
ws
row in which
the jth column. Finally. let x be the index of the first the
any two columns J),j,
olumn lI(S,) has a l. We then prove that for
EE
1
rech-NeoPublications
(New Syabus wef Academic Year 23-24) (BC-12)
no. (11-10)
B.Sc.-Comp-SEM16) (Crawling & Near-Duplicate PD).Page
(New
ech-Neo Publications
Syllabus w.e.f Academic Y'ear 23-24) (BC-12)
R no.
(MU-TY. B.S.-Comp-SEM6) (Crawling & Near-Duplicate PD).Page (11-11)
determine whether the content of the particular webpage 1s truly unique.
However, such an approach is still prone to error (1.e., not allduplicates
are identified as duplicates).
Chapter Ends...
UNIT 3 Cross-Lingual
and Multilingual
CHAPTER 13 Retrieval
Syllabus
Muttilingua! Retrieval :
Cialerges ad
Cress-Lingusl and P,
13.1 CROSS-LINGUAL
RETRIEVAL OR CROSS-LINGUAL
INFORMATION RETRIEVAL AND MULTILINGUAL
RETRIEVAL OR MULTILINGUAL INFORMATION
RETRIEVAL (CROSS-LINGUAL SEARCH AND
MULTILINGUAL SEARCH)
or Cross-Lingual
13.1.1 Cross-Lingual Retrieval
Information Retrieval
a
Cross-Lingual Information Retrieval (CLIR) ís retrieval task in which
are written in different
Scarch queries and candídate documents
languages, CLIR can be very useful in some scenarios.
use
Cross-language information retrieval refers more specifically to the
Case where users formulate their information need in
one language and
the system retrieves relevant documents in another.
IR (MU-TY. B.SC.-Comp-SEM Multi Retrieval)...Page no.
6) (Cross-Lingual
&
(13-2)
1. Dictionary-based
In dictionary-based query translation, the query wll be processed
linguistically, and only keywords are translating using Machine
Rcadable Dictionaries (MRD), MRDs are electronic versions of printed
dictionaries, either in general domain or specific domain. Translating the
query using the dictionaries is much faster and simpler than translating
the documents.
Some common problems associated with dictionary-based translation:
1. Untranslatable words (like new compound words, proper names,
spelling varians, and special terms): Not every form of words used
in query is always found in dictionary. Sometimes problem occurs
in translating different compound words (formed by combination of
new words) due to the unavailability of their proper translation in
díctionary.
2 dictionaries.
found in
Lexical ambiguity in source and target languages: Relevant forms
3 meaning for information retrieval are:
oflexical
homnonymous and 2) polysemous words.
1)
if they have at least two different
Tuo words are homonymous
of words are unrelated. E.g.- She will park the
meanings and senses
park.
car so we can walk in the
-
moving vehicle to a place usually a car park
of
Park- action
nature
Park a public area close to
uses dictionary to
This is the simplest technique which literally
4. one used for the
retrieve information in other language(s) than the
query.
very
serious drawbacks, most
Unfortunately, it has a few but
5.
which raises
notably the issue of words having different meanings
the question of accuracy.
(NewSyllabus
W.e.f Academic Year 23-24) (BC-12)
lrech-Neo Publications
IR (MU-T.Y,B.Sc.-Comp-SEM 6) (Cross-Lingual
&
Multi Retrieval)..Page no. (13-6)
Discuss the
challenges in Cross-Lingual Information Retrieval (CLIR).
GQ.
neck and
respect.
Phrase
identification and translation: Identifying phrases in limited
as a
Context and translating them whole entity rather than individual
difficult.
word translation is
Translate/transliterate a term: There are ambiguous names which need to
transliterated instead of translation. • For example, 000000
as well as su.
Rhaskar. Sun) in Marathi refers to a person's name
context is a challenge.
Detecting these cases based on available
up fetching
Transliteration errors: Errors while transliteration might end
the wrong word in target language.
the
Dictionary coverage: For translations using bi-lingual dictionary,
on
exhaustiveness of the dictionary is important criteria for performance
system.
Font: Many documents on web are not in Unicode format. These
further
documents need to be converted in Unicode format for
processing and storage.
1. Morphological analysis (different for different languages)
to
2. Out-of-Vocabulary (00V) problems: New words get added
language which may not be recognized by the system.
SPACE FOR NOTES
(NewSyllabus 1) (BC-12)
aech -Neo Publications
w.e.f Academic Year 23-24)
no. (13-8)
IR (MU-T.Y, B.Sc.-Comp-SEM 6) (Cross-Lingual & Multi Retrieval)..Page
in IR.
! GQ. Briefly explain the concept of Machine Translation
(MT) for IR.
I GQ Explain the different approaches of Machine Translation
morphological
quantity of the sourcc-targct language dictionaries,
word-by-word translation with
analysis, lext proccssing softwarc, and
minor grammatical adjustments on word order and morphology.
Wew Syllabus
tech-NeoPublications
w.ef Acadermic Year 23-24) (BC-12)
no. (13-10)
RUT.Y.BSCCono SEM 6) Coss-ingual &Muli Retrieval)..Page
Transfer based MT
spaific rpresentation.
An equiv alent representation (with same level of abstraction) is then
genrated for the target language using bilingual dictionaries and
gramma rules.
These systems have three major components:
Analysis
Analysis of the source iext is done based on inguistic information such
as morphology. part-of-speech, syntax, semantics, etc. Heuristics as well
as algorithms are applied to parse the source language and derive the
syntactic structure (for Janguage pair of the same family, for example
Tarmi! and Telugu are siblings of same family i.e., Dravidian
Languages etc.) of the text to be translated; Or the semantic structure
(for language pair of different families, Hindi from Devnagari Family
and Telugu from Dravidian Family)
Transfer
The syntactic/semantic structure of source language is then transferred
intothe syntactic/semantic structure of the target language.
Synthesis is also as Generation
This module repiaces the constituents in the source language to the
target language equivalents. This approach, however, has dependency on
the language pair involved.
Interlingua based MT
tech
(New Syllabus w.e.f Academic Year 23-24) (BC-12) Tech-Neo Publications
B.So.-Comp-SEM 6) (Cross-l.Jngual & Multl Rotrioval)...Pago no. (13-13)
(MU-TY,
IR
phrase-based Machine Translation
more 1ccuratc SMT approach, called phrase-based translalion, was
A
source and lurget sentence are divided into
introduccd, where cach
of words belore translation,
separate phrascs instead
alignment between the phrases in the input and output sentences
The
putterns, which is very similur lo word-bnsed
normally follows certain
translation, Even though the phrase-based models reSult in bctter
performance than the word-bascd translation, they did not improve the
palterns,
model of sentence order
model is based on flat reordering patlerns, and
The alignment
may perform wcll with
experiments show that this rcordering technique
orders but not as well with long scntences and complex
local phrase
orders.
Translation
uierarchical phrase-based Machine
two methods, developed a
By considering the drawback of previous
more sophisticated SMT approachi, called the hierarchical plhrasc-bascd
model.
The advantage of this approach is that hicrarchical
plrases hve
lhigher lcvel of
rccursive structures instcad of simple phrascs. This
accurncy of the SMT systenm.
abstraction approach lurther improved the
Hybrid Machine Translation Approach
rule-based translation
By taking the advantage of both statistical and
methodologies, a new approach was developed, called hybrid-based
in the area of MT
approaclh, which has proven to have better efficicncy
scctors
systems. At present, several governmcntal and private bascd MT
use this hybrid-bascd approach to develop translation from source to
on both rules and statistics.
larget language, which is based
a ways. In some
The hybrid approach can bc uscd in number of dilferent
a
cases, translations are performed in the first stage using rulc-based
statistical
approach followed by adjusting or correcting the oulput using
pre-proccss lhe input
information. In the other way, rules arc uscd to
output of a statistical-based
data as well as post-process the statistical
translation system. This technique is better than the previous and
bas
If we put all the documents into a mixed collcction, the first question is
how to distinguish words in different languages, especially for
homographs such as "but" in English and "but" in French.
We propose the following solution: to associate a language tag to every
indexing term. When a query is submitted the system, and the user
indicates he languages of interest, the original query is transiated
Separately into all these languages. Allthe translations, and the original
query, will be grouped into a large query expression for every language
of interest.
One possible advantage of this approach is that the weights of index
terms in different languages may be more comparable, because they are
determined in the same way. Although the weights may still be
unbalanced because of the unbalanced occurrences of index term in the
document collection, the problem is much less severe than if document
collections are processed separately.
Another advantage results from the removal of the problemaic merging
step. The retrieval result naturally contains answers in differet
languages. One may expecta higher effectiveness
This approach contains the following five main steps
1. Language identification
This step aims to identify the language of each document. so that
the document can be submitted to the appropriate language
dependent pre-processing.
Nowadays, the automatic language identification is no longer a
difficult problem. There are systems that are able to determine the
language accurately using statistical language models.
2. Language-dependent preprocessing
Each document is then submitted to a language-dependent pre
processing. This includes the following steps: -
user interaction, for instance, the user copying the result URL
with secondary
successful result, as is copy/pasting from the snippet.
considered a
is
Zero result rate
Tero result rate (ZRR) is the ratio of Search Engine Results Pages
a
(SERPS) which returned with zero results. The metric either indicates recall
issue, or that the information being searched for is
not in the index.
offline metrics
Offline metrics are generally created from relevance judgment sessions
binary
where the judges score the quality of the search results. Both
(relevant/non-relevant) and multi-level (e.g., relevance from 0 to 5) scales
can be used to score cach document returned in response to a query. In
may be different shades of
practice, queries may be ill-posed, and there
relevance.
Precision
are relevant to
Precision is the fraction of the documents retrieved that
the user's information necd.
I{relevant documents n {retrieved documents
}l
ech-Neo Publcations
(New Syllabus w.e.f Academic Year 23-24) (BC-12)
frekev at documents) ^ (retrieved documents}!
1{relevart documernts }!
In hmary clafication, scrsitivity. So. it can be
rocal} is often called
ech-tNeo Publications
iew Sylabus wei Acaiermic Veat 73-24) (BC-12)
derived s that eassres
F-sneasure e
eflestrvsess t
The
with respest to a user
retrieval
prrecision". it is based on the reasare
to recall as
p R
is
and their relationship
where =
Fa = l-E
E.measure can be a better single metric when compared to precision and
can
recall: both precision and recal! give different information that
smore than
complement cach other when combined. If one of them exceis
it.
the other, F-measure will reflect
Average precision
Precision and recall are single-value metrics based on the whole list of
a
documents returned by the system. For systems that return ranked
sequence of documents, it is desirable to aiso consider the order in
which the returned documents are presented. By computing a precisiOD
and recall at every position in the ranked sequence of documents., one
can plot a precision-recall curve, plotting precision pír) as a function of
recall r. Average precision computes the average value of p() over the
interval from r=0 to r=l.
AveP = p(r) dr
AveP =
A
P(k) r(k)
k=l
u is the
where k is the rank in the sequence of retrieved docunents,
number of retrieved documents, P(k) is the precision at cut-off k in the
list, and Ar(k) is the change in recall from items k-I to &.
ech-NeaPublications
New Syilabus w.e.f Academic Year 23-24) (BC-l2)
IR (MU-T.Y. B.Sc.-Comp-SEM 6) (Cross-Ling. & Mult. Retrie.)...Page no. (13-20)
P(k) x rel(k)
AveP k=l
total number of relevant documents
where rel(k) is an indicator function equalling 1
if the item at rank k is a
relevant document, zero otherwise.
Precision at k
For modern information retrieval, recall is no longer a mneaningful
metric. as many queries have thousands of relevant documents, and few
users wvill be interested in reading all of them.
Precision at k documents (P@k) is still a useful metric (e.g., P@10 or
"Precision at 10" corresponds to the number of relevant results among
the top 10 retrieved documents), but fails to take into account the
positions of the relevant documents among the top k.
Another shortcoming is that on a query with fewer relevant results than
k, evena perfect system will have a score less than 1. It is easier to score
manually since only the top k results need to be examined to determine
if they are relevant or not.
R-precision
) AveP(q)
q=l
MAP
the result.
p as:
The DCG accumulated at a particular rank position is defined
= rel,
DCG,
log, (i + 1)
i=1
or systems, to
Since result set may vary in size among different queries
uses an ideal
compare performances the normalised version of DCG
a relevance,
DCG. To this end, it sorts documents of result list by
normalizes the
producing an ideal DCG at positionp (DCG), which
SCOre.
DCGp
nDCO, =DCG,
to obtain a measure of
The nDCG values for all queries can be averaged
the average performance of ranking algorithm.
a Note that in a perfect
same as the IDCG, producing
ranking algorithm, the DCG, willbe the
an nDCG of 1.0. All nDCG calculations are then relative values on the
interval0.0to 1.0 and so are crosS-query comparable.
Chapter Ends...
UNIT 3
User-based
Evaluation
CHAPTER 14
Syllabus
in IR.
GQ Explain the concept of user-based evaluatíon
Evaluatíon is highly impotant for designing, developing and
as
maíntaining effective information retrieval or search systems it allows
the measurement of how successfully an ínformation retrieval systern
meets its goal of helping users fulfil their informatíon needs.
The success of an IR system may be judged by a range of criteria
including relevance, speed, user satisfaction, usability, efficiency and
reliability. However, the most important factor in determining a system's
effectiveness for users is the overall relevance of results retrieved in
response to a query.
User-based evaluation is evaluation through user participation, that is,
evaluation that involves the people for whom the system is intended: the
uSers.
User-based evaluation techniques include: experimental methods,
observational methods, questionnaires, interviews, and physiological
monitoring methods.
B,npSEM E)
(UseeKag (14-2)
The moSt
Uerkessed evaluati% are User
studis, Survezs
Iafornatíon nss zre prsenially
usful in
e 14.1.2 Surveys
answers directly from
This involves questioning users and obtaining
users about their behaviour, attributes, values, conditions and/or
preferences.
user studies. It at times
This is by far the most frequently used method in
also leads to somewhat biased results.
on the format of
Surveys can be both qualitative and quantitative, based
questions used.
ech-Neo Publications
(New Syllabus w.e.f Academic Year 23-24) (BC-12)
IR(MU-T.Y, B.Sc.-Comp-SEM 6) (User-based evaluation)...Page no. (14-3)
ech-NeoPublications
(New Syllabus w.e.f Academic Year 23-24) (BC-12)
(User-tased evaluation)...Page io.
IR (MJTY. B.SCCon SEM6) (14-5)
ech-Neo Publications
(New Syllabus w.e.f Academic Year 23-24) (BC-12)
IR (MU-T.Y. B.Sc.-Comp-SEM 6) (Usor-basod ovaluation),..Pago no. (14-7)
Core Subjects
USCS601: Data Science
USCS602:Cloud Computing and Web Servlces
Enhancement Elective (SEE I)
I
Skill
USCS6031 : Wireless and Sensor Networks
USCS6032: Information Retrieval
SkillEnhancement Elective I| (SEE I)
USCS6041: Data Mining Warehousing &
Generic Elective
USCS6051: Customer Relationship Management
UScS6052: Cyber Laws andIPR
Sure Marks
iotes oAst
El.utinu faielence
CHHAPTEWiSt UNIYIRSITY
PAPLR SoLUriONS