0% found this document useful (0 votes)

35 views7 pages

Unit-4 1

Uploaded by

keerthanazion546

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

35 views7 pages

Unit-4 1

Uploaded by

keerthanazion546

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 7

Unit-IV

User Search Techniques:

Search Statements and Binding:
Search Statements:

 Represent the information need of users, specifying the concepts they wish to locate.

 Can be generated using Boolean logic and/or Natural Language.

 May allow users to assign weights to different concepts based on their importance.

 Binding: Transforming abstract forms into more specific forms (e.g., user's vocabulary or
past experiences).

 The goal is to logically subset the total item space to find relevant information.

 Some examples for Statistical Weighting in Search are Document Frequency and Total
Frequency for a specific term.

 Document Frequency (DF): How many documents in the database contain a specific term.

 Total Frequency (TF): How often a specific term appears across all documents in the
database.

 These statistics are dynamic and depend on the current contents of the database being
searched.

Levels of Binding:

1. User's Binding: The initial stage where users define concepts based on their vocabulary and
understanding.

2. Search System Binding: The system translates the query into its own metalanguage (e.g.,
statistical systems, natural language systems, concept systems).

o Statistical Systems: Process tokens based on frequency.

o Natural Language Systems: Parse syntactical and discourse semantics.

o Concept Systems: Map the search statement to specific concepts used in indexing.

3. Database Binding: The final stage where the search is applied to a specific database using
statistics (e.g., Document Frequency, Total Frequency).

o Concept Indexing: Concepts are derived from statistical analysis of the database.

o Natural Language Indexing: Uses corpora-independent algorithms.

Search Statement Length:

 Longer search queries improve the ability of IR systems to find relevant items.

 Selective Dissemination of Information (SDI) systems use long profiles (75-100 terms).
 In large systems, typical ad hoc queries are around 7 terms.

 Internet Queries: Often very short (1-2 words), reducing effectiveness.

 Short search queries highlight the need for automatic search expansion algorithms.

Similarity Measures and Ranking:

A variety of different similarity measures can be used to calculate the similarity between the item
and the search statement. A characteristic of a similarity formula is that the results of the formula
increase as the items become more similar. The value is zero if the items are totally dissimilar.

This formula uses the summation of the product of the various terms of two items when treating the
index as a vector. If is replaced with then the same formula generates the similarity between every
Item and The problem with this simple measure is in the normalization needed to account for
variances in the length of items. Additional normalization is also used to have the final results come
between zero and +1

Similarity Measures:

1) Cosine Similarity

Vector-Based:

Cosine similarity measures the cosine of the angle between two non-zero vectors in a multi-
dimensional space. Outputs a value between -1 and 1, with 1 indicating identical vectors.

Value=0-orthogonal

Value=1-coincident

Efficient Computation:

Can be calculated efficiently using dot products, making it a popular choice for IR systems.
2) Jaccard Similarity:

Set-Based:

The Jaccard similarity coefficient measures the similarity between two finite sets.

Range of Values:

As the common elements increase, the similarity value quickly decreases, but is always in the range -
1 to +1.

Applications:

Useful for comparing the overlap between documents, tags, or other categorical data.

3)Dice Method:

 The Dice measure simplifies the denominator from the Jaccard measure and introduces a
factor of 2 in the numerator. The normalization in the Dice formula is also invariant to the
number of terms in common. For the Dice value, the numerator factor of 2 is divided into
the denominator. As long as the vector values are same, independent of their order, the
Cosine and Dice normalization factors do not change.

Threshold in Similarity Measures:

 Use of a similarity algorithm returns the complete data base as search results. Many of the
items have a similarity close or equal to zero (or minimum value the similarity measure
produces). For this reason, thresholds are usually associated with the search process. The
threshold defines the items in the resultant Hit file from the query. Thresholds are either a
value that the similarity measure must equal or exceed or a number that limits the number
of items in the Hit file.
Clustering Hierarchy:

 The items are stored in clusters that are represented by the centroid for each cluster. The
hierarchy is used in search by performing a top-down process. The query is compared to the
centroids “A” and “B.” If the results of the similarity measure are above the threshold, the
query is then applied to the nodes’ children. If not, then that part of the tree is pruned and
not searched. This continues until the actual leaf nodes that are not pruned are compared.
 The risk is that the average may not be similar enough to the query for continued search, but
specific items used to calculate the centroid may be close enough to satisfy the search.

each letter at the leaf (bottom nodes) represent an item (i.e., K, L, M, N, D, E, F, G, H, P, Q, R, J). The
letters at the higher nodes (A, C, B, I) represent the centroid of their immediate children nodes. The
hierarchy is used in search by performing a top-down process. The query is compared to the
centroids “A” and “B.”
Hidden Markov Models Techniques:

 In HMMs the documents are considered unknown statistical processes that can generate
output that is equivalent to the set of queries that would consider the document relevant.
 Another way to look at it is by taking the general definition that a HMM is defined by output
that is produced by passing some unknown key via state transitions through a noisy channel.
The observed output is the query, and the unknown keys are the relevant documents
 The development for a HMM approach begins with applying Bayes rule to the conditional
probability:

 By applying Bayes rule to the conditional probability, we can derive an expression for the
posterior probability, which represents the probability of a document being relevant given
the query and the observed output. This posterior probability is then used to make decisions
on document relevance in HMMs. The goal is to find the most likely sequence of hidden
states (relevant documents) that generate the observed output (query).
 A Hidden Markov Model is defined by a set of states, a transition matrix defining the
probability of moving between states, a set of output symbols and the probability of the
output symbols given a particular state. The set of all possible queries is the output symbol
set and the Document file defines the states.
 Thus the HMM process traces itself through the states of a document (e.g., the words in the
document) and at each state transition has an output of query terms associated with the
new state.
 The biggest problem in using this approach is to estimate the transition probability matrix
and the output (queries that could cause hits) for every document in the corpus. If there was
a large training database of queries and the relevant documents they were associated with
that included adequate coverage, then the problem could be solved using Estimation-
Maximization algorithms.

Ranking Algorithms:

 A by-product of use of similarity measures for selecting Hit items is a value that can be used
in ranking the output. Ranking the output implies ordering the output from most likely items
that satisfy the query to least likely items. This reduces the user overhead by allowing the
user to display the most likely relevant items first.
 The original Boolean systems returned items ordered by date of entry into the system versus
by likelihood of relevance to the user’s search statement. With the inclusion of statistical
similarity techniques into commercial systems and the large number of hits that originate
from searching diverse corpora, such as the Internet, ranking has become a common feature
of modern systems.
 In most of the commercial systems, heuristic rules are used to assist in the ranking of items.
Generally, systems do not want to use factors that require knowledge across the corpus
(e.g., inverse document frequency) as a basis for their similarity or ranking functions because
it is too difficult to maintain current values as the database changes and the added
complexity has not been shown to significantly improve the overall weighting process.
RetrievalWare System:

 RetrievalWare first uses indexes (inversion lists) to identify potential relevant items. It then
applies coarse grain and fine grain ranking. The coarse grain ranking is based on the
presence of query terms within items. In the fine grain ranking, the exact rank of the item is
calculated. The coarse grain ranking is a weighted formula that can be adjusted based on
completeness, contextual evidence or variety, and semantic distance.
 Completeness is the proportion of the number of query terms (or related terms if a query
term is expanded using the RetrievalWare semantic network/thesaurus) found in the item
versus the number in the query. It sets an upper limit on the rank value for the item. If
weights are assigned to query terms, the weights are factored into the value. Contextual
evidence occurs when related words from the semantic network are also in the item.
 Thus if the user has indicated that the query term “charge” has the context of “paying for an
object” then finding words such as “buy,” “purchase,” “debt” suggests that the term
“charge” in the item has the meaning the user desires and that more weight should be
placed in ranking the item. Semantic distance evaluates how close the additional words are
to the query term.
 Synonyms add additional weight; antonyms decrease weight. The coarse grain process
provides an initial rank to the item based upon existence of words within the item. Since
physical proximity is not considered in coarse grain ranking, the ranking value can be easily
calculated.
 Fine grain ranking considers the physical location of query terms and related words using
factors of proximity in addition to the other three factors in coarse grain evaluation. If the
related terms and query terms occur in close proximity (same sentence or paragraph) the
item is judged more relevant. A factor is calculated that maximizes at adjacency and
decreases as the physical separation increases.
 Although ranking creates a ranking score, most systems try to use other ways of indicating
the rank value to the user as Hit lists are displayed. The scores have a tendency to be
misleading and confusing to the user. The differences between the values may be very close
or very large.

Reactive Spring
100% (1)
Reactive Spring
101 pages
IRS Unit 4 by Krishna
No ratings yet
IRS Unit 4 by Krishna
23 pages
Irs Unit 4
No ratings yet
Irs Unit 4
66 pages
IRS Unit-4
No ratings yet
IRS Unit-4
35 pages
Irs Unit-4 Modified
No ratings yet
Irs Unit-4 Modified
13 pages
Irs Unit - 4
No ratings yet
Irs Unit - 4
29 pages
IRS III Year UNIT-3 Part 1
50% (2)
IRS III Year UNIT-3 Part 1
18 pages
Irs r22 Unit 4 Lecture Notes User Search Techniques Ranking Algorithms
No ratings yet
Irs r22 Unit 4 Lecture Notes User Search Techniques Ranking Algorithms
24 pages
Irs Unit-4 Notes - 241202 - 150037
No ratings yet
Irs Unit-4 Notes - 241202 - 150037
18 pages
Unit 4
No ratings yet
Unit 4
61 pages
IRS Automatic Indexing UNIT-2
67% (3)
IRS Automatic Indexing UNIT-2
18 pages
IRS Unit 4
No ratings yet
IRS Unit 4
63 pages
UNIT5-User Search Techniques
No ratings yet
UNIT5-User Search Techniques
24 pages
Unit III
No ratings yet
Unit III
37 pages
IR Models: - Why IR Models? - Boolean IR Model - Vector Space IR Model - Probabilistic IR Model
No ratings yet
IR Models: - Why IR Models? - Boolean IR Model - Vector Space IR Model - Probabilistic IR Model
46 pages
Pattern Recognition: Fundamentals and Applications
From Everand
Pattern Recognition: Fundamentals and Applications
Fouad Sabry
No ratings yet
Lecture 2: More Similarity Searching Multidimensional Scaling
No ratings yet
Lecture 2: More Similarity Searching Multidimensional Scaling
8 pages
Information Retrieval Systems Slip Test 2
No ratings yet
Information Retrieval Systems Slip Test 2
10 pages
4 IRModels
No ratings yet
4 IRModels
32 pages
Unit-3 Irs
No ratings yet
Unit-3 Irs
46 pages
Intro IR
No ratings yet
Intro IR
108 pages
Lec2 2
No ratings yet
Lec2 2
17 pages
4 IRModels
No ratings yet
4 IRModels
46 pages
WDM 3,4,5
No ratings yet
WDM 3,4,5
12 pages
Irs Unit III
No ratings yet
Irs Unit III
74 pages
NLP See
No ratings yet
NLP See
27 pages
Unit 1
No ratings yet
Unit 1
108 pages
4 IRinArabic2021 Ranked Retrieval I
No ratings yet
4 IRinArabic2021 Ranked Retrieval I
49 pages
Ii - 3 Unit
No ratings yet
Ii - 3 Unit
45 pages
IR - Models
100% (3)
IR - Models
58 pages
CS8080 INFORMATION RETRIEVAL TECHNIQUES II INTERNAL EXAMINATION - Google Forms
No ratings yet
CS8080 INFORMATION RETRIEVAL TECHNIQUES II INTERNAL EXAMINATION - Google Forms
420 pages
Chap 6
No ratings yet
Chap 6
70 pages
Irt Ans
No ratings yet
Irt Ans
9 pages
Module 3 Indexing Part A
No ratings yet
Module 3 Indexing Part A
46 pages
Lec 4
No ratings yet
Lec 4
39 pages
CCS369 - TSS-Unit 3
No ratings yet
CCS369 - TSS-Unit 3
55 pages
AICS Unit I
No ratings yet
AICS Unit I
4 pages
Introduction To Information Retrieval
No ratings yet
Introduction To Information Retrieval
50 pages
Module 1print
No ratings yet
Module 1print
5 pages
Module 6 Updated Final
No ratings yet
Module 6 Updated Final
48 pages
Unit IV
No ratings yet
Unit IV
19 pages
2 Introduction To Information Retrieval
No ratings yet
2 Introduction To Information Retrieval
38 pages
Chapter Five IR Models
No ratings yet
Chapter Five IR Models
28 pages
Introduction To Automatic Indexing
No ratings yet
Introduction To Automatic Indexing
28 pages
Information Retrievalpdf
No ratings yet
Information Retrievalpdf
7 pages
5 IRModels IR
No ratings yet
5 IRModels IR
25 pages
Unit Iii
No ratings yet
Unit Iii
100 pages
Bulu
No ratings yet
Bulu
47 pages
ISE Information Retrieval Mod-V
No ratings yet
ISE Information Retrieval Mod-V
48 pages
Informaiton Retrieval and Web Search
No ratings yet
Informaiton Retrieval and Web Search
44 pages
NLP See
No ratings yet
NLP See
9 pages
CMP 312 - 2
No ratings yet
CMP 312 - 2
5 pages
Search Algorithm: Fundamentals and Applications
From Everand
Search Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
27 pages
1 Overview
No ratings yet
1 Overview
44 pages
Mir2ed Toc
No ratings yet
Mir2ed Toc
17 pages
LIBS 894 Assignment Three Classic Models
No ratings yet
LIBS 894 Assignment Three Classic Models
8 pages
ISE Information Retrieval Mod-V (Uploaded by Snaptricks - In)
No ratings yet
ISE Information Retrieval Mod-V (Uploaded by Snaptricks - In)
48 pages
Introduction To Telecom Technologies (Telecom) : Getachew Mamo
No ratings yet
Introduction To Telecom Technologies (Telecom) : Getachew Mamo
65 pages
Unit-1 Chapter 1
No ratings yet
Unit-1 Chapter 1
44 pages
What Is Information Retrieval (IR)
No ratings yet
What Is Information Retrieval (IR)
17 pages
Ilovepdf Merged
No ratings yet
Ilovepdf Merged
117 pages
Cs Notes
No ratings yet
Cs Notes
54 pages
Cs Final PDF Kerths2
No ratings yet
Cs Final PDF Kerths2
69 pages
ASN Notes (1,2,3)
100% (1)
ASN Notes (1,2,3)
49 pages
Unit-4 Part3
No ratings yet
Unit-4 Part3
16 pages
ASN Notes
No ratings yet
ASN Notes
28 pages
Unit I - Irs
No ratings yet
Unit I - Irs
85 pages
202441551502IA Alatishe
No ratings yet
202441551502IA Alatishe
1 page
Flexo Information Sheet
No ratings yet
Flexo Information Sheet
2 pages
Proceedings of Spie
No ratings yet
Proceedings of Spie
20 pages
Machine Elements Quiz 2
No ratings yet
Machine Elements Quiz 2
15 pages
COPILOT ESSENTIALS - Step-By-Step Copilot in Excel Guide - C
No ratings yet
COPILOT ESSENTIALS - Step-By-Step Copilot in Excel Guide - C
23 pages
Manual Detroit Diesel Serie 92
No ratings yet
Manual Detroit Diesel Serie 92
180 pages
OptiRamp Rod Pump Diagnostics PDF
No ratings yet
OptiRamp Rod Pump Diagnostics PDF
16 pages
Sericulture Ece
No ratings yet
Sericulture Ece
27 pages
Internet Safety Theory Assessment
No ratings yet
Internet Safety Theory Assessment
4 pages
STEM Kits: Build & Have Fun With Makerzoid STEM Coding Kits
No ratings yet
STEM Kits: Build & Have Fun With Makerzoid STEM Coding Kits
1 page
Nessus Report: 21/mar/2012:16:20:52 GMT
No ratings yet
Nessus Report: 21/mar/2012:16:20:52 GMT
74 pages
Circuit Theory Lec 4
No ratings yet
Circuit Theory Lec 4
34 pages
Asyril Datasheet Asycube 240 en
No ratings yet
Asyril Datasheet Asycube 240 en
2 pages
Imminent Pros 2 - Imminent Luxury 2
No ratings yet
Imminent Pros 2 - Imminent Luxury 2
1 page
Microvision2 Hardware Manual SP101015.100
No ratings yet
Microvision2 Hardware Manual SP101015.100
50 pages
Analysis and Study of K Means Clustering Algorithm IJERTV2IS70648
No ratings yet
Analysis and Study of K Means Clustering Algorithm IJERTV2IS70648
6 pages
Openbravo Obtt2 Platform Course Guide
No ratings yet
Openbravo Obtt2 Platform Course Guide
7 pages
Distance Protection Relay Trainer Kit
No ratings yet
Distance Protection Relay Trainer Kit
2 pages
Installation Manual & Operation Instructions: Wheel Balancer Pse Wb-260
No ratings yet
Installation Manual & Operation Instructions: Wheel Balancer Pse Wb-260
42 pages
Panasonic Phone System KXT308
No ratings yet
Panasonic Phone System KXT308
6 pages
Blockchain Technology Applications in Healthcare: An Overview
No ratings yet
Blockchain Technology Applications in Healthcare: An Overview
11 pages
胡希恕经方理论与实践
No ratings yet
胡希恕经方理论与实践
324 pages
Chapter One: Learning Objectives
No ratings yet
Chapter One: Learning Objectives
21 pages
E Advanced Service Functional Blocks C-Arm C-Arm
100% (3)
E Advanced Service Functional Blocks C-Arm C-Arm
78 pages
Divvy Exercise R Script
No ratings yet
Divvy Exercise R Script
5 pages
Unit 5 - Design Concept (Sofrware Engineering) - NSG Academy
No ratings yet
Unit 5 - Design Concept (Sofrware Engineering) - NSG Academy
11 pages
Ms. Thandar Oo Sales Resume (1) .PDF - 20240701 - 111316 - 0000
No ratings yet
Ms. Thandar Oo Sales Resume (1) .PDF - 20240701 - 111316 - 0000
1 page
Restful Api
No ratings yet
Restful Api
69 pages
Nocd Just Replace The Original Oald8.exe in The Folder of Dictionary Installation
No ratings yet
Nocd Just Replace The Original Oald8.exe in The Folder of Dictionary Installation
1 page

Unit-4 1

Uploaded by

Unit-4 1

Uploaded by

Unit-IV

User Search Techniques:

 Can be generated using Boolean logic and/or Natural Language.

o Statistical Systems: Process tokens based on frequency.

o Natural Language Systems: Parse syntactical and discourse semantics.

o Natural Language Indexing: Uses corpora-independent algorithms.

Search Statement Length:

 Internet Queries: Often very short (1-2 words), reducing effectiveness.

Similarity Measures and Ranking:

Threshold in Similarity Measures:

You might also like