0% found this document useful (0 votes)
35 views7 pages

Unit-4 1

Uploaded by

keerthanazion546
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views7 pages

Unit-4 1

Uploaded by

keerthanazion546
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 7

Unit-IV

User Search Techniques:


Search Statements and Binding:
Search Statements:

 Represent the information need of users, specifying the concepts they wish to locate.

 Can be generated using Boolean logic and/or Natural Language.

 May allow users to assign weights to different concepts based on their importance.

 Binding: Transforming abstract forms into more specific forms (e.g., user's vocabulary or
past experiences).

 The goal is to logically subset the total item space to find relevant information.

 Some examples for Statistical Weighting in Search are Document Frequency and Total
Frequency for a specific term.

 Document Frequency (DF): How many documents in the database contain a specific term.

 Total Frequency (TF): How often a specific term appears across all documents in the
database.

 These statistics are dynamic and depend on the current contents of the database being
searched.

Levels of Binding:

1. User's Binding: The initial stage where users define concepts based on their vocabulary and
understanding.

2. Search System Binding: The system translates the query into its own metalanguage (e.g.,
statistical systems, natural language systems, concept systems).

o Statistical Systems: Process tokens based on frequency.

o Natural Language Systems: Parse syntactical and discourse semantics.

o Concept Systems: Map the search statement to specific concepts used in indexing.

3. Database Binding: The final stage where the search is applied to a specific database using
statistics (e.g., Document Frequency, Total Frequency).

o Concept Indexing: Concepts are derived from statistical analysis of the database.

o Natural Language Indexing: Uses corpora-independent algorithms.

Search Statement Length:

 Longer search queries improve the ability of IR systems to find relevant items.

 Selective Dissemination of Information (SDI) systems use long profiles (75-100 terms).
 In large systems, typical ad hoc queries are around 7 terms.

 Internet Queries: Often very short (1-2 words), reducing effectiveness.

 Short search queries highlight the need for automatic search expansion algorithms.

Similarity Measures and Ranking:


A variety of different similarity measures can be used to calculate the similarity between the item
and the search statement. A characteristic of a similarity formula is that the results of the formula
increase as the items become more similar. The value is zero if the items are totally dissimilar.

This formula uses the summation of the product of the various terms of two items when treating the
index as a vector. If is replaced with then the same formula generates the similarity between every
Item and The problem with this simple measure is in the normalization needed to account for
variances in the length of items. Additional normalization is also used to have the final results come
between zero and +1

Similarity Measures:

1) Cosine Similarity

Vector-Based:

Cosine similarity measures the cosine of the angle between two non-zero vectors in a multi-
dimensional space. Outputs a value between -1 and 1, with 1 indicating identical vectors.

Value=0-orthogonal

Value=1-coincident

Efficient Computation:

Can be calculated efficiently using dot products, making it a popular choice for IR systems.
2) Jaccard Similarity:

Set-Based:

The Jaccard similarity coefficient measures the similarity between two finite sets.

Range of Values:

As the common elements increase, the similarity value quickly decreases, but is always in the range -
1 to +1.

Applications:

Useful for comparing the overlap between documents, tags, or other categorical data.

3)Dice Method:

 The Dice measure simplifies the denominator from the Jaccard measure and introduces a
factor of 2 in the numerator. The normalization in the Dice formula is also invariant to the
number of terms in common. For the Dice value, the numerator factor of 2 is divided into
the denominator. As long as the vector values are same, independent of their order, the
Cosine and Dice normalization factors do not change.

Threshold in Similarity Measures:

 Use of a similarity algorithm returns the complete data base as search results. Many of the
items have a similarity close or equal to zero (or minimum value the similarity measure
produces). For this reason, thresholds are usually associated with the search process. The
threshold defines the items in the resultant Hit file from the query. Thresholds are either a
value that the similarity measure must equal or exceed or a number that limits the number
of items in the Hit file.
Clustering Hierarchy:

 The items are stored in clusters that are represented by the centroid for each cluster. The
hierarchy is used in search by performing a top-down process. The query is compared to the
centroids “A” and “B.” If the results of the similarity measure are above the threshold, the
query is then applied to the nodes’ children. If not, then that part of the tree is pruned and
not searched. This continues until the actual leaf nodes that are not pruned are compared.
 The risk is that the average may not be similar enough to the query for continued search, but
specific items used to calculate the centroid may be close enough to satisfy the search.

each letter at the leaf (bottom nodes) represent an item (i.e., K, L, M, N, D, E, F, G, H, P, Q, R, J). The
letters at the higher nodes (A, C, B, I) represent the centroid of their immediate children nodes. The
hierarchy is used in search by performing a top-down process. The query is compared to the
centroids “A” and “B.”
Hidden Markov Models Techniques:

 In HMMs the documents are considered unknown statistical processes that can generate
output that is equivalent to the set of queries that would consider the document relevant.
 Another way to look at it is by taking the general definition that a HMM is defined by output
that is produced by passing some unknown key via state transitions through a noisy channel.
The observed output is the query, and the unknown keys are the relevant documents
 The development for a HMM approach begins with applying Bayes rule to the conditional
probability:

 By applying Bayes rule to the conditional probability, we can derive an expression for the
posterior probability, which represents the probability of a document being relevant given
the query and the observed output. This posterior probability is then used to make decisions
on document relevance in HMMs. The goal is to find the most likely sequence of hidden
states (relevant documents) that generate the observed output (query).
 A Hidden Markov Model is defined by a set of states, a transition matrix defining the
probability of moving between states, a set of output symbols and the probability of the
output symbols given a particular state. The set of all possible queries is the output symbol
set and the Document file defines the states.
 Thus the HMM process traces itself through the states of a document (e.g., the words in the
document) and at each state transition has an output of query terms associated with the
new state.
 The biggest problem in using this approach is to estimate the transition probability matrix
and the output (queries that could cause hits) for every document in the corpus. If there was
a large training database of queries and the relevant documents they were associated with
that included adequate coverage, then the problem could be solved using Estimation-
Maximization algorithms.

Ranking Algorithms:

 A by-product of use of similarity measures for selecting Hit items is a value that can be used
in ranking the output. Ranking the output implies ordering the output from most likely items
that satisfy the query to least likely items. This reduces the user overhead by allowing the
user to display the most likely relevant items first.
 The original Boolean systems returned items ordered by date of entry into the system versus
by likelihood of relevance to the user’s search statement. With the inclusion of statistical
similarity techniques into commercial systems and the large number of hits that originate
from searching diverse corpora, such as the Internet, ranking has become a common feature
of modern systems.
 In most of the commercial systems, heuristic rules are used to assist in the ranking of items.
Generally, systems do not want to use factors that require knowledge across the corpus
(e.g., inverse document frequency) as a basis for their similarity or ranking functions because
it is too difficult to maintain current values as the database changes and the added
complexity has not been shown to significantly improve the overall weighting process.
RetrievalWare System:

 RetrievalWare first uses indexes (inversion lists) to identify potential relevant items. It then
applies coarse grain and fine grain ranking. The coarse grain ranking is based on the
presence of query terms within items. In the fine grain ranking, the exact rank of the item is
calculated. The coarse grain ranking is a weighted formula that can be adjusted based on
completeness, contextual evidence or variety, and semantic distance.
 Completeness is the proportion of the number of query terms (or related terms if a query
term is expanded using the RetrievalWare semantic network/thesaurus) found in the item
versus the number in the query. It sets an upper limit on the rank value for the item. If
weights are assigned to query terms, the weights are factored into the value. Contextual
evidence occurs when related words from the semantic network are also in the item.
 Thus if the user has indicated that the query term “charge” has the context of “paying for an
object” then finding words such as “buy,” “purchase,” “debt” suggests that the term
“charge” in the item has the meaning the user desires and that more weight should be
placed in ranking the item. Semantic distance evaluates how close the additional words are
to the query term.
 Synonyms add additional weight; antonyms decrease weight. The coarse grain process
provides an initial rank to the item based upon existence of words within the item. Since
physical proximity is not considered in coarse grain ranking, the ranking value can be easily
calculated.
 Fine grain ranking considers the physical location of query terms and related words using
factors of proximity in addition to the other three factors in coarse grain evaluation. If the
related terms and query terms occur in close proximity (same sentence or paragraph) the
item is judged more relevant. A factor is calculated that maximizes at adjacency and
decreases as the physical separation increases.
 Although ranking creates a ranking score, most systems try to use other ways of indicating
the rank value to the user as Hit lists are displayed. The scores have a tendency to be
misleading and confusing to the user. The differences between the values may be very close
or very large.

You might also like