IRS Unit-3
IRS Unit-3
Automatic Indexing
− Statistical strategies
− use of frequency of occurrence of events is the basis for a statistical approach.
− events are related to occurrences of processing tokens (words/phrases) within
documents and within the database.
− words/phrases are the domain of searchable values.
Concept indexing
− uses the words within an item to correlate to concepts discussed in the item.
− This is a generalization of the specific words to values used to index the item.
− a special class of indexing can be defined by creation of hypertext linkages.
3.2 Statistical Indexing
− Statistical indexing
− O(R):
▪ Odds that a document chosen at random from the database is relevant to
query Qi.
− Probability of relevance of a document to a query:
▪ obtained by applying the inverse logistic transformation:
− logodds formula:
− Example. Figure shows how an item that discusses petroleum refineries in Mexico
would be represented.
− binary approach
− The domain contains the value of one or zero. One represents the existence of the
processing token in the item.
− Binary vectors require a decision process to determine that a particular processing
token represents the semantics of an item is sufficient to include it in the vector.
− Concepts of “Tax” and “Shipment” are below the threshold of importance (e.g.,
assume threshold is 1.0) and they not are included in the vector.
− weighted approach
− The domain is typically the set of all real positive numbers.
− The value for each processing token represents the relative importance of that
processing token in representing the semantics of the item.
− A weighted vector provides a range of values that contains a variance in the value
of the relative importance of a processing token in representing the semantics of
the item.
− Weights provide a basis for determining the rank of an item.
− The vector approach allows for a mathematical and a physical representation using a
vector space model.
− Each processing token can be considered another dimension in item representation space.
− Figure 5.3 shows a three-dimensional vector representation assuming there were only
three processing tokens Petroleum, Mexico and Oil.
− In a statistical system, the data that are available for calculating a weight are
− Term Frequency (TF): frequency of occurrence of the processing token
in an existing item.
− Total Frequency (TOTF): frequency of occurrence of the processing
token in the existing database.
− Item frequency (IF) or Document Frequency (DF): number of unique
items in the database that contain the processing token.
− Approaches for different document lengths when determining the value of Term
Frequency are
− maximum term frequency (MTF)
▪ Term frequency for each word is divided by the maximum frequency of
the word in any item.
▪ This normalizes the term frequency values to a value between zero and
one.
− logaritmetic term frequency (LTF)
▪ log of the term frequency plus a constant is used to replace the term
frequency.
▪ The log function will perform the normalization when the term
frequencies vary significantly due to size of documents.
− Example:
− Assume that the term “oil” is found in 128 items, “Mexico” is found in 16 items and
“refinery” is found in 1024 items.
− If a new item arrives with all three terms in it, “oil” found 4 times, “Mexico” found 8
times, and “refinery found 10 times and there are 2048 items in the total database.
− Using a simple unnormalized term frequency, the item vector is (4, 8, 10).
− Figure 5.4 shows the weight calculations using inverse document frequency.
− Based upon the number of items that contain the term in the existing database.
− Inverse document frequency adjusts the weight of a processing token for an item.
− Example:
− Assume the terms “SAW” and “DRILL” are found in 5 items with the following
frequencies defined in Figure.
− If there are many independent occurring events then the calculation for the average
information value across the events is:
− The value of pk can be defined as TFik/TOTFk, the ratio of the frequency of occurrence of
the term in an item to the total number of occurrences of the item in the data base.
− The weighting factor, Signal:
− producing a final formula of:
Discrimination Value
− Another approach to create a weighting algorithm is based upon the discrimination value
of a term.
− To achieve the objective of finding relevant items, it is important that the index
discriminates among items.
− discrimination value for each term “i”is:
− Basis
− based upon conditional probabilities (Probability of Event 1 given Event 2
occurred).
− Bayesian formula
− probability of relevance (REL) to a search statement given a particular document
and query, P(REL/DOCi, Queryj).
− Uses/Application
− Bayesian model is applied to the search function and to creating the index to the
database.
− Bayesian formula can be used in determining the weights associated with a
particular processing token in an item.
− Bayesian network can be used to determine the final set of processing tokens
(called topics) and their weights.
− Issues
− Some topics are related to other topics and some processing tokens related to
other processing tokens.
− For example, the topics of “Politics” and “Economics” are in some instances
related to each other and in many other instances totally unrelated.
− There are two approaches to handling this problem.
− A second approach
− extends the network to additional layers to handle interdependencies.
− additional layer of Independent Topics (ITs) can be placed above the Topic layer.
− A layer of Independent Processing Tokens (IPs) can be placed above the
processing token layer.
3.3 Natural Language
− The goal of indexing is to represent the semantic concepts of an item in the information
system to support finding relevant information.
− Single words have conceptual context, but frequently they are too general to help the user
find the desired information.
− Term phrases allow additional specification and focusing of the concept to provide better
precision and reduce the user’s overhead of retrieving non-relevant items.
− It is assumed that Lexical analysis (which determines verb tense, plurality, and part of
speech) have been completed prior to the additional processing.
− produces more accurate term phrases
− provides higher level semantic information that identifies relationships between concepts.
− is not just determining the topic statement(s) but also assigning semantic attributes to the
topic such as time frame (past, present, future).
− Each sentence is evaluated and assigned weights associated with its possible inclusion in
the different components.
− The next level of natural language processing identifies inter- relationships between the
concepts.
− The relationships are typically visualized as triples with two concepts and a relationship
between them.
− The weights are based upon a combination of statistical information and values assigned
to the actual words used in establishing the linkages.
− The additional information beyond the indexing is kept in additional data structures
associated with each item.
3.4 Concept Indexing
− goal
− To use concepts instead of terms as the basis for the index
− To produce a reduced dimension vector space.
− Concept indexing
− can start with a number of unlabeled concept classes
− Let the information in the items define the concepts classes.
− where is the Weight associated with processing token “j” in item “i”
and processing token “l” in item “k” that are related via a hyperlink.
− Linkik is the weight associated with strength of the link.
− α,β,ƴ are weighting/normalization factors.
− The values stored in an expanded index structure or calculated dynamically.
3.6 Document and Term Clustering
− Goal of clustering
− to assist in the location of information.
− to group similar objects (e.g., terms or items) into a “class” under.
− To allow linkages between clusters to be specified.
− If a concordance is used, other tools such as KWOC, KWIC or KWAC may help in
determining useful words.
− Key Word Out of Context (KWOC) is another name for a concordance.
− Key Word In Context (KWIC) displays a possible term in its phrase context.
− It is structured to easily identify the location of the term under consideration in the
sentence.
− Key Word And Context (KWAC) displays the keywords followed by their context.
− Figure 6.1 shows the various displays for “computer design contains memory chips”.
− The phrase is assumed to be from doc4; the other frequency and document ids for KWOC
were created for this example.
− In the Figure 6.1 the character “/” is used in KWIC to indicate the end of the phrase.
− The KWIC and KWAC are useful in determining the meaning of homographs.
− The term “chips” could be wood chips or memory chips.
− In both the KWIC and KWAC displays, the editor of the thesaurus can read the sentence
fragment associated with the term and determine its meaning.
− Once the terms are selected they are clustered based upon the word relationship
guidelines and the interpretation of the strength of the relationship.
− The resultant thesaurus undergoes many quality assurance reviews by additional editors
using some of the guidelines already suggested before it is finalized.
3.7.2 Automatic Term Clustering
Basis
− similarity between every term pair is calculated for determining the clusters.
o The vector model is represented by a matrix where the rows are individual items
and the columns are the unique words (processing tokens) in the items.
o The values in the matrix represent how strongly that particular word represents
concepts in the item.
− There are no values on the diagonal since that represents the autocorrelation of
a word to itself.
− The next step is to select a threshold that determines if two terms are considered similar
enough to each other to be in the same class.
o In this example, two terms are considered similar if the similarity
value between them is the threshold value of 10 or greater.
o This produces a new binary matrix called the Term Relationship
matrix (Figure 6.4) that defines which terms are similar.
o A ‘1’ in the matrix indicates that the terms specified by the column
and the row are similar enough to be in the same class.
− The final step in creating clusters is to determine when two objects (words) are in the
same cluster.
− The first assignment of terms to clusters produces centroids that are not related to the
final clustering of terms.
− The similarity between all existing terms and the centroids of the clusters can be
calculated.
− The term is reallocated to the cluster(s) that has the highest similarity.
− This process is iterated until it stabilizes.
− Calculations using this process are of the order O(n).
− Example:
− A graphical representation of terms and centroids illustrates how classes move after the
initial assignment.
− The solid black box represents the centroid for each of the classes.
− In Figure 6.6b. the centroids for the first three arbitrary class are shown.
− The ovals in Figure 6.6b. show the ideal cluster assignments for each term.
− During the next iteration the similarity between every term and the clusters is performed
reassigning terms as needed.
− The resulting new centroid for the new clusters is again shown as black squares in Figure
6.6a.
− The new centroids are not yet perfectly associated with the ideal clusters, but they are
much closer.
− The process continues until it stabilizes.
− Clustering of items
o similar to term clustering for the generation of thesauri.
− Manual item clustering
o someone reads the item and determines the category or categories to which it
belongs.
o each item is usually assigned to one category.
− Using Figure 6.2 as the set of items and their terms and similarity equation:
− as k goes from 1 to 8 for the eight terms, an Item-Item matrix is created (Figure 6.9).
− Using a threshold of 10 produces the Item Relationship matrix shown in Figure 6.10.
− Algorithms used are
o Clique algorithm
o Single link technique
o Star technique
o String technique
3.9 Hierarchy of Clusters
− Dendograms
− Search overhead is reduced by performing top-down searches of the centroids of the
clusters in the hierarchy and trimming those branches that are not relevant.
− It is difficult to create a visual display of the total item space.
− Use of dendograms along with visual cues on the size of clusters and strengths of the
linkages between clusters (e.g., dashed lines indicate reduced similarities) allows a user
to determine alternate paths of browsing the database.
− The dendogram allows the user to determine which clusters to be reviewed are likely to
have items of interest.
− Even without the visual display of the hierarchy, a user can use the Figure 6.12
Dendogram logical hierarchy to browse items of interest.
− A user, once having identified an item of interest, can request to see other items in the
cluster.
− HACM approaches can be defined as the dissimilarity D between any existing cluster Ck
and a new cluster Cij created by combining clusters Ci and Cj.
− Voorhees showed that the similarity between a cluster centroid and any item is equal to
the mean similarity between the item and all items in the cluster.
− Ward’s Method chooses the minimum square Euclidean distance between points (e.g.,
centroids) normalized by the number of objects in each cluster.
− Automatic clustering techniques can also be used to create a hierarchy of objects (items
or terms).
− Centroids were used to reduce computation required for adjustments in term assignments
to classes.
− For both terms and items, the centroid has the same structure as any of the items or terms
when viewed as a vector from the Item/Term matrix.
− A term is a vector composed of a column whereas an item is a vector composed of a row.