Unit 3 Indexing
Unit 3 Indexing
Unit 3 Indexing
Static and Dynamic Inverted Indices –Index Construction and Index Compression –Searching –
Sequential Searching and Pattern Matching –Query operations –Query Languages –Query
Processing –Relevance Feedback and Query Expansion – Automatic Local and Global analysis –
Measuring Efffectiveness and Efficiency.
Next, create an index of the terms, where each term points to a list of documents that contain that term, as
follows.
The word “hello” is in document 1 (“hello everyone”) starting at word 1, so has an entry (1, 1), and the
word “is” is in documents 2 and 3 at ‘3rd’ and ‘2nd’ positions respectively (here position is based on the
word).
The index may have weights, frequencies, or other indicators.
Steps to Build an Inverted Index
Fetch the Document: Removing of Stop Words: Stop words are the most occurring and useless words in
documents like “I”, “the”, “we”, “is”, and “an”.
Stemming of Root Word: Whenever I want to search for “cat”, I want to see a document that has
information about it. But the word present in the document is called “cats” or “catty” instead of “cat”. To
relate both words, I’ll chop some part of every word I read so that I could get the “root word”. There are
standard tools for performing this like “Porter’s Stemmer”.
Record Document IDs: If the word is already present add a reference of the document to index else
creates a new entry. Add additional information like the frequency of the word, location of the word, etc.
Example:
Words Document
ant doc1
demo doc2
world doc1, doc2
Advantages of Inverted Index
The inverted index is to allow fast full-text searches, at a cost of increased processing when a document is
added to the database.
It is easy to develop.
It is the most popular data structure used in document retrieval systems, used on a large scale for example
in search engines.
Disadvantages of Inverted Index
Large storage overhead and high maintenance costs on updating, deleting, and inserting.
Instead of retrieving the data in decreasing order of expected usefulness, the records are retrieved in the
order in which they occur in the inverted lists.
Features of Inverted Indexes
Efficient search: Inverted indexes allow for efficient searching of large volumes of text-based data. By
indexing every term in every document, the index can quickly identify all documents that contain a given
search term or phrase, significantly reducing search time.
Fast updates: Inverted indexes can be updated quickly and efficiently as new content is added to the
system. This allows for near-real-time indexing and searching for new content.
Flexibility: Inverted indexes can be customized to suit the needs of different types of information retrieval
systems. For example, they can be configured to handle different types of queries, such as Boolean queries
or proximity queries.
Compression: Inverted indexes can be compressed to reduce storage requirements. Various techniques such
as delta encoding, gamma encoding, variable byte encoding, etc. can be used to compress the posting list
efficiently.
Support for stemming and synonym expansion: Inverted indexes can be configured to support stemming
and synonym expansion, which can improve the accuracy and relevance of search results. Stemming is the
process of reducing words to their base or root form, while synonym expansion involves mapping different
words that have similar meanings to a common term.
Support for multiple languages: Inverted indexes can support multiple languages, allowing users to search
for content in different languages using the same system.
3.2. INDEX CONSTRUCTION AND INDEX COMPRESSION
3.2.1Index Construction
Index construction is the process of creating an index for a collection of documents. An index is a data
structure that allows for efficient searching of the document collection. The index is typically created by
first parsing the documents to extract the terms, and then storing these terms in a data structure that allows
for fast lookup.
There are a number of different algorithms for index construction. The most common algorithm is the sort-
based algorithm, which involves sorting the terms by their document identifiers (DocIDs). This algorithm
is simple to implement, but it can be inefficient for large collections of documents.
Other index construction algorithms include:
1. Blocked sort-based indexing: This algorithm divides the document collection into smaller blocks and
constructs an index for each block. The resulting indexes are then merged to create a single index for the
entire collection.
2. Single-pass indexing: This algorithm constructs the index in a single pass over the document collection.
This algorithm is more efficient than the sort-based algorithm for large collections of documents, but it is
more complex to implement.
3. Distributed indexing: This algorithm distributes the index construction process across multiple machines.
This algorithm is useful for very large collections of documents, where it would be impractical to construct
the index on a single machine.
3.2.2.Index Compression
Index compression is the process of reducing the size of an index. This is important because large indexes
can take up a lot of disk space. There are a number of different index compression techniques, which can
be broadly classified into two categories:
1. Term-level compression: This type of compression techniques compress the terms themselves. For
example, a technique called Huffman coding can be used to represent terms with shorter bit sequences.
2. Postings-level compression: This type of compression techniques compress the postings lists, which are
the lists of document identifiers for each term. For example, a technique called delta encoding can be used
to store the differences between document identifiers instead of the entire document identifiers themselves.
The choice of index compression technique depends on the specific characteristics of the document
collection and the index structure. In general, a combination of term-level and postings-level compression
techniques is used to achieve the best compression ratio.
Benefits of Index Construction and Index Compression
Index construction and index compression provide a number of benefits for information retrieval systems,
including:
Improved search performance: Indexes allow for efficient searching of the document collection, which can
significantly improve search performance.
Reduced storage requirements: Index compression can reduce the size of the index, which can save disk
space.
Improved scalability: Index construction and index compression can be used to create indexes for very
large collections of documents, which makes them scalable to large-scale information retrieval systems.
3.3 SEARCHING – SEQUENTIAL SEARCHING AND PATTERN MATCHING
Searching is the fundamental operation of information retrieval (IR). It involves locating relevant
information within a large collection of data. There are two main types of searching:
Exact matching: This type of searching finds all occurrences of a specific pattern in the data. For example,
searching for the word "cat" in a document would find all instances of the word "cat", regardless of its
capitalization or context.
Pattern matching: This type of searching finds all occurrences of a pattern in the data, even if the pattern
is not exact. For example, searching for the pattern "ca*" would find all words that start with "ca", such as
"cat", "car", and "captain".
3.3.1 Sequential searching
Sequential searching is the simplest and most straightforward search algorithm. It works by comparing the
search pattern to each item in the data collection until a match is found. Sequential searching is a relatively
slow algorithm, but it is easy to implement and understand.
Example:
Suppose you are looking for a specific book in a library. The library has a collection of books that is
organized alphabetically by author's last name. You can use sequential searching to find the book by
starting at the beginning of the collection and comparing the title of each book to the title of the book you
are looking for until you find a match.
3.3.2. Pattern matching
Pattern matching is a more complex type of searching that uses algorithms to find occurrences of patterns
in data.
Example:Suppose you are searching for a specific word in a document. The document is a large text file
that contains thousands of words. You can use pattern matching to find the word by using a regular
expression. For example, the regular expression \bcat\b would match all instances of the word "cat" in the
document, regardless of its capitalization or context.
Here is a table that summarizes the key differences between sequential searching and pattern matching:
Where n is the size of the data collection and m is the size of the search pattern
There are many different pattern matching algorithms, each with its own strengths and weaknesses. Some
of the most common pattern matching algorithms include:
1. Knuth-Morris-Pratt (KMP) algorithm: The KMP algorithm is a fast and efficient pattern matching
algorithm that is well-suited for searching large amounts of text.
2. Boyer-Moore (BM) algorithm: The BM algorithm is another fast and efficient pattern matching algorithm
that is known for its good worst-case performance.
3. Rabin-Karp algorithm: The Rabin-Karp algorithm is a pattern matching algorithm that uses hashing to
quickly find potential matches.
Web search: Search engines use searching to find relevant web pages based on user queries.
Database search: Database management systems use searching to find records that match user-specified
criteria.
Text search: Text editors use searching to find specific words or phrases within a document.
Pattern recognition: Pattern recognition systems use searching to find patterns in images, audio, and other
types of data.
3.4. QUERY OPERATIONS –QUERY LANGUAGES –QUERY PROCESSING
A query operation is a fundamental action that is performed on a collection of documents to retrieve
relevant information. Common query operations include:
Relevance feedback is a user-centered technique that involves asking the user to evaluate the relevance of
the retrieved documents to their information need. The user's feedback is then used to refine the query and
improve the ranking of the retrieved documents.
There are two main types of relevance feedback:
Directed relevance feedback: The user is asked to specify which documents are relevant and which are
not relevant.
Undirected relevance feedback: The user is asked to rate the relevance of the documents on a scale, such
as 1 (not relevant) to 5 (very relevant).
The feedback can be used to perform the following tasks:
Identifying relevant terms: Relevant terms are the words or phrases that are most closely related to the
user's information need. These terms can be used to refine the query by adding them to the query or by
removing them from the query.
Removing irrelevant terms: Irrelevant terms are the words or phrases that are not related to the user's
information need. These terms can be used to refine the query by removing them from the query.
Improving term weights: Term weights are a measure of the importance of a term in the query. Term
weights can be adjusted based on the relevance of the documents that contain the term.
3.5.2.Query Expansion
Query expansion is a technique that involves adding new terms to the query based on the retrieved
documents. This can be done using a variety of methods, such as:
Leveraging semantic relationships: Related words and phrases can be added to the query to expand the
search.
Using thesauruses and dictionaries: Thesauruses and dictionaries can be used to find synonyms and
antonyms of the query terms.
Using word embedding techniques: Word embedding techniques can be used to identify words that are
semantically similar to the query terms.
Query expansion can be used to improve the effectiveness of search results in several ways:
Capturing more relevant information: New terms can be added to the query that capture more of the user's
information need.
Reducing ambiguity: New terms can be added to the query that help to reduce the ambiguity of the query.
Improving recall: New terms can be added to the query that help to identify more relevant documents that
were not originally retrieved.
Relevance feedback and query expansion are often used together to improve the effectiveness of search
results. Relevance feedback is used to identify relevant terms and refine the query, while query expansion
is used to further expand the search and capture more relevant information.The combination of relevance
feedback and query expansion can be a powerful tool for improving the effectiveness of IR systems. By
involving the user in the search process and using the feedback to refine the query, these techniques can
help to identify more relevant documents and improve the overall user experience.
3.5.3. Automatic Local analysis
Automatic Local Analysis, Query Expansion Through Local Clustering, and Query Expansion Through
Local Context Analysis are three important techniques used in query expansion for information retrieval
(IR). These techniques aim to improve the effectiveness of search results by expanding the user's original
query with relevant terms.
Automatic Local Analysis (ALA) ALA is a technique that identifies relevant terms based on their co-
occurrence with the query terms in the retrieved documents. The underlying assumption is that terms that
frequently appear together are likely to be semantically related.ALA involves the following steps:
1. Gather retrieval results: Retrieve a set of documents based on the initial query.
2. Identify local clusters: Divide the retrieved documents into local clusters based on their similarity.
3. Rank terms: Rank the terms within each cluster based on their frequency and similarity to the query terms.
4. Select terms: Select the top-ranked terms from each cluster to expand the query.
QE-LC is a technique that expands the query by selecting terms from the centroids of local clusters. The
centroids represent the most representative documents in each cluster and are assumed to contain the most
relevant terms.
1. Gather retrieval results: Retrieve a set of documents based on the initial query.
2. Identify local clusters: Divide the retrieved documents into local clusters based on their similarity.
3. Extract cluster centroids: Extract the centroids from each local cluster.
4. Select terms: Select the terms from the cluster centroids to expand the query.
QE-LCA is a technique that expands the query by selecting terms from the local context of the query terms.
The local context refers to the terms that appear within a certain window size of the query terms in the
retrieved documents.
1. Gather retrieval results: Retrieve a set of documents based on the initial query.
2. Identify local context: Identify the local context for each query term in the retrieved documents.
3. Rank terms: Rank the terms within each local context based on their frequency and similarity to the query
terms.
4. Select terms: Select the top-ranked terms from each local context to expand the query.
The three techniques differ in how they identify and select relevant terms for query expansion:
ALA uses co-occurrence analysis to identify relevant terms.
QE-LC uses cluster centroids to identify relevant terms.
QE-LCA uses local context analysis to identify relevant terms.
Automatic Global Analysis (AGA) is a query expansion technique that utilizes global document analysis to
identify relevant terms for query refinement. It employs statistical methods to analyze the relationships
between terms across the entire document collection, rather than relying on local context within individual
documents. This global perspective allows AGA to capture broader semantic connections and uncover
terms that may not be apparent when examining documents in isolation.
Document indexing: Create a representation of each document in the collection, typically as a vector of
term weights.
Term-term correlation analysis: Compute the correlation between each pair of terms in the collection,
identifying terms that tend to co-occur frequently.
Term selection: Select a subset of the most highly correlated terms to add to the query .
Query Expansion Through Global Clustering
Query Expansion Through Global Clustering (QEGC) employs clustering techniques to group semantically
related terms and identify relevant terms for query expansion. It first clusters the terms in the document
collection based on their distributional similarities, resulting in a set of clusters that represent distinct
semantic concepts. Then, it identifies the most representative terms from each cluster and adds them to the
query.
QEGC typically involves the following steps:
Term clustering: Cluster the terms in the document collection using a clustering algorithm, such as k-
means clustering.
Cluster centroid computation: For each cluster, compute the centroid, which represents the most central
term in the cluster.
Term selection: Select thecentroids of clusters that have high relevance to the query and add them to the
query.
Reference
1. https://ryansblog.xyz/post/96b548f9-6de3-455b-be47-937c43639e1e --query processing
2. https://nlp.stanford.edu/IR-book/pdf/09expand.pdf------- Relevance Feedback and Query Expansion
UNIT 4 CLASSFICATION AND CLUSTERING
Text Classification and Naïve Bayes –Vector Space Classification –Support vector machines and
machine learning on documents –Flat Clustering –Hierarchical Clustering –Matrix decompositions
and latent semantic indexing –Fusion and Meta learning