Wildcard queries allow users to search for terms that match a specified pattern using wildcard characters like *. There are several methods for supporting wildcard queries in information retrieval systems. Trailing and leading wildcard queries can be efficiently supported using search trees, while more complex queries require rewriting the query and using special indexes like permuterm or k-gram indexes. These indexes contain rotations or subsequences of terms that can be queried to return a superset of potential matches, which are then filtered to find only terms that fully satisfy the wildcard query. While powerful for users, wildcard queries also tend to be computationally expensive for the system.
Wildcard queries allow users to search for terms that match a specified pattern using wildcard characters like *. There are several methods for supporting wildcard queries in information retrieval systems. Trailing and leading wildcard queries can be efficiently supported using search trees, while more complex queries require rewriting the query and using special indexes like permuterm or k-gram indexes. These indexes contain rotations or subsequences of terms that can be queried to return a superset of potential matches, which are then filtered to find only terms that fully satisfy the wildcard query. While powerful for users, wildcard queries also tend to be computationally expensive for the system.
Wildcard queries allow users to search for terms that match a specified pattern using wildcard characters like *. There are several methods for supporting wildcard queries in information retrieval systems. Trailing and leading wildcard queries can be efficiently supported using search trees, while more complex queries require rewriting the query and using special indexes like permuterm or k-gram indexes. These indexes contain rotations or subsequences of terms that can be queried to return a superset of potential matches, which are then filtered to find only terms that fully satisfy the wildcard query. While powerful for users, wildcard queries also tend to be computationally expensive for the system.
Wildcard queries allow users to search for terms that match a specified pattern using wildcard characters like *. There are several methods for supporting wildcard queries in information retrieval systems. Trailing and leading wildcard queries can be efficiently supported using search trees, while more complex queries require rewriting the query and using special indexes like permuterm or k-gram indexes. These indexes contain rotations or subsequences of terms that can be queried to return a superset of potential matches, which are then filtered to find only terms that fully satisfy the wildcard query. While powerful for users, wildcard queries also tend to be computationally expensive for the system.
J. Pei: Information Retrieval and Web Search -- Wildcard Queries 2
Inverted Indexes Query Brutus AND Calpurnia J. Pei: Information Retrieval and Web Search -- Wildcard Queries 3 Vocabulary Lookup Given an inverted index and a query, we need to determine whether each query term exists in the vocabulary If so, identify the pointer to the corresponding postings Hashing or search trees? How many keys (terms)? Is the number of keys static or changing a lot? Operations on the keys, insertions only or insertions + deletions? Relative frequencies of key accesses? J. Pei: Information Retrieval and Web Search -- Wildcard Queries 4 Hashing No easy way to find minor variants of a query term Minor variants could be hashed to very different buckets Cannot find all terms with the same prefix For web search, the vocabulary size keeps growing A hash function may become insufficient after several years J. Pei: Information Retrieval and Web Search -- Wildcard Queries 5 Search Trees Easy to find all terms with the same prefix Balancing search trees Logarithmic search time Cost: rebalancing J. Pei: Information Retrieval and Web Search -- Wildcard Queries 6 B-trees Every internal node has a number of children in interval [a, b] Good for disk-based data storage J. Pei: Information Retrieval and Web Search -- Wildcard Queries 7 When Are Wildcard Queries Useful? A user is uncertain about the spelling of a query term S*dney ! uncertain about Sydney or Sidney A user is aware of multiple variants of spelling a term and (consciously) seeks documents containing any of the variants Color versus colour A user searches documents containing variants of a term that would be caught by stemming, but is unsure whether the search engine conducts stemming judicia* ! judicial versus judiciary A user is uncertain about the correct rendition of a foreign word or phrase Universit* Stuttgart J. Pei: Information Retrieval and Web Search -- Wildcard Queries 8 Trailing Wildcard Queries A trailing wildcard query has only one * symbol at the end of the search string Example: mon* Trailing wildcard queries can be answered efficiently using a search tree Walk down the tree following the symbols m, o, and n in turn Enumerate the set W of terms in the dictionary with the prefix mon Use |W| lookups on the inverted index to retrieve all documents containing any term in W J. Pei: Information Retrieval and Web Search -- Wildcard Queries 9 Leading Wildcard Queries A leading wildcard query has only one * symbol at the beginning of the query Example: *mon A leading wildard query can be answered efficiently using a reverse search tree Each root-to-leaf path corresponds to a term in the dictionary written backwards The term lemon is represented by a path root-n-o-m-e-l J. Pei: Information Retrieval and Web Search -- Wildcard Queries 10 A Little More General Case How to answer queries containing only one * symbol but can be in any position Example: se*mon? Rewrite the query to se* AND *mon Use two search trees A search tree to answer query se*, find the set W of terms A reverse search tree to answer query *mon, find the set R of terms W ! R is the set of terms satisfying the query J. Pei: Information Retrieval and Web Search -- Wildcard Queries 11 General Wildcard Queries A general wildcard query can have any number of * symbol at any position Framework Rewrite a given wildcard query q as a Boolean query Q on a specially constructed index, such that the answer to Q is a superset of the set of vocabulary terms matching q Check each term in the answer to Q against q, discarding those vocabulary terms that do not match q Two methods: permuterm indexes and k-gram indexes J. Pei: Information Retrieval and Web Search -- Wildcard Queries 12 Permuterm Indexes Use a special symbol $ to mark the end of a term Term hello is represented as hello$ A permuterm index contains various rotations of each term augmented with $ all linked to the original vocabulary term The permuterm vocabulary: the set of rotated terms in the permuterm index J. Pei: Information Retrieval and Web Search -- Wildcard Queries 13 Query Answering One * Symbol Rotate a wildcard query so that the * symbol appears at the end of the string Example: rotate m*n to n$m* Look up the string in the permuterm index Find terms n$ma and n$moro ! man and moron are the answers J. Pei: Information Retrieval and Web Search -- Wildcard Queries 14 Query Answering Multiple *s Example query: q = fi*mo*er Conduct query Q = er$fi Check each term returned from Q against q, only search the inverted index for those terms satisfying q Cost: the permuterm index is quite large since it contains all rotations of each term On average 10 times for English documents J. Pei: Information Retrieval and Web Search -- Wildcard Queries 15 Discussion For query q = f*mo*er, we can run queries Q1 = er$f and Q2=mo and obtain the intersection of the answers Is the method good? Why? For query q = b*etro*t Run query Q1 = t$b* Run query Q2 = etro* Which way is better? Why? J. Pei: Information Retrieval and Web Search -- Wildcard Queries 16 K-gram Indexes A k-gram is a sequence of k characters Use symbol $ to denote the beginning and end of a term 3-grams of castle: $ca, cas, ast, stl, tle, le$ A k-gram index contains all k-grams that occur in any term in the vocabulary Each postings list points from a k-gram to all vocabulary terms containing that k-gram J. Pei: Information Retrieval and Web Search -- Wildcard Queries 17 Query Answering Example query re*ve Run the Boolean query $re AND ve$ False positive may happen Query red* Run Boolean query $re AND red Term retired is an answer Postfiltering: check terms returned from the Boolean query against the original query J. Pei: Information Retrieval and Web Search -- Wildcard Queries 18 More on Wildcard Queries Wildcard queries can be quite expensive The added lookups in the special index, filtering Most commonly, the capability of wildcard queries is hidden behind an advanced query interface Most users never use Do not encourage users to invoke wildcard queries when they do not require it Reduce the processing load on a search engine J. Pei: Information Retrieval and Web Search -- Wildcard Queries 19 Summary Vocabulary lookup: hashing versus search trees Wildcard queries are powerful in search Permuterm indexes K-gram indexes