20 Tolerantretrieval
20 Tolerantretrieval
retrieval
ot
s
le
v ark
g en
zy g
s i ck
huy
a ard
Tree: B-tree
a-hu n-z
hy-m
Cons:
Slower: O(log M) [and this requires balanced tree]
Query = hel*o
X=hel, Y=o
Lookup o$hel*
Permuterm query processing
Rotate query wild-card to the right
Now use B-tree lookup as before.
Permuterm problem: ≈ quadruples lexicon size
$a,ap,pr,ri,il,l$,$i,is,s$,$t,th,he,e$,$c,cr,ru,
ue,el,le,es,st,t$, $m,mo,on,nt,h$
$ is a special word boundary symbol
Maintain a second inverted index from bigrams to
dictionary terms that match each bigram.
Bigram index example
The k-gram index finds terms based on a query
consisting of k-grams
$m mace madden
mo among amortize
on among around
Processing n-gram wild-cards
Query mon* can now be run as
$m AND mo AND on
Gets terms that match AND version of our
wildcard query.
But we’d enumerate moon.
Must post-filter these terms against query.
Surviving enumerated terms are then looked up
in the term-document inverted index.
Fast, space efficient (compared to permuterm).
Processing wild-card queries
As before, we must execute a Boolean query for
each enumerated, filtered term.
Wild-cards can result in expensive query
execution (very large disjunctions…)
pyth* AND prog*
If you encourage “laziness” people will respond!
Search
Type your search terms, use ‘*’ if you need to.
E.g., Alex* will match Alexander.
X Y / X Y
Equals 1 when X and Y have the same elements
and zero when they are disjoint
X and Y don’t have to be of the same size
Always assigns a number between 0 and 1
Now threshold to decide if you have a match
E.g., if J.C. > 0.8, declare a match
Matching trigrams
Consider the query lord – we wish to identify
words matching 2 of its 3 bigrams (lo, or, rd)
Heathrow”
We’d like to respond