Lecture5 Spell Correction 1per
Lecture5 Spell Correction 1per
Introduction to
Information Retrieval
CS276: Information Retrieval and Web Search
Christopher Manning and Pandu Nayak
WILD-CARD QUERIES
2
Introduction to Information Retrieval Sec. 3.2
Wild-card queries: *
§ mon*: find all docs containing any word beginning
with “mon”.
§ Easy with binary tree (or B-tree) dictionary: retrieve
all words in range: mon ≤ w < moo
§ *mon: find words ending in “mon”: harder
§ Maintain an additional B-tree for terms backwards.
Can retrieve all words in range: nom ≤ w < non.
Query processing
§ At this point, we have an enumeration of all terms in
the dictionary that match the wild-card query.
§ We still have to look up the postings for each
enumerated term.
§ E.g., consider the query:
se*ate AND fil*er
This may result in the execution of many Boolean
AND queries.
4
Introduction to Information Retrieval Sec. 3.2
5
Introduction to Information Retrieval Sec. 3.2.1
Permuterm index
§ Add a $ to the end of each term
§ Rotate the resulting term and index them in a B-tree
§ For term hello, index under:
§ hello$, ello$h, llo$he, lo$hel, o$hell, $hello
where $ is a special symbol.
hello$
ello$h
llo$he Empirically, dictionary
hello
lo$hel quadruples in size
o$hell
$hello 6
Introduction to Information Retrieval Sec. 3.2.1
7
Introduction to Information Retrieval Sec. 3.2.2
$a,ap,pr,ri,il,l$,$i,is,s$,$t,th,he,e$,$c,cr,ru,
ue,el,le,es,st,t$, $m,mo,on,nt,h$
$m mace madden
mo among amortize
on along among
9
Introduction to Information Retrieval Sec. 3.2.2
Processing wild-cards
§ Query mon* can now be run as
§ $m AND mo AND on
§ Gets terms that match AND version of our wildcard
query.
§ But we’d enumerate moon.
§ Must post-filter these terms against query.
§ Surviving enumerated terms are then looked up in
the term-document inverted index.
§ Fast, space efficient (compared to permuterm).
10
Introduction to Information Retrieval Sec. 3.2.2
Search
Type your search terms, use ‘*’ if you need to.
E.g., Alex* will match Alexander.
11
Introduction to Information Retrieval
SPELLING CORRECTION
12
Introduction to Information Retrieval
Web search
13
Introduction to Information Retrieval
Spelling Tasks
§ Spelling Error Detection
§ Spelling Error Correction:
§ Autocorrect
§ hteàthe
§ Suggest a correction
§ Suggestion lists
15
Introduction to Information Retrieval
17
Introduction to Information Retrieval
18
Introduction to Information Retrieval
Terminology
§ We just discussed character bigrams and k-grams:
§ st, pr, an …
§ We can also have word bigrams and n-grams:
§ palo alto, flying from, road repairs
19
Introduction to Information Retrieval
21
Introduction to Information Retrieval
ŵ = argmax P(w | x)
w∈V
P(x | w)P(w)
= argmax Bayes
w∈V P(x)
= argmax P(x | w)P(w)
w∈V Prior
Noisy channel model 22
Introduction to Information Retrieval
acress
24
Introduction to Information Retrieval
Candidate generation
§ Words with similar spelling
§ Small edit distance to error
§ Words with similar pronunciation
§ Small distance of pronunciation to error
25
Introduction to Information Retrieval
Candidate Testing:
Damerau-Levenshtein edit distance
§ Minimal edit distance between two strings, where
edits are:
§ Insertion
§ Deletion
§ Substitution
§ Transposition of two adjacent letters
26
Introduction to Information Retrieval
Candidate generation
§ 80% of errors are within edit distance 1
§ Almost all errors within edit distance 2
A paradigm …
§ We want the best spell corrections
§ Instead of finding the very best, we
§ Find a subset of pretty good corrections
§ (say, edit distance at most 2)
§ Find the best amongst them
§ These may not be the actual best
§ This is a recurring paradigm in IR including finding
the best docs for a query, best answers, best ads …
§ Find a good candidate set
§ Find the top K amongst them and return them as the best
30
Introduction to Information Retrieval
ŵ = argmax P(w | x)
w∈V
P(x | w)P(w)
= argmax
w∈V P(x)
= argmax P(x | w)P(w) What’s P(w)?
w∈V
31
Introduction to Information Retrieval
Language Model
§ Take a big supply of words (your document collection
with T tokens); let C(w) = # occurrences of w
C(w)
P(w) =
T
§ In other applications – you can take the supply to be
typed queries (suitably filtered) – when a static
dictionary is inadequate
32
Introduction to Information Retrieval
34
Introduction to Information Retrieval
35
Introduction to Information Retrieval
Nearby keys
Introduction to Information Retrieval
38
Introduction to Information Retrieval
8
> del[wi 1 ,wi ] , if deletion
>
> count[wi 1 wi ]
>
>
>
> ins[wi 1 ,xi ] , if insertion
>
< count[wi 1 ]
P (x|w) = sub[x ,w ]
>
> i i
, if substitution
>
>
> count [wi ]
>
> trans[wi ,wi+1 ] , if transposition
>
: count[w w ]
i i+1
39
Introduction to Information Retrieval
w)
actress t - c|c .000117 .0000231 2.7
t
cress - a a|# .00000144 .000000544 .00078
Evaluation
§ Some spelling error test sets
§ Wikipedia’s list of common English misspelling
§ Aspell filtered version of that list
§ Birkbeck spelling error corpus
§ Peter Norvig’s list of errors (includes Wikipedia and
Birkbeck, for training or testing)
44
Introduction to Information Retrieval
46
Introduction to Information Retrieval
47
Introduction to Information Retrieval
ŵ = argmax P(w | x)
w∈V
= argmax P(x | w)P(w)
w∈V
Introduction to Information Retrieval
50
Introduction to Information Retrieval
52
Introduction to Information Retrieval
53
Introduction to Information Retrieval
to threw
too on the
two of thaw
54
Introduction to Information Retrieval
to threw
too on the
two of thaw
55
Introduction to Information Retrieval
57
Introduction to Information Retrieval
Probability of no error
§ What is the channel probability for a correctly typed
word?
§ P(“the”|“the”)
§ If you have a big corpus, you can estimate this percent
correct
58
Introduction to Information Retrieval
59
Introduction to Information Retrieval
60
Introduction to Information Retrieval
61