Lecture3 Tolerant Retrieval Handout 6 Per
Lecture3 Tolerant Retrieval Handout 6 Per
3 4
Introduction to Information Retrieval Sec. 3.1 Introduction to Information Retrieval Sec. 3.1
1
Introduction to Information Retrieval Sec. 3.1 Introduction to Information Retrieval Sec. 3.1
7 8
Introduction to Information Retrieval Sec. 3.1 Introduction to Information Retrieval Sec. 3.1
Wild-card queries: *
mon*: find all docs containing any word beginning
with “mon”.
Easy with binary tree (or B-tree) lexicon: retrieve all
words in range: mon ≤ w < moo
*mon: find words ending in “mon”: harder
Maintain an additional B-tree for terms backwards.
WILD-CARD QUERIES Can retrieve all words in range: nom ≤ w < non.
2
Introduction to Information Retrieval Sec. 3.2 Introduction to Information Retrieval Sec. 3.2
13 14
Introduction to Information Retrieval Sec. 3.2.1 Introduction to Information Retrieval Sec. 3.2.1
Query = hel*o
X=hel, Y=o
Lookup o$hel*
15 16
Introduction to Information Retrieval Sec. 3.2.2 Introduction to Information Retrieval Sec. 3.2.2
3
Introduction to Information Retrieval Sec. 3.2.2 Introduction to Information Retrieval Sec. 3.2.2
Spell correction
Two principal uses
Correcting document(s) being indexed
Correcting user queries to retrieve “right” answers
Two main flavors:
Isolated word
Check each word on its own for misspelling
Will not catch typos resulting in correctly spelled words
SPELLING CORRECTION e.g., from → form
Context-sensitive
Look at surrounding words,
e.g., I flew form Heathrow to Narita.
21 22
Introduction to Information Retrieval Sec. 3.3 Introduction to Information Retrieval Sec. 3.3
4
Introduction to Information Retrieval Sec. 3.3.2 Introduction to Information Retrieval Sec. 3.3.2
25 26
Introduction to Information Retrieval Sec. 3.3.3 Introduction to Information Retrieval Sec. 3.3.3
Introduction to Information Retrieval Sec. 3.3.4 Introduction to Information Retrieval Sec. 3.3.4
5
Introduction to Information Retrieval Sec. 3.3.4 Introduction to Information Retrieval Sec. 3.3.4
31 32
Introduction to Information Retrieval Sec. 3.3.4 Introduction to Information Retrieval Sec. 3.3.4
X ∩Y / X ∪Y
lo alone lore sloth
Equals 1 when X and Y have the same elements and or border lore morbid
zero when they are disjoint
rd ardent border card
X and Y don’t have to be of the same size
Always assigns a number between 0 and 1
Now threshold to decide if you have a match Standard postings “merge” will enumerate …
E.g., if J.C. > 0.8, declare a match Adapt this to using Jaccard (or another) measure.
33 34
Introduction to Information Retrieval Sec. 3.3.5 Introduction to Information Retrieval Sec. 3.3.5
35 36
6
Introduction to Information Retrieval Sec. 3.3.5 Introduction to Information Retrieval Sec. 3.3.5
37 38
Introduction to Information Retrieval Sec. 3.4 Introduction to Information Retrieval Sec. 3.4
http://www.creativyst.com/Doc/Articles/SoundEx1/SoundEx1.htm#Top
41 42
7
Introduction to Information Retrieval Sec. 3.4 Introduction to Information Retrieval Sec. 3.4
45 46
Exercise Resources
Draw yourself a diagram showing the various indexes IIR 3, MG 4.2
in a search engine incorporating all the functionality Efficient spell retrieval:
K. Kukich. Techniques for automatically correcting words in text. ACM
we have talked about Computing Surveys 24(4), Dec 1992.
Identify some of the key design choices in the index J. Zobel and P. Dart. Finding approximate matches in large
pipeline: lexicons. Software - practice and experience 25(3), March 1995.
http://citeseer.ist.psu.edu/zobel95finding.html
Does stemming happen before the Soundex index? Mikael Tillenius: Efficient Generation and Ranking of Spelling Error
Corrections. Master’s thesis at Sweden’s Royal Institute of Technology.
What about n-grams? http://citeseer.ist.psu.edu/179155.html
Given a query, how would you parse and dispatch Nice, easy reading on spell correction:
sub-queries to the various indexes? Peter Norvig: How to write a spelling corrector
http://norvig.com/spell-correct.html
47 48