0% found this document useful (0 votes)

61 views

Informa (On Retrieval: Recap of The Previous Lecture

This document discusses document preprocessing and indexing for information retrieval systems. It begins with a recap of inverted indexes and boolean query processing. The document then outlines the plan to elaborate on basic indexing, including preprocessing documents to form the term vocabulary and constructing postings. Key topics covered include tokenization, deciding which terms to include in the index, handling numbers and dates, and addressing language-specific issues like compounds in German and character boundaries in Chinese/Japanese. The goal is to discuss the considerations and challenges involved in transforming raw documents into an indexed form suitable for information retrieval.

Uploaded by

gouse5815

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

61 views

Informa (On Retrieval: Recap of The Previous Lecture

Uploaded by

gouse5815

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Introduc)on to Informa)on Retrieval  Introduc)on to Informa)on Retrieval  Ch.

Recap of the previous lecture 
  Basic inverted indexes: 
Introduc)on to    Structure: Dic)onary and Pos)ngs 
Informa(on Retrieval 
CS276: Informa)on Retrieval and Web Search    Key step in construc)on: Sor)ng 
Christopher Manning and Prabhakar Raghavan    Boolean query processing 
Lecture 2: The term vocabulary and pos)ngs    Intersec)on by linear )me “merging” 
lists    Simple op)miza)ons 
  Overview of course topics 

Introduc)on to Informa)on Retrieval  Introduc)on to Informa)on Retrieval 

Plan for this lecture  Recall the basic indexing pipeline 
Elaborate basic indexing  Documents to Friends, Romans, countrymen.
be indexed.
  Preprocessing to form the term vocabulary 
  Documents  Tokenizer
  Tokeniza)on  Token stream. Friends Romans Countrymen
  What terms do we put in the index?  Linguistic
  Pos)ngs  modules
Modified tokens. friend roman countryman
  Faster merges: skip lists 
Indexer friend  2 4
  Posi)onal pos)ngs and phrase queries 
roman  1 2
Inverted index.
countryman  13 16

Introduc)on to Informa)on Retrieval  Sec. 2.1 Introduc)on to Informa)on Retrieval  Sec. 2.1

Parsing a document  Complica)ons: Format/language 
  What format is it in?    Documents being indexed can include docs from 
many different languages 
  pdf/word/excel/html? 
  A single index may have to contain terms of several 
  What language is it in?  languages. 
  What character set is in use?    Some)mes a document or its components can 
contain mul)ple languages/formats 
  French email with a German pdf aWachment. 
Each of these is a classification problem,   What is a unit document? 
which we will study later in the course.   A file? 
  An email?  (Perhaps one of many in an mbox.) 
But these tasks are often done heuristically …   An email with 5 aWachments? 
  A group of files (PPT or LaTeX as HTML pages) 

1
Introduc)on to Informa)on Retrieval  Introduc)on to Informa)on Retrieval  Sec. 2.2.1

Tokeniza)on 
  Input: “Friends, Romans and Countrymen” 
  Output: Tokens 
  Friends 
  Romans 
  Countrymen 
  A token is an instance of a sequence of characters 
TOKENS AND TERMS    Each such token is now a candidate for an index 
entry, a_er further processing 
  Described below 
  But what are valid tokens to emit? 

Introduc)on to Informa)on Retrieval  Sec. 2.2.1 Introduc)on to Informa)on Retrieval  Sec. 2.2.1

Tokeniza)on  Numbers 
  Issues in tokeniza)on:    3/20/91       Mar. 12, 1991       20/3/91 
  Finland’s capital →     55 B.C. 
  B‐52 
     Finland? Finlands? Finland’s? 
  My PGP key is 324a3df234cb23e 
  Hewle:‐Packard → Hewle: and Packard as two    (800) 234‐2333 
tokens?    O_en have embedded spaces 
  state‐of‐the‐art: break up hyphenated sequence.   
  Older IR systems may not index numbers 
  co‐educa?on 
  But o_en very useful: think about things like looking up error 
  lowercase, lower‐case, lower case ? 
codes/stacktraces on the web 
  It can be eﬀec)ve to get the user to put in possible hyphens 
  (One answer is using n‐grams: Lecture 3) 
  San Francisco: one token or two?      Will o_en index “meta‐data” separately 
  How do you decide it is one token?    Crea)on date, format, etc. 

Introduc)on to Informa)on Retrieval  Sec. 2.2.1 Introduc)on to Informa)on Retrieval  Sec. 2.2.1

Tokeniza)on: language issues  Tokeniza)on: language issues 
  French    Chinese and Japanese have no spaces between 
  L'ensemble → one token or two?  words: 
  L ? L’ ? Le ?    莎拉波娃现在居住在美国东南部的佛罗里达。
  Want l’ensemble to match with un ensemble 
  Not always guaranteed a unique tokeniza)on  
  Un)l at least 2003, it didn’t on Google 
  Interna)onaliza)on!    Further complicated in Japanese, with mul)ple 
alphabets intermingled 
  German noun compounds are not segmented    Dates/amounts in mul)ple formats 
  LebensversicherungsgesellschaUsangestellter  フォーチュン500社は情報不足のため時間あた$500K(約6,000万円)
  ‘life insurance company employee’ 
  German retrieval systems beneﬁt greatly from a compound spli>er  Katakana Hiragana Kanji Romaji
module 
  Can give a 15% performance boost for German   End-user can express query entirely in hiragana!

2
Introduc)on to Informa)on Retrieval  Sec. 2.2.1 Introduc)on to Informa)on Retrieval  Sec. 2.2.2

Tokeniza)on: language issues  Stop words 
  Arabic (or Hebrew) is basically wriWen right to le_,    With a stop list, you exclude from the dic)onary 
but with certain items like numbers wriWen le_ to  en)rely the commonest words. Intui)on: 
right    They have liWle seman)c content: the, a, and, to, be 
  Words are separated, but leWer forms within a word    There are a lot of them: ~30% of pos)ngs for top 30 words 
form complex ligatures    But the trend is away from doing this: 
  Good compression techniques (lecture 5) means the space for 
including stopwords in a system is very small 
                                ←  →    ← →                         ← start    Good query op)miza)on techniques (lecture 7) mean you pay liWle 
at query )me for including stop words. 
  ‘Algeria achieved its independence in 1962 a_er 132 
  You need them for: 
years of French occupa)on.’    Phrase queries: “King of Denmark” 
  With Unicode, the surface presenta)on is complex, but the    Various song )tles, etc.: “Let it be”, “To be or not to be” 
stored form is  straighnorward    “Rela)onal” queries: “ﬂights to London” 

Introduc)on to Informa)on Retrieval  Sec. 2.2.3 Introduc)on to Informa)on Retrieval  Sec. 2.2.3

Normaliza)on to terms  Normaliza)on: other languages 
  We need to “normalize” words in indexed text as well    Accents: e.g., French résumé vs. resume. 
as query words into the same form    Umlauts: e.g., German: Tuebingen vs. Tübingen 
  We want to match U.S.A. and USA    Should be equivalent 
  Result is terms: a term is a (normalized) word type,    Most important criterion: 
which is an entry in our IR system dic)onary    How are your users like to write their queries for these 
  We most commonly implicitly deﬁne equivalence  words? 
classes of terms by, e.g.,  
  dele)ng periods to form a term    Even in languages that standardly have accents, users 
  U.S.A., USA    USA  o_en may not type them 
  dele)ng hyphens to form a term    O_en best to normalize to a de‐accented term 
  an?‐discriminatory, an?discriminatory    an?discriminatory    Tuebingen, Tübingen, Tubingen  Tubingen 

Introduc)on to Informa)on Retrieval  Sec. 2.2.3 Introduc)on to Informa)on Retrieval  Sec. 2.2.3

Normaliza)on: other languages  Case folding 
  Normaliza)on of things like date forms    Reduce all leWers to lower case 
  7月30日 vs. 7/30   excep)on: upper case in mid‐sentence? 
  Japanese use of kana vs. Chinese characters    e.g., General Motors 
  Fed vs. fed 
  SAIL vs. sail 
  Tokeniza)on and normaliza)on may depend on the    O_en best to lower case everything, since 
language and so is intertwined with language  users will use lowercase regardless of 
detec)on  Is this
‘correct’ capitaliza)on… 
Morgen will ich in MIT … German “mit”?   Google example: 
  Crucial: Need to “normalize” indexed text as well as    Query C.A.T.   
query terms into the same form    #1 result is for “cat” (well, Lolcats) not 
Caterpillar Inc. 

3
Introduc)on to Informa)on Retrieval  Sec. 2.2.3 Introduc)on to Informa)on Retrieval 

Normaliza)on to terms  Thesauri and soundex 
  Do we handle synonyms and homonyms? 
  An alterna)ve to equivalence classing is to do    E.g., by hand‐constructed equivalence classes 
  car = automobile   color = colour 
asymmetric expansion    We can rewrite to form equivalence‐class terms 
  An example of where this may be useful    When the document contains automobile, index it under car‐
  Enter: window   Search: window, windows  automobile (and vice‐versa) 
  Enter: windows  Search: Windows, windows, window    Or we can expand a query 
  Enter: Windows  Search: Windows    When the query contains automobile, look under car as well 

  Poten)ally more powerful, but less eﬃcient    What about spelling mistakes? 
  One approach is soundex, which forms equivalence classes 
of words based on phone)c heuris)cs 
  More in lectures 3 and 9 

Introduc)on to Informa)on Retrieval  Sec. 2.2.4 Introduc)on to Informa)on Retrieval  Sec. 2.2.4

Lemma)za)on  Stemming 
  Reduce inflec)onal/variant forms to base form    Reduce terms to their “roots” before indexing 
  E.g.,    “Stemming” suggest crude affix chopping 
  am, are, is → be    language dependent 
  car, cars, car's, cars' → car    e.g., automate(s), automa?c, automa?on all reduced to 
automat. 
  the boy's cars are different colors → the boy car be 
different color 
  Lemma)za)on implies doing “proper” reduc)on to  for example compressed for exampl compress and
dic)onary headword form  and compression are both compress ar both accept
accepted as equivalent to as equival to compress
compress.

Introduc)on to Informa)on Retrieval  Sec. 2.2.4 Introduc)on to Informa)on Retrieval  Sec. 2.2.4

Porter’s algorithm  Typical rules in Porter 
  Commonest algorithm for stemming English    sses → ss 
  Results suggest it’s at least as good as other stemming    ies → i 
op)ons 
  a)onal → ate 
  Conven)ons + 5 phases of reduc)ons 
  )onal → )on 
  phases applied sequen)ally 
  each phase consists of a set of commands 
  sample conven)on: Of the rules in a compound command,     Weight of word sensi)ve rules 
select the one that applies to the longest suﬃx.      (m>1) EMENT → 
  replacement → replac 
  cement  → cement 

4
Introduc)on to Informa)on Retrieval  Sec. 2.2.4 Introduc)on to Informa)on Retrieval  Sec. 2.2.4

Other stemmers  Language‐specificity 
  Other stemmers exist, e.g., Lovins stemmer     Many of the above features embody transforma)ons 
 
  hWp://www.comp.lancs.ac.uk/compu)ng/research/stemming/general/lovins.htm
that are 
  Single‐pass, longest suffix removal (about 250 rules)    Language‐specific and 
  Full morphological analysis – at most modest    O_en, applica)on‐specific 
benefits for retrieval    These are “plug‐in” addenda to the indexing process 
  Do stemming and other normaliza)ons help?    Both open source and commercial plug‐ins are 
  English: very mixed results. Helps recall for some queries but  available for handling these 
harms precision on others 
  E.g., opera)ve (den)stry) ⇒ oper
  Definitely useful for Spanish, German, Finnish, …
  30% performance gains for Finnish!

Introduc)on to Informa)on Retrieval  Sec. 2.2 Introduc)on to Informa)on Retrieval 

Dic)onary entries – ﬁrst cut 
ensemble.french

時間.japanese

MIT.english These may be

grouped by
mit.german
language (or
guaranteed.english not…).
More on this in
entries.english ranking/query
FASTER POSTINGS MERGES: 
sometimes.english
processing. SKIP POINTERS/SKIP LISTS 
tokenization.english

Introduc)on to Informa)on Retrieval  Sec. 2.3 Introduc)on to Informa)on Retrieval  Sec. 2.3

Augment pos)ngs with skip pointers 
Recall basic merge  (at indexing )me) 
  Walk through the two pos)ngs simultaneously, in  41 128
)me linear in the total number of pos)ngs entries  2 4 8 41 48 64 128

11 31
2 4 8 41 48 64 128 Brutus 1 2 3 8 11 17 21 31
2 8
1 2 3 8 11 17 21 31 Caesar   Why? 
  To skip pos)ngs that will not ﬁgure in the search 
If the list lengths are m and n, the merge takes O(m+n)
operations. results. 
  How? 
Can we do better?
Yes (if index isn’t changing too fast).   Where do we place skip pointers? 

5
Introduc)on to Informa)on Retrieval  Sec. 2.3 Introduc)on to Informa)on Retrieval  Sec. 2.3

Query processing with skip pointers  Where do we place skips? 
41 128   Tradeoﬀ: 
2 4 8 41 48 64 128   More skips → shorter skip spans ⇒ more likely to skip.  
But lots of comparisons to skip pointers. 
11 31   Fewer skips → few pointer comparison, but then long skip 
1 2 3 8 11 17 21 31 spans ⇒ few successful skips. 

Suppose we’ve stepped through the lists until we

process 8 on each list. We match it and advance.

We then have 41 and 11 on the lower. 11 is smaller.

But the skip successor of 11 on the lower list is 31, so

we can skip ahead past the intervening postings.

Introduc)on to Informa)on Retrieval  Sec. 2.3 Introduc)on to Informa)on Retrieval 

Placing skips 
  Simple heuris)c: for pos)ngs of length L, use √L 
evenly‐spaced skip pointers. 
  This ignores the distribu)on of query terms. 
  Easy if the index is rela)vely sta)c; harder if L keeps 
changing because of updates. 

  This deﬁnitely used to help; with modern hardware it 
may not (Bahle et al. 2002) unless you’re memory‐
PHRASE QUERIES AND POSITIONAL 
based  INDEXES 
  The I/O cost of loading a bigger pos)ngs list can outweigh 
the gains from quicker in memory merging! 

Introduc)on to Informa)on Retrieval  Sec. 2.4 Introduc)on to Informa)on Retrieval  Sec. 2.4.1

Phrase queries  A ﬁrst aWempt: Biword indexes 
  Want to be able to answer queries such as “stanford    Index every consecu)ve pair of terms in the text as a 
university” – as a phrase  phrase 
  Thus the sentence “I went to university at Stanford”    For example the text “Friends, Romans, Countrymen” 
is not a match.   would generate the biwords 
  The concept of phrase queries has proven easily    friends romans 
understood by users; one of the few “advanced search”    romans countrymen 
ideas that works 
  Each of these biwords is now a dic)onary term 
  Many more queries are implicit phrase queries 
  Two‐word phrase query‐processing is now 
  For this, it no longer suﬃces to store only 
immediate. 
   <term : docs> entries 

6
Introduc)on to Informa)on Retrieval  Sec. 2.4.1 Introduc)on to Informa)on Retrieval  Sec. 2.4.1

Longer phrase queries  Extended biwords 
  Longer phrases are processed as we did with wild‐   Parse the indexed text and perform part‐of‐speech‐tagging 
(POST). 
cards: 
  Bucket the terms into (say) Nouns (N) and ar)cles/
  stanford university palo alto can be broken into the  preposi)ons (X). 
Boolean query on biwords:    Call any string of terms of the form NX*N an extended biword. 
stanford university AND university palo AND palo alto    Each such extended biword is now made a term in the 
dic)onary. 
  Example:  catcher in the rye 
Without the docs, we cannot verify that the docs                  N           X   X    N 
matching the above Boolean query do contain the    Query processing: parse it into N’s and X’s 
phrase.    Segment query into enhanced biwords 
  Look up in index: catcher rye 
Can have false positives!

Introduc)on to Informa)on Retrieval  Sec. 2.4.1 Introduc)on to Informa)on Retrieval  Sec. 2.4.2

Issues for biword indexes  Solu)on 2: Posi)onal indexes 
  False posi)ves, as noted before    In the pos)ngs, store, for each term the posi)on(s) in 
  Index blowup due to bigger dic)onary  which tokens of it appear: 
  Infeasible for more than biwords, big even for them 
<term, number of docs containing term; 
  Biword indexes are not the standard solu)on (for all  doc1: posi)on1, posi)on2 … ; 
biwords) but can be part of a compound strategy  doc2: posi)on1, posi)on2 … ; 
etc.> 

Introduc)on to Informa)on Retrieval  Sec. 2.4.2 Introduc)on to Informa)on Retrieval  Sec. 2.4.2

Posi)onal index example  Processing a phrase query 
  Extract inverted index entries for each dis)nct term: 
<be: 993427; to, be, or, not. 
1: 7, 18, 33, 72, 86, 231; Which of docs 1,2,4,5   Merge their doc:posi)on lists to enumerate all 
2: 3, 149; could contain “to be posi)ons with “to be or not to be”. 
4: 17, 191, 291, 430, 434; or not to be”?   to:  
5: 363, 367, …>
  2:1,17,74,222,551; 4:8,16,190,429,433; 7:13,23,191; ... 
  For phrase queries, we use a merge algorithm    be:   
recursively at the document level    1:17,19; 4:17,191,291,430,434; 5:14,19,101; ... 
  But we now need to deal with more than just    Same general method for proximity searches 
equality 

7
Introduc)on to Informa)on Retrieval  Sec. 2.4.2 Introduc)on to Informa)on Retrieval  Sec. 2.4.2

Proximity queries  Posi)onal index size 
  LIMIT! /3 STATUTE /3 FEDERAL /2 TORT     You can compress posi)on values/oﬀsets: we’ll talk 
  Again, here, /k means “within k words of”.  about that in lecture 5  
  Clearly, posi)onal indexes can be used for such    Nevertheless, a posi)onal index expands pos)ngs 
queries; biword indexes cannot.  storage substan)ally 
  Exercise: Adapt the linear merge of pos)ngs to    Nevertheless, a posi)onal index is now standardly 
handle proximity queries.  Can you make it work for  used because of the power and usefulness of phrase 
any value of k?  and proximity queries … whether used explicitly or 
  This is a liWle tricky to do correctly and eﬃciently  implicitly in a ranking retrieval system. 
  See Figure 2.12 of IIR 
  There’s likely to be a problem on it! 

Introduc)on to Informa)on Retrieval  Sec. 2.4.2 Introduc)on to Informa)on Retrieval  Sec. 2.4.2

Posi)onal index size  Rules of thumb 
  Need an entry for each occurrence, not just once per    A posi)onal index is 2–4 as large as a non‐posi)onal 
document  index 
  Index size depends on average document size  Why?   Posi)onal index size 35–50% of volume of original 
  Average web page has <1000 terms  text 
  SEC ﬁlings, books, even some epic poems … easily 100,000    Caveat: all of this holds for “English‐like” languages 
terms 
  Consider a term with frequency 0.1% 
Document size Postings Positional postings
1000 1 1
100,000 1 100

Introduc)on to Informa)on Retrieval  Sec. 2.4.3 Introduc)on to Informa)on Retrieval 

Combina)on schemes  Resources for today’s lecture 
  These two approaches can be profitably    IIR 2 
combined    MG 3.6, 4.3; MIR 7.2 
  For par)cular phrases (“Michael Jackson”, “Britney    Porter’s stemmer: 
Spears”) it is inefficient to keep on merging posi)onal  hWp://www.tartarus.org/~mar)n/PorterStemmer/ 
pos)ngs lists    Skip Lists theory: Pugh (1990) 
  Even more so for phrases like “The Who”    Mul)level skip lists give same O(log n) efficiency as trees 
  Williams et al. (2004) evaluate a more    H.E. Williams, J. Zobel, and D. Bahle. 2004. “Fast Phrase
Querying with Combined Indexes”, ACM Transactions on
sophis)cated mixed indexing scheme  Information Systems.
  A typical web query mixture was executed in ¼ of the   hWp://www.seg.rmit.edu.au/research/research.php?author=4 
)me of using just a posi)onal index    D. Bahle, H. Williams, and J. Zobel. Efficient phrase querying with an 
auxiliary index. SIGIR 2002, pp. 215‐221. 
  It required 26% more space than having a posi)onal 
index alone 

Manual ECI AS9205
No ratings yet
Manual ECI AS9205
38 pages
Getting Started With TeleForm
No ratings yet
Getting Started With TeleForm
102 pages
Lecture2 Dictionary
No ratings yet
Lecture2 Dictionary
62 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
48 pages
Lecture2-Dictionary - Term Vocabulary and Postings Lists ch2 and ch4
No ratings yet
Lecture2-Dictionary - Term Vocabulary and Postings Lists ch2 and ch4
33 pages
lecture2-dictionary
No ratings yet
lecture2-dictionary
37 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
48 pages
C2 Dictionary
No ratings yet
C2 Dictionary
6 pages
CSE 435/535 Information Retrieval: Chapter 2: Tokenization, Stemming, Lemmatization
No ratings yet
CSE 435/535 Information Retrieval: Chapter 2: Tokenization, Stemming, Lemmatization
48 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
34 pages
Information Retrieval Systems Chap 2
67% (3)
Information Retrieval Systems Chap 2
60 pages
Lecture 3-Term Vocabulary and Posting Lists
No ratings yet
Lecture 3-Term Vocabulary and Posting Lists
26 pages
Term Vocabulary and Postings List
No ratings yet
Term Vocabulary and Postings List
64 pages
Lecture 3-Term Vocabulary and Posting Lists
No ratings yet
Lecture 3-Term Vocabulary and Posting Lists
38 pages
IR Lec03 Vocabulary Postings List
No ratings yet
IR Lec03 Vocabulary Postings List
28 pages
Lecture3 Roy
No ratings yet
Lecture3 Roy
5 pages
Intro To IRE
No ratings yet
Intro To IRE
48 pages
IR 02 02 Tokens
No ratings yet
IR 02 02 Tokens
8 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
115 pages
Module 4-Boolean Retrieval Models-Edit Distance, Spelling Correction
No ratings yet
Module 4-Boolean Retrieval Models-Edit Distance, Spelling Correction
124 pages
IRS Chapter 2
No ratings yet
IRS Chapter 2
57 pages
03 Dictionaries
No ratings yet
03 Dictionaries
112 pages
03 Dictionaries
No ratings yet
03 Dictionaries
112 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
46 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
54 pages
6_2018_09_11!11_16_16_AM
No ratings yet
6_2018_09_11!11_16_16_AM
101 pages
IR-Lec1 - Ch1-2023
No ratings yet
IR-Lec1 - Ch1-2023
41 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
51 pages
Lecture3 Tolerant Retrieval Handout 6 Per
No ratings yet
Lecture3 Tolerant Retrieval Handout 6 Per
8 pages
2T-Inverted Index
No ratings yet
2T-Inverted Index
54 pages
Lecture 5-Dictionaries and Tolerant Retrieval
No ratings yet
Lecture 5-Dictionaries and Tolerant Retrieval
48 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
33 pages
Module 4-Boolean Retrieval Models
No ratings yet
Module 4-Boolean Retrieval Models
52 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
48 pages
C7 SpellCorrection
No ratings yet
C7 SpellCorrection
43 pages
Chapter 2 Part 1 & 2
No ratings yet
Chapter 2 Part 1 & 2
58 pages
Lecture 2: Datastructures and Algorithms For Indexing: Information Retrieval Computer Science Tripos Part II
No ratings yet
Lecture 2: Datastructures and Algorithms For Indexing: Information Retrieval Computer Science Tripos Part II
47 pages
Lecture 4-Dictionaries and Tolerant Retrieval
No ratings yet
Lecture 4-Dictionaries and Tolerant Retrieval
50 pages
Ir 1
No ratings yet
Ir 1
59 pages
Lecture3 Tolerant Retrieval
No ratings yet
Lecture3 Tolerant Retrieval
48 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
42 pages
1. 2_text Operation_1 (2)
No ratings yet
1. 2_text Operation_1 (2)
28 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
48 pages
Lecture1 Intro Handout 1 Per
No ratings yet
Lecture1 Intro Handout 1 Per
57 pages
Information Retrival Systems
No ratings yet
Information Retrival Systems
50 pages
Chapter 1: Boolean Retrieval
No ratings yet
Chapter 1: Boolean Retrieval
9 pages
Lecture2 Intro Boolean 6per
No ratings yet
Lecture2 Intro Boolean 6per
9 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
33 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
57 pages
C1 Intro
No ratings yet
C1 Intro
10 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
38 pages
Lecture3-Tolerant-retrieval Dictionaries and Tolerant Retrieval CH 3
No ratings yet
Lecture3-Tolerant-retrieval Dictionaries and Tolerant Retrieval CH 3
47 pages
Lecture01 Intro
No ratings yet
Lecture01 Intro
45 pages
Boolean Retrieval
No ratings yet
Boolean Retrieval
34 pages
lecture1-intro
No ratings yet
lecture1-intro
60 pages
2 Text Operations
No ratings yet
2 Text Operations
32 pages
Lecture3 Tolerant Retrieval
100% (1)
Lecture3 Tolerant Retrieval
48 pages
Lecture3 Tolerant Retrieval
100% (1)
Lecture3 Tolerant Retrieval
48 pages
Lecture1 Introduction
No ratings yet
Lecture1 Introduction
67 pages
2
No ratings yet
2
50 pages
IR Summary Lec 1 - Introduction
No ratings yet
IR Summary Lec 1 - Introduction
54 pages
Compiler Design
From Everand
Compiler Design
Knowledge Flow
No ratings yet
Infix, Prefix, Postfix
100% (2)
Infix, Prefix, Postfix
57 pages
C Programming and Data Structures (May/June 2006)
No ratings yet
C Programming and Data Structures (May/June 2006)
20 pages
Linked List
100% (1)
Linked List
17 pages
Class 1 C
No ratings yet
Class 1 C
12 pages
Active Learning Techniques
No ratings yet
Active Learning Techniques
4 pages
Distributed Database Management Systems: 1998 M. Tamer Özsu and Patrick Valduriez
No ratings yet
Distributed Database Management Systems: 1998 M. Tamer Özsu and Patrick Valduriez
10 pages
Query
No ratings yet
Query
104 pages
Semantic Data Control
No ratings yet
Semantic Data Control
12 pages
Outline: Background Distributed DBMS Architecture
No ratings yet
Outline: Background Distributed DBMS Architecture
23 pages
ME3791-Unit - 1
No ratings yet
ME3791-Unit - 1
91 pages
Name: Rohan Konde Roll No:52: Title: A) - Install and Configure Antivirus Software On System
No ratings yet
Name: Rohan Konde Roll No:52: Title: A) - Install and Configure Antivirus Software On System
4 pages
Flava Works vs. Myvidster, Marques Rondale Gunter, Salsa Indy, LLC. Appellant's Brief Filed by Appellants Marques Rondale Gunter and SalsaIndy LLC
No ratings yet
Flava Works vs. Myvidster, Marques Rondale Gunter, Salsa Indy, LLC. Appellant's Brief Filed by Appellants Marques Rondale Gunter and SalsaIndy LLC
80 pages
AMC10 - AIME Combo 2
No ratings yet
AMC10 - AIME Combo 2
17 pages
Shell Commands in Windows
No ratings yet
Shell Commands in Windows
4 pages
Paper EN20CS302016 Harsh Golchha
No ratings yet
Paper EN20CS302016 Harsh Golchha
5 pages
AM5766
0% (1)
AM5766
10 pages
5G Protocols
No ratings yet
5G Protocols
35 pages
LEPU MEDICAL Pocket ECG PCECG-500 Catalog
No ratings yet
LEPU MEDICAL Pocket ECG PCECG-500 Catalog
4 pages
6653-010 B - Verity - 2.6 - Device Troubleshooting Field Guide
No ratings yet
6653-010 B - Verity - 2.6 - Device Troubleshooting Field Guide
52 pages
Digital Micro-Ohmmeter: - Test Current Set From
No ratings yet
Digital Micro-Ohmmeter: - Test Current Set From
1 page
HiAS 743. High Resolution Impulse Analyzing System FEATURES BENEFITS
No ratings yet
HiAS 743. High Resolution Impulse Analyzing System FEATURES BENEFITS
7 pages
Library Policy and Procedures Manual
No ratings yet
Library Policy and Procedures Manual
25 pages
Operating System English Powerpoint
No ratings yet
Operating System English Powerpoint
12 pages
DataWare Housing
No ratings yet
DataWare Housing
31 pages
Mqa Cep
No ratings yet
Mqa Cep
3 pages
FND - USER - PKG - ADDRESP API Is Used To Attach A Responsibility To A User in r12
No ratings yet
FND - USER - PKG - ADDRESP API Is Used To Attach A Responsibility To A User in r12
2 pages
Handheld Dog Repeller Using IOT: Mayank Babu Pallem (E18Ece025) Yashee Chaudhary (E18Ece043) Under The Supervision of
No ratings yet
Handheld Dog Repeller Using IOT: Mayank Babu Pallem (E18Ece025) Yashee Chaudhary (E18Ece043) Under The Supervision of
7 pages
COMPUTER External Hardware
No ratings yet
COMPUTER External Hardware
17 pages
Tripwire Report
No ratings yet
Tripwire Report
30 pages
Introduction To Injection Mold Design
No ratings yet
Introduction To Injection Mold Design
78 pages
Ns2 Thesis in Ludhiana
100% (3)
Ns2 Thesis in Ludhiana
8 pages
Realweld: Weld Training Solutions
No ratings yet
Realweld: Weld Training Solutions
8 pages
CSE Module2
No ratings yet
CSE Module2
9 pages
The Design Is Adequate.: A W B L T (Weight P
No ratings yet
The Design Is Adequate.: A W B L T (Weight P
1 page
Brochure of Seria Camera
No ratings yet
Brochure of Seria Camera
2 pages
Validated FIPS 140-1 and FIPS 140-2 Cryptographic Modules
No ratings yet
Validated FIPS 140-1 and FIPS 140-2 Cryptographic Modules
85 pages

Manual ECI AS9205
Manual ECI AS9205
Getting Started With TeleForm
Getting Started With TeleForm
Lecture2 Dictionary
Lecture2 Dictionary
Introduction To: Information Retrieval
Introduction To: Information Retrieval
Lecture2-Dictionary - Term Vocabulary and Postings Lists ch2 and ch4
Lecture2-Dictionary - Term Vocabulary and Postings Lists ch2 and ch4
lecture2-dictionary
lecture2-dictionary
Introduction To: Information Retrieval
Introduction To: Information Retrieval
C2 Dictionary
C2 Dictionary
CSE 435/535 Information Retrieval: Chapter 2: Tokenization, Stemming, Lemmatization
CSE 435/535 Information Retrieval: Chapter 2: Tokenization, Stemming, Lemmatization
Introduction To: Information Retrieval
Introduction To: Information Retrieval
Information Retrieval Systems Chap 2
Information Retrieval Systems Chap 2
Lecture 3-Term Vocabulary and Posting Lists
Lecture 3-Term Vocabulary and Posting Lists
Term Vocabulary and Postings List
Term Vocabulary and Postings List
Lecture 3-Term Vocabulary and Posting Lists
Lecture 3-Term Vocabulary and Posting Lists
IR Lec03 Vocabulary Postings List
IR Lec03 Vocabulary Postings List
Lecture3 Roy
Lecture3 Roy
Intro To IRE
Intro To IRE
IR 02 02 Tokens
IR 02 02 Tokens
Introduction To: Information Retrieval
Introduction To: Information Retrieval
Module 4-Boolean Retrieval Models-Edit Distance, Spelling Correction
Module 4-Boolean Retrieval Models-Edit Distance, Spelling Correction
IRS Chapter 2
IRS Chapter 2
03 Dictionaries
03 Dictionaries
03 Dictionaries
03 Dictionaries
Introduction To: Information Retrieval
Introduction To: Information Retrieval
Introduction To: Information Retrieval
Introduction To: Information Retrieval
6_2018_09_11!11_16_16_AM
6_2018_09_11!11_16_16_AM
IR-Lec1 - Ch1-2023
IR-Lec1 - Ch1-2023
Introduction To: Information Retrieval
Introduction To: Information Retrieval
Lecture3 Tolerant Retrieval Handout 6 Per
Lecture3 Tolerant Retrieval Handout 6 Per
2T-Inverted Index
2T-Inverted Index
Lecture 5-Dictionaries and Tolerant Retrieval
Lecture 5-Dictionaries and Tolerant Retrieval
Introduction To: Information Retrieval
Introduction To: Information Retrieval
Module 4-Boolean Retrieval Models
Module 4-Boolean Retrieval Models
Introduction To: Information Retrieval
Introduction To: Information Retrieval
C7 SpellCorrection
C7 SpellCorrection
Chapter 2 Part 1 & 2
Chapter 2 Part 1 & 2
Lecture 2: Datastructures and Algorithms For Indexing: Information Retrieval Computer Science Tripos Part II
Lecture 2: Datastructures and Algorithms For Indexing: Information Retrieval Computer Science Tripos Part II
Lecture 4-Dictionaries and Tolerant Retrieval
Lecture 4-Dictionaries and Tolerant Retrieval
Ir 1
Ir 1
Lecture3 Tolerant Retrieval
Lecture3 Tolerant Retrieval
Introduction To: Information Retrieval
Introduction To: Information Retrieval
1. 2_text Operation_1 (2)
1. 2_text Operation_1 (2)
Introduction To: Information Retrieval
Introduction To: Information Retrieval
Lecture1 Intro Handout 1 Per
Lecture1 Intro Handout 1 Per
Information Retrival Systems
Information Retrival Systems
Chapter 1: Boolean Retrieval
Chapter 1: Boolean Retrieval
Lecture2 Intro Boolean 6per
Lecture2 Intro Boolean 6per
Introduction To: Information Retrieval
Introduction To: Information Retrieval
Introduction To: Information Retrieval
Introduction To: Information Retrieval
C1 Intro
C1 Intro
Introduction To: Information Retrieval
Introduction To: Information Retrieval
Lecture3-Tolerant-retrieval Dictionaries and Tolerant Retrieval CH 3
Lecture3-Tolerant-retrieval Dictionaries and Tolerant Retrieval CH 3
Lecture01 Intro
Lecture01 Intro
Boolean Retrieval
Boolean Retrieval
lecture1-intro
lecture1-intro
2 Text Operations
2 Text Operations
Lecture3 Tolerant Retrieval
Lecture3 Tolerant Retrieval
Lecture3 Tolerant Retrieval
Lecture3 Tolerant Retrieval
Lecture1 Introduction
Lecture1 Introduction
2
2
IR Summary Lec 1 - Introduction
IR Summary Lec 1 - Introduction
Compiler Design
From Everand
Compiler Design
Infix, Prefix, Postfix
Infix, Prefix, Postfix
C Programming and Data Structures (May/June 2006)
C Programming and Data Structures (May/June 2006)
Linked List
Linked List
Class 1 C
Class 1 C
Active Learning Techniques
Active Learning Techniques
Distributed Database Management Systems: 1998 M. Tamer Özsu and Patrick Valduriez
Distributed Database Management Systems: 1998 M. Tamer Özsu and Patrick Valduriez
Query
Query
Semantic Data Control
Semantic Data Control
Outline: Background Distributed DBMS Architecture
Outline: Background Distributed DBMS Architecture
ME3791-Unit - 1
ME3791-Unit - 1
Name: Rohan Konde Roll No:52: Title: A) - Install and Configure Antivirus Software On System
Name: Rohan Konde Roll No:52: Title: A) - Install and Configure Antivirus Software On System
Flava Works vs. Myvidster, Marques Rondale Gunter, Salsa Indy, LLC. Appellant's Brief Filed by Appellants Marques Rondale Gunter and SalsaIndy LLC
Flava Works vs. Myvidster, Marques Rondale Gunter, Salsa Indy, LLC. Appellant's Brief Filed by Appellants Marques Rondale Gunter and SalsaIndy LLC
AMC10 - AIME Combo 2
AMC10 - AIME Combo 2
Shell Commands in Windows
Shell Commands in Windows
Paper EN20CS302016 Harsh Golchha
Paper EN20CS302016 Harsh Golchha
AM5766
AM5766
5G Protocols
5G Protocols
LEPU MEDICAL Pocket ECG PCECG-500 Catalog
LEPU MEDICAL Pocket ECG PCECG-500 Catalog
6653-010 B - Verity - 2.6 - Device Troubleshooting Field Guide
6653-010 B - Verity - 2.6 - Device Troubleshooting Field Guide
Digital Micro-Ohmmeter: - Test Current Set From
Digital Micro-Ohmmeter: - Test Current Set From
HiAS 743. High Resolution Impulse Analyzing System FEATURES BENEFITS
HiAS 743. High Resolution Impulse Analyzing System FEATURES BENEFITS
Library Policy and Procedures Manual
Library Policy and Procedures Manual
Operating System English Powerpoint
Operating System English Powerpoint
DataWare Housing
DataWare Housing
Mqa Cep
Mqa Cep
FND - USER - PKG - ADDRESP API Is Used To Attach A Responsibility To A User in r12
FND - USER - PKG - ADDRESP API Is Used To Attach A Responsibility To A User in r12
Handheld Dog Repeller Using IOT: Mayank Babu Pallem (E18Ece025) Yashee Chaudhary (E18Ece043) Under The Supervision of
Handheld Dog Repeller Using IOT: Mayank Babu Pallem (E18Ece025) Yashee Chaudhary (E18Ece043) Under The Supervision of
COMPUTER External Hardware
COMPUTER External Hardware
Tripwire Report
Tripwire Report
Introduction To Injection Mold Design
Introduction To Injection Mold Design
Ns2 Thesis in Ludhiana
Ns2 Thesis in Ludhiana
Realweld: Weld Training Solutions
Realweld: Weld Training Solutions
CSE Module2
CSE Module2
The Design Is Adequate.: A W B L T (Weight P
The Design Is Adequate.: A W B L T (Weight P
Brochure of Seria Camera
Brochure of Seria Camera
Validated FIPS 140-1 and FIPS 140-2 Cryptographic Modules
Validated FIPS 140-1 and FIPS 140-2 Cryptographic Modules

Informa (On Retrieval: Recap of The Previous Lecture

Uploaded by

Informa (On Retrieval: Recap of The Previous Lecture

Uploaded by

Introduc)on to Informa)on Retrieval Introduc)on to Informa)on Retrieval Ch.

Introduc)on to Informa)on Retrieval Sec. 2.1 Introduc)on to Informa)on Retrieval Sec. 2.1

Introduc)on to Informa)on Retrieval Sec. 2.2.1 Introduc)on to Informa)on Retrieval Sec. 2.2.1

Introduc)on to Informa)on Retrieval Sec. 2.2.1 Introduc)on to Informa)on Retrieval Sec. 2.2.1

Introduc)on to Informa)on Retrieval Sec. 2.2.3 Introduc)on to Informa)on Retrieval Sec. 2.2.3

Introduc)on to Informa)on Retrieval Sec. 2.2.3 Introduc)on to Informa)on Retrieval Sec. 2.2.3

Introduc)on to Informa)on Retrieval Sec. 2.2.4 Introduc)on to Informa)on Retrieval Sec. 2.2.4

Introduc)on to Informa)on Retrieval Sec. 2.2.4 Introduc)on to Informa)on Retrieval Sec. 2.2.4

Introduc)on to Informa)on Retrieval Sec. 2.2 Introduc)on to Informa)on Retrieval

MIT.english These may be

Introduc)on to Informa)on Retrieval Sec. 2.3 Introduc)on to Informa)on Retrieval Sec. 2.3

Suppose we’ve stepped through the lists until we

We then have 41 and 11 on the lower. 11 is smaller.

But the skip successor of 11 on the lower list is 31, so

Introduc)on to Informa)on Retrieval Sec. 2.3 Introduc)on to Informa)on Retrieval

Introduc)on to Informa)on Retrieval Sec. 2.4 Introduc)on to Informa)on Retrieval Sec. 2.4.1

Introduc)on to Informa)on Retrieval Sec. 2.4.1 Introduc)on to Informa)on Retrieval Sec. 2.4.2

Introduc)on to Informa)on Retrieval Sec. 2.4.2 Introduc)on to Informa)on Retrieval Sec. 2.4.2

Introduc)on to Informa)on Retrieval Sec. 2.4.2 Introduc)on to Informa)on Retrieval Sec. 2.4.2

Introduc)on to Informa)on Retrieval Sec. 2.4.3 Introduc)on to Informa)on Retrieval

You might also like

Introduc)on to Informa)on Retrieval  Introduc)on to Informa)on Retrieval  Ch.

Introduc)on to Informa)on Retrieval  Sec. 2.1 Introduc)on to Informa)on Retrieval  Sec. 2.1

Introduc)on to Informa)on Retrieval  Sec. 2.2.1 Introduc)on to Informa)on Retrieval  Sec. 2.2.1

Introduc)on to Informa)on Retrieval  Sec. 2.2.1 Introduc)on to Informa)on Retrieval  Sec. 2.2.1

Introduc)on to Informa)on Retrieval  Sec. 2.2.3 Introduc)on to Informa)on Retrieval  Sec. 2.2.3

Introduc)on to Informa)on Retrieval  Sec. 2.2.3 Introduc)on to Informa)on Retrieval  Sec. 2.2.3

Introduc)on to Informa)on Retrieval  Sec. 2.2.4 Introduc)on to Informa)on Retrieval  Sec. 2.2.4

Introduc)on to Informa)on Retrieval  Sec. 2.2.4 Introduc)on to Informa)on Retrieval  Sec. 2.2.4

Introduc)on to Informa)on Retrieval  Sec. 2.2 Introduc)on to Informa)on Retrieval 

Introduc)on to Informa)on Retrieval  Sec. 2.3 Introduc)on to Informa)on Retrieval  Sec. 2.3

Introduc)on to Informa)on Retrieval  Sec. 2.3 Introduc)on to Informa)on Retrieval 

Introduc)on to Informa)on Retrieval  Sec. 2.4 Introduc)on to Informa)on Retrieval  Sec. 2.4.1

Introduc)on to Informa)on Retrieval  Sec. 2.4.1 Introduc)on to Informa)on Retrieval  Sec. 2.4.2

Introduc)on to Informa)on Retrieval  Sec. 2.4.2 Introduc)on to Informa)on Retrieval  Sec. 2.4.2

Introduc)on to Informa)on Retrieval  Sec. 2.4.2 Introduc)on to Informa)on Retrieval  Sec. 2.4.2

Introduc)on to Informa)on Retrieval  Sec. 2.4.3 Introduc)on to Informa)on Retrieval