0% found this document useful (0 votes)
2 views

chapter2-MA212-Indexing+&+Preprocessing

Uploaded by

alaaabdo347890
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

chapter2-MA212-Indexing+&+Preprocessing

Uploaded by

alaaabdo347890
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 68

Faculty of Artificial Intelligence – KFS

University
Fist Semester- 2024-2025

MA212 : Information Retrieval and Web Search


Grade: General - Second Year
Dr. Marwa Elseddik
2. Indexing & Preprocessing
4

3
4
Google supports Boolean retrieval.
➢ Yes
➢ No
Boolean retrieval is called "exact-match" because ...
➢ it returns documents that exactly satisfy the Boolean query.
➢ it returns documents that exactly satisfy the information need.
➢ it divides the collection into exactly two subsets of documents.
When we change our query after seeing the search results, .....
➢ we are actually changing our information need.
➢ we are representing the same information need but in a different way.
➢ either of the above cases can happen. 4
Today’s Roadmap
 The anatomy of a search engine

 Indexing

 Preprocessing

5
The IR Black Box
Query Documents

Hits

7
Inside the IR Black Box
Query Documents

online offline
Representation Representation
Function Function

Query Representation Document Representation


Retrieval
Model
Comparison
Function Index

Hits
This course in 1 slide!
8
Indexing process (offline)
document → unique ID
Document what can you store?
web-crawling, disk space? rights?
RSS feeds, Data Store
compression?
emails

A System
and
…………M…e…th…odfo r
……………………
…………………………………… Acquisition Index Creation Index
……………………………………
……………………………………
………..

what data do a lookup table for


we want to quickly finding all
search? Preprocessing docs containing a
(transformation) word
format conversion. international?
which part contains “meaning”?
word units? stopping? stemming?
9
Search process (online)
Document
help user formulate Data Store
the query by
suggesting what he fetch a set of results,
could search for present to the user
User
Ranking Index
Interaction

log user’s
Logging &
actions: clicks,
Log data Analysis
hovering,
giving up logging,
ranking analysis,
performance analysis
10
5

11
5

Indexing is done at query time only.


➢ Yes
➢ No, it is done only offline
➢ No, it is done both offline and online

Ranking is done ...


➢ offline
➢ online
➢ both offline and online

12
13
Bigger Collections …
 Consider N = 1 million documents, each with about 1000 words.
 Say there are M = 500K distinct terms among these.
 500K x 1M term-doc incidence matrix has half-a-trillion 0’s and
1’s.

 But it has no more than one billion 1’s. ?


● matrix is extremely sparse.

What’s a better representation?


14
Sec. 1.2

Inverted Index
 For each term t, we must store a list of all documents that
contain t.
● Identify each by a docID, a document serial number
Posting

likes 1 2 4 11 31 45 173
wink 1 2 4 5 6 16 57 132
drink 2 31 54 101
Postings List
Dictionary
Sorted by docID (more later on why)

15
Inverted Index Construction
Documents to be indexed He likes to wink, he likes to drink.

Tokenizer

Token stream He likes to wink he likes to drink


Preprocessing
Normalizer
(later)
Terms (modified tokens) he like wink he like drink

Indexer
he 2 4

Inverted index like 1 2

wink 3 9
16
Step 1: Term Sequence
Doc 1 Term
I
docID
1
did 1
enact 1
I did enact Julius julius 1
Caesar I was caesar 1
I 1
killed i' the Capitol; was 1
Brutus killed me. killed 1
i' 1
the 1
capitol 1
brutus 1
killed 1

Preprocess me
so
1
2
let 2

Doc 2 it
be
2
2
with 2
So let it be with caesar 2
Caesar. The noble the 2
noble 2
Brutus hath told brutus 2
you Caesar was hath 2
told 2
ambitious
you 2
caesar 2
was 2
ambitious 2

Sequence of
(term, Doc ID) pairs
Step 2: Sorting
Doc 1 Term
I
docID
1
Term
ambitious
docID
2
did 1 be 2
enact 1 brutus 1
I did enact Julius julius 1 brutus 2
Caesar I was caesar 1 capitol 1
I 1 caesar 1
killed i' the Capitol; was 1 Core indexing step caesar 2
Brutus killed me. killed 1 caesar 2
i' 1
did 1
the
capitol
1
1
Sort by enact 1
hath 1
brutus
killed
1
1 term then I
I
1
1
Preprocess me
so
1
2 DocID i'
it
1
2
let 2
julius 1
Doc 2 it
be
2
2 killed 1
with 2 killed 1
So let it be with caesar 2 let 2
Caesar. The noble the 2 me 1
noble 2 noble 2
Brutus hath told brutus 2 so 2
you Caesar was hath 2 the 1
told 2 the 2
ambitious
you 2 told 2
caesar 2 you 2
was 2 was 1
ambitious 2 was 2
with 2

Sequence of Sorted Sequence of


(term, Doc ID) pairs (term, Doc ID) pairs
Step 3: Dictionary & Postings
Doc 1 Term
I
docID
1
Term
ambitious
docID
2
did 1 be 2
enact 1 brutus 1
I did enact Julius julius 1 brutus 2
Caesar I was caesar 1 capitol 1
I 1 caesar 1
killed i' the Capitol; was 1 Core indexing step caesar 2
Brutus killed me. killed 1 caesar 2
i' 1
did 1
the
capitol
1
1
Sort by enact 1 Dictionary
hath 1
brutus
killed
1
1 term then I
I
1
1
&
Preprocess me
so
1
2 DocID i'
it
1
2
Postings
let 2
julius 1
Doc 2 it
be
2
2 killed 1
with 2 killed 1
So let it be with caesar 2 let 2
Caesar. The noble the 2 me 1
noble 2 noble 2
Brutus hath told brutus 2 so 2
you Caesar was hath 2 the 1 df information
told 2 the 2
ambitious is added
you 2 told 2
caesar 2 you 2
was 2 was 1
ambitious 2 was 2
with 2

Sequence of Sorted Sequence of Inverted Index


(term, Doc ID) pairs (term, Doc ID) pairs
Indexing
Doc 1 Term
I
docID
1
Term
ambitious
docID
2
did 1 be 2
enact 1 brutus 1
I did enact Julius julius 1 brutus 2
Caesar I was caesar 1 capitol 1
I 1
killed i' the Capitol; caesar 1
was 1 caesar 2
Brutus killed me. killed 1
caesar 2
i' 1
the
capitol
1
1
Sort by did
enact
1
1 Dictionary
hath 1
brutus
killed
1
1 term then I 1 &
I 1
Preprocess me
so
1
2 DocID i'
it
1 Postings
let 2 2
j 1
Doc 2 it 2
u
l
i
So let it be with u
Caesar. The noble
Brutus hath told
noble
be How do we index efficiently?
22
s
nkoble
i
1 2

you Caesar was l


l
In22 IR2 course ☺
ambitious told 2 the
you 2 e
told
d
caesar 2 you 2
with 2 k 1
was 2 was 1
ambitious
i
2 was 2
l
with
l 2
e
Sequence of Sorted Sequence of
d Inverted Index
(term, Doc ID) pairs
caesar 2 (term, Doc2 ID) pairs
l
Query Processing: AND
 Consider processing the query: wink AND drink
1. Locate wink in the Dictionary, Retrieve its postings
2. Locate drink in the Dictionary, Retrieve its postings
3. “Merge” the two postings lists

wink 2 4 8 16 32 64 128
2 8
drink 1 2 3 5 8 13 21 34

 Complexity ?
 Crucial: postings sorted by docID. 21
Intersecting Two Postings Lists:
(a “merge” algorithm)

Document-at-a-time

How to modify for OR?


22
6

23
6

In inverted index, we can get efficiently ...


➢ what terms appear in a specific document
➢ what documents have a specific term
➢ both of the above

One posting belongs to ...


➢ one term
➢ one document
➢ one term in one document

24
25
Proximity Queries
 If 2
words are “near” each other in a document d, they might be
more related than further words ➔ d might be “more relevant”

 Ex: Find Gates NEAR/3 Microsoft.

How can we support it?

26
Positional Indexes
 Inthe postings, store for each term the position(s) in which
tokens of it appear:
<likes: 9347;
1: 7, 18, 33, 72, 86, 231;
2: 3, 149;
4: 17, 191, 291, 430, 434;
5: 363, 367, …>

What’s the biggest problem?

28
Sec. 2.4.2

Positional Index Size


 You can compress position values/offsets

 Nevertheless, a positional index expands postings storage


substantially

 Nevertheless, apositional index is now standardly used because


of the power and usefulness of phrase and proximity queries …
whether used explicitly or implicitly in a ranking retrieval system.
Phrase Queries
 Want to be able to answer queries such as “KafrEl Sheikh
university” – as a phrase
 Thus the sentence “I went to university in KafrElSheikh” is not a
match.
● The concept of phrase queries has proven easily understood by users;
one of the few “advanced search” ideas that works
● Many more queries are implicit phrase queries

30
7

30
7

Phrase queries are special case of proximity queries


➢ Yes
➢ No

Proximity queries are ......... Boolean queries


➢ more expensive than
➢ less expensive then
➢ of equal cost to

31
33
Zone
A zone is a region of the doc that can contain an arbitrary
amount of text e.g.,
● Title
● Abstract
● References …

 Build inverted indexes on zones as well to permit querying


● e.g., find docs with merchant in the title zone and “gentle rain” in the
body.

33
Example Zone Indexes

Encode zones in dictionary vs. postings.

34
Today’s Roadmap
 The anatomy of a search engine

 Indexing

 Preprocessing

35
The Basic Indexing Pipeline
Documents to
Friends, Romans, countrymen
be indexed
Preprocessing
Tokenization

Token stream Friends Romans Countrymen

Normalization

Terms (modified tokens) friend roman countryman

Indexer
friend 2 4
roman 1 2
countryman 3 9
38
Inverted index
Preprocessing
…………………
…………………

Preprocessing
…………………
…………………
…………………

Text Text
transformation transformation

BOW BOW
to better match between
different forms of words in
Index
documents and query

37
Preprocessing Steps
1. Tokenization
2. Stopping
3. Stemming
Objective: identify the optimal form of the
term to be indexed to achieve the best
retrieval performance.

38
Before Tokenization …
 Encoding & Parsing a Document Byte sequence ➔
● Which encoding/character set? Character sequence
● What format? pdf/word/excel/html?
● What language?
● Each is classification problem
● BUT often done heuristically, by user selection, or by metadata
 What is a Unit Document?
● A file? An email? A group of files (PPT)? Where to Stop?
A book (a chapter/paragraph/sentence)?
● Understand collection, user, and usage patterns

39
42
Tokenization
 Sentence → tokenization (splitting) → tokens
 Input: “Friends, Romans and Countrymen”
 Output: Tokens
● Friends
But what are
● Romans
valid tokens to emit?
● and
● Countrymen
A token is an instance of a sequence of characters
 Each such token is now a candidate for an index entry (term),
after further processing.
41
Issues in Tokenization
 Finland’s capital →
Finland? Finlands? Finland’s?
 Hewlett-Packard → one token or two?
● state-of-the-art: break up hyphenated sequence.
● co-education
● lowercase, lower-case, lower case ?
● It can be effective to get the user to put in possible hyphens
 San Francisco: one token or two?
● How do you decide it is one token?
 Numbers?
● 3/20/91 Mar. 12, 1991 20/3/91
● This course code is CMPT621
● (800) 234-2333

42
Issues in Tokenization
 URLs:
● http://www.bbc.co.uk
● http://www.bbc.co.uk/news/world-europe-41376577
 Social Media
● Black lives matter
● #Black_lives_matter
● #BlackLivesMatter
● #blacklivesmatter
● @blacklivesmatter

45
Language-dependent Issues
 French
● L'ensemble → one token or two?
• L ? L’ ? Le ?
• Want l’ensemble to match with un ensemble
– Until at least 2003, it didn’t on Google

 German noun compounds are not segmented


● Lebensversicherungsgesellschaftsangestellter
● ‘life insurance company employee’
● German retrieval systems benefit greatly from a compound splitter module
– Can give a 15% performance boost for German

 Chinese and Japanese have no spaces between words:


● 莎拉波娃现在居住在美国东南部的佛罗里达。
● Tokenization → Segmentation
46
Tokenization: common practice
 Just split at non-letter characters
 Add special cases if required
 Some applications have special setup
● Social media: hashtags/mentions handled differently
● URLs: no split, split at domain only, remove entirely!
● Medical: proteins & diseases names

45
46
Stopping (stop words removal)
 This is a very exciting lecture on the technologies of text
 Stop words: the most common words in collection
→ the, a, is, he, she, I, him, for, on, to, very, …
 They have little semantic contribution
 They appear a lot ≈ 30-40% of text
 New stop words appear in specific domains
● e.g., “RT” in Tweets: “RT @realDonalTrump Mexico will …”
 Stop words
● influence on sentence structure
● less influence on topic (aboutness)
47
Stopping: always apply?
 Sometimes very important:
● Phrase queries: “Let it be”, “To be or not to be”
● Relational queries:
- flights to Doha from London
- flights from Doha to London
 In Web search, trend is to keep them:
● Good compression techniques means the space for including stop
words in a system is small.
● Good query optimization techniques mean you pay little at query time
for including stop words.
48
Stopping: common practice
 Common practice in many applications
→ remove stop words
 There are common stop words list for each language
● NLTK (Python)
● Lucene (Java)
● http://members.unine.ch/jacques.savoy/clef/index.html
 There are special stop words list for some applications

How to create your own list?


49
8

50
8

Can tokenization affect retrieval effectiveness?


➢ Yes
➢ No

Stop words should usually have very high document frequency


➢ Yes
➢ No

51
52
Normalization
 Objective → make words with different surface forms look the
same
 Document: “there are few CARS!!”
Query: “car”
should “car” match “CARS”?

 Sentence → tokenization → tokens → normalization → terms to


be indexed (vocabulary/dictionary).

53
Case Folding
 “A” & “a” are different strings for computers
 Case folding: convert all letters to lower case

 CAR, Car, caR → car


 Windows → windows
● should we do that?
● Usually yes, users are so lazy
 Upper case in mid-sentence?
● I bought it from General Motors
● Black vs. black

54
Thesauri and Soundex
 Do we handle synonyms?
● e.g., by hand-constructed equivalence classes
• car = automobile color = colour
● We can rewrite to form equivalence-class terms
• When the document contains automobile, index it under car-automobile (and
vice-versa)
● Or we can expand a query
• When the query contains automobile, look under car as well

 What about spelling mistakes?


● One approach is soundex, which forms equivalence classes of words
based on phonetic heuristics

55
Lemmatization
 Lemmatization implies doing “proper” reduction to the “base” or
dictionary form, called lemma.
● Morphological analysis

 Reduce inflectional/variant forms to base form


 e.g.,
● am, are, is → be
● saw → see
● car, cars, car's, cars' → car
56
Stemming
 Search for: “play”
should it match: “plays”, “played”, “playing”, “player”?
 Many morphological variations of words
● inflectional (plurals, tenses)
● derivational (making verbs nouns, etc.)
 In most cases, aboutness does not change.
 Stemmers attemptto reduce morphological variations of words
to a common stem.

59
Stemming
 “Stemming” suggests crude affix chopping
● language dependent
● e.g., automate, automates, automatic, automation all reduced to
automat.
for example compressed and compression are both accepted as
equivalent to compress.

for exampl compress and compress ar both accept as


equival to compress

60
Porter Stemmer
 Most common algorithm for stemming English
 Conventions + 5 phases of reductions
● phases applied sequentially
● each phase consists of a set of commands
 Example convention: Of the rules in a compound command, select the
one that applies to the longest suffix.
 Example rules
● sses → ss (processes → process)
●y→i (reply → repli)
● ies → i (replies → repli)
● tional → tion (international → internation)
● (m>1)ement → (replacement → replac), (cement → cement)
59
Stemming: is it really useful?
 Usually, itachieves 5-10% improvement in retrieval effectiveness,
e.g. English.
 For highly inflected languages, it is more critical:
● 30% improvement in Finnish IR
● 50% improvement in Arabic IR
They are Ahmad’s children ‫أ ﺑﻨﺎأ ءﺣﻤﺪ‬
The children behaved well ‫ﺟﯿﺪھ اﺆﻻء‬
Her children are cute ‫اﻷﺑﻨﺎء ﺗﺼﺮﻓﻮا‬
My children are funny ‫أﺑﻨﺎھءﺎ ﻟﻄﺎف‬
We have to save our children ‫أﺑﻨﺎﺋﻲظ ﺮﻓﺎء‬
Patents and children are happy ‫ﻋﻠﯿﻨﺎنأ ﻧﺤﻤﻲأ ﺑﻨﺎءﻧﺎ‬
He loves his children ‫او ﻷﺑﻨﺎء ﺳﻌﺪءا‬
His children loves him ‫ﯾﺤﺐأ ﺑﻨﺎا هءﻵﺑﺎء‬
60
‫أﺑﻨﺎھؤ ﯾﺤﺒﻮﻧﮫھ ﻮ‬
Stemmed words are misspelled ?!
 repli, replac, suppli, inform retriev, anim
 These are not words anymore, these are terms.
 These terms are not seen by the user, but just used by the IR
system (search engine).
 These represent the optimal form for a better match between
different surface forms of a word.
● e.g. replace, replaces, replaced, replacing, replacer, replacers,
replacement, replacements → replac.

61
9

62
9

Same tokenization/normalization steps should be


applied to documents and queries.
➢ Yes, always!
➢ No, they can be different of course

The dictionary in the index includes ...


➢ words
➢ tokens
➢ terms
➢ all of the above
63
64
Preprocessing: common practice
 Tokenization: split at non-letter characters
● For tweets, you might want to keep “#” and “@”.
 Remove stop words
● find a common list, and filter these words out.
 Apply case folding

 Apply Porter stemmer (or others for other languages)


● Other stemmers are available, but Porter is the most famous with many
implementations available in different programming languages.

65
Summary
 Pre-processing:
● Tokenization → Stopping → Stemming

This is an example sentence of how the pre-processing is applied to


text in information retrieval. It includes: Tokenization, Stop Words
Removal, and Stemming

exampl sentenc pre process appli text inform retriev includ token
stop word remov stem

66
How can we know
if a search engine is “good” or “bad”?

67
68

You might also like