chapter2-MA212-Indexing+&+Preprocessing
chapter2-MA212-Indexing+&+Preprocessing
University
Fist Semester- 2024-2025
3
4
Google supports Boolean retrieval.
➢ Yes
➢ No
Boolean retrieval is called "exact-match" because ...
➢ it returns documents that exactly satisfy the Boolean query.
➢ it returns documents that exactly satisfy the information need.
➢ it divides the collection into exactly two subsets of documents.
When we change our query after seeing the search results, .....
➢ we are actually changing our information need.
➢ we are representing the same information need but in a different way.
➢ either of the above cases can happen. 4
Today’s Roadmap
The anatomy of a search engine
Indexing
Preprocessing
5
The IR Black Box
Query Documents
Hits
7
Inside the IR Black Box
Query Documents
online offline
Representation Representation
Function Function
Hits
This course in 1 slide!
8
Indexing process (offline)
document → unique ID
Document what can you store?
web-crawling, disk space? rights?
RSS feeds, Data Store
compression?
emails
A System
and
…………M…e…th…odfo r
……………………
…………………………………… Acquisition Index Creation Index
……………………………………
……………………………………
………..
log user’s
Logging &
actions: clicks,
Log data Analysis
hovering,
giving up logging,
ranking analysis,
performance analysis
10
5
11
5
12
13
Bigger Collections …
Consider N = 1 million documents, each with about 1000 words.
Say there are M = 500K distinct terms among these.
500K x 1M term-doc incidence matrix has half-a-trillion 0’s and
1’s.
Inverted Index
For each term t, we must store a list of all documents that
contain t.
● Identify each by a docID, a document serial number
Posting
likes 1 2 4 11 31 45 173
wink 1 2 4 5 6 16 57 132
drink 2 31 54 101
Postings List
Dictionary
Sorted by docID (more later on why)
15
Inverted Index Construction
Documents to be indexed He likes to wink, he likes to drink.
Tokenizer
Indexer
he 2 4
wink 3 9
16
Step 1: Term Sequence
Doc 1 Term
I
docID
1
did 1
enact 1
I did enact Julius julius 1
Caesar I was caesar 1
I 1
killed i' the Capitol; was 1
Brutus killed me. killed 1
i' 1
the 1
capitol 1
brutus 1
killed 1
Preprocess me
so
1
2
let 2
Doc 2 it
be
2
2
with 2
So let it be with caesar 2
Caesar. The noble the 2
noble 2
Brutus hath told brutus 2
you Caesar was hath 2
told 2
ambitious
you 2
caesar 2
was 2
ambitious 2
Sequence of
(term, Doc ID) pairs
Step 2: Sorting
Doc 1 Term
I
docID
1
Term
ambitious
docID
2
did 1 be 2
enact 1 brutus 1
I did enact Julius julius 1 brutus 2
Caesar I was caesar 1 capitol 1
I 1 caesar 1
killed i' the Capitol; was 1 Core indexing step caesar 2
Brutus killed me. killed 1 caesar 2
i' 1
did 1
the
capitol
1
1
Sort by enact 1
hath 1
brutus
killed
1
1 term then I
I
1
1
Preprocess me
so
1
2 DocID i'
it
1
2
let 2
julius 1
Doc 2 it
be
2
2 killed 1
with 2 killed 1
So let it be with caesar 2 let 2
Caesar. The noble the 2 me 1
noble 2 noble 2
Brutus hath told brutus 2 so 2
you Caesar was hath 2 the 1
told 2 the 2
ambitious
you 2 told 2
caesar 2 you 2
was 2 was 1
ambitious 2 was 2
with 2
wink 2 4 8 16 32 64 128
2 8
drink 1 2 3 5 8 13 21 34
Complexity ?
Crucial: postings sorted by docID. 21
Intersecting Two Postings Lists:
(a “merge” algorithm)
Document-at-a-time
23
6
24
25
Proximity Queries
If 2
words are “near” each other in a document d, they might be
more related than further words ➔ d might be “more relevant”
26
Positional Indexes
Inthe postings, store for each term the position(s) in which
tokens of it appear:
<likes: 9347;
1: 7, 18, 33, 72, 86, 231;
2: 3, 149;
4: 17, 191, 291, 430, 434;
5: 363, 367, …>
28
Sec. 2.4.2
30
7
30
7
31
33
Zone
A zone is a region of the doc that can contain an arbitrary
amount of text e.g.,
● Title
● Abstract
● References …
33
Example Zone Indexes
34
Today’s Roadmap
The anatomy of a search engine
Indexing
Preprocessing
35
The Basic Indexing Pipeline
Documents to
Friends, Romans, countrymen
be indexed
Preprocessing
Tokenization
Normalization
Indexer
friend 2 4
roman 1 2
countryman 3 9
38
Inverted index
Preprocessing
…………………
…………………
Preprocessing
…………………
…………………
…………………
Text Text
transformation transformation
BOW BOW
to better match between
different forms of words in
Index
documents and query
37
Preprocessing Steps
1. Tokenization
2. Stopping
3. Stemming
Objective: identify the optimal form of the
term to be indexed to achieve the best
retrieval performance.
38
Before Tokenization …
Encoding & Parsing a Document Byte sequence ➔
● Which encoding/character set? Character sequence
● What format? pdf/word/excel/html?
● What language?
● Each is classification problem
● BUT often done heuristically, by user selection, or by metadata
What is a Unit Document?
● A file? An email? A group of files (PPT)? Where to Stop?
A book (a chapter/paragraph/sentence)?
● Understand collection, user, and usage patterns
39
42
Tokenization
Sentence → tokenization (splitting) → tokens
Input: “Friends, Romans and Countrymen”
Output: Tokens
● Friends
But what are
● Romans
valid tokens to emit?
● and
● Countrymen
A token is an instance of a sequence of characters
Each such token is now a candidate for an index entry (term),
after further processing.
41
Issues in Tokenization
Finland’s capital →
Finland? Finlands? Finland’s?
Hewlett-Packard → one token or two?
● state-of-the-art: break up hyphenated sequence.
● co-education
● lowercase, lower-case, lower case ?
● It can be effective to get the user to put in possible hyphens
San Francisco: one token or two?
● How do you decide it is one token?
Numbers?
● 3/20/91 Mar. 12, 1991 20/3/91
● This course code is CMPT621
● (800) 234-2333
42
Issues in Tokenization
URLs:
● http://www.bbc.co.uk
● http://www.bbc.co.uk/news/world-europe-41376577
Social Media
● Black lives matter
● #Black_lives_matter
● #BlackLivesMatter
● #blacklivesmatter
● @blacklivesmatter
45
Language-dependent Issues
French
● L'ensemble → one token or two?
• L ? L’ ? Le ?
• Want l’ensemble to match with un ensemble
– Until at least 2003, it didn’t on Google
45
46
Stopping (stop words removal)
This is a very exciting lecture on the technologies of text
Stop words: the most common words in collection
→ the, a, is, he, she, I, him, for, on, to, very, …
They have little semantic contribution
They appear a lot ≈ 30-40% of text
New stop words appear in specific domains
● e.g., “RT” in Tweets: “RT @realDonalTrump Mexico will …”
Stop words
● influence on sentence structure
● less influence on topic (aboutness)
47
Stopping: always apply?
Sometimes very important:
● Phrase queries: “Let it be”, “To be or not to be”
● Relational queries:
- flights to Doha from London
- flights from Doha to London
In Web search, trend is to keep them:
● Good compression techniques means the space for including stop
words in a system is small.
● Good query optimization techniques mean you pay little at query time
for including stop words.
48
Stopping: common practice
Common practice in many applications
→ remove stop words
There are common stop words list for each language
● NLTK (Python)
● Lucene (Java)
● http://members.unine.ch/jacques.savoy/clef/index.html
There are special stop words list for some applications
50
8
51
52
Normalization
Objective → make words with different surface forms look the
same
Document: “there are few CARS!!”
Query: “car”
should “car” match “CARS”?
53
Case Folding
“A” & “a” are different strings for computers
Case folding: convert all letters to lower case
54
Thesauri and Soundex
Do we handle synonyms?
● e.g., by hand-constructed equivalence classes
• car = automobile color = colour
● We can rewrite to form equivalence-class terms
• When the document contains automobile, index it under car-automobile (and
vice-versa)
● Or we can expand a query
• When the query contains automobile, look under car as well
55
Lemmatization
Lemmatization implies doing “proper” reduction to the “base” or
dictionary form, called lemma.
● Morphological analysis
59
Stemming
“Stemming” suggests crude affix chopping
● language dependent
● e.g., automate, automates, automatic, automation all reduced to
automat.
for example compressed and compression are both accepted as
equivalent to compress.
60
Porter Stemmer
Most common algorithm for stemming English
Conventions + 5 phases of reductions
● phases applied sequentially
● each phase consists of a set of commands
Example convention: Of the rules in a compound command, select the
one that applies to the longest suffix.
Example rules
● sses → ss (processes → process)
●y→i (reply → repli)
● ies → i (replies → repli)
● tional → tion (international → internation)
● (m>1)ement → (replacement → replac), (cement → cement)
59
Stemming: is it really useful?
Usually, itachieves 5-10% improvement in retrieval effectiveness,
e.g. English.
For highly inflected languages, it is more critical:
● 30% improvement in Finnish IR
● 50% improvement in Arabic IR
They are Ahmad’s children أ ﺑﻨﺎأ ءﺣﻤﺪ
The children behaved well ﺟﯿﺪھ اﺆﻻء
Her children are cute اﻷﺑﻨﺎء ﺗﺼﺮﻓﻮا
My children are funny أﺑﻨﺎھءﺎ ﻟﻄﺎف
We have to save our children أﺑﻨﺎﺋﻲظ ﺮﻓﺎء
Patents and children are happy ﻋﻠﯿﻨﺎنأ ﻧﺤﻤﻲأ ﺑﻨﺎءﻧﺎ
He loves his children او ﻷﺑﻨﺎء ﺳﻌﺪءا
His children loves him ﯾﺤﺐأ ﺑﻨﺎا هءﻵﺑﺎء
60
أﺑﻨﺎھؤ ﯾﺤﺒﻮﻧﮫھ ﻮ
Stemmed words are misspelled ?!
repli, replac, suppli, inform retriev, anim
These are not words anymore, these are terms.
These terms are not seen by the user, but just used by the IR
system (search engine).
These represent the optimal form for a better match between
different surface forms of a word.
● e.g. replace, replaces, replaced, replacing, replacer, replacers,
replacement, replacements → replac.
61
9
62
9
65
Summary
Pre-processing:
● Tokenization → Stopping → Stemming
exampl sentenc pre process appli text inform retriev includ token
stop word remov stem
66
How can we know
if a search engine is “good” or “bad”?
67
68