Chapter 3 IR
Chapter 3 IR
Construction
Adama Science and Technology University
School of Electrical Engineering and
Computing
Department of CSE
Kibrom T.
Indexing Subsystem
documents
Documents Assign document identifier
document
Tokenization document IDs
tokens
Stop word removal
non-stop list tokens
Stemming & Normalization
stemmed terms
Term weighting
Indexes are built using a web crawler, which retrieves each page on the
Web for indexing.
use web crawlers to traverse the web and download web pages. These
crawlers follow links from one web page to another, and can also
discover new pages through sitemaps, RSS feeds, and other sources.
After indexing, the local copy of each page is discarded, unless stored in a
cache.
Step Description
1. Crawling Automatically or semi-automatically gather web pages and other types of content
3. Preprocessing Tokenize, remove stop words, apply stemming or other text normalization techniques
4. Indexing Create an inverted index to map terms to the documents in which they appear
5. Ranking Assign a score to each document based on its relevance to the query
Automated indexing is typically faster and more scalable, but may not be as
accurate as semi-automated indexing for complex documents. Semi-
automated indexing is typically slower and more labor-intensive, but can
result in a more accurate and comprehensive index.
Major Steps in Index Construction
Documents to
Friends, Romans, countrymen.
be indexed.
Token Tokenizer
stream. Friends Romans countrymen
Running time:
Indexing time;
Access/search time;
Update time (Insertion time, Deletion time, Modification time….)
Space overhead:
Computer storage space consumed.
So let it be with
Doc 2 Caesar. The noble
Brutus has told you
Caesar was ambitious
Sorting the Vocabulary
Sequential file
Term Doc #
After all I
did
1
1
documents have enact 1
Doc
been tokenized, julius
caesar
1
1
Term No.
stopwords are I
was
1
1 1 ambition 2
removed, and killed 1
normalization and I 1 2 brutus 1
the 1
stemming are capitol 1 3 brutus 2
applied, to generate brutus
killed
1
1
4 capitol 1
index terms me
so
1
2
5 caesar 1
let 2 6 caesar 2
it
be
2
2
7 caesar 2
with 2 8 enact 1
These index terms caesar
the
2
2 9 julius 1
in sequential file noble
brutus
2
2 10 kill 1
are sorted in hath 2
11 kill 1
alphabetical order. told
you
2
2 12 noble 2
caesar 2
was 2
ambitious 2
Sequential File
Why vocabulary?
Having information about vocabulary (list of terms) speeds
searching for relevant documents.
Why location?
Having information about the location of each term within the
document helps for:
User interface design: highlight location of search term,
Proximity based ranking: adjacency and near operators (in
Boolean searching)
Why frequencies?
Having information about frequency is used for:
Calculating term weighting (like TF, TF*IDF, …)
Optimizing query processing.
Inverted File
Records kept for each term j in the word list contains the
following:
Term j
Number of documents in which term j occurs (DFj)
Total frequency of term j (CFj)
Pointer to postings (inverted) list for term j
Postings File (Inverted List)
Vocabulary
Postings Actual
(word list)
(inverted list) Documents
Term No Tot Pointer
of freq To
Doc posting
Act 3 3 Inverted
Bus 3 4 lists
pen 1 1
total 2 3
Example: Indexing
So let it be with
Doc 2 Caesar. The noble
Brutus has told you
Caesar was ambitious
Sorting the Vocabulary
Term Doc #
Term Doc # ambitious 2
I 1 be 2
did 1 brutus 1
enact 1 brutus 2
julius 1
capitol 1
caesar 1
caesar 1
I 1
caesar 2
was 1
After all killed 1
caesar
did
2
1
documents have I
the
1
1 enact 1
Vocabulary Posting
Term Doc # TF
ambition 2 1 Doc # TF
Term DF CF
brutus 1 1 2 1
ambitious 1 1
brutus 2 1 1 1
brutus 2 2 2 1
capitol 1 1 capitol 1 1 1 1
caesar 1 1 caesar 2 3 1 1
caesar 2 2 enact 1 1 2 2
enact 1 1 1 1
julius 1 1 1 1
julius 1 1 kill 1 2 1 2
kill 1 2 noble 1 1 2 1
noble 2 1
Pointers
Complexity Analysis
Term Documents
----------------------
a 2
brown 1, 2
dog 2
fox 1, 2
jumps 1, 2
lazy 1, 2
over 1, 2
quick 1, 2
the 1
Complexity Analysis
Storage of text:
The need for text compression: to reduce storage space.
Indexing text
Storage of indexes
Is compression required? Do we store on memory or in a disk ?
Accessing text
Accessing indexes
How to access to indexes? What data/file structure to use?
Processing indexes
How to a search a given query in the index? How to update the
index?
Accessing documents
Text Compression
MISSISSIPPI RIVER=1100000101000101001001000011111010011011110101
ISPRMVE_17
0 1
PRMVE_8
0 1
MVE_4
0 1
I5 S4 P2 R2 M1 V1 E1 _1
00 01 100 101 1100 1101 1110 1111
Example
I5 S4 P2 R2 M1 V1 E1 _1
00 01 100 101 1100 1101 1110 1111
MISSISSIPPI RIVER=
1100 00 01 01 00 01 01 00 100 100 00 1111 101 00 1101 1110 101=
46bits
1
0 1
0.4 0.6
0 1 0
1
0.3 e
d f 0 1
0.2
g 1
0 0.1
c 0 1
a b
Using the Huffman coding a table can be constructed by
working down the tree, left to right. This gives the binary
equivalents for each symbol in terms of 1s and 0s.
What is the Huffman binary representation for ‘café’?
Exercise
1. Given the following, apply the Huffman algorithm to find an optimal binary
code:
Character: a b c d e t
Frequency: 16 5 12 17 10 25
2. Given text: “for each rose, a rose is a rose”
Construct the Huffman coding
Ziv-Lempel Compression
07/12/23 55
Thank You !!!
07/12/23 56