CS276 PA1 Report Rukmani Ravi Sundaram Tayyab Tariq 1.description of The Structure of The Program: Index - Py

The document summarizes the structure and performance of an inverted index implementation for information retrieval. It discusses: 1) The indexing process divides text into blocks, tokenizes on whitespace, sorts (termID, docID) tuples, and merges intermediate indexes through a binomial merge. 2) Larger block sizes can improve indexing time by reducing disk I/O during merging, but are limited by memory. The merging process also limits scalability and can be optimized. 3) Retrieval time can be improved by merging postings in order of document frequency and enabling multithreading during query processing. Skip lists and variable byte encoding also help performance.

Uploaded by

Khozainuz Zuhri

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

62 views2 pages

CS276 PA1 Report Rukmani Ravi Sundaram Tayyab Tariq 1.description of The Structure of The Program: Index - Py

Uploaded by

Khozainuz Zuhri

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 2

CS276 PA1 Report

Rukmani Ravi Sundaram Tayyab Tariq

1.Description of the structure of the program: index.py :
The algorithm used for indexing is the Block Sort based Indexing. The data is divided into blocks
of 10. Each separate file is considered a document. Tokenization is simply done on whitespace
characters.
Indexing - Key Steps: Each block is read into memory one at a time. So at any point in time,
only 1 block is in memory. The documents are parsed and (termID, docID) tuples are collected.
These are sorted according to termID, then concatenated into postings list for each termID,
which are in turn sorted according to docID ( needed for merging at query time ). Each block
gets written into memory as two files. One hold (termID, pointer to postings list, length of
postings) tuples (- posting.dict). The other file contains all the postings list of all termIDs (-
corpus.index) All files are written as binary files. After all 10 blocks are parsed this way, the
next step is to merge. A binomial merge is done, i.e we merge 2 intermediate indexes at a time.
Merging proceeds as follows. Read a line from each posting.dict file of each of the 2 blocks.
Each termID is looked up in the block indexs corresponding posting.dict along with its file
position and length (in bytes), so we know where the postings list for that termID ends. Next
depending on whether the termIDs are same or different, merge the corresponding postings
lists or write the single postings list to the intermediate combined output postings file and also
update an intermediate combined postings dictionary file. Repeat until we get a single index.
The number of disk seeks/read/write is proportional to the sum of the number of terms in each
block index, so O(T), where T is the total number of terms in the vocabulary.
Gamma Encoding - Each postings list was separately gamma encoded and right padded with
1s, for ease of access, as the postings_dict maintains the length of each postings list in bytes.

Query - Key Steps: At the time of query, query terms are sorted in ascending order of docfreq,
which is maintained in the postings dict file. Postings list for each query terms are merged
two at a time, and the resulting postings list is finally mapped back to its document name and
displayed.

Size of Index(postings list):

Uncompressed : 55.3MB
Variable Byte Compression : 16.4MB (29%)
Gamma Encoding : 11.5 MB (20%)
Encoding Query Time Index load time

None 0.23s 6.07s

Variable Byte 0.098s 5.842s

Gamma 0.5s 5.2s

Indexing Time:
Uncompressed : 10 minutes,11 seconds
Variable Byte Compression : 7 minutes, 40 seconds
Gamma Encoding : 43 minutes, 58 seconds

2.
a . Larger sizes of a block will likely minimize indexing time, but not by a whole lot. Since the
bulk of the indexing time is dependant on disk IO, small performance gain will be obtained if
we process(parse/sort) larger sized blocks in main memory. The size of the blocks has impact
during the merge phase, especially if the binomial merge is done. The merge step alone is
O(T*logK) in disk IO, where K is the number of blocks, and T is an upper bound on the number
of terms in any one block. So the merge step is upper bounded by the time needed to read/write
T terms, logK times. By reducing the number of blocks, or in other words, increasing the size of
each block, we can improve the merging time. However block sizes are restricted by the amount
of main memory available. So a general strategy to minimize indexing time will be to have the
largest possible block sizes, i.e equal to the size of the main memory.

2.b. The merging of each blocks postings list in its current form limits scalability to large
datasets. The number of disk seeks/read/writes performed during each merge is proportional to
3 times number of terms in each block index. This is because we read in one posting at a time
into memory, perform the merge step and write it back to disk. This step can be optimized. We
could read in a subsection (say K postings lists) of each intermediate index , perform the merge
in memory and when the merged postings lists exceeds main memory size, write it back to disk.
This way the number of disk reads/writes is minimized to one every K terms (if we assume we
read in K terms from each of the two block postings) .

Another part that can be optimized is the simultaneous populating of each terms postings list as
we parse the documents, as opposed to collecting the (termID,docID) pairs, sorting them and
then combining them into (termID, postings list). This saves us the extra combining step.

A third optimization of indexing time that can be done is to compress only the final postings list,
instead of compressing/decompressing each intermediate postings files. This can be only be
done if disk space is sufficient to hold all the uncompressed postings.

2.c To improve retrieval time,one optimization that has already been done is to merge the
postings list of terms in increasing order of doc frequency, to minimize the number of steps
taken to merge. We could also take advantage of multithreading, by retrieving the postings list
for the next query term, while the merging for the previous two query terms is taking place.

Using skip lists can help reduce query retrieval time, at an additional cost to the disk space and
indexing time, since we now have to populate each posting with an additional skip pointer. But
the extra cost in building and maintaining skip pointers may not be too practical in dynamic IR
systems.
A variable byte encoding seems to perform a lot better (with respect to retrieval time and
indexing time) than a variable bit encoding since computers are more optimized to work with
byes than manipulations at the bit level.

WTL Mini Project ON: Online Food Ordering System
No ratings yet
WTL Mini Project ON: Online Food Ordering System
27 pages
05 Index Construction
No ratings yet
05 Index Construction
47 pages
Document Indexing in Information Retrieval
No ratings yet
Document Indexing in Information Retrieval
19 pages
Data Warehousing Lab Excercise
No ratings yet
Data Warehousing Lab Excercise
45 pages
SAP WM Config
100% (4)
SAP WM Config
44 pages
VAT UAE and KSA (Import Process)
No ratings yet
VAT UAE and KSA (Import Process)
3 pages
Aeternus Brass VST, VST3, Audio Unit Plugins: Virtual Trumpet, Trombone, Tuba, French Horn, Flugelhorn, Cornet, Brass Sections, Orchestral Ensemble and Classic Analog Synthesizer Brasses. EXS24 + KONTAKT (Windows, macOS)
0% (1)
Aeternus Brass VST, VST3, Audio Unit Plugins: Virtual Trumpet, Trombone, Tuba, French Horn, Flugelhorn, Cornet, Brass Sections, Orchestral Ensemble and Classic Analog Synthesizer Brasses. EXS24 + KONTAKT (Windows, macOS)
18 pages
IDM Crack 6.32 Build 9 Serial Key Final Retail + Patch (Latest) (2019) PDF
No ratings yet
IDM Crack 6.32 Build 9 Serial Key Final Retail + Patch (Latest) (2019) PDF
19 pages
Console
No ratings yet
Console
206 pages
PCIe Data Link Layer Essential Questions Set 2 1707451417
No ratings yet
PCIe Data Link Layer Essential Questions Set 2 1707451417
2 pages
Python Grade 9 Lesson Tuples and Lists
0% (1)
Python Grade 9 Lesson Tuples and Lists
50 pages
Thinkpad t540p
No ratings yet
Thinkpad t540p
182 pages
HWFrequent Alarm
No ratings yet
HWFrequent Alarm
51 pages
C3 IndexConstruction
No ratings yet
C3 IndexConstruction
46 pages
TechNet Group Policy Tips and Tricks
No ratings yet
TechNet Group Policy Tips and Tricks
30 pages
Object Oriented Programming Tutorial
No ratings yet
Object Oriented Programming Tutorial
61 pages
Pro-Watch 4.5 Installation Guide
No ratings yet
Pro-Watch 4.5 Installation Guide
46 pages
Lecture 5p2 - Index Construction & Compressing
No ratings yet
Lecture 5p2 - Index Construction & Compressing
40 pages
Mimix PDF
No ratings yet
Mimix PDF
46 pages
Aaaws
No ratings yet
Aaaws
16 pages
Professional Summary:: RPA Developer
No ratings yet
Professional Summary:: RPA Developer
5 pages
FoodMart BO Case Study
No ratings yet
FoodMart BO Case Study
18 pages
SN-IND-1-032 VT System For Smart Charge Communication
No ratings yet
SN-IND-1-032 VT System For Smart Charge Communication
19 pages
Tenetech Guide On Using The Mobile Application 1
No ratings yet
Tenetech Guide On Using The Mobile Application 1
6 pages
Medelec EEG Operation Manual 3
No ratings yet
Medelec EEG Operation Manual 3
20 pages
Lecture 4 - Index Construction - Compressing
No ratings yet
Lecture 4 - Index Construction - Compressing
90 pages
Lecture 2 - Boolean Retrieval
No ratings yet
Lecture 2 - Boolean Retrieval
49 pages
Knowledge Document - WFM Workforce Management - Reflexis - Reflexis ESS Workforce Ma - CA Service Desk Manager
No ratings yet
Knowledge Document - WFM Workforce Management - Reflexis - Reflexis ESS Workforce Ma - CA Service Desk Manager
13 pages
Lesson Plan
No ratings yet
Lesson Plan
13 pages
Object Oriented Programming
No ratings yet
Object Oriented Programming
10 pages
Lecture2 Indexing
No ratings yet
Lecture2 Indexing
78 pages
Unit 2
No ratings yet
Unit 2
157 pages
History of Computer
No ratings yet
History of Computer
2 pages
Index Compression
100% (1)
Index Compression
38 pages
Data Base Management Systems 2020
No ratings yet
Data Base Management Systems 2020
2 pages
Tomato Health Monitoring System
No ratings yet
Tomato Health Monitoring System
2 pages
IRS Imp
No ratings yet
IRS Imp
76 pages
TLKR T41 DataSheet ENG Lor
No ratings yet
TLKR T41 DataSheet ENG Lor
2 pages
Information Retrieval - 3
No ratings yet
Information Retrieval - 3
36 pages
Mini Apps Ecosystem - Removed
No ratings yet
Mini Apps Ecosystem - Removed
2 pages
Lecture 5p1 - Index Construction & Compressing
No ratings yet
Lecture 5p1 - Index Construction & Compressing
42 pages
Lecture 1
No ratings yet
Lecture 1
53 pages
04 Index Construction
No ratings yet
04 Index Construction
48 pages
Information Retrieval: Prof: Ehab Ezzat Hassanein
No ratings yet
Information Retrieval: Prof: Ehab Ezzat Hassanein
16 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
50 pages
3 - Index Construction
No ratings yet
3 - Index Construction
5 pages
Lecture4 Compression V1
No ratings yet
Lecture4 Compression V1
43 pages
04const Flat
No ratings yet
04const Flat
54 pages
CAIM: Cerca I Anàlisi D'informació Massiva: FIB, Grau en Enginyeria Informàtica
No ratings yet
CAIM: Cerca I Anàlisi D'informació Massiva: FIB, Grau en Enginyeria Informàtica
20 pages
IR Indexing
No ratings yet
IR Indexing
15 pages
Information Retrieval - 2
No ratings yet
Information Retrieval - 2
24 pages
chapter2-MA212-Indexing & Preprocessing
No ratings yet
chapter2-MA212-Indexing & Preprocessing
68 pages
Lec6 InvretedIndex pt2
No ratings yet
Lec6 InvretedIndex pt2
38 pages
Unit I
No ratings yet
Unit I
83 pages
01 Intro
No ratings yet
01 Intro
145 pages
Big and Little Endian
No ratings yet
Big and Little Endian
4 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
50 pages
Lecture 4-Indexconstruction
No ratings yet
Lecture 4-Indexconstruction
45 pages
Lecture 2 Inverted Index PDF
No ratings yet
Lecture 2 Inverted Index PDF
24 pages
Lect 3 Inverted Index
No ratings yet
Lect 3 Inverted Index
24 pages
Lecture1-Intro - Realted To Ch1
No ratings yet
Lecture1-Intro - Realted To Ch1
60 pages
Lec 2
No ratings yet
Lec 2
17 pages
Indexing 1
No ratings yet
Indexing 1
61 pages
Lecture02 - IR
No ratings yet
Lecture02 - IR
36 pages
2-Boolean IR and Indexing
No ratings yet
2-Boolean IR and Indexing
46 pages
2.boolean Retrieval Model
No ratings yet
2.boolean Retrieval Model
40 pages
4.index Construction - New
No ratings yet
4.index Construction - New
46 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
69 pages
02 Boolean Retrieval
No ratings yet
02 Boolean Retrieval
52 pages
Assignment 1
No ratings yet
Assignment 1
3 pages
L05
No ratings yet
L05
33 pages
Chap5 Index Construction
No ratings yet
Chap5 Index Construction
38 pages
Sheet 2
No ratings yet
Sheet 2
4 pages
4 Indexing
No ratings yet
4 Indexing
29 pages
Information Retrieval: Indexing
No ratings yet
Information Retrieval: Indexing
32 pages
Lecture01 Intro
No ratings yet
Lecture01 Intro
45 pages
Algorithms For Information Retrieval: Index Construction
No ratings yet
Algorithms For Information Retrieval: Index Construction
12 pages
1726119671-4 Index Construction
No ratings yet
1726119671-4 Index Construction
19 pages
AI6122 Topic 3.1 - Index
No ratings yet
AI6122 Topic 3.1 - Index
40 pages
FOP Efficiency Indexing 13
No ratings yet
FOP Efficiency Indexing 13
22 pages
UE20CS332 Unit2 Slides PDF
No ratings yet
UE20CS332 Unit2 Slides PDF
264 pages
Index Construction
No ratings yet
Index Construction
37 pages
T 01
100% (1)
T 01
1 page
Module 1-1
No ratings yet
Module 1-1
12 pages
II. Information Retrieval (Basics Cont.) : Web Search - Summer Term 2006
No ratings yet
II. Information Retrieval (Basics Cont.) : Web Search - Summer Term 2006
16 pages
Ir
No ratings yet
Ir
4 pages
Chapter 1: Boolean Retrieval
No ratings yet
Chapter 1: Boolean Retrieval
9 pages
Data Structures I Essentials
From Everand
Data Structures I Essentials
Dennis Smolarski
No ratings yet
MVS JCL Utilities Quick Reference, Third Edition
From Everand
MVS JCL Utilities Quick Reference, Third Edition
Robert Wingate
5/5 (1)

CS276 PA1 Report Rukmani Ravi Sundaram Tayyab Tariq 1.description of The Structure of The Program: Index - Py

Uploaded by

CS276 PA1 Report Rukmani Ravi Sundaram Tayyab Tariq 1.description of The Structure of The Program: Index - Py

Uploaded by

CS276 PA1 Report

Rukmani Ravi Sundaram Tayyab Tariq

Size of Index(postings list):

None 0.23s 6.07s

Variable Byte 0.098s 5.842s

Gamma 0.5s 5.2s

You might also like