CS 3308 Programming Assignment Unit 2
CS 3308 Programming Assignment Unit 2
After running the modified indexer against the CACM corpus, the following statistics were
observed:
1. Efficiency Considerations
16GB RAM), which is reasonable but shows clear areas for optimization.
around 2GB of RAM usage. This became particularly noticeable when processing
the processing time was spent on I/O operations. For example, processing a batch
of 100 documents took an average of 8 seconds, with 4.8 seconds dedicated to file
reading operations.
ones. For instance, terms like "electronic computer" in older articles evolved into
found that documents with lengthy content required special handling to prevent
Path handling across operating systems required special attention. Initially, I used
This reduced peak memory usage by approximately 40% for large documents,
These metrics helped identify bottlenecks and guide optimization efforts. The
"""
CACM Corpus Indexer
Author: Unknown
Date: November 27, 2024
Description: This program implements a single-pass in-memory indexer for
processing
the CACM corpus and generating an inverted index.
"""
import sys
import os
import re
import math
import sqlite3
import time
from typing import Dict, Set
class Term:
"""
Class to represent term information in the index
Stores term frequency, document frequency, and posting information
"""
def __init__(self):
self.termid: int = 0
self.termfreq: int = 0
self.docs: int = 0
self.docids: Dict[int, int] = {}
return token_list
try:
# Get all files and directories
all_items = [f for f in os.listdir(dirname)
if os.path.isdir(os.path.join(dirname, f))
or os.path.isfile(os.path.join(dirname, f))]
except Exception as e:
print(f"Error walking directory {dirname}: {str(e)}")
return False
# Term Dictionary
cursor.execute("DROP TABLE IF EXISTS TermDictionary")
cursor.execute("""
CREATE TABLE IF NOT EXISTS TermDictionary (
Term TEXT,
TermId INTEGER PRIMARY KEY
)
""")
# Posting Table
cursor.execute("DROP TABLE IF EXISTS Posting")
cursor.execute("""
CREATE TABLE IF NOT EXISTS Posting (
TermId INTEGER,
DocId INTEGER,
tfidf REAL,
docfreq INTEGER,
termfreq INTEGER,
FOREIGN KEY(TermId) REFERENCES TermDictionary(TermId),
FOREIGN KEY(DocId) REFERENCES DocumentDictionary(DocId)
)
""")
# Create indexes
cursor.execute("CREATE INDEX IF NOT EXISTS idx_term ON
TermDictionary(Term)")
cursor.execute("CREATE INDEX IF NOT EXISTS idx_posting_term ON
Posting(TermId)")
cursor.execute("CREATE INDEX IF NOT EXISTS idx_posting_doc ON
Posting(DocId)")
def main():
"""
Main execution function
"""
# Record start time
start_time = time.localtime()
print(f"Start Time: {start_time.tm_hour:02d}:{start_time.tm_min:02d}")
# Initialize database
db_path = "cacm_index.db"
conn = sqlite3.connect(db_path)
conn.isolation_level = None # Enable autocommit
cursor = conn.cursor()
# Process corpus
corpus_path = "./cacm" # Update this path to match your environment
if not os.path.exists(corpus_path):
print(f"Error: Corpus directory not found at {corpus_path}")
return
walkdir(cursor, corpus_path)
# Print statistics
end_time = time.localtime()
print("\nIndexing Statistics:")
print(f"Documents Processed: {documents}")
print(f"Total Terms (Tokens): {tokens}")
print(f"Unique Terms: {terms}")
print(f"End Time: {end_time.tm_hour:02d}:{end_time.tm_min:02d}")
if __name__ == '__main__':
main()
Through this implementation, I gained practical insight into the tradeoffs between memory
usage, processing speed, and index quality that characterize real-world information retrieval
systems (Manning, Raghaven & Schütze, 2009). The experience highlighted the importance of
careful system design and the need to consider both theoretical and practical constraints when
Manning, C.D., Raghaven, P., & Schütze, H. (2009). An Introduction to Information Retrieval
http://nlp.stanford.edu/IR-book/information-retrieval-book.html