Lab 2
Lab 2
Instructor
Parth Mehta (parth [email protected])
Teaching Assistants
Adarsh Gupta ([email protected]),
Bhavesh Baraiya ([email protected])
August 2024
Lab Manual
Topics Covered from the Introduction to IR book (Manning et. al): Pre-processing (Section 2.2), Boolean Matching
(Section 1.3), Vector Space scoring (Section 6.3).
You are given a dataset consisting of approx 32,000 news articles. The data is structured in JSON format; each article
has four fields: id, title, summary and text. In Lab 1 you created a boolean and tf-idf index from the articles. For this
lab session, we will use title fields as a query to search the boolean and tf-idf index created previously.
Task 1: Boolean Search: For boolean search use the disjunction (OR) operator for querying. This means docu-
ments which have at least one of the terms from the query are relevant. Extend this matching model to a scoring function
by counting the number of terms in the query appearing in the document. Rank documents based on that score.
Task 2: Vector space matching: Compute the tf-idf scores for each document for a given query and rank based
on the scores.
You are expected to complete the list of tasks mentioned below during the lab hours. You can use existing python
libraries for preprocessing (NLTK or Spacy) and basic matrix manipulation (e.g. numpy). For all other problems, you are
expected to write a solution from scratch. Specifically you can not use tools like scipy or scikit learn to vectorize the data.
Note: The use of GPT for such trivial tasks is generally frowned upon and highlights your lack of interest or/and
ability. Also since the instructor uses it almost daily in his other life (to solve real problems, not tf-idf) he can easily
detect it with a few simple questions. Save yourself some embarrassment.
1. Explore Pyterrier