0% found this document useful (0 votes)

12 views

Lab 2

Uploaded by

Rishab Gupta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views

Lab 2

Uploaded by

Rishab Gupta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 1

Lab 1 - Boolean and TF-IDF Matching

Instructor
Parth Mehta (parth [email protected])

Teaching Assistants
Adarsh Gupta ([email protected]),
Bhavesh Baraiya ([email protected])
August 2024

Lab Manual
Topics Covered from the Introduction to IR book (Manning et. al): Pre-processing (Section 2.2), Boolean Matching
(Section 1.3), Vector Space scoring (Section 6.3).

You are given a dataset consisting of approx 32,000 news articles. The data is structured in JSON format; each article
has four fields: id, title, summary and text. In Lab 1 you created a boolean and tf-idf index from the articles. For this
lab session, we will use title fields as a query to search the boolean and tf-idf index created previously.

Task 1: Boolean Search: For boolean search use the disjunction (OR) operator for querying. This means docu-
ments which have at least one of the terms from the query are relevant. Extend this matching model to a scoring function
by counting the number of terms in the query appearing in the document. Rank documents based on that score.

Task 2: Vector space matching: Compute the tf-idf scores for each document for a given query and rank based
on the scores.

You are expected to complete the list of tasks mentioned below during the lab hours. You can use existing python
libraries for preprocessing (NLTK or Spacy) and basic matrix manipulation (e.g. numpy). For all other problems, you are
expected to write a solution from scratch. Specifically you can not use tools like scipy or scikit learn to vectorize the data.

Note: The use of GPT for such trivial tasks is generally frowned upon and highlights your lack of interest or/and
ability. Also since the instructor uses it almost daily in his other life (to solve real problems, not tf-idf) he can easily
detect it with a few simple questions. Save yourself some embarrassment.

Advanced Exploratory Topics

This lab is designed as a warm-up exercise and some of you might find it too easy. In that case you can look ahead and
experiment with the following problems, which we will cover in a future lab session.

1. Explore Pyterrier

2. Implement preprocessing and tf-idf indexing pipeline in PyTerrier

3. Compare the vocabulary size, index size and tf-idf values from the index created by PyTerrier with the one created
in the previous exercise.
4. [new] Perform query matching and ranking using pyterrier.

Xtream Codes
No ratings yet
Xtream Codes
4 pages
DMlab2021
No ratings yet
DMlab2021
4 pages
Module 3 Indexing Part A
No ratings yet
Module 3 Indexing Part A
46 pages
IR Journal (Printable)
No ratings yet
IR Journal (Printable)
20 pages
1 Overview
No ratings yet
1 Overview
44 pages
TextFeatureEnginerring-NLP lec2
No ratings yet
TextFeatureEnginerring-NLP lec2
60 pages
Document Ranking Using Customizes Vector Method
No ratings yet
Document Ranking Using Customizes Vector Method
6 pages
Yann Debray - 1714613827618
No ratings yet
Yann Debray - 1714613827618
16 pages
Basic IR: Modeling
No ratings yet
Basic IR: Modeling
22 pages
ir
No ratings yet
ir
120 pages
Homework2 Solution
100% (1)
Homework2 Solution
11 pages
CAIM: Cerca I Anàlisi D'informació Massiva: FIB, Grau en Enginyeria Informàtica
No ratings yet
CAIM: Cerca I Anàlisi D'informació Massiva: FIB, Grau en Enginyeria Informàtica
21 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
27 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
31 pages
CS8080 INFORMATION RETRIEVAL TECHNIQUES II INTERNAL EXAMINATION - Google Forms
No ratings yet
CS8080 INFORMATION RETRIEVAL TECHNIQUES II INTERNAL EXAMINATION - Google Forms
420 pages
Chapter 4 IR Models
No ratings yet
Chapter 4 IR Models
43 pages
Chapter 2
No ratings yet
Chapter 2
37 pages
Unit V Easy To Learn
No ratings yet
Unit V Easy To Learn
21 pages
4-IR Models
No ratings yet
4-IR Models
33 pages
Vector Model-21PW41
No ratings yet
Vector Model-21PW41
5 pages
Vector Space Model
No ratings yet
Vector Space Model
6 pages
5 B IRModels
No ratings yet
5 B IRModels
51 pages
Assign 3
No ratings yet
Assign 3
1 page
Chapter 5 IR
No ratings yet
Chapter 5 IR
46 pages
IR Chap4
100% (1)
IR Chap4
32 pages
IR Chap4
100% (1)
IR Chap4
32 pages
4_IRModels
No ratings yet
4_IRModels
46 pages
4-IR Models
No ratings yet
4-IR Models
33 pages
Supervisionguide16 17 Students
No ratings yet
Supervisionguide16 17 Students
17 pages
Lab5 Instructions
No ratings yet
Lab5 Instructions
51 pages
5 IRModels
No ratings yet
5 IRModels
30 pages
Chapter Five IR Models
No ratings yet
Chapter Five IR Models
28 pages
5 IRModels IR
No ratings yet
5 IRModels IR
25 pages
Supervisionguide15 16 Students
No ratings yet
Supervisionguide15 16 Students
18 pages
Information Retrieval Practical
No ratings yet
Information Retrieval Practical
10 pages
KEN2570-5-Search and IR
No ratings yet
KEN2570-5-Search and IR
18 pages
CSE508: Information Retrieval Assignment 2: Question 1 - (40 Points) Scoring and Term-Weighting
No ratings yet
CSE508: Information Retrieval Assignment 2: Question 1 - (40 Points) Scoring and Term-Weighting
3 pages
Theory Assignment
No ratings yet
Theory Assignment
4 pages
IR Models: Chapter Five
100% (1)
IR Models: Chapter Five
26 pages
NLP SEE
No ratings yet
NLP SEE
9 pages
4 IRinArabic2021 Ranked Retrieval I
No ratings yet
4 IRinArabic2021 Ranked Retrieval I
49 pages
Cs224n 2025 Lecture03 Neuralnets
No ratings yet
Cs224n 2025 Lecture03 Neuralnets
96 pages
feature eng2
No ratings yet
feature eng2
31 pages
NLP DL Lecture1
No ratings yet
NLP DL Lecture1
48 pages
Neural IR
No ratings yet
Neural IR
45 pages
F-IR
No ratings yet
F-IR
30 pages
Vector Space Model
No ratings yet
Vector Space Model
11 pages
Question Answering, Information Retrieval, and Retrieval Augmented Generation
No ratings yet
Question Answering, Information Retrieval, and Retrieval Augmented Generation
22 pages
Vector Space Model: TF - IDF: Adapted From Lectures by
No ratings yet
Vector Space Model: TF - IDF: Adapted From Lectures by
37 pages
2 Introduction To Information Retrieval
No ratings yet
2 Introduction To Information Retrieval
38 pages
NLP SEE
No ratings yet
NLP SEE
27 pages
4_IRModels
No ratings yet
4_IRModels
30 pages
IR Lecture 4b
No ratings yet
IR Lecture 4b
57 pages
IR Systems Usually Adopt Index Terms To Process Queries Index Term
No ratings yet
IR Systems Usually Adopt Index Terms To Process Queries Index Term
24 pages
AI6122 Topic 3.2 - Ranking
No ratings yet
AI6122 Topic 3.2 - Ranking
27 pages
Chapter 14-NLP
No ratings yet
Chapter 14-NLP
24 pages
application_nlp
No ratings yet
application_nlp
23 pages
Information Retrival
No ratings yet
Information Retrival
43 pages
3 termWeightingIR
No ratings yet
3 termWeightingIR
32 pages
Module III
No ratings yet
Module III
42 pages
Python For Data Science
From Everand
Python For Data Science
Kevin Clark
No ratings yet
Low and High Fidility Wireframes
No ratings yet
Low and High Fidility Wireframes
5 pages
Chapter 5 Vertical and Horizontal Tie Reinforcements
No ratings yet
Chapter 5 Vertical and Horizontal Tie Reinforcements
20 pages
RTA - Yahoo Mail - Amount Blocked For CAMS IPO Applied On 21 - 09 - 2020 - Ref No - IE295759
No ratings yet
RTA - Yahoo Mail - Amount Blocked For CAMS IPO Applied On 21 - 09 - 2020 - Ref No - IE295759
2 pages
Air Enters The Compressor of A Gas Turbine at 1...
No ratings yet
Air Enters The Compressor of A Gas Turbine at 1...
4 pages
Type_Supplement_Fighters.Jan2018
No ratings yet
Type_Supplement_Fighters.Jan2018
3 pages
User's Manual DM2 Digital Motor Protection, Overload Protection, Overcurrent-Time Protection
No ratings yet
User's Manual DM2 Digital Motor Protection, Overload Protection, Overcurrent-Time Protection
111 pages
2001 Novel biometric digital signatures for Internet-based стаття 21
No ratings yet
2001 Novel biometric digital signatures for Internet-based стаття 21
8 pages
Brand Training Strategy Design
100% (2)
Brand Training Strategy Design
109 pages
Test Bank - Exponential and Logarithmic Functions
No ratings yet
Test Bank - Exponential and Logarithmic Functions
3 pages
Whether Application For Android OPERATING SYSTEM MAD
No ratings yet
Whether Application For Android OPERATING SYSTEM MAD
13 pages
Andonstar_AD249S-M-Users Manual backup
No ratings yet
Andonstar_AD249S-M-Users Manual backup
21 pages
Impact Design With All Senses Proceedings of the Design Modelling Symposium Berlin 2019 Christoph Gengnagel - The latest updated ebook is now available for download
100% (2)
Impact Design With All Senses Proceedings of the Design Modelling Symposium Berlin 2019 Christoph Gengnagel - The latest updated ebook is now available for download
60 pages
The National Code of Practice For The Construction Industry and Implementation Guidelines
No ratings yet
The National Code of Practice For The Construction Industry and Implementation Guidelines
2 pages
Company_Database2024-2025(1)
No ratings yet
Company_Database2024-2025(1)
6 pages
Technical Drawing Intro + 01 Layout
100% (1)
Technical Drawing Intro + 01 Layout
14 pages
OCTOFROST 2018 Installation Manual
No ratings yet
OCTOFROST 2018 Installation Manual
44 pages
(Ebook) It Never Snows in September: The German View of Market-Garden and the Battle of Arnhem, September 1944 by Robert Kershaw ISBN 9780711030626, 0711030626 download
100% (1)
(Ebook) It Never Snows in September: The German View of Market-Garden and the Battle of Arnhem, September 1944 by Robert Kershaw ISBN 9780711030626, 0711030626 download
50 pages
PDF Progress in Digital and Physical Manufacturing Proceedings of ProDPM 19 Henrique A. Almeida Download
100% (3)
PDF Progress in Digital and Physical Manufacturing Proceedings of ProDPM 19 Henrique A. Almeida Download
52 pages
Configuring JVM Parameters
No ratings yet
Configuring JVM Parameters
3 pages
Collecting Statistical Data
No ratings yet
Collecting Statistical Data
17 pages
Impact of Digital
No ratings yet
Impact of Digital
2 pages
Lecture 09 Functional Analysis System Technique
No ratings yet
Lecture 09 Functional Analysis System Technique
11 pages
EVD Ice: User Manual
No ratings yet
EVD Ice: User Manual
28 pages
Haswanth P Ram Varada - CV
No ratings yet
Haswanth P Ram Varada - CV
3 pages
Oracle ADF Assorted Notes
No ratings yet
Oracle ADF Assorted Notes
112 pages
Smanual Pa Xmv8280d Xmv8140d C
No ratings yet
Smanual Pa Xmv8280d Xmv8140d C
259 pages
Harold Conklin 1986 - Hanunoo Color Categories
No ratings yet
Harold Conklin 1986 - Hanunoo Color Categories
7 pages
19Vol102No20
No ratings yet
19Vol102No20
12 pages
Casa 04 Ada Bala Hughes
No ratings yet
Casa 04 Ada Bala Hughes
7 pages

Lab 2

Uploaded by

Lab 2

Uploaded by

Lab 1 - Boolean and TF-IDF Matching

Advanced Exploratory Topics

2. Implement preprocessing and tf-idf indexing pipeline in PyTerrier

You might also like