0% found this document useful (0 votes)

73 views3 pages

CSE508: Information Retrieval Assignment 2: Question 1 - (40 Points) Scoring and Term-Weighting

This document outlines 3 questions for an information retrieval assignment. Question 1 involves implementing Jaccard coefficient and TF-IDF scoring on a dataset to retrieve top relevant documents for a query. Question 2 involves ranking documents using Microsoft learning to rank dataset and evaluating using DCG and nDCG. Question 3 involves implementing a Naive Bayes text classifier on 20 newsgroups dataset using TF-ICF feature selection and reporting results across different train-test splits. Students need to submit code and a report on GitHub explaining their methodology and results.

Uploaded by

Pranshu Patel

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

73 views3 pages

CSE508: Information Retrieval Assignment 2: Question 1 - (40 Points) Scoring and Term-Weighting

Uploaded by

Pranshu Patel

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 3

CSE508: Information Retrieval

Assignment 2
Max Marks: 100

Instructions-
• The assignment is to be attempted in groups (max 2 members).
• Language allowed: Python
• For plagiarism, institute policy will be followed.
• You need to submit README.pdf and code files. The code should be well commented.
• You are allowed to use libraries such as NLTK for data preprocessing.
• Mention methodology, preprocessing steps, and assumptions you may have in README.pdf.
• You will be required to use Github for code management.
– Each group will create a GitHub repository with the name IR2022 A2 GroupNo (Eg - IR2022 A2 1
for Group No-1).
– Each group would add the assigned TA as a collaborator to the GitHub repository. TAs’ GitHub
handles would be shared shortly.
– While uploading on Classroom, each group would need to upload a link of the GitHub repository.
Only one member needs to submit.
• You cannot use any exact API/library for the tasks you have been assigned. You have to do the
implementation from scratch. For instance, if you have been asked to implement IDF then you can’t
use API for the same.
• You will have 10 days to complete the assignment.

Question 1 - [40 Points] Scoring and Term-Weighting

Jaccard Coefficient [20 points]

The goal is to find the Jaccard coefficient between a given query and the document. The formula used is
mentioned below as:

Jaccard Coefficient = Intersection of (doc,query) / Union of (doc,query)

The high the value of the Jaccard coefficient, the more the document is relevant for the query.

1. Use the same data given in assignment 1 and carry out the same preprocessing steps as mentioned
before.
2. To calculate this make set of the document token and query token and perform intersection and union
between the query and each document.
3. Report the top 5 relevant documents based on the value of the Jaccard coefficient.

TF-IDF Matrix [20 points]

The goal is to generate a TF-IDF matrix for each word in the vocab and obtain a TF-IDF score for a
given query. TF-IDF has two parts Term Frequency and Inverse Document Frequency.
• Computing Term Frequency involves calculating the raw count of the word in each document and
stored as a nested dictionary for each document.
• To calculate the document frequency of each word, find the postings list of each word and subsequently
find the no. of documents in each posting list of each word.

• The IDF value of each word is calculated using the formula as mention below:
Using smoothing:-
IDF(word)=log(total no. of documents/document frequency(word)+1)
• The Term Frequency is calculated using 5 different variants:

Weighting Scheme TF Weight

Binary 0,1
Raw count f(t,d)
Term frequency f(t,d)/f(t‘,d)
Log normalization log(1+f(t,d))
Double normalization 0.5+0.5*(f(t,d)/ max(f(t‘,d))

1. Use the same data given in assignment 1 and carry out the same preprocessing steps as mentioned
before.
2. Build the matrix of size no. of document x vocab size.
3. Fill the tf idf values in the matrix of each word of the vocab.
4. Make the query vector of size vocab
5. Compute the TF-IDF score for the query using the TF-IDF matrix. Report the top 5 relevant
documents based on the score.
6. Use all 5 weighting schemes for term frequency calculation and report the TF-IDF score and
results for each scheme separately

Note- State the pros and cons of using each scoring scheme to find the relevance of documents in your
report.

Question 2 - [25 points] Ranked-Information Retrieval and Evaluation

Use the data file provided here. This has been taken from Microsoft learning to rank dataset, which can
be found here. Read about the dataset carefully, and what all it contains.

1. Consider only the queries with qid:4 and the relevance judgement labels as relevance score.

2. (10 points) Make a file rearranging the query-url pairs in order of max DCG. State how many such
files could be made.
3. (5 points) Compute nDCG

(a) At 50
(b) For the whole dataset

4. (10 points) Assume a model that simply ranks URLs on the basis of the value of feature 75 (sum of
TF-IDF on the whole document) i.e. the higher the value, the more relevant the URL. Assume any
non zero relevance judgment value to be relevant. Plot a Precision-Recall curve for query “qid:4”.

Question 3 - [ 35 points ] Naive Bayes Classifier

Download the 20 newsgroup. You need to pick documents of comp.graphics, sci.med, talk.politics.misc,
rec.sport.hockey, sci.space [5 classes] for text classification.

Implement the Naive Bayes algorithm for text classification using TF-ICF, (a modification of TF-
IDF) as a feature selection technique.

TF-ICF score for a given term belonging to a class can be calculated as follows:

Term Frequency (TF): Number of occurrences of a term in all documents of a particular class Class
Frequency (CF): Number of classes in which that term occurs

Inverse-Class Frequency (ICF): log( N / CF), where N represents the number of classes

Implementation Points:
1. Perform suitable pre-processing steps for the given dataset.
2. Split your dataset randomly into train: test ratio. You need to select the documents randomly for
splitting. You are not supposed to split documents in sequential order, for instance, choosing the first
800 documents in the train set and last 200 in the test set for the train: test ratio of 80:20.
3. Implement the TF-ICF scoring technique for efficient feature selection. Select the top k features for
each class. Subsequently, the effective vocabulary shall be the union of the top k features of each class.

4. For each class, train your Naive Bayes Classifier on the training data.
5. Test your classifier on testing data and report the confusion matrix and overall accuracy.
6. Perform the above steps on 50:50, 70:30, and 80:20 training and testing split ratios.
7. Analyze the performance of the classification algorithm for the feature selection technique across dif-
ferent train: test ratios.

IR Solutions Combined
No ratings yet
IR Solutions Combined
82 pages
Theory Assignment
No ratings yet
Theory Assignment
4 pages
4 IRinArabic2021 Ranked Retrieval I
No ratings yet
4 IRinArabic2021 Ranked Retrieval I
49 pages
Lec 4
No ratings yet
Lec 4
39 pages
CS8080 INFORMATION RETRIEVAL TECHNIQUES II INTERNAL EXAMINATION - Google Forms
No ratings yet
CS8080 INFORMATION RETRIEVAL TECHNIQUES II INTERNAL EXAMINATION - Google Forms
420 pages
153 Sanskriti IR File
No ratings yet
153 Sanskriti IR File
55 pages
3 Retrieval Models
No ratings yet
3 Retrieval Models
87 pages
DeekshikaJadyada26 AP24LDS11
No ratings yet
DeekshikaJadyada26 AP24LDS11
7 pages
Sample Question
No ratings yet
Sample Question
19 pages
Assignment 2
No ratings yet
Assignment 2
3 pages
Term Weighting 2021
100% (2)
Term Weighting 2021
38 pages
1 Overview
No ratings yet
1 Overview
44 pages
Ir QB
No ratings yet
Ir QB
8 pages
Chapter 2
No ratings yet
Chapter 2
37 pages
Information Retrieval
100% (1)
Information Retrieval
11 pages
asila-IR
No ratings yet
asila-IR
16 pages
3 Termweighting
No ratings yet
3 Termweighting
34 pages
Chapter-3 Termweighting
No ratings yet
Chapter-3 Termweighting
17 pages
Text Pre Processing With NLTK
No ratings yet
Text Pre Processing With NLTK
42 pages
5.2 Feature Engineering
No ratings yet
5.2 Feature Engineering
57 pages
3 Termweighting
No ratings yet
3 Termweighting
34 pages
L12&L13 Ranked Retrieval
No ratings yet
L12&L13 Ranked Retrieval
31 pages
Lecture 5 - Scoring, Term Weighting, Vector Space Model - Part 1
No ratings yet
Lecture 5 - Scoring, Term Weighting, Vector Space Model - Part 1
45 pages
Session 4 Text Feature
No ratings yet
Session 4 Text Feature
40 pages
AI6122 Topic 3.2 - Ranking
No ratings yet
AI6122 Topic 3.2 - Ranking
27 pages
3 termWeightingIR
No ratings yet
3 termWeightingIR
32 pages
Assignment 2
No ratings yet
Assignment 2
4 pages
IRS Automatic Indexing UNIT-2
67% (3)
IRS Automatic Indexing UNIT-2
18 pages
Bits Pilani, Dubai Campus
No ratings yet
Bits Pilani, Dubai Campus
11 pages
NLP Week10 IR Enc Dec
No ratings yet
NLP Week10 IR Enc Dec
68 pages
Term Weighting and Similarity Measures
50% (2)
Term Weighting and Similarity Measures
54 pages
Basic IR: Modeling
No ratings yet
Basic IR: Modeling
22 pages
3 Term Weighting
No ratings yet
3 Term Weighting
34 pages
3 Termweighting
No ratings yet
3 Termweighting
40 pages
23 24 Endsem
No ratings yet
23 24 Endsem
12 pages
NLP Week10 IR Enc Dec Annotated - by - Ces
No ratings yet
NLP Week10 IR Enc Dec Annotated - by - Ces
83 pages
Vector Space Model: TF - IDF: Adapted From Lectures by
No ratings yet
Vector Space Model: TF - IDF: Adapted From Lectures by
37 pages
Chapter 3 IR
No ratings yet
Chapter 3 IR
34 pages
Term Weighting and Similarity Measures
No ratings yet
Term Weighting and Similarity Measures
35 pages
IR Assignment4
No ratings yet
IR Assignment4
5 pages
Module III
No ratings yet
Module III
42 pages
Final Exam (Spring 2020 - V1)
No ratings yet
Final Exam (Spring 2020 - V1)
11 pages
IR - Models
100% (3)
IR - Models
58 pages
Scoring
No ratings yet
Scoring
49 pages
IR Practical Theory
No ratings yet
IR Practical Theory
9 pages
IR Practical
No ratings yet
IR Practical
24 pages
Cs402 Data Mining and Warehousing, June 2022
No ratings yet
Cs402 Data Mining and Warehousing, June 2022
3 pages
Chapter 6 - Scoring Term Weighting and Vector Space Model
No ratings yet
Chapter 6 - Scoring Term Weighting and Vector Space Model
43 pages
Allnlp
No ratings yet
Allnlp
15 pages
MSBD5001 WrittenAssignment2 2024F
No ratings yet
MSBD5001 WrittenAssignment2 2024F
5 pages
Reference Material For NLP - 1
No ratings yet
Reference Material For NLP - 1
40 pages
COL 774: Assignment 2
No ratings yet
COL 774: Assignment 2
3 pages
Information Retrival Final Exam
0% (1)
Information Retrival Final Exam
16 pages
Ir End Pyq Sols
No ratings yet
Ir End Pyq Sols
8 pages
I R Rank
No ratings yet
I R Rank
52 pages
12 Midterm Review
No ratings yet
12 Midterm Review
18 pages
Ranked Retrieval
No ratings yet
Ranked Retrieval
52 pages
Série RI-récap Corrigée
No ratings yet
Série RI-récap Corrigée
11 pages
Oracle Certified Professional Java Programmer OCPJP 1Z0 809
From Everand
Oracle Certified Professional Java Programmer OCPJP 1Z0 809
Manish Soni
No ratings yet
Exam SC-400: Microsoft Information Protection and Compliance Administrator Associate Exam Preparation
From Everand
Exam SC-400: Microsoft Information Protection and Compliance Administrator Associate Exam Preparation
Georgio Daccache
No ratings yet
Thesis On Mobile Cloud Computing
100% (2)
Thesis On Mobile Cloud Computing
5 pages
Oracle DBA Fresher Interview Questions - 2
No ratings yet
Oracle DBA Fresher Interview Questions - 2
5 pages
Explain Following CSS Properties
No ratings yet
Explain Following CSS Properties
8 pages
Bull VPN Disconnect
No ratings yet
Bull VPN Disconnect
15 pages
1 CH1 IT Project Management
No ratings yet
1 CH1 IT Project Management
19 pages
Microprocessor 8085 Appendix A
No ratings yet
Microprocessor 8085 Appendix A
1 page
Fix VLC Player VLSUB 0.9.13 Crashing - Not Working Bug
No ratings yet
Fix VLC Player VLSUB 0.9.13 Crashing - Not Working Bug
7 pages
How To Create A Live Ubuntu USB Drive With Persistent Storage
No ratings yet
How To Create A Live Ubuntu USB Drive With Persistent Storage
15 pages
Crit - B - Record - of - Tasks IA
No ratings yet
Crit - B - Record - of - Tasks IA
3 pages
DetailedFormHelpDoc OTR
No ratings yet
DetailedFormHelpDoc OTR
16 pages
ACDP Programming Master: FRM Module
No ratings yet
ACDP Programming Master: FRM Module
4 pages
Topic-4 MCQ
No ratings yet
Topic-4 MCQ
22 pages
MY k8s Day2 Chapter 7 f5 XC Lab
No ratings yet
MY k8s Day2 Chapter 7 f5 XC Lab
23 pages
Blue Simple Professional CV Resume
No ratings yet
Blue Simple Professional CV Resume
1 page
Grade 5 Holiday Study Pack T1
No ratings yet
Grade 5 Holiday Study Pack T1
20 pages
COMP-111 Programming Fundamentals
No ratings yet
COMP-111 Programming Fundamentals
26 pages
MISY675 Instructions v3
No ratings yet
MISY675 Instructions v3
5 pages
A Study On Coverage Criteria Based Test Case Reduction Techniques
No ratings yet
A Study On Coverage Criteria Based Test Case Reduction Techniques
7 pages
Typography & Logo Design in Adobe Illustrator 1
No ratings yet
Typography & Logo Design in Adobe Illustrator 1
4 pages
Brochure Philips Respironics V60 Plus Ventilator
No ratings yet
Brochure Philips Respironics V60 Plus Ventilator
4 pages
Unit 1 - Structured Paradigm
No ratings yet
Unit 1 - Structured Paradigm
67 pages
WSC2022SE 54 Cyber Security Marking Schemembmbmb
No ratings yet
WSC2022SE 54 Cyber Security Marking Schemembmbmb
14 pages
Mastering Azure Cloud A Comprehensive Guide To Building Scalable Cloud Solutions
No ratings yet
Mastering Azure Cloud A Comprehensive Guide To Building Scalable Cloud Solutions
4 pages
GRC 12.0 PDF
No ratings yet
GRC 12.0 PDF
3 pages
BMC+Automation+Console+20.02 Home 04 21 2020
100% (2)
BMC+Automation+Console+20.02 Home 04 21 2020
168 pages
Errors From Internet
No ratings yet
Errors From Internet
35 pages
Code AMK
No ratings yet
Code AMK
10 pages
Digital Microscope: Instruction Manual
No ratings yet
Digital Microscope: Instruction Manual
72 pages
Sai Baba
No ratings yet
Sai Baba
1 page
History of The Computer
No ratings yet
History of The Computer
6 pages

CSE508: Information Retrieval Assignment 2: Question 1 - (40 Points) Scoring and Term-Weighting

Uploaded by

CSE508: Information Retrieval Assignment 2: Question 1 - (40 Points) Scoring and Term-Weighting

Uploaded by

CSE508: Information Retrieval

Question 1 - [40 Points] Scoring and Term-Weighting

Jaccard Coefficient = Intersection of (doc,query) / Union of (doc,query)

TF-IDF Matrix [20 points]

Weighting Scheme TF Weight

Question 2 - [25 points] Ranked-Information Retrieval and Evaluation

Question 3 - [ 35 points ] Naive Bayes Classifier

You might also like