We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 11
‘ara, 1:01 PM Copy of ENGGBG00_IR_F22_ HWA ipynb - Colaboratory
ENGG*6600: Special Topics in Information Retrieval - Fall
2022
Assignment 3: Retrieval Models (Total : 100 points)
Description
This is a coding assignment where you will implement three retrieval models. Basic proficiency
in Python is recommended.
Instructions
To start working on the assignment, you would first need to save the notebook to your
local Google Drive. For this purpose, you can click on Copy to Drive button. You can
alternatively click the Share button located at the top right corner and click on Copy Link
under Get Link to get a link and copy this notebook to your Google Drive
For questions with descriptive answers, please replace the text in the cell which states
“Enter your answer here!" with your answer. If you are using mathematical notation in your
answers, please define the variables.
You should implement all the functions yourself and should not use a library or tool for the
computation
For coding questions, you can add code where it says "enter code here" and execute the
cell to print the output.
To create the final pdf submission file, execute Runtime->RunAll from the menu to re~
execute all the cells and then generate a PDF using File->Print->Save as PDF. Make sure
that the generated PDF contains all the codes and printed outputs before submission. To
create the final python submission file, click on File>Download .py.
Submission Details
Due data: Nov. 03, 2022 at 11:59 PM (EST).
The final PDF and python file must be uploaded on CourseLink.
After copying this notebook to your Google Drive, please paste a link to it below. Use the
same process given above to generate a link. You will not recieve any credit if you don't
paste the link! Make sure we can access the file.
*LINK: https://colab research google.com/drive/ biUAN6FHIE2_Pf0hrKcZMEAL-Xxg3J3C *
Academic Honesty
Please follow the guidelines under the Collaboration and Help section in the first lecture.
hitpsieolab research, google.comvrivelbjUANSFHIE2_PYOhrKeZME4L-Xxg33C#scrolTo=SMUSd12-jCG-&printMode=true um‘ara, 1:01 PM Copy of ENGGBG00_IR_F22_ HWA ipynb - Colaboratory
» Download input files and code
Please execute the cell below to download the input files.
import os
from pydrive.auth import Googleauth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials
auth.authenticate_user()
gauth = Googleauth()
gauth. credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
import os
import zipfile
download = drive.CreateFile({'id
download. GetContentFile( '[email protected]")
* LobnY¥vxGG8-x02552U8aVY FayBcSmFQsw" })
with zipfile.ZipFile(‘[email protected]', ‘r*) as zip_file:
zip_file.extractall(’./*)
os. remove( "[email protected]')
# We will use hw as our working directory
os..chdir( ‘HW@3")
Setting the input files
queries file = "queries tok_clean_ksten"
col = “antique-collection.tok.clean_kstem"
qrel_file = "test.qrel"
> 1: Initial Data Setup (10 points)
We use files from the ANTIQUE [https://arxiv.org/pdf/1905.08957.pdf] dataset for this
assignment. As described in the previous assignments, this is a passage retrieval dataset.
The description of the input files provided for this assignment is given below.
Query File
We randomly sampled a set of 15 queries from the test set of the ANTIQUE dataset. Each row of
the input file contains the following information:
queryid query_text
hitpsieolab research, google.comvrivelbjUANSFHIE2_PYOhrKeZME4L-Xxg33C#scrolTo=SMUSd12-jCG-&printMode=true amsar, 1:91 PM Copy of ENGGEA00_IR_F22_ MINS pyr - Colabaratony
The id and text information is tab separated. queryid is a unique identifier for a query and query
text has been pre-processed to remove punctutation, tokenised and stemmed using the Krovetz
stemmer.
Query Relevance (qrel) file
The qrel file contains the relevance judgements (ground truth) for the query passage
combinations. Each row of the file contains the following information.
queryid topicid passageid relevance_judgement
Please note that the entries are space separated. The second column (topicid) can be ignored
Given below are a couple of rows of a sample qrel file.
2146313 U0 2146313_04
2146313 QO 2146313_23 2
The relevance judgements range from values 1-4. The description of the labels is given below.
Label 1: Non-Relevant
Label 2: Slightly Relevant
Label 3: Relevant
Label 4: Highly Relevant
Note: that for metrics with binary relevance assumptions, Labels 1 and 2 are considered non-
relevant and Labels 3 and 4 are considered relevant.
Note: if a query-document pair is not listed in the qrels file, we assume that the document is not
relevant to the query.
Collection file
Each row of the file consists of the following information:
passage_id passage_text
The id and text information is tab separated. The passage text has been pre-processed to
remove punctutation, tokenised and stemmed using the Krovetz stemmer (same as queries).
The terms in the passage text can be accessed by splitting the text based on space.
In this section, you have to implement the following:
+ Load the queries from the query file into a datastructure
+ Load the query relevance information into datastructure, You can reuse some of the code
written in Assignment 1 for this and make modifications to it as needed.
You can use any additional datastructures than the suggested ones for your implementation.
This function is used to load query file information into datastructure(s).
hitpsicolab research, google.comverive!bjUANSFHIE2_PYOhrKeZMEAL-Xxg330#scrolTo=SMUSd12-jCG-&printMode=true aint‘ara, 1:01 PM Copy of ENGGBG00_IR_F22_ HWA ipynb - Colaboratory
Return Variables:
queries - mapping from queryid to querytext
import pandas as pd
aid = []
q.text = []
def loadQueries(queries_file):
queries = pd.DataFrame(colunns=['q id’, *q_text’])
with open(queries file) as f:
for line in f
z = line. split()
queries = queries. append({'qid
[2], ‘q_text':z[1]}, ignore_index=True)
return queries
This function is used to load qrel file information into datastructure(s).
The qrel file format is the same as the one provided in Assignment 1 and is given below:
"queryid topicid passageid relevance_judgement"
The entries are space separated.
You can copy your qrel loading code from Assignment 1 and make modifications if necessary.
Return Variables:
num_queries - number of queries in the qrel file
qrels - query relevance information
def loadgrels(qrel_file):
grels_list = []
num_queries = set()
grels = pd.bataFrame(columns=[‘queryid", ‘passageid’, ‘relevance_score'])
with open(qrel_file) as f:
for line in f:
z = line. split()
num_queries.add(z[@])
grels_list.append(z)
grels = qrels.append({‘queryid':z[@], ‘passageid':z[2], ‘relevance_score' :2[3]},
data_structure = pd.DataFrame(qrels_list, columns=["queryid", “topicid”, “passageid
num_rel = data_structure. loc[data_structure[”relevancejudgment"]=="3"].shape[@] + d
return num_queries,qrels
# You can return additional datastructures for your implementation.
queries = loadQueries(queries_file)
num_queries, grels = loadrels(qrel_file)
print (‘Total Num of queries in the query file : (@}'.format(len(queries)))
print (‘Total Num of queries in the qrel file : {@}'.format(num_queries))
print (‘Queries in the qrel file : {@}.format(num_queries))
hitpsieolab research, google.comvrivelbjUANSFHIE2_PYOhrKeZME4L-Xxg33C#scrollTo=SMUSd12-|CG-&printMode=true am‘ara, 1:01 PM Copy of ENGGBG00_IR_F22_ HWA ipynb - Colaboratory
Total Num of queries in the query file : 15
Total Num of queries in the qrel file : {'4185501", '1844896", '1262692", '3396066
Queries in the qrel file : {'41855e1', '1844g96", '1262692', '3396066',
In the cell below, an inverted index with count has been created in memory. Please run the cell
and use the variables for implementing the retrieval models.
An inverted index with count information.
class indexCount:
pcount = @
ctf = {}
sumdl = @
avgdl = @
doclen = {}
index = {}
probetf = {}
def _init_(self, col):
self.col = col
def create_index(self):
for Line in open(self.col):
pid,ptext = line.strip().split('\t')
self.pcount+=1
if pid not in self.doclen:
self.doclen[pid]=0
pfreq = {)
for term in ptext.split(’
self. sumdl.
if term not in self.ctf:
self.ct#[term
self.ct#[tern]#=1
self.doclen[pid]+=1
if term not in pfreq:
pfreq[tern]=@
pfreq[term]+=1
for kv in pfreq.items():
if k not in self.index:
self. index[k]=[]
self. index[k] .append(pid+"
“sstr(v))
hitpsicolab research, google.comvrive!bjUANSFHIE2_PYOhrKeZMEAL-Xxg330#scrolTo=SMUSd12-jCG-&printMode=true
st‘ara, 1:01 PM Copy of ENGGBG00_IR_F22_ HWA ipynb - Colaboratory
for k,v in self.ctf.items():
self .probct#[k]=v/float (self.sumdl)
self.avgdl = self.sumdl/float(self.pcount)
buildindex = indexcount(col)
buildindex.create_index()
inverted index with count: dict with term as key and posting list as value
posting list is a list with each element in the format “passage_id:term frequency
Example - {‘the': ['2020338_0:11', '3174498_1:4"]}
index = buildindex.index
#total number of passages in the collection
num_passages = buildIndex.pcount
# Average passage length
avgd1 = buildindex.avgd1.
# Collection Term Frequency : dict with term as the key and the term frequency in collecti
ctf = buildIndex.ctf
# Probability Term Frequencies : dict with terms as key and probability distribution over
probctf = buildIndex.probct#
dict with passageId as key and number of tokens in the passage as value
doclen = buildIndex.doclen
# Total number of tokens in the collection
totNumTerms = buildindex.sumd1
print (‘Total number of passages in the collection :(@)'.format(num_passages))
print( ‘Average passage length :{@)'.format(avgdl))
print('Total nun of unique terms :{@}' format (len(ct#)))
print(‘Total num of terms in the collection :{0}' .format(totNumTerms))
Total number of passages in the collection :403492
Average passage length :41.11619809066846
Total num of unique terms :149467
Total num of terms in the collection :16590057
> 2: Vector Space model (VSM model) (30 points)
hitpsieolab research, google.comvrivelbjUANSFHIE2_PYOhrKeZME4L-Xxg33C#scrollTo=SMUSd12-|CG-&printMode=true ant‘ara, 1:01 PM Copy of ENGGBG00_IR_F22_ HWA ipynb - Colaboratory
In the cell below, implement the VSM model given in Slide 19 of ‘Basic Retrieval Models Part 1’
The score function has been given below for reference.
score(ayp) = 2 count, 4) in(1+In(1+ count(up)) plCitt
weg 1-045, df(w)
score(q, p) - score assigned to a passage p for a query q
count(w, q) -number of times term w occurs in query g
count(w, p) - number of times term w occurs in passage p
b- set this to 0.75
\p| - Number of tokens in passage p
avgdl - Average number of tokens in passages in collection
‘C| - number of passages in collection C
df (w) - number of passages containing term w
Please note that we consider each query term once, since this is equivalent to a dot product.
For each query, you have to return the top 5 retrieved passages ranked based on the score
returned by the VSM model using "term at a time" scoring method.
Rank passages for each query and return top 5 passages.
Return Variables:
final_ranking_vsm : map with query id as key and list of top 5 ranked passages as value
import numpy as np
import operator
def vsm(queries, index, avgdl, num_passages, doclen):
for line in open(‘queries_tok_clean_kstem', encodin
queryid, querytext = line.strip().split(‘\t")
query_vocabulary = []
for word in querytext.split():
if word not in query_vocabulary:
query_vocabulary.append(word)
tts"):
query_we = ()
for word in query vocabulary:
query_we[word] = querytext.lower().split().count (word)
final_score={}
for w in query_vocabulary:
m=len(index[w])
for i in index[w]:
pid, pcount=i.split(*:')
hitpsieolab research, google.comvrivelbjUANSFHIE2_PYOhrKeZME4L-Xxg33C#scrolTo=SMUSd12-jCG-&printMode=true 7msar, 1:91 PM Copy of ENGGEA00_IR_F22_ MINS pyr - Colabaratony
#F upper part
kenp.1og(1+ np.log(1+int(pcount) ))
#f-lower part
1-0.75)+(0.75* (int (doclen[pid])/avedl))
#first part
f=(k/p)
c=(num_passages+1)/m
sc=np.1og(c)
with score calculation
scoreofi=query_wc[word]*F*sc
#adding into final score
if pid not in final_score:
final_score[pid]=int (scoreofi)
else:
final_score[pid]=int(final_score[pid] )+scoreofi,
final_ranking_vsm[queryid]=sorted(final_score.items(),key=operator.itemgetter(1),
return final_ranking_vsm
final_ranking_vsm = vsm(queries, index, avgdl, num_passages, doclen)
# Hint: The score would be in the interval: [13,14]
print (‘The top retrieved passage and score for query id "3698636" using VSM is : {@)'.for
‘The top retrieved passage and score for query id "3698636" using VSM is : [('754739_:
> 3: BM25 (30 points)
In the cell below, implement the BM25 model given in slide 31 of ‘Basic Retrieval Models Part 3°.
score(ayp) = (i+ 1eountw,p) Il af(w) +0.5
wey ky(1—b+6(25)) + count(w,p) — Af(w) +08
score(q, p) - score assigned to a passage p for a query q
count(w, p) - number of times term w occurs in passage p
b- set this to 0.75
\p| - Number of tokens in passage p
avgdl - Average number of tokens in passages in collection
‘C| - number of passages in collection C’
df (w) - number of passages containing term w
ky - set to 1.2
Please note that we iterate over all query tokens including repetitions.
hitpsieolab research, google.comvrivelbjUANSFHIE2_PYOhrKeZME4L-Xxg33C#scrollTo=SMUSd12-|CG-&printMode=true antsar, 1:91 PM Copy of ENGGEA00_IR_F22_ MINS pyr - Colabaratony
Similar to the previous model, return the top 5 retrieved passages for each query ranked based
on the BM25 scoring using "term at a time" scoring method.
Rank passages for each query and return top 5 passages.
Return Variables:
final_ranking_bm25 : map with query id as key and list of top 5 ranked passages as value
def bm25(queries, index, avgdl, num_passages, doclen):
Final_bm25={}
for line in open(‘queries_tok_clean_kstem', encodin
gid,qtext = line.strip().split('\t")
query_vocabulary = []
for word in qtext.split():
if word not in query vocabulary:
query_vocabulary. append (word)
utfe"):
query_we = (
for word in query_vocab:
query_wc[word] = qtext.lower().split().count(word)
final_score={}
for w in query_vocabulary:
m = len(index[w])
for i in index{w]:
pid,pcount=i.split(*
first part
fu=(1,2+1) "int (pcount)
u=(int (doclen[pid])) /avedL
[email protected]+(8.75*u)
fl=(1.2*u1)+int (pcount)
u/#1
second part
lu=num_passages-m+0.5
Ll=m0.5
Lenp. Log(1u/11)
with score calculation
scoreofi= f*1
Hadding into final score
if pid not in final_score:
final_score[pid]=int(scoreofi)
else:
final_score[pid]=int(final_score[pid])+scoreofi
Final_ranking_bm25[qid]=sorted(Final_score.items(),key=operator.itemgetter(1), rev
return final_ranking_bm25
final_ranking_bm25 = bm25(queries, index, avgdl, num_passages, doclen)
hitpsieolab research, google.comvrivelbjUANSFHIE2_PYOhrKeZME4L-Xxg33C#scrolTo=SMUSd12-jCG-&printMode=true on‘ara, 1:01 PM Copy of ENGGBG00_IR_F22_ HWA ipynb - Colaboratory
# Hint: The score would be in the interval: [18,19]
print (‘The top retrieved passage and score for query id "3698636" using BM25 is : (@}'.fo
The top retrieved passage and score for query id "3698636" using BM25 is : [('369863¢
Double-click (or enter) to edit
> 4: Evaluation (30 points)
In the cell, evaluate the top 5 retrieved passages coresponding to each of the models using
Precision@5 and Recall@5 metrics. You can use the code from assignment 1 modified as.
needed.
# return precision of top 5 retrieved passages
def calcPrecision(top, qrels, rank_in):
#enter your code here
return 0
# return recall of top 5 retrieved passages
def calcRecall(top, grels, rank_in):
#enter your code here
return @
# Hint: Precision value interval [@.1,0.2], Recall value interval [0.04,0.05]
print(“Evaluate VSM model")
print (‘Precision at top 5 : {@}'.format(calcPrecision(S, qrels, final_ranking_vsm)))
print (‘Recall at top 5 : {0}'.format(calcRecall(5, qrels, final_ranking_vsm)))
Print (dsbdenorneane aaa oUGO dono rina aaa O DO OHO ESE GUO CODDE EEE")
# Hint: Precision value interval [.3,0.4], Recall value interval [@.10,0.20]
print(“Evaluate BM25 model")
print (‘Precision at top 5 : {@}'.format(calcPrecision(S, qrels, final_ranking_bm25)))
print (‘Recall at top 5 : {@}'.format(calcRecall(5, qrels, final_ranking_bm2s)))
print (#tttstseeesnenreneneenentedsenestsneetentetentienetenteneetentete!)
Precision value interval [@.3,0.4], Recall value interval [@.1,0.2]
# Hin
Evaluate VSM model
Precision at top 5 : @
Recall at top 5: @
Evaluate BM25 model
Precision at top 5 : @
Recall at top 5: @
hitpsieolab research, google.comvrivelbjUANSFHIE2_PYOhrKeZME4L-Xxg33C#scrolTo=SMUSd12-jCG-&printMode=true tom‘ara, 1:01 PM Copy of ENGGBG00_IR_F22_ HWA ipynb - Colaboratory
v 0s completed at 11:11 PM ex
hitpsieolab research, google.comvrivelbjUANSFHIE2_PYOhrKeZME4L-Xxg33C#scrolTo=SMUSd12-jCG-&printMode=true wt