0% found this document useful (0 votes)
38 views

Ir - Assignment 3

Uploaded by

Khushali Mehta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
38 views

Ir - Assignment 3

Uploaded by

Khushali Mehta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 11
‘ara, 1:01 PM Copy of ENGGBG00_IR_F22_ HWA ipynb - Colaboratory ENGG*6600: Special Topics in Information Retrieval - Fall 2022 Assignment 3: Retrieval Models (Total : 100 points) Description This is a coding assignment where you will implement three retrieval models. Basic proficiency in Python is recommended. Instructions To start working on the assignment, you would first need to save the notebook to your local Google Drive. For this purpose, you can click on Copy to Drive button. You can alternatively click the Share button located at the top right corner and click on Copy Link under Get Link to get a link and copy this notebook to your Google Drive For questions with descriptive answers, please replace the text in the cell which states “Enter your answer here!" with your answer. If you are using mathematical notation in your answers, please define the variables. You should implement all the functions yourself and should not use a library or tool for the computation For coding questions, you can add code where it says "enter code here" and execute the cell to print the output. To create the final pdf submission file, execute Runtime->RunAll from the menu to re~ execute all the cells and then generate a PDF using File->Print->Save as PDF. Make sure that the generated PDF contains all the codes and printed outputs before submission. To create the final python submission file, click on File>Download .py. Submission Details Due data: Nov. 03, 2022 at 11:59 PM (EST). The final PDF and python file must be uploaded on CourseLink. After copying this notebook to your Google Drive, please paste a link to it below. Use the same process given above to generate a link. You will not recieve any credit if you don't paste the link! Make sure we can access the file. *LINK: https://colab research google.com/drive/ biUAN6FHIE2_Pf0hrKcZMEAL-Xxg3J3C * Academic Honesty Please follow the guidelines under the Collaboration and Help section in the first lecture. hitpsieolab research, google.comvrivelbjUANSFHIE2_PYOhrKeZME4L-Xxg33C#scrolTo=SMUSd12-jCG-&printMode=true um ‘ara, 1:01 PM Copy of ENGGBG00_IR_F22_ HWA ipynb - Colaboratory » Download input files and code Please execute the cell below to download the input files. import os from pydrive.auth import Googleauth from pydrive.drive import GoogleDrive from google.colab import auth from oauth2client.client import GoogleCredentials auth.authenticate_user() gauth = Googleauth() gauth. credentials = GoogleCredentials.get_application_default() drive = GoogleDrive(gauth) import os import zipfile download = drive.CreateFile({'id download. GetContentFile( '[email protected]") * LobnY¥vxGG8-x02552U8aVY FayBcSmFQsw" }) with zipfile.ZipFile(‘[email protected]', ‘r*) as zip_file: zip_file.extractall(’./*) os. remove( "[email protected]') # We will use hw as our working directory os..chdir( ‘HW@3") Setting the input files queries file = "queries tok_clean_ksten" col = “antique-collection.tok.clean_kstem" qrel_file = "test.qrel" > 1: Initial Data Setup (10 points) We use files from the ANTIQUE [https://arxiv.org/pdf/1905.08957.pdf] dataset for this assignment. As described in the previous assignments, this is a passage retrieval dataset. The description of the input files provided for this assignment is given below. Query File We randomly sampled a set of 15 queries from the test set of the ANTIQUE dataset. Each row of the input file contains the following information: queryid query_text hitpsieolab research, google.comvrivelbjUANSFHIE2_PYOhrKeZME4L-Xxg33C#scrolTo=SMUSd12-jCG-&printMode=true am sar, 1:91 PM Copy of ENGGEA00_IR_F22_ MINS pyr - Colabaratony The id and text information is tab separated. queryid is a unique identifier for a query and query text has been pre-processed to remove punctutation, tokenised and stemmed using the Krovetz stemmer. Query Relevance (qrel) file The qrel file contains the relevance judgements (ground truth) for the query passage combinations. Each row of the file contains the following information. queryid topicid passageid relevance_judgement Please note that the entries are space separated. The second column (topicid) can be ignored Given below are a couple of rows of a sample qrel file. 2146313 U0 2146313_04 2146313 QO 2146313_23 2 The relevance judgements range from values 1-4. The description of the labels is given below. Label 1: Non-Relevant Label 2: Slightly Relevant Label 3: Relevant Label 4: Highly Relevant Note: that for metrics with binary relevance assumptions, Labels 1 and 2 are considered non- relevant and Labels 3 and 4 are considered relevant. Note: if a query-document pair is not listed in the qrels file, we assume that the document is not relevant to the query. Collection file Each row of the file consists of the following information: passage_id passage_text The id and text information is tab separated. The passage text has been pre-processed to remove punctutation, tokenised and stemmed using the Krovetz stemmer (same as queries). The terms in the passage text can be accessed by splitting the text based on space. In this section, you have to implement the following: + Load the queries from the query file into a datastructure + Load the query relevance information into datastructure, You can reuse some of the code written in Assignment 1 for this and make modifications to it as needed. You can use any additional datastructures than the suggested ones for your implementation. This function is used to load query file information into datastructure(s). hitpsicolab research, google.comverive!bjUANSFHIE2_PYOhrKeZMEAL-Xxg330#scrolTo=SMUSd12-jCG-&printMode=true aint ‘ara, 1:01 PM Copy of ENGGBG00_IR_F22_ HWA ipynb - Colaboratory Return Variables: queries - mapping from queryid to querytext import pandas as pd aid = [] q.text = [] def loadQueries(queries_file): queries = pd.DataFrame(colunns=['q id’, *q_text’]) with open(queries file) as f: for line in f z = line. split() queries = queries. append({'qid [2], ‘q_text':z[1]}, ignore_index=True) return queries This function is used to load qrel file information into datastructure(s). The qrel file format is the same as the one provided in Assignment 1 and is given below: "queryid topicid passageid relevance_judgement" The entries are space separated. You can copy your qrel loading code from Assignment 1 and make modifications if necessary. Return Variables: num_queries - number of queries in the qrel file qrels - query relevance information def loadgrels(qrel_file): grels_list = [] num_queries = set() grels = pd.bataFrame(columns=[‘queryid", ‘passageid’, ‘relevance_score']) with open(qrel_file) as f: for line in f: z = line. split() num_queries.add(z[@]) grels_list.append(z) grels = qrels.append({‘queryid':z[@], ‘passageid':z[2], ‘relevance_score' :2[3]}, data_structure = pd.DataFrame(qrels_list, columns=["queryid", “topicid”, “passageid num_rel = data_structure. loc[data_structure[”relevancejudgment"]=="3"].shape[@] + d return num_queries,qrels # You can return additional datastructures for your implementation. queries = loadQueries(queries_file) num_queries, grels = loadrels(qrel_file) print (‘Total Num of queries in the query file : (@}'.format(len(queries))) print (‘Total Num of queries in the qrel file : {@}'.format(num_queries)) print (‘Queries in the qrel file : {@}.format(num_queries)) hitpsieolab research, google.comvrivelbjUANSFHIE2_PYOhrKeZME4L-Xxg33C#scrollTo=SMUSd12-|CG-&printMode=true am ‘ara, 1:01 PM Copy of ENGGBG00_IR_F22_ HWA ipynb - Colaboratory Total Num of queries in the query file : 15 Total Num of queries in the qrel file : {'4185501", '1844896", '1262692", '3396066 Queries in the qrel file : {'41855e1', '1844g96", '1262692', '3396066', In the cell below, an inverted index with count has been created in memory. Please run the cell and use the variables for implementing the retrieval models. An inverted index with count information. class indexCount: pcount = @ ctf = {} sumdl = @ avgdl = @ doclen = {} index = {} probetf = {} def _init_(self, col): self.col = col def create_index(self): for Line in open(self.col): pid,ptext = line.strip().split('\t') self.pcount+=1 if pid not in self.doclen: self.doclen[pid]=0 pfreq = {) for term in ptext.split(’ self. sumdl. if term not in self.ctf: self.ct#[term self.ct#[tern]#=1 self.doclen[pid]+=1 if term not in pfreq: pfreq[tern]=@ pfreq[term]+=1 for kv in pfreq.items(): if k not in self.index: self. index[k]=[] self. index[k] .append(pid+" “sstr(v)) hitpsicolab research, google.comvrive!bjUANSFHIE2_PYOhrKeZMEAL-Xxg330#scrolTo=SMUSd12-jCG-&printMode=true st ‘ara, 1:01 PM Copy of ENGGBG00_IR_F22_ HWA ipynb - Colaboratory for k,v in self.ctf.items(): self .probct#[k]=v/float (self.sumdl) self.avgdl = self.sumdl/float(self.pcount) buildindex = indexcount(col) buildindex.create_index() inverted index with count: dict with term as key and posting list as value posting list is a list with each element in the format “passage_id:term frequency Example - {‘the': ['2020338_0:11', '3174498_1:4"]} index = buildindex.index #total number of passages in the collection num_passages = buildIndex.pcount # Average passage length avgd1 = buildindex.avgd1. # Collection Term Frequency : dict with term as the key and the term frequency in collecti ctf = buildIndex.ctf # Probability Term Frequencies : dict with terms as key and probability distribution over probctf = buildIndex.probct# dict with passageId as key and number of tokens in the passage as value doclen = buildIndex.doclen # Total number of tokens in the collection totNumTerms = buildindex.sumd1 print (‘Total number of passages in the collection :(@)'.format(num_passages)) print( ‘Average passage length :{@)'.format(avgdl)) print('Total nun of unique terms :{@}' format (len(ct#))) print(‘Total num of terms in the collection :{0}' .format(totNumTerms)) Total number of passages in the collection :403492 Average passage length :41.11619809066846 Total num of unique terms :149467 Total num of terms in the collection :16590057 > 2: Vector Space model (VSM model) (30 points) hitpsieolab research, google.comvrivelbjUANSFHIE2_PYOhrKeZME4L-Xxg33C#scrollTo=SMUSd12-|CG-&printMode=true ant ‘ara, 1:01 PM Copy of ENGGBG00_IR_F22_ HWA ipynb - Colaboratory In the cell below, implement the VSM model given in Slide 19 of ‘Basic Retrieval Models Part 1’ The score function has been given below for reference. score(ayp) = 2 count, 4) in(1+In(1+ count(up)) plCitt weg 1-045, df(w) score(q, p) - score assigned to a passage p for a query q count(w, q) -number of times term w occurs in query g count(w, p) - number of times term w occurs in passage p b- set this to 0.75 \p| - Number of tokens in passage p avgdl - Average number of tokens in passages in collection ‘C| - number of passages in collection C df (w) - number of passages containing term w Please note that we consider each query term once, since this is equivalent to a dot product. For each query, you have to return the top 5 retrieved passages ranked based on the score returned by the VSM model using "term at a time" scoring method. Rank passages for each query and return top 5 passages. Return Variables: final_ranking_vsm : map with query id as key and list of top 5 ranked passages as value import numpy as np import operator def vsm(queries, index, avgdl, num_passages, doclen): for line in open(‘queries_tok_clean_kstem', encodin queryid, querytext = line.strip().split(‘\t") query_vocabulary = [] for word in querytext.split(): if word not in query_vocabulary: query_vocabulary.append(word) tts"): query_we = () for word in query vocabulary: query_we[word] = querytext.lower().split().count (word) final_score={} for w in query_vocabulary: m=len(index[w]) for i in index[w]: pid, pcount=i.split(*:') hitpsieolab research, google.comvrivelbjUANSFHIE2_PYOhrKeZME4L-Xxg33C#scrolTo=SMUSd12-jCG-&printMode=true 7m sar, 1:91 PM Copy of ENGGEA00_IR_F22_ MINS pyr - Colabaratony #F upper part kenp.1og(1+ np.log(1+int(pcount) )) #f-lower part 1-0.75)+(0.75* (int (doclen[pid])/avedl)) #first part f=(k/p) c=(num_passages+1)/m sc=np.1og(c) with score calculation scoreofi=query_wc[word]*F*sc #adding into final score if pid not in final_score: final_score[pid]=int (scoreofi) else: final_score[pid]=int(final_score[pid] )+scoreofi, final_ranking_vsm[queryid]=sorted(final_score.items(),key=operator.itemgetter(1), return final_ranking_vsm final_ranking_vsm = vsm(queries, index, avgdl, num_passages, doclen) # Hint: The score would be in the interval: [13,14] print (‘The top retrieved passage and score for query id "3698636" using VSM is : {@)'.for ‘The top retrieved passage and score for query id "3698636" using VSM is : [('754739_: > 3: BM25 (30 points) In the cell below, implement the BM25 model given in slide 31 of ‘Basic Retrieval Models Part 3°. score(ayp) = (i+ 1eountw,p) Il af(w) +0.5 wey ky(1—b+6(25)) + count(w,p) — Af(w) +08 score(q, p) - score assigned to a passage p for a query q count(w, p) - number of times term w occurs in passage p b- set this to 0.75 \p| - Number of tokens in passage p avgdl - Average number of tokens in passages in collection ‘C| - number of passages in collection C’ df (w) - number of passages containing term w ky - set to 1.2 Please note that we iterate over all query tokens including repetitions. hitpsieolab research, google.comvrivelbjUANSFHIE2_PYOhrKeZME4L-Xxg33C#scrollTo=SMUSd12-|CG-&printMode=true ant sar, 1:91 PM Copy of ENGGEA00_IR_F22_ MINS pyr - Colabaratony Similar to the previous model, return the top 5 retrieved passages for each query ranked based on the BM25 scoring using "term at a time" scoring method. Rank passages for each query and return top 5 passages. Return Variables: final_ranking_bm25 : map with query id as key and list of top 5 ranked passages as value def bm25(queries, index, avgdl, num_passages, doclen): Final_bm25={} for line in open(‘queries_tok_clean_kstem', encodin gid,qtext = line.strip().split('\t") query_vocabulary = [] for word in qtext.split(): if word not in query vocabulary: query_vocabulary. append (word) utfe"): query_we = ( for word in query_vocab: query_wc[word] = qtext.lower().split().count(word) final_score={} for w in query_vocabulary: m = len(index[w]) for i in index{w]: pid,pcount=i.split(* first part fu=(1,2+1) "int (pcount) u=(int (doclen[pid])) /avedL [email protected]+(8.75*u) fl=(1.2*u1)+int (pcount) u/#1 second part lu=num_passages-m+0.5 Ll=m0.5 Lenp. Log(1u/11) with score calculation scoreofi= f*1 Hadding into final score if pid not in final_score: final_score[pid]=int(scoreofi) else: final_score[pid]=int(final_score[pid])+scoreofi Final_ranking_bm25[qid]=sorted(Final_score.items(),key=operator.itemgetter(1), rev return final_ranking_bm25 final_ranking_bm25 = bm25(queries, index, avgdl, num_passages, doclen) hitpsieolab research, google.comvrivelbjUANSFHIE2_PYOhrKeZME4L-Xxg33C#scrolTo=SMUSd12-jCG-&printMode=true on ‘ara, 1:01 PM Copy of ENGGBG00_IR_F22_ HWA ipynb - Colaboratory # Hint: The score would be in the interval: [18,19] print (‘The top retrieved passage and score for query id "3698636" using BM25 is : (@}'.fo The top retrieved passage and score for query id "3698636" using BM25 is : [('369863¢ Double-click (or enter) to edit > 4: Evaluation (30 points) In the cell, evaluate the top 5 retrieved passages coresponding to each of the models using Precision@5 and Recall@5 metrics. You can use the code from assignment 1 modified as. needed. # return precision of top 5 retrieved passages def calcPrecision(top, qrels, rank_in): #enter your code here return 0 # return recall of top 5 retrieved passages def calcRecall(top, grels, rank_in): #enter your code here return @ # Hint: Precision value interval [@.1,0.2], Recall value interval [0.04,0.05] print(“Evaluate VSM model") print (‘Precision at top 5 : {@}'.format(calcPrecision(S, qrels, final_ranking_vsm))) print (‘Recall at top 5 : {0}'.format(calcRecall(5, qrels, final_ranking_vsm))) Print (dsbdenorneane aaa oUGO dono rina aaa O DO OHO ESE GUO CODDE EEE") # Hint: Precision value interval [.3,0.4], Recall value interval [@.10,0.20] print(“Evaluate BM25 model") print (‘Precision at top 5 : {@}'.format(calcPrecision(S, qrels, final_ranking_bm25))) print (‘Recall at top 5 : {@}'.format(calcRecall(5, qrels, final_ranking_bm2s))) print (#tttstseeesnenreneneenentedsenestsneetentetentienetenteneetentete!) Precision value interval [@.3,0.4], Recall value interval [@.1,0.2] # Hin Evaluate VSM model Precision at top 5 : @ Recall at top 5: @ Evaluate BM25 model Precision at top 5 : @ Recall at top 5: @ hitpsieolab research, google.comvrivelbjUANSFHIE2_PYOhrKeZME4L-Xxg33C#scrolTo=SMUSd12-jCG-&printMode=true tom ‘ara, 1:01 PM Copy of ENGGBG00_IR_F22_ HWA ipynb - Colaboratory v 0s completed at 11:11 PM ex hitpsieolab research, google.comvrivelbjUANSFHIE2_PYOhrKeZME4L-Xxg33C#scrolTo=SMUSd12-jCG-&printMode=true wt

You might also like