Ir - Assignment 3

Uploaded by

Khushali Mehta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF or read online on Scribd

0% found this document useful (0 votes)

38 views

Ir - Assignment 3

Uploaded by

Khushali Mehta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF or read online on Scribd

You are on page 1/ 11

‘ara, 1:01 PM Copy of ENGGBG00_IR_F22_ HWA ipynb - Colaboratory ENGG*6600: Special Topics in Information Retrieval - Fall 2022 Assignment 3: Retrieval Models (Total : 100 points) Description This is a coding assignment where you will implement three retrieval models. Basic proficiency in Python is recommended. Instructions To start working on the assignment, you would first need to save the notebook to your local Google Drive. For this purpose, you can click on Copy to Drive button. You can alternatively click the Share button located at the top right corner and click on Copy Link under Get Link to get a link and copy this notebook to your Google Drive For questions with descriptive answers, please replace the text in the cell which states “Enter your answer here!" with your answer. If you are using mathematical notation in your answers, please define the variables. You should implement all the functions yourself and should not use a library or tool for the computation For coding questions, you can add code where it says "enter code here" and execute the cell to print the output. To create the final pdf submission file, execute Runtime->RunAll from the menu to re~ execute all the cells and then generate a PDF using File->Print->Save as PDF. Make sure that the generated PDF contains all the codes and printed outputs before submission. To create the final python submission file, click on File>Download .py. Submission Details Due data: Nov. 03, 2022 at 11:59 PM (EST). The final PDF and python file must be uploaded on CourseLink. After copying this notebook to your Google Drive, please paste a link to it below. Use the same process given above to generate a link. You will not recieve any credit if you don't paste the link! Make sure we can access the file. *LINK: https://colab research google.com/drive/ biUAN6FHIE2_Pf0hrKcZMEAL-Xxg3J3C * Academic Honesty Please follow the guidelines under the Collaboration and Help section in the first lecture. hitpsieolab research, google.comvrivelbjUANSFHIE2_PYOhrKeZME4L-Xxg33C#scrolTo=SMUSd12-jCG-&printMode=true um‘ara, 1:01 PM Copy of ENGGBG00_IR_F22_ HWA ipynb - Colaboratory » Download input files and code Please execute the cell below to download the input files. import os from pydrive.auth import Googleauth from pydrive.drive import GoogleDrive from google.colab import auth from oauth2client.client import GoogleCredentials auth.authenticate_user() gauth = Googleauth() gauth. credentials = GoogleCredentials.get_application_default() drive = GoogleDrive(gauth) import os import zipfile download = drive.CreateFile({'id download. GetContentFile( '[email protected]") * LobnY¥vxGG8-x02552U8aVY FayBcSmFQsw" }) with zipfile.ZipFile(‘[email protected]', ‘r*) as zip_file: zip_file.extractall(’./*) os. remove( "[email protected]') # We will use hw as our working directory os..chdir( ‘HW@3") Setting the input files queries file = "queries tok_clean_ksten" col = “antique-collection.tok.clean_kstem" qrel_file = "test.qrel" > 1: Initial Data Setup (10 points) We use files from the ANTIQUE [https://arxiv.org/pdf/1905.08957.pdf] dataset for this assignment. As described in the previous assignments, this is a passage retrieval dataset. The description of the input files provided for this assignment is given below. Query File We randomly sampled a set of 15 queries from the test set of the ANTIQUE dataset. Each row of the input file contains the following information: queryid query_text hitpsieolab research, google.comvrivelbjUANSFHIE2_PYOhrKeZME4L-Xxg33C#scrolTo=SMUSd12-jCG-&printMode=true amsar, 1:91 PM Copy of ENGGEA00_IR_F22_ MINS pyr - Colabaratony The id and text information is tab separated. queryid is a unique identifier for a query and query text has been pre-processed to remove punctutation, tokenised and stemmed using the Krovetz stemmer. Query Relevance (qrel) file The qrel file contains the relevance judgements (ground truth) for the query passage combinations. Each row of the file contains the following information. queryid topicid passageid relevance_judgement Please note that the entries are space separated. The second column (topicid) can be ignored Given below are a couple of rows of a sample qrel file. 2146313 U0 2146313_04 2146313 QO 2146313_23 2 The relevance judgements range from values 1-4. The description of the labels is given below. Label 1: Non-Relevant Label 2: Slightly Relevant Label 3: Relevant Label 4: Highly Relevant Note: that for metrics with binary relevance assumptions, Labels 1 and 2 are considered non- relevant and Labels 3 and 4 are considered relevant. Note: if a query-document pair is not listed in the qrels file, we assume that the document is not relevant to the query. Collection file Each row of the file consists of the following information: passage_id passage_text The id and text information is tab separated. The passage text has been pre-processed to remove punctutation, tokenised and stemmed using the Krovetz stemmer (same as queries). The terms in the passage text can be accessed by splitting the text based on space. In this section, you have to implement the following: + Load the queries from the query file into a datastructure + Load the query relevance information into datastructure, You can reuse some of the code written in Assignment 1 for this and make modifications to it as needed. You can use any additional datastructures than the suggested ones for your implementation. This function is used to load query file information into datastructure(s). hitpsicolab research, google.comverive!bjUANSFHIE2_PYOhrKeZMEAL-Xxg330#scrolTo=SMUSd12-jCG-&printMode=true aint‘ara, 1:01 PM Copy of ENGGBG00_IR_F22_ HWA ipynb - Colaboratory Return Variables: queries - mapping from queryid to querytext import pandas as pd aid = [] q.text = [] def loadQueries(queries_file): queries = pd.DataFrame(colunns=['q id’, *q_text’]) with open(queries file) as f: for line in f z = line. split() queries = queries. append({'qid [2], ‘q_text':z[1]}, ignore_index=True) return queries This function is used to load qrel file information into datastructure(s). The qrel file format is the same as the one provided in Assignment 1 and is given below: "queryid topicid passageid relevance_judgement" The entries are space separated. You can copy your qrel loading code from Assignment 1 and make modifications if necessary. Return Variables: num_queries - number of queries in the qrel file qrels - query relevance information def loadgrels(qrel_file): grels_list = [] num_queries = set() grels = pd.bataFrame(columns=[‘queryid", ‘passageid’, ‘relevance_score']) with open(qrel_file) as f: for line in f: z = line. split() num_queries.add(z[@]) grels_list.append(z) grels = qrels.append({‘queryid':z[@], ‘passageid':z[2], ‘relevance_score' :2[3]}, data_structure = pd.DataFrame(qrels_list, columns=["queryid", “topicid”, “passageid num_rel = data_structure. loc[data_structure[”relevancejudgment"]=="3"].shape[@] + d return num_queries,qrels # You can return additional datastructures for your implementation. queries = loadQueries(queries_file) num_queries, grels = loadrels(qrel_file) print (‘Total Num of queries in the query file : (@}'.format(len(queries))) print (‘Total Num of queries in the qrel file : {@}'.format(num_queries)) print (‘Queries in the qrel file : {@}.format(num_queries)) hitpsieolab research, google.comvrivelbjUANSFHIE2_PYOhrKeZME4L-Xxg33C#scrollTo=SMUSd12-|CG-&printMode=true am‘ara, 1:01 PM Copy of ENGGBG00_IR_F22_ HWA ipynb - Colaboratory Total Num of queries in the query file : 15 Total Num of queries in the qrel file : {'4185501", '1844896", '1262692", '3396066 Queries in the qrel file : {'41855e1', '1844g96", '1262692', '3396066', In the cell below, an inverted index with count has been created in memory. Please run the cell and use the variables for implementing the retrieval models. An inverted index with count information. class indexCount: pcount = @ ctf = {} sumdl = @ avgdl = @ doclen = {} index = {} probetf = {} def _init_(self, col): self.col = col def create_index(self): for Line in open(self.col): pid,ptext = line.strip().split('\t') self.pcount+=1 if pid not in self.doclen: self.doclen[pid]=0 pfreq = {) for term in ptext.split(’ self. sumdl. if term not in self.ctf: self.ct#[term self.ct#[tern]#=1 self.doclen[pid]+=1 if term not in pfreq: pfreq[tern]=@ pfreq[term]+=1 for kv in pfreq.items(): if k not in self.index: self. index[k]=[] self. index[k] .append(pid+" “sstr(v)) hitpsicolab research, google.comvrive!bjUANSFHIE2_PYOhrKeZMEAL-Xxg330#scrolTo=SMUSd12-jCG-&printMode=true st‘ara, 1:01 PM Copy of ENGGBG00_IR_F22_ HWA ipynb - Colaboratory for k,v in self.ctf.items(): self .probct#[k]=v/float (self.sumdl) self.avgdl = self.sumdl/float(self.pcount) buildindex = indexcount(col) buildindex.create_index() inverted index with count: dict with term as key and posting list as value posting list is a list with each element in the format “passage_id:term frequency Example - {‘the': ['2020338_0:11', '3174498_1:4"]} index = buildindex.index #total number of passages in the collection num_passages = buildIndex.pcount # Average passage length avgd1 = buildindex.avgd1. # Collection Term Frequency : dict with term as the key and the term frequency in collecti ctf = buildIndex.ctf # Probability Term Frequencies : dict with terms as key and probability distribution over probctf = buildIndex.probct# dict with passageId as key and number of tokens in the passage as value doclen = buildIndex.doclen # Total number of tokens in the collection totNumTerms = buildindex.sumd1 print (‘Total number of passages in the collection :(@)'.format(num_passages)) print( ‘Average passage length :{@)'.format(avgdl)) print('Total nun of unique terms :{@}' format (len(ct#))) print(‘Total num of terms in the collection :{0}' .format(totNumTerms)) Total number of passages in the collection :403492 Average passage length :41.11619809066846 Total num of unique terms :149467 Total num of terms in the collection :16590057 > 2: Vector Space model (VSM model) (30 points) hitpsieolab research, google.comvrivelbjUANSFHIE2_PYOhrKeZME4L-Xxg33C#scrollTo=SMUSd12-|CG-&printMode=true ant‘ara, 1:01 PM Copy of ENGGBG00_IR_F22_ HWA ipynb - Colaboratory In the cell below, implement the VSM model given in Slide 19 of ‘Basic Retrieval Models Part 1’ The score function has been given below for reference. score(ayp) = 2 count, 4) in(1+In(1+ count(up)) plCitt weg 1-045, df(w) score(q, p) - score assigned to a passage p for a query q count(w, q) -number of times term w occurs in query g count(w, p) - number of times term w occurs in passage p b- set this to 0.75 \p| - Number of tokens in passage p avgdl - Average number of tokens in passages in collection ‘C| - number of passages in collection C df (w) - number of passages containing term w Please note that we consider each query term once, since this is equivalent to a dot product. For each query, you have to return the top 5 retrieved passages ranked based on the score returned by the VSM model using "term at a time" scoring method. Rank passages for each query and return top 5 passages. Return Variables: final_ranking_vsm : map with query id as key and list of top 5 ranked passages as value import numpy as np import operator def vsm(queries, index, avgdl, num_passages, doclen): for line in open(‘queries_tok_clean_kstem', encodin queryid, querytext = line.strip().split(‘\t") query_vocabulary = [] for word in querytext.split(): if word not in query_vocabulary: query_vocabulary.append(word) tts"): query_we = () for word in query vocabulary: query_we[word] = querytext.lower().split().count (word) final_score={} for w in query_vocabulary: m=len(index[w]) for i in index[w]: pid, pcount=i.split(*:') hitpsieolab research, google.comvrivelbjUANSFHIE2_PYOhrKeZME4L-Xxg33C#scrolTo=SMUSd12-jCG-&printMode=true 7msar, 1:91 PM Copy of ENGGEA00_IR_F22_ MINS pyr - Colabaratony #F upper part kenp.1og(1+ np.log(1+int(pcount) )) #f-lower part 1-0.75)+(0.75* (int (doclen[pid])/avedl)) #first part f=(k/p) c=(num_passages+1)/m sc=np.1og(c) with score calculation scoreofi=query_wc[word]*F*sc #adding into final score if pid not in final_score: final_score[pid]=int (scoreofi) else: final_score[pid]=int(final_score[pid] )+scoreofi, final_ranking_vsm[queryid]=sorted(final_score.items(),key=operator.itemgetter(1), return final_ranking_vsm final_ranking_vsm = vsm(queries, index, avgdl, num_passages, doclen) # Hint: The score would be in the interval: [13,14] print (‘The top retrieved passage and score for query id "3698636" using VSM is : {@)'.for ‘The top retrieved passage and score for query id "3698636" using VSM is : [('754739_: > 3: BM25 (30 points) In the cell below, implement the BM25 model given in slide 31 of ‘Basic Retrieval Models Part 3°. score(ayp) = (i+ 1eountw,p) Il af(w) +0.5 wey ky(1—b+6(25)) + count(w,p) — Af(w) +08 score(q, p) - score assigned to a passage p for a query q count(w, p) - number of times term w occurs in passage p b- set this to 0.75 \p| - Number of tokens in passage p avgdl - Average number of tokens in passages in collection ‘C| - number of passages in collection C’ df (w) - number of passages containing term w ky - set to 1.2 Please note that we iterate over all query tokens including repetitions. hitpsieolab research, google.comvrivelbjUANSFHIE2_PYOhrKeZME4L-Xxg33C#scrollTo=SMUSd12-|CG-&printMode=true antsar, 1:91 PM Copy of ENGGEA00_IR_F22_ MINS pyr - Colabaratony Similar to the previous model, return the top 5 retrieved passages for each query ranked based on the BM25 scoring using "term at a time" scoring method. Rank passages for each query and return top 5 passages. Return Variables: final_ranking_bm25 : map with query id as key and list of top 5 ranked passages as value def bm25(queries, index, avgdl, num_passages, doclen): Final_bm25={} for line in open(‘queries_tok_clean_kstem', encodin gid,qtext = line.strip().split('\t") query_vocabulary = [] for word in qtext.split(): if word not in query vocabulary: query_vocabulary. append (word) utfe"): query_we = ( for word in query_vocab: query_wc[word] = qtext.lower().split().count(word) final_score={} for w in query_vocabulary: m = len(index[w]) for i in index{w]: pid,pcount=i.split(* first part fu=(1,2+1) "int (pcount) u=(int (doclen[pid])) /avedL [email protected]+(8.75*u) fl=(1.2*u1)+int (pcount) u/#1 second part lu=num_passages-m+0.5 Ll=m0.5 Lenp. Log(1u/11) with score calculation scoreofi= f*1 Hadding into final score if pid not in final_score: final_score[pid]=int(scoreofi) else: final_score[pid]=int(final_score[pid])+scoreofi Final_ranking_bm25[qid]=sorted(Final_score.items(),key=operator.itemgetter(1), rev return final_ranking_bm25 final_ranking_bm25 = bm25(queries, index, avgdl, num_passages, doclen) hitpsieolab research, google.comvrivelbjUANSFHIE2_PYOhrKeZME4L-Xxg33C#scrolTo=SMUSd12-jCG-&printMode=true on‘ara, 1:01 PM Copy of ENGGBG00_IR_F22_ HWA ipynb - Colaboratory # Hint: The score would be in the interval: [18,19] print (‘The top retrieved passage and score for query id "3698636" using BM25 is : (@}'.fo The top retrieved passage and score for query id "3698636" using BM25 is : [('369863¢ Double-click (or enter) to edit > 4: Evaluation (30 points) In the cell, evaluate the top 5 retrieved passages coresponding to each of the models using Precision@5 and Recall@5 metrics. You can use the code from assignment 1 modified as. needed. # return precision of top 5 retrieved passages def calcPrecision(top, qrels, rank_in): #enter your code here return 0 # return recall of top 5 retrieved passages def calcRecall(top, grels, rank_in): #enter your code here return @ # Hint: Precision value interval [@.1,0.2], Recall value interval [0.04,0.05] print(“Evaluate VSM model") print (‘Precision at top 5 : {@}'.format(calcPrecision(S, qrels, final_ranking_vsm))) print (‘Recall at top 5 : {0}'.format(calcRecall(5, qrels, final_ranking_vsm))) Print (dsbdenorneane aaa oUGO dono rina aaa O DO OHO ESE GUO CODDE EEE") # Hint: Precision value interval [.3,0.4], Recall value interval [@.10,0.20] print(“Evaluate BM25 model") print (‘Precision at top 5 : {@}'.format(calcPrecision(S, qrels, final_ranking_bm25))) print (‘Recall at top 5 : {@}'.format(calcRecall(5, qrels, final_ranking_bm2s))) print (#tttstseeesnenreneneenentedsenestsneetentetentienetenteneetentete!) Precision value interval [@.3,0.4], Recall value interval [@.1,0.2] # Hin Evaluate VSM model Precision at top 5 : @ Recall at top 5: @ Evaluate BM25 model Precision at top 5 : @ Recall at top 5: @ hitpsieolab research, google.comvrivelbjUANSFHIE2_PYOhrKeZME4L-Xxg33C#scrolTo=SMUSd12-jCG-&printMode=true tom‘ara, 1:01 PM Copy of ENGGBG00_IR_F22_ HWA ipynb - Colaboratory v 0s completed at 11:11 PM ex hitpsieolab research, google.comvrivelbjUANSFHIE2_PYOhrKeZME4L-Xxg33C#scrolTo=SMUSd12-jCG-&printMode=true wt

IR Assignment 01
No ratings yet
IR Assignment 01
19 pages
DSS Assignment
No ratings yet
DSS Assignment
5 pages
Assignment 2
No ratings yet
Assignment 2
4 pages
ir
No ratings yet
ir
23 pages
IR practical
No ratings yet
IR practical
24 pages
PAIFile 2023
No ratings yet
PAIFile 2023
48 pages
PAIFile2023 Final
No ratings yet
PAIFile2023 Final
48 pages
COURSEWORK1 Details
No ratings yet
COURSEWORK1 Details
3 pages
Web Mining DA
No ratings yet
Web Mining DA
13 pages
DMlab2021
No ratings yet
DMlab2021
4 pages
IR
No ratings yet
IR
12 pages
Assignment 2
No ratings yet
Assignment 2
3 pages
3rd EXPERIMENT
No ratings yet
3rd EXPERIMENT
13 pages
Information Retrival
No ratings yet
Information Retrival
43 pages
CSE 3024: Web Mining: Lab Assessment - 3
No ratings yet
CSE 3024: Web Mining: Lab Assessment - 3
13 pages
Assignment 3
No ratings yet
Assignment 3
6 pages
Notebook PYTHON DATA SCIENCE
No ratings yet
Notebook PYTHON DATA SCIENCE
16 pages
Chapter1 PDF
No ratings yet
Chapter1 PDF
25 pages
Datascience
No ratings yet
Datascience
8 pages
Py 1679789071
No ratings yet
Py 1679789071
2 pages
01 Python For Data Analysis (Ziad)
No ratings yet
01 Python For Data Analysis (Ziad)
53 pages
Irs 122010304057 PDF
No ratings yet
Irs 122010304057 PDF
23 pages
A1 Exploratory and Descriptive Data Analysis
No ratings yet
A1 Exploratory and Descriptive Data Analysis
1 page
DSBDA Lab Plan
No ratings yet
DSBDA Lab Plan
5 pages
AI Lab Report BIM
No ratings yet
AI Lab Report BIM
34 pages
AIML Lab Manual
No ratings yet
AIML Lab Manual
39 pages
My Python
No ratings yet
My Python
48 pages
Machine Learning Lab Manual (15CSL76)
No ratings yet
Machine Learning Lab Manual (15CSL76)
30 pages
University Institute of Engineering Department of Computer Science & Engineering
No ratings yet
University Institute of Engineering Department of Computer Science & Engineering
11 pages
DA 8th Sem
No ratings yet
DA 8th Sem
32 pages
15CSL76 Students
No ratings yet
15CSL76 Students
18 pages
Spark Using Python
No ratings yet
Spark Using Python
28 pages
AI& ML LAB TLP (AI A)
No ratings yet
AI& ML LAB TLP (AI A)
5 pages
Numpy Module
No ratings yet
Numpy Module
10 pages
MOD550 - Assignment 1: Problems
No ratings yet
MOD550 - Assignment 1: Problems
3 pages
AI Lab Manual For V 5SEM PDF
No ratings yet
AI Lab Manual For V 5SEM PDF
83 pages
IR - 754 All Practical
No ratings yet
IR - 754 All Practical
21 pages
Sample Worksheet 1
No ratings yet
Sample Worksheet 1
8 pages
PR LIST DSBDA
No ratings yet
PR LIST DSBDA
2 pages
IR_Prac_2
No ratings yet
IR_Prac_2
4 pages
IR Journal (Printable)
No ratings yet
IR Journal (Printable)
20 pages
Six Weeks Summer Training Reportpdf
100% (1)
Six Weeks Summer Training Reportpdf
26 pages
Assignment 1
No ratings yet
Assignment 1
2 pages
BTCS 1st IMLP_assignment 3_4_ 5
No ratings yet
BTCS 1st IMLP_assignment 3_4_ 5
3 pages
Python Data Science Toolbox
No ratings yet
Python Data Science Toolbox
14 pages
Case study-ML-SI No 2
No ratings yet
Case study-ML-SI No 2
13 pages
Syllabus AIML
No ratings yet
Syllabus AIML
14 pages
ML_LAB Record_final
No ratings yet
ML_LAB Record_final
39 pages
01 Introduction to Python
No ratings yet
01 Introduction to Python
36 pages
Cheat Sheet: Python For Data Science
No ratings yet
Cheat Sheet: Python For Data Science
4 pages
Data Science and Its Applications (21AD62) Lab Manual
No ratings yet
Data Science and Its Applications (21AD62) Lab Manual
26 pages
Cheat Sheet: Python For Data Science
No ratings yet
Cheat Sheet: Python For Data Science
4 pages
DOC-20250211-WA0009. (1)
No ratings yet
DOC-20250211-WA0009. (1)
26 pages
AI_ASSIGN
No ratings yet
AI_ASSIGN
25 pages
ad3461-ml-lab-manual
No ratings yet
ad3461-ml-lab-manual
48 pages
Week 1 To Week 9
No ratings yet
Week 1 To Week 9
30 pages
DSBDA Lab Manual
No ratings yet
DSBDA Lab Manual
167 pages
DataGrokr Technical Assignment - Data Engineering (1) (1)
No ratings yet
DataGrokr Technical Assignment - Data Engineering (1) (1)
4 pages

Ir - Assignment 3

Uploaded by

Ir - Assignment 3

Uploaded by

You might also like