0% found this document useful (0 votes)
26 views

Assignment 2

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
26 views

Assignment 2

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 4
Assignment 2 Scoring and Evaluation (Deadline: 05.11.2023, 11:59 PM ‘This assignment is on building a tof based ranked retrieval system to answer free text queries. You have to use python for this assignment, since python provides many features that wil ease up your Workload, as compared to other programming languages like C+ Dataset: “The data for this assignment can be found at this link (CRAN Folder): hitpsiicrive.qooale.com/crvefolders/19C-WbeYCiSValdl_ KGAQRKGDAgeHon us ‘You wil require two files for this assignment: ‘© eran.all.1400: This isthe main document file, containing the information for 1400 documents. To parse each document, you must read, sequentially, the following records from the fle. ‘© Each document starts with the fed, indicating the 1D © Followed by the Til, indicating the tite ‘© Followed by the.A fel, indicating the author © Followed by the 8 field, indicating the sourcelocation © Followed by the W fol, which actually contains tho text ofthe document ‘the boundary layer in simples sm.bglauer .bin (picile) fle saved in the main code directory. To build a ranked retrieval model, you have to vectorize each query and each document available in the corpus. > Consider all the terms (keys) inthe invert index tobe your vocabulary V. Obtain the Document Frequency for each term DF) as the size of the corresponding posting st Inthe Inverted index, > Tho Torm Frequency TF(, dof torm fin document dis fined as the numberof times f occurs in d. TEADF welght Wt, a) of each term ts thus obtained as Wt. d) = TF, a) x IDF(t, Obtain the query and document texts as previously done in Assignment 1. Our goal now is to ‘obtain [V-dimensional TF-IOF vectors for each query and each document inthe corpus Represent each query 9 as fq) = [ W(t, q) ¥ tin VJ. Similar, represent each document din the corpus as [d] = [W(t ) ¥tin V}, where Vis the vocabulary defined above. Refer side #41 of Lecture 5 and write codes for implementing the following three déd.qae schemes for weighting and normalizing the [V|-dimensional TF-IDF vectors: > Incite > Inebte > ancape Rank all the documents in the corpus corresponding to each query using the cosine similarity metric as descrived in slide #36 of Lecture 5 > For each of the schemes, store the query ids and their coresponding top 50 document namesiids in ranked order ina two-column esv file with a format similar to “rankedRelevantDocList.es’, .. “ : ‘Save the following tree fle in your main code directory ‘Assignment2__ranked_list_A.csv for “Inc.lte” ‘Assignment2_ ranked list_B.csv for “Lne.L.p ‘Assignment2_ ranked _list_C.csv for “anc apt > Name your code fleas: Assignment2__rankerpy > Bunning the fle ; Your code should take the path to the dataset and inverted index fle, Le ‘model_queries_.bin (obtained in Assignment 1) as input and it should run i the following manner: ‘$>> python Assignment2__rankerpy bin> ‘Task 28 (Evaluation) 1. For each query, consider the top 20 ranked documents from the list obtained in the previous stop. 2, For each query, calculate and report the folowing metrics with respect to the gold-standara rankad lst of documents provided in “grt cv" ‘a. Avorago Procision (AP) @10. b. Average Precision (AP) @20. ©. Normalized Discounted Cumulative Gain (NDCG) @10, 4. Normalized Discounted Cumulative Gain (NDCG) @20. “arels.cs" has 4 fields -topie_id (represents query no) iteration, cord_id, judgment (values 0-2) Use the iteration field to resolve confcts of multiple entries (if any) ofthe same fopic_id and ordi take the record having higher iteration value) For binary relevance, consider non-zero judgment values tobe relevant, ‘Assume the relevance of any pair af (topic, cord_id) not presentin “gre. csv"to be 0. 3. Final, caloulate and report the Mean Average Precision (mAP@10 and mAP@20) and average NDCG (averNDCG@t0 and averNDCG@20) by averaging overall tho queris. 4. For each of the liweo ranked lists Assignment2__ranked list .csv (K in ‘AB.C) obtained in the previous step, create a separate fle inthe main code directory with name ‘Assignment2__metrics_txt and systematically save the values of the above mentioned evaluation metrics (both querywise and average), 5. Name your code fle as: Assignment? _evaluatorpy 6, Bunning the fie : For each value of K (AVBIC), your code should take the path tothe oblained ranked list and gold-standard ranked lst as input and it should run inthe folowing manner: ‘$>> python Assignment2__evaluator.py ‘ - ranked list_.csv> ‘Submit the fies: ‘Assignment2__ranker-py ‘Assignment2. evaluator py ‘Assignment2_ metrics A.csv ‘Assignment2. metrics B.csv ‘Assignment2_metrics_C.csv README.txt in a zipped fle named: Assignment2_zip Your README should contain any specific brary requirements to run your cade and the specie Python version you are using. Any other spacial information about your code or logic tat you wish to convey should be in the README fle. Further, provide detals of your design in the readme, such as the vocabulary length, pre-processing pipeline, etc ‘Also, mention you rll number in the fst line of your README, IMPORTANT: PLEASE FOLLOW THE EXACT NAMING CONVENTION OF THE FILES AND THE ‘SPECIFIC INSTRUCTIONS IN THE TASKS CAREFUL, ANY DEVIATION FROM IT WILL RESULT IN DEDUCTION OF MARKS. Python brary restrictions: You can use simple python Hbaries Tike nite, numpy 08, sy, collections, timeit, ete, However, you cannot use Hbrares tke lucene, elasticsearch, or any other search ap. If your code Is found to use any of such libraries, you will be awarded with zoro marks for this assignmont without any evaluation. You also cannot use parsing libraries either for parsing the corpus and query fles, do itby wring your own code. Plagiarism Rules: We wil be employing strict plagiarism checking. If your code matches with another students code, all those students whose codes match will be awarded with zero marks without any ‘evaluation, Therefore, it is your responsibilty to ensure you neither copy anyone's code nor anyone is able to copy your code. ‘God exter: If your code doesn't run or gives exror while running, marks will be awarded based on the ‘correctness of logic. Kf requirec, you mignt be called to mest the TAs and explain your code.

You might also like