Assignment 2
Assignment 2
Programming Assignment 2
Overview
In this assignment, you will use the index you created in Assignment 1 to rank documents and
create a search engine. You will implement different scoring functions and compare their results
against a baseline ranking produced by expert analysts.
Running Queries
For this assignment, you will need the following two files:
o <topic> is the ID of the query for which the document was assessed
o <doc> is the name of one of the documents which you have indexed.
o <grade> is a value in the set {1, 2, 3, 4}, where a higher value means that the
document is more relevant to the query. The value 1 indicates a document which
is non-relevant.
This QREL does not have assessments for every (query, document) pair. If an assessment
is missing, we assume the correct grade for the pair is 1 (non-relevant).
You will write a program which takes the name of a scoring function as a command line
argument and which prints a ranked list of documents for all queries found in topics.xml using
that scoring function. For example:
<topic> is the ID of the query for which the document was ranked.
<docid> is the document identifier.
<rank> is the order in which to present the document to the user. The document with the
highest score will be assigned a rank of 1, the second highest a rank of 2, and so on.
<score> is the actual score the document obtained for that query.
<run> is the name of the run. You can use any value here. It is meant to allow research
teams to submit multiple runs for evaluation in competitions such as TREC.
Query Processing
Before running any scoring function, you should process the text of the query in exactly the same
way that you processed the text of a document for your inverted index. That is:
The parameter --score TF-IDF directs your program to use a vector space model with TF-IDF
scores. This should be very similar to the TF score, but use the following scoring function:
where D is the total number of documents, and df(i) is the number of documents which contain term
i.
Implement BM25 scores. This should use the following scoring function for document d and
query q:
Where k1,k2, and b are constants. For start, you can use the values suggested in the lecture on BM25
(k1 = 1.2, k2 varies from 0 to 1000, b = 0.75). Feel free to experiment with different values for these
constants to learn their effect and try to improve performance.
Implement a language model with Dirichlet smoothing. The parameter mu should be set equal to
average document length in collection.
Evaluation
To evaluate your results, we will write a program that computes NDCG@5, NDCG@10,
NDCG15, NDCG@20. The input to program will be the qrel file (relevance judgments) and
scoring file that has rank list of documents.
These measures should be computed for each query. Average for all queries should also be
computed.
Report
Submission Checklist
Submit your files in a zipped folder named your roll number on Google classroom.