IR Exercise LAB1
IR Exercise LAB1
Exercise 1:
Draw the inverted index that would be built for the following document collection
Doc 1 - new home sales top forecasts
Doc 2 - home sales rise in july
Doc 3 - increase in home sales in july
Doc 4 - july new home sales rise
Solution:
We are given 4 documents, each with certain words in them. Each word is called a term.
An index is a matrix that captures the presence of each term in each document.
An Inverted Index captures, for each term, the documents in which the term occurs. To
create the inverted index,
First list each unique term - new, home, sales, top, forecasts, rise, in, july, increase.
Then, arrange the terms in alphabetical order - forecasts, home, in, increase, july, new,
rise, sales, top.
For each term, list the documents in which the term occurs, thus creating the Inverted
Index.
forecasts -> Doc 1
home -> Doc 1, Doc 2, Doc 3, Doc 4
in -> Doc 2, Doc 3
increase -> Doc 3
july -> Doc 2, Doc 3, Doc 4
new -> Doc 1, Doc 4
rise -> Doc2,Doc 4
sales -> Doc 1, Doc 2, Doc 3, Doc 4
top -> Doc 1
In the inverted index above, the list of terms is called as the vocabulary or lexicon.
Each document in which a term occurs is called a posting. The list of documents in
which each term occurs is called a postings list. The entire postings list is called as the
postings.
The vocabulary and the postings lists are sorted either alphabetically or by their unique
ids
Exercise 2: Consider these documents
a. Draw the term document incidence matrix for this document collection.
b. Draw the inverted index representation for this collection.
Solution:
a.
The term document incidence matrix has the list of terms as rows and the list of
documents as columns. Each cell in the matrix represents whether the term is present in
the document (value 1 if present, else value 0).
b.
The inverted index for the above collection is as below
approach Doc 3
breakthrough Doc 1
drug Doc 1 Doc 2
for Doc 1 Doc 3 Doc 4
hopes Doc 4
new Doc 2 Doc 3 Doc4
of Doc 3
patients Doc 4
schizophrenia Doc 1 Doc 2 Doc 3 Doc 4
treatment Doc 3
Exercise 3:
For the document collection shown in Exercise 2, what are the returned results for these
queries?
a. schizophrenia AND drug
b. for AND NOT (drug OR approach)
Solution
a. schizophrenia AND drug.
Here we use the term-document incidence matrix to perform a boolean retrieval for the
given query
For the terms schizophrenia and drug, we take the row (or vector) which indicate the
document the term appears in,
schizophrenia - 1 1 1 1
drug - 1 1 0 0
Doing a bitwise AND operation for each of the term vectors gives,
1 1 1 1 AND 1 1 0 0 = 1 1 0 0
The result vector 1 1 0 0 gives Doc 1 and Doc 2 as the documents in which the terms
schizophrenia AND drug both are present.
Thus the document that contains for AND NOT (drug OR approach) is Doc 4.
Exercise 4:
Consider the following document collection: (Note: Please do NOT apply stemming and
stopword removal.)