0% found this document useful (0 votes)
10 views4 pages

IR Exercise LAB1

Information system analysis

Uploaded by

lobnaselgahed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views4 pages

IR Exercise LAB1

Information system analysis

Uploaded by

lobnaselgahed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

IR-Lab1

Exercise 1:
Draw the inverted index that would be built for the following document collection
Doc 1 - new home sales top forecasts
Doc 2 - home sales rise in july
Doc 3 - increase in home sales in july
Doc 4 - july new home sales rise

Solution:
We are given 4 documents, each with certain words in them. Each word is called a term.
An index is a matrix that captures the presence of each term in each document.

An Inverted Index captures, for each term, the documents in which the term occurs. To
create the inverted index,

First list each unique term - new, home, sales, top, forecasts, rise, in, july, increase.

Then, arrange the terms in alphabetical order - forecasts, home, in, increase, july, new,
rise, sales, top.

For each term, list the documents in which the term occurs, thus creating the Inverted
Index.
forecasts -> Doc 1
home -> Doc 1, Doc 2, Doc 3, Doc 4
in -> Doc 2, Doc 3
increase -> Doc 3
july -> Doc 2, Doc 3, Doc 4
new -> Doc 1, Doc 4
rise -> Doc2,Doc 4
sales -> Doc 1, Doc 2, Doc 3, Doc 4
top -> Doc 1

In the inverted index above, the list of terms is called as the vocabulary or lexicon.

Each document in which a term occurs is called a posting. The list of documents in
which each term occurs is called a postings list. The entire postings list is called as the
postings.

The vocabulary and the postings lists are sorted either alphabetically or by their unique
ids
Exercise 2: Consider these documents

Doc 1 - breakthrough drug for schizophrenia


Doc 2 - new schizophrenia drug
Doc 3 - new approach for treatment of schizophrenia
Doc 4 - new hopes for schizophrenia patients

a. Draw the term document incidence matrix for this document collection.
b. Draw the inverted index representation for this collection.

Solution:
a.
The term document incidence matrix has the list of terms as rows and the list of
documents as columns. Each cell in the matrix represents whether the term is present in
the document (value 1 if present, else value 0).

The term document incidence matrix is created as below

Doc 1 Doc 2 Doc 3 Doc 4


approach 0 0 1 0
breakthrough 1 0 0 0
drug 1 1 0 0
for 1 0 1 1
hopes 0 0 0 1
new 0 1 1 1
of 0 0 1 0
patients 0 0 0 1
schizophrenia 1 1 1 1
treatment 0 0 1 0

b.
The inverted index for the above collection is as below

approach Doc 3
breakthrough Doc 1
drug Doc 1 Doc 2
for Doc 1 Doc 3 Doc 4
hopes Doc 4
new Doc 2 Doc 3 Doc4
of Doc 3
patients Doc 4
schizophrenia Doc 1 Doc 2 Doc 3 Doc 4
treatment Doc 3
Exercise 3:
For the document collection shown in Exercise 2, what are the returned results for these
queries?
a. schizophrenia AND drug
b. for AND NOT (drug OR approach)

Solution
a. schizophrenia AND drug.
Here we use the term-document incidence matrix to perform a boolean retrieval for the
given query

For the terms schizophrenia and drug, we take the row (or vector) which indicate the
document the term appears in,

schizophrenia - 1 1 1 1
drug - 1 1 0 0
Doing a bitwise AND operation for each of the term vectors gives,
1 1 1 1 AND 1 1 0 0 = 1 1 0 0

The result vector 1 1 0 0 gives Doc 1 and Doc 2 as the documents in which the terms
schizophrenia AND drug both are present.

b. for AND NOT (drug OR approach)


Term vectors
for - 1 0 1 1
drug - 1 1 0 0
approach - 0 0 1 0

First we do a boolean bit wise OR for drug, approach, which gives


1 1 0 0 OR 0 0 1 0 = 1 1 1 0

The we do a NOT operation on 1 1 1 0 (i.e. on drug OR approach), which gives 0 0 0 1

Then we do an AND operation on 1 0 1 1 (i.e. for) AND 0 0 0 1 (i.e. NOT(drug OR


approach)), which gives 0 0 0 1

Thus the document that contains for AND NOT (drug OR approach) is Doc 4.
Exercise 4:
Consider the following document collection: (Note: Please do NOT apply stemming and
stopword removal.)

Doc1 new home sales top forecasts

Doc2 home sales rise in july

Doc3 increase in home sales in july

Doc4 july new home sales rise

a) Draw the term-document count matrix for this document collection.


b) Draw the inverted index for this collection. (Note: No need for positional
information.)
c) What are the returned results for the queries:

(1) sales AND rise

(2) july AND (new OR increase)

You might also like