0% found this document useful (0 votes)
549 views5 pages

Solution.: Increase - 3

The question asks to calculate the precision and recall of an IR system given the following information: 1) The system returned 3 relevant documents 2) It also returned 2 irrelevant documents 3) There are a total of 8 relevant documents in the collection The precision is 3/5 = 0.6 since there were 3 true positives out of the 5 documents returned. The recall is 3/8 = 0.375 since it returned 3 of the 8 relevant documents.

Uploaded by

Ehab Emam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
549 views5 pages

Solution.: Increase - 3

The question asks to calculate the precision and recall of an IR system given the following information: 1) The system returned 3 relevant documents 2) It also returned 2 irrelevant documents 3) There are a total of 8 relevant documents in the collection The precision is 3/5 = 0.6 since there were 3 true positives out of the 5 documents returned. The recall is 3/8 = 0.375 since it returned 3 of the 8 relevant documents.

Uploaded by

Ehab Emam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 5

Q1.

Draw the inverted index that would be built for the following document collection

Doc 1 new home sales top forecasts

Doc 2 home sales rise in july

Doc 3 increase in home sales in july

Doc 4 july new home sales rise


SOLUTION. Inverted Index: forecast->1 home->1->2->3->4 in->2->3 increase->3 july->2->3 new->1->4 rise->2->4 sale->1->2-
>3->4 top->1

Q2. Consider these documents:


Doc 1 breakthrough drug for schizophrenia
Doc 2 new schizophrenia drug
Doc 3 new approach for treatment of schizophrenia
Doc 4 new hopes for schizophrenia patients

a. Draw the term-document incidence matrix for this document collection.


b. Draw the inverted index representation for this collection, as in Figure 1.3 (page 7).

SOLUTION.
Term-Document matrix: dl d2 d3 d4 Approach 0 01 0 breakthrough 1 0 0 0
d r u g 1 1 0 0 f o r 1 0 1 1 h o p e s 0 0 0 1 n e w 0 1 1 1 o f 0 0 1 0 p a t i e n t s 0 0 0 1 schizophrenia 1 1 1 1 treatment 0 0 1 0

inverted Index: Approach -> 3 breakthrough ->1 drug ->i->2 for ->1->3>4 hopes ->4 new -.>2->3->4 of ->3 patients ->4
schizophrenia ->1->2->3->4 treatment >3

Q3 For the document collection shown in Exercise 1.2, what are the returned
results for these queries: a. schizophrenia AND drug b. for AND NOT(drug OR
approach)
SOLUTION.
(i) docl, doc2 (ii) doc4

Q4. Recommend a query processing order for

(tangerine OR trees) AND (marmalade OR skies) AND (kaleidoscope OR eyes) given the following postings list sizes:

Term Postings size

eyes 213312

kaleidoscope 87009

marmalade 107913
skies 271658

tangerine 46653

trees 316812
SOLUTION. Using the conservative estimate of the length of unioned postings lists, the recommended order is: (kaleidoscope OR eyes)
(300,321) AND (tangerine OR trees) (363,465) AND (marmalade OR skies) (379,571) However, depending on the actual distribution of
postings, (tangerine OR trees) may well be longer than (marmalade OR skies) because the two components of the former are more asymmetric.
For example, the union of 11 and 9990 is expected to be longer than the union of 5000 and 5000 even though the conservative estimate
predicts otherwise.

S. Singh's solution

1.71'ime for processing : (i) (tangerine OR trees) = 0(46653+316812) = 0(363465) (ii) (marmalade OR skies) = 0(107913+271658) = 0(379571)
(iii) (kaleidoscope OR eyes) = 0(46653+87009) = 0(300321)

Order of processing: a. Process (i), (ii), (iii) in any order as first 3 steps (total time for these steps is 0(363465+379571+300321) in any case)

b. Merge (i) AND (iii) = (iv): In case of AND operator, the complexity of merging postings list depends on the length of the shorter
postings list. Therefore, the more short the smaller postings list, the lesser the time spent. The reason for choosing (i) instead of (ii) is that the
output list (iv) is more probable to be shorter if (i) is chosen.
c. Merge (iv) AND (ii): This is the only merging operation left.

Q5. Are the following statements true or false?

a. In a Boolean retrieval system, stemming never lowers precision

b. In a Boolean retrieval system, stemming never lowers recall.

c. Stemming increases the size of the vocabulary.


d.Stemming should be invoked at indexing time but not while processing a query.

SOLUTION. a. False. Stemming can increase the retrieved set without increasing the number of relevant docuemnts, b.
True. Stemming can only increase the retrieved set, which means increased or unchanged recall. c. False. Stemming
decreases the size of the vocabulary. d. False. The same processing should be applied to documents and queries to ensure
matching terms.

Q6. We have a two-word query. For one term the postings list consists of the following 16 entries:

[4,6,10,12,14,16,18,20,22,32,47,81,120,122,157,180] and for the other it is the one entry postings

list: [47].

Work out how many comparisons would be done to intersect the two postings lists with the following two strategies. Briefly
justify your answers:

a. Using standard postings lists


b.Using postings lists stored with skip pointers, with a skip length of VT', as suggested in Section 2.3.
SOLUTION.
Applying MERGE on the standard postings list, comparisons will be made unless either of the postings list end i.e. till we reach
47 in the upper postings list, after which the lower list ends and no more processing needs to be done. Number of comparisons = 11

b. Using skip pointers of length 4 for the longer list and of length 1 for the shorter list, the following comparisons will be made:
1. 4 & 47 2. 14 & 47 3. 22 Sr 47 4. 120 & 47 5. 81 & 47 6. 47 & 47 Number of comparisons =6
Q7. Consider a postings intersection between this postings list, with skip pointers:

Trace through the Postings lists intersection with skip pointers.

a. How often is a skip pointer followed?


b. How many postings comparisons will be made by this algorithm while intersect ing the two lists? Identify them.
c. How many postings comparisons would be made if the postings lists are inter sected without the use of skip pointers?

SOLUTION.
a. The skip pointer is followed once. (from 24 to 75).
b. 19 co m p a r i s on s a r e m a d e . ( L e t ( x, y ) de no t e a p os t in g co m p a r i son. The comparisons are:(3,3),(5,5),(9,89),(15,89),
(24,89),(73,89),(75,89),(92,89),(81,89),(84,89),(89,89),(92,95),(115,95),(96,95),(96,97),(97,9),(100,99),(100 c. 19 ,
1
0

Q8. Shown below is a portion of a positional index in the format: term: doc1: (positions, position2, ); doc2: (positionl, position2, );
etc.

angels: 2: (36,174,252,651); 4: (12,22,102,432); 7: (17);

fools: 2: (1,17,74,222); 4: (8,78,108,458); 7: (3,13,23,193);

fear: 2: (87,704,722,901); 4: (13,43,113,433); 7: (18,328,528);

in: 2: (3,37,76,444,851); 4: (10,20,110,470,500); 7: (5,15,25,195);

rush: 2: (2,66,194,321,702); 4: (9,69,149,429,569); 7: (4,14,404);

to: 2: (47,86,234,999); 4:(14,24,774,944) 7: (199,319,599,709); tread: 2: (57,94,333);


4: (15,35,155); 7: (20,320);

where: 2: (67,124,393,1001); 4: (11,41,101,421,431); 7: (16,36,736);

Which document(s) if any meet each of the following queries, where each expression within quotes is a
phrase query?

a."fools rush in"


b."fools rush in" AND "angels fear to tread"

SOLUTION. Answer (a): All three documents (2, 4, and 7) satisfy the query. Answer (b):
Only document 4.

Q9. Write down the entries in the permuterm index dictionary that are generated by the term
mama.

SOLUTION.
marna$,ama$m,ma$ma,a$mam,$mama.

Q10. If you wanted to search for s*ng in a permuterm wildcard index, what key(s) would one do the lookup on?
SOLUTION. ng$s*

Q11. Compute the edit distance between paris and alice.

SOLUTION.
a 1 i c e

0 1 1 2 2 3 3 4 4 5 5
P 1 1 2 2 3 3 4 4 5 5 6
11 2 1 2 2 3 3 4 4 5 5
a 2 1 2 2 3 3 4 4 5 5 6
2 3 1 2 2 3 3 4 4 5 5
r 3 3 2 2 3 3 4 4 5 5 6
3 4 2 3 2 3 3 4 4 5 5
4 4 3 3 3 2 4 4 5 5 6
1
4 5 3 4 3 4 2 3 3 4 4
s 5 5 4 4 4 4 3 3 4 4 5
5 6 4 5 4 5 3 4 3 4 4

Q12. Starting from the following documents collection, build the documents-terms incidence
matrix as required by the Boolean model

d1 = “Big cats are nice and funny”

d2 = “Small dogs are better than big dogs”

d3 = “Small cats are afraid of small dogs”

d4 = “Big cats are not afraid of small dogs”

d5 = “Funny cats are not afraid of small dogs”


Q13. An IR system returns 3 relevant documents, and 2 irrelevant documents. There are a total of
8 relevant documents in the collection. What is the precision of the system on this search, and
what is its recall?

The precision is given by tp/(tp+fp) = 3/5

The recall is given by tp/(tp+fn) = 3/8

You might also like