100% found this document useful (2 votes)
332 views9 pages

Midterm2006 Sol Csi4107

This document provides instructions and regulations for a midterm exam being held on March 2, 2005 at the University of Ottawa. It outlines important exam details such as duration, total marks possible, and regulations. The exam consists of 5 parts (A-E) worth a total of 47 marks. Part A contains short answer questions, Part B involves calculating modified query vectors, Part C requires calculating term frequencies and document scores for a small collection, Part D evaluates retrieval performance metrics, and Part E involves running hub/authority and PageRank algorithms on a small graph of web pages.

Uploaded by

martin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (2 votes)
332 views9 pages

Midterm2006 Sol Csi4107

This document provides instructions and regulations for a midterm exam being held on March 2, 2005 at the University of Ottawa. It outlines important exam details such as duration, total marks possible, and regulations. The exam consists of 5 parts (A-E) worth a total of 47 marks. Part A contains short answer questions, Part B involves calculating modified query vectors, Part C requires calculating term frequencies and document scores for a small collection, Part D evaluates retrieval performance metrics, and Part E involves running hub/authority and PageRank algorithms on a small graph of web pages.

Uploaded by

martin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

UNIVERSITY OF OTTAWA

FACULTY OF ENGINEERING
SCHOOL OF IT AND ENGINEERING
CSI 4107
Midterm
March 2, 2005, 4-5:30 pm
Examiner: Diana Inkpen

Name
Student Number
Total marks:
Duration:
Total Number of pages:

48
80 minutes
9

Important Regulations:
1. Students are allowed to bring in a page of notes (written on one side).
2. Calculators are allowed.
3. A student identification cards (or another photo ID and signature) is required.
4. An attendance sheet shall be circulated and should be signed by each student.
5. Please answer all questions on this paper, in the indicated spaces.

Marks

Total

A
B
C
D
E

/ 13
/ 4
/ 10
/ 10
/ 10
/ 47

Part A
Short answers and explanations.

[18 marks]

1. (2 marks) Explain the difference between an information retrieval system and a search
engine.

a search engine contains a crawler to collect webpages


the scale is much larger (large collection, efficiency issues)
the collection is dynamic: new pages appear, some pages disappear
HTML format can be used in weighting (headings, large font, etc).

2. (2 marks) Why is tfidf a good weighting scheme? Why are inverse document
frequencies (idf weights) expected to improve IR performance when added to term
frequencies (tf)? (Remember that the idf value for a term is given by the number of
documents where it appears).

- idf gives higher weight to terms that appear in few documents and therefore are
likely to be important in those documents.

3. (2 marks) Explain what is the difference between relevance feedback and the pseudorelevance feedback. Which one do you think would achieve better retrieval performance.
Why?
-

relevance feedback asks a user to judge the first N answers to a query in


order to revise the query for a better search. Pseudo-relevance will blindly
assume that the first N documents are relevant.
relevance feedback is likely to achieve higher performance because the
judgements for the N document wont be incorrect.

4. (2 marks) In IR systems, a possible pre-processing step is stemming the words.


Do you think the performance of the system (the average precision) would be higher with
or without stemming? Why?

Usually the performance is higher with stemming.


- allows for higher recall by retrieving inflected forms (plurals, verb forms, etc.)
without much loss of precision.

5. (3 marks) Compute the edit distance between the following strings. Remember that the
edit distance is the minimum number of deletions, insertions and substitutions needed to
transform the first string into the second.
How would you normalize the score? Why is the normalization needed?
String 1: abracadabra
String 2: nabucodor

Edit distance = 7
Normalize by dividing by length of longest string.
Why: to make it fair when there are the same number of deletions, insertions,
substitutions, but the strings are long or short. If the strings are short the distance
should be higher.

6. (2 marks) Below is a sample robot META tag in the HEAD section of an HTML
document. Explain what this tag means.
<meta name = robots content = index,nofollow>

- spiders are allowed to index the webpage but not to follow the links in it

Part B

[4 marks]

Assume that you are given a query vector q=(2,0,3,1,0), three documents identified as
relevant by a user: d1, d2, d3, and two irrelevant documents: d4, d5.
d1 = (3,1,2,1,0)
d4 = (1,3,0,1,2)
d2 = (4,1,3,2,2)
d5 = (0,4,0,2,2)
d3 = (1,0,5,0,3)
Compute the modified query, using the Ide regular method. Remember that the Ide
regular method is given by the formula:
r
r
r
r
qm = q +
d

j
j
r
r
d j Dr

d j Dn

where Dr is the set of the known relevant and Dn is the set of irrelevant documents.
Use equal weight for the original query, the relevant documents, and the irrelevant ones,
===1.
q

(9, -5, 13, 1, 1)

Part C
[10 marks]
Consider a very small collection C that consists in the following three documents:
d1: red green rainbow
d2: red green blue
d3: yellow rainbow
For all the documents, calculate the tf scores for all the terms in C. Assume that the words
in the vectors are ordered alphabetically. Ignore idf values and normalization by
maximum frequency.
Given the following query: blue green rainbow, calculate the tf vector for the query,
and compute the score of each document in C relative to this query, using the cosine
similarity measure. (Dont forget to compute the lengths of the vectors).
What is the final order in which the documents are presented as result to the query?
blue

green rainbow

red

yellow |

lentgh

d1
d2
d3

0
1
0

1
1
0

1
0
1

1
1
0

0
0
1

sqrt(3)
sqrt(3)
sqrt(2)

sqrt(3)

cos(d1,q) = (1+1) / (sqrt(3) sqrt(3)) = 2 / 3


cos(d2,q) = 2/3
cos(d3,q) = 1 /(sqrt(3) sqrt(2)) = 0.408
=> d1 and d2 are returned first (in any order), d3 is third.

Part D

[10 marks]

Given a query q, where the relevant documents are d1, d3, d6, d7, d10, d12, d13
an IR system retrieves the following ranking: d2, d6, d5, d8, d3, d12, d11, d14, d7, d13.
1. What are the precision and recall for this ranking at each retrieved document?
Recall
0/7 = 0.00
1/7 = 0.14
1/7 = 0.14
1/7 = 0.14
2/7 = 0.28
3/7 = 0.42
3/7 = 0.42
3/7 = 0.42
4/7 = 0.57
5/7 = 0.21

d2
d6
d5
d8
d3
d12
d11
d14
d7
d13

Precision
0/1 = 0.00
1/2 = 0.50
1/3 = 0.33
1/4 = 0.25
2/5 = 0.50
3/6 = 0.50
3/7 = 0.42
3/8 = 0.37
4/9 = 0.44
5/10 = 0.50

2. Interpolate the precision scores at 11 recall levels.


Remember that the interpolated precision at the j-th standard recall level is the maximum
known precision at any recall level between the j-th and (j + 1)-th level: P( rj ) = max P( r )
r j r r j +1

Recall
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%

Interpolated Precision
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0
0
0

3. Why is interpolation of precision scores necessary when evaluating an IR system?


- to evaluate precision over all queries at the same recall levels

4. What is the value of the R-precision? (the precision at first R retrieved documents
where R is the total number of relevant documents)
R-Precision

3/7

5. Assume we have two users that judged the documents before the search. The first user
knew before the search that d3, d6, d7, d10, are relevant to the query, and the second
user knew that d1, d3, d12 are relevant to the query, what is the coverage ratio and the
novelty ratio for these two users? (Remember that the coverage ratio is the proportion of
relevant items retrieved out of the total relevant documents known to a user prior to the
search. The novelty ratio is the proportion of retrieved items, judged relevant by the user,
of which they were previously unaware.)
User 1
User 2

Coverage ratio
3/4
2/3

Novelty ratio
2/5
3/5

Part E
Consider the following web pages and the set of web pages they link to:

[10 marks]

Page A points to pages B, C, and D.


Page B points to pages A and C. (there was a typo: A and B Ok if you used that graph).
Page C points to page D.
Page D points to page A.
E. 1. Run the Hubs and Authorities algorithms on this subgraph of pages. Show the
authority and hub scores for each page for two iterations. Present the results in the order
A,B,C,D. To simplify the calculation, do not normalize the scores.
Remember that the Hubs and Authorities algorithms can be described in pseudo-code as:
Initialize for all p S: ap = hp = 1
For i = 1 to No_iterations:
For all p S:

ap =

For all p S:

In

Out

hp =

(update authority scores)

h
a

q:q p
q: p q

(update hub scores)

B,D

B,C,D

It 0
a
1

h
1

It 1
a
2

h
3

It 2
a
3

h
5

A,C

A,B

A,C

E.2. For the same graph, run the PageRank algorithm for two iterations.
Remember that one way to describe the algorithm is:
PR(A) = (1-d) + d(PR(T1)/C(T1) + + PR(Tn)/C(Tn))
where T1 Tn are the pages that point to a page A (the incoming links), d is damping factor
(usually d = 0.85, you can consider it 1 for simplicity), C(A) is number of links going out of a
page A and PR(A) is the PageRank of a page A. NOTE: the sum of all pages PageRank is 1 (but
you can ignore the normalization step for simplicity).

How many iterations do you need for convergence?


One possible solution:
P(A) = 1
P(B) = P(A) / 3 = 1/3
P(C) = P(A) / 3 + P(B) / 2 = 1/3 + 1/6 = 1/2
P(D) = P(A) / 3 + P(C) / 1 = 1/3 + 1/2 = 5/6

P(A) = P(B) / 2 + P(D) / 1 = 1/6 + 5/6 = 1


In this case, one iteration is sufficient for convergence

You might also like