Midterm2006 Sol Csi4107
Midterm2006 Sol Csi4107
FACULTY OF ENGINEERING
SCHOOL OF IT AND ENGINEERING
CSI 4107
Midterm
March 2, 2005, 4-5:30 pm
Examiner: Diana Inkpen
Name
Student Number
Total marks:
Duration:
Total Number of pages:
48
80 minutes
9
Important Regulations:
1. Students are allowed to bring in a page of notes (written on one side).
2. Calculators are allowed.
3. A student identification cards (or another photo ID and signature) is required.
4. An attendance sheet shall be circulated and should be signed by each student.
5. Please answer all questions on this paper, in the indicated spaces.
Marks
Total
A
B
C
D
E
/ 13
/ 4
/ 10
/ 10
/ 10
/ 47
Part A
Short answers and explanations.
[18 marks]
1. (2 marks) Explain the difference between an information retrieval system and a search
engine.
2. (2 marks) Why is tfidf a good weighting scheme? Why are inverse document
frequencies (idf weights) expected to improve IR performance when added to term
frequencies (tf)? (Remember that the idf value for a term is given by the number of
documents where it appears).
- idf gives higher weight to terms that appear in few documents and therefore are
likely to be important in those documents.
3. (2 marks) Explain what is the difference between relevance feedback and the pseudorelevance feedback. Which one do you think would achieve better retrieval performance.
Why?
-
5. (3 marks) Compute the edit distance between the following strings. Remember that the
edit distance is the minimum number of deletions, insertions and substitutions needed to
transform the first string into the second.
How would you normalize the score? Why is the normalization needed?
String 1: abracadabra
String 2: nabucodor
Edit distance = 7
Normalize by dividing by length of longest string.
Why: to make it fair when there are the same number of deletions, insertions,
substitutions, but the strings are long or short. If the strings are short the distance
should be higher.
6. (2 marks) Below is a sample robot META tag in the HEAD section of an HTML
document. Explain what this tag means.
<meta name = robots content = index,nofollow>
- spiders are allowed to index the webpage but not to follow the links in it
Part B
[4 marks]
Assume that you are given a query vector q=(2,0,3,1,0), three documents identified as
relevant by a user: d1, d2, d3, and two irrelevant documents: d4, d5.
d1 = (3,1,2,1,0)
d4 = (1,3,0,1,2)
d2 = (4,1,3,2,2)
d5 = (0,4,0,2,2)
d3 = (1,0,5,0,3)
Compute the modified query, using the Ide regular method. Remember that the Ide
regular method is given by the formula:
r
r
r
r
qm = q +
d
j
j
r
r
d j Dr
d j Dn
where Dr is the set of the known relevant and Dn is the set of irrelevant documents.
Use equal weight for the original query, the relevant documents, and the irrelevant ones,
===1.
q
Part C
[10 marks]
Consider a very small collection C that consists in the following three documents:
d1: red green rainbow
d2: red green blue
d3: yellow rainbow
For all the documents, calculate the tf scores for all the terms in C. Assume that the words
in the vectors are ordered alphabetically. Ignore idf values and normalization by
maximum frequency.
Given the following query: blue green rainbow, calculate the tf vector for the query,
and compute the score of each document in C relative to this query, using the cosine
similarity measure. (Dont forget to compute the lengths of the vectors).
What is the final order in which the documents are presented as result to the query?
blue
green rainbow
red
yellow |
lentgh
d1
d2
d3
0
1
0
1
1
0
1
0
1
1
1
0
0
0
1
sqrt(3)
sqrt(3)
sqrt(2)
sqrt(3)
Part D
[10 marks]
Given a query q, where the relevant documents are d1, d3, d6, d7, d10, d12, d13
an IR system retrieves the following ranking: d2, d6, d5, d8, d3, d12, d11, d14, d7, d13.
1. What are the precision and recall for this ranking at each retrieved document?
Recall
0/7 = 0.00
1/7 = 0.14
1/7 = 0.14
1/7 = 0.14
2/7 = 0.28
3/7 = 0.42
3/7 = 0.42
3/7 = 0.42
4/7 = 0.57
5/7 = 0.21
d2
d6
d5
d8
d3
d12
d11
d14
d7
d13
Precision
0/1 = 0.00
1/2 = 0.50
1/3 = 0.33
1/4 = 0.25
2/5 = 0.50
3/6 = 0.50
3/7 = 0.42
3/8 = 0.37
4/9 = 0.44
5/10 = 0.50
Recall
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Interpolated Precision
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0
0
0
4. What is the value of the R-precision? (the precision at first R retrieved documents
where R is the total number of relevant documents)
R-Precision
3/7
5. Assume we have two users that judged the documents before the search. The first user
knew before the search that d3, d6, d7, d10, are relevant to the query, and the second
user knew that d1, d3, d12 are relevant to the query, what is the coverage ratio and the
novelty ratio for these two users? (Remember that the coverage ratio is the proportion of
relevant items retrieved out of the total relevant documents known to a user prior to the
search. The novelty ratio is the proportion of retrieved items, judged relevant by the user,
of which they were previously unaware.)
User 1
User 2
Coverage ratio
3/4
2/3
Novelty ratio
2/5
3/5
Part E
Consider the following web pages and the set of web pages they link to:
[10 marks]
ap =
For all p S:
In
Out
hp =
h
a
q:q p
q: p q
B,D
B,C,D
It 0
a
1
h
1
It 1
a
2
h
3
It 2
a
3
h
5
A,C
A,B
A,C
E.2. For the same graph, run the PageRank algorithm for two iterations.
Remember that one way to describe the algorithm is:
PR(A) = (1-d) + d(PR(T1)/C(T1) + + PR(Tn)/C(Tn))
where T1 Tn are the pages that point to a page A (the incoming links), d is damping factor
(usually d = 0.85, you can consider it 1 for simplicity), C(A) is number of links going out of a
page A and PR(A) is the PageRank of a page A. NOTE: the sum of all pages PageRank is 1 (but
you can ignore the normalization step for simplicity).