100% found this document useful (2 votes)

332 views9 pages

Midterm2006 Sol Csi4107

This document provides instructions and regulations for a midterm exam being held on March 2, 2005 at the University of Ottawa. It outlines important exam details such as duration, total marks possible, and regulations. The exam consists of 5 parts (A-E) worth a total of 47 marks. Part A contains short answer questions, Part B involves calculating modified query vectors, Part C requires calculating term frequencies and document scores for a small collection, Part D evaluates retrieval performance metrics, and Part E involves running hub/authority and PageRank algorithms on a small graph of web pages.

Uploaded by

martin

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (2 votes)

332 views9 pages

Midterm2006 Sol Csi4107

Uploaded by

martin

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

UNIVERSITY OF OTTAWA

FACULTY OF ENGINEERING
SCHOOL OF IT AND ENGINEERING
CSI 4107
Midterm
March 2, 2005, 4-5:30 pm
Examiner: Diana Inkpen

Name
Student Number
Total marks:
Duration:
Total Number of pages:

48
80 minutes
9

Important Regulations:
1. Students are allowed to bring in a page of notes (written on one side).
2. Calculators are allowed.
3. A student identification cards (or another photo ID and signature) is required.
4. An attendance sheet shall be circulated and should be signed by each student.
5. Please answer all questions on this paper, in the indicated spaces.

Marks

Total

A
B
C
D
E

/ 13
/ 4
/ 10
/ 10
/ 10
/ 47

Part A
Short answers and explanations.

[18 marks]

1. (2 marks) Explain the difference between an information retrieval system and a search
engine.

a search engine contains a crawler to collect webpages

the scale is much larger (large collection, efficiency issues)
the collection is dynamic: new pages appear, some pages disappear
HTML format can be used in weighting (headings, large font, etc).

2. (2 marks) Why is tfidf a good weighting scheme? Why are inverse document
frequencies (idf weights) expected to improve IR performance when added to term
frequencies (tf)? (Remember that the idf value for a term is given by the number of
documents where it appears).

- idf gives higher weight to terms that appear in few documents and therefore are
likely to be important in those documents.

3. (2 marks) Explain what is the difference between relevance feedback and the pseudorelevance feedback. Which one do you think would achieve better retrieval performance.
Why?
-

relevance feedback asks a user to judge the first N answers to a query in

order to revise the query for a better search. Pseudo-relevance will blindly
assume that the first N documents are relevant.
relevance feedback is likely to achieve higher performance because the
judgements for the N document wont be incorrect.

4. (2 marks) In IR systems, a possible pre-processing step is stemming the words.

Do you think the performance of the system (the average precision) would be higher with
or without stemming? Why?

Usually the performance is higher with stemming.

- allows for higher recall by retrieving inflected forms (plurals, verb forms, etc.)
without much loss of precision.

5. (3 marks) Compute the edit distance between the following strings. Remember that the
edit distance is the minimum number of deletions, insertions and substitutions needed to
transform the first string into the second.
How would you normalize the score? Why is the normalization needed?
String 1: abracadabra
String 2: nabucodor

Edit distance = 7
Normalize by dividing by length of longest string.
Why: to make it fair when there are the same number of deletions, insertions,
substitutions, but the strings are long or short. If the strings are short the distance
should be higher.

6. (2 marks) Below is a sample robot META tag in the HEAD section of an HTML
document. Explain what this tag means.
<meta name = robots content = index,nofollow>

- spiders are allowed to index the webpage but not to follow the links in it

Part B

[4 marks]

Assume that you are given a query vector q=(2,0,3,1,0), three documents identified as
relevant by a user: d1, d2, d3, and two irrelevant documents: d4, d5.
d1 = (3,1,2,1,0)
d4 = (1,3,0,1,2)
d2 = (4,1,3,2,2)
d5 = (0,4,0,2,2)
d3 = (1,0,5,0,3)
Compute the modified query, using the Ide regular method. Remember that the Ide
regular method is given by the formula:
r
r
r
r
qm = q +
d

j
j
r
r
d j Dr

d j Dn

where Dr is the set of the known relevant and Dn is the set of irrelevant documents.
Use equal weight for the original query, the relevant documents, and the irrelevant ones,
===1.
q

(9, -5, 13, 1, 1)

Part C
[10 marks]
Consider a very small collection C that consists in the following three documents:
d1: red green rainbow
d2: red green blue
d3: yellow rainbow
For all the documents, calculate the tf scores for all the terms in C. Assume that the words
in the vectors are ordered alphabetically. Ignore idf values and normalization by
maximum frequency.
Given the following query: blue green rainbow, calculate the tf vector for the query,
and compute the score of each document in C relative to this query, using the cosine
similarity measure. (Dont forget to compute the lengths of the vectors).
What is the final order in which the documents are presented as result to the query?
blue

green rainbow

red

yellow |

lentgh

d1
d2
d3

0
1
0

1
1
0

1
0
1

1
1
0

0
0
1

sqrt(3)
sqrt(3)
sqrt(2)

sqrt(3)

cos(d1,q) = (1+1) / (sqrt(3) sqrt(3)) = 2 / 3

cos(d2,q) = 2/3
cos(d3,q) = 1 /(sqrt(3) sqrt(2)) = 0.408
=> d1 and d2 are returned first (in any order), d3 is third.

Part D

[10 marks]

Given a query q, where the relevant documents are d1, d3, d6, d7, d10, d12, d13
an IR system retrieves the following ranking: d2, d6, d5, d8, d3, d12, d11, d14, d7, d13.
1. What are the precision and recall for this ranking at each retrieved document?
Recall
0/7 = 0.00
1/7 = 0.14
1/7 = 0.14
1/7 = 0.14
2/7 = 0.28
3/7 = 0.42
3/7 = 0.42
3/7 = 0.42
4/7 = 0.57
5/7 = 0.21

d2
d6
d5
d8
d3
d12
d11
d14
d7
d13

Precision
0/1 = 0.00
1/2 = 0.50
1/3 = 0.33
1/4 = 0.25
2/5 = 0.50
3/6 = 0.50
3/7 = 0.42
3/8 = 0.37
4/9 = 0.44
5/10 = 0.50

2. Interpolate the precision scores at 11 recall levels.

Remember that the interpolated precision at the j-th standard recall level is the maximum
known precision at any recall level between the j-th and (j + 1)-th level: P( rj ) = max P( r )
r j r r j +1

Recall
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%

Interpolated Precision
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0
0
0

3. Why is interpolation of precision scores necessary when evaluating an IR system?

- to evaluate precision over all queries at the same recall levels

4. What is the value of the R-precision? (the precision at first R retrieved documents
where R is the total number of relevant documents)
R-Precision

3/7

5. Assume we have two users that judged the documents before the search. The first user
knew before the search that d3, d6, d7, d10, are relevant to the query, and the second
user knew that d1, d3, d12 are relevant to the query, what is the coverage ratio and the
novelty ratio for these two users? (Remember that the coverage ratio is the proportion of
relevant items retrieved out of the total relevant documents known to a user prior to the
search. The novelty ratio is the proportion of retrieved items, judged relevant by the user,
of which they were previously unaware.)
User 1
User 2

Coverage ratio
3/4
2/3

Novelty ratio
2/5
3/5

Part E
Consider the following web pages and the set of web pages they link to:

[10 marks]

Page A points to pages B, C, and D.

Page B points to pages A and C. (there was a typo: A and B Ok if you used that graph).
Page C points to page D.
Page D points to page A.
E. 1. Run the Hubs and Authorities algorithms on this subgraph of pages. Show the
authority and hub scores for each page for two iterations. Present the results in the order
A,B,C,D. To simplify the calculation, do not normalize the scores.
Remember that the Hubs and Authorities algorithms can be described in pseudo-code as:
Initialize for all p S: ap = hp = 1
For i = 1 to No_iterations:
For all p S:

ap =

For all p S:

Out

hp =

(update authority scores)

h
a

q:q p
q: p q

(update hub scores)

B,D

B,C,D

It 0
a
1

h
1

It 1
a
2

h
3

It 2
a
3

h
5

A,C

A,B

A,C

E.2. For the same graph, run the PageRank algorithm for two iterations.
Remember that one way to describe the algorithm is:
PR(A) = (1-d) + d(PR(T1)/C(T1) + + PR(Tn)/C(Tn))
where T1 Tn are the pages that point to a page A (the incoming links), d is damping factor
(usually d = 0.85, you can consider it 1 for simplicity), C(A) is number of links going out of a
page A and PR(A) is the PageRank of a page A. NOTE: the sum of all pages PageRank is 1 (but
you can ignore the normalization step for simplicity).

How many iterations do you need for convergence?

One possible solution:
P(A) = 1
P(B) = P(A) / 3 = 1/3
P(C) = P(A) / 3 + P(B) / 2 = 1/3 + 1/6 = 1/2
P(D) = P(A) / 3 + P(C) / 1 = 1/3 + 1/2 = 5/6

P(A) = P(B) / 2 + P(D) / 1 = 1/6 + 5/6 = 1

In this case, one iteration is sufficient for convergence

Question Bank - Machine Learning
No ratings yet
Question Bank - Machine Learning
16 pages
CSI 4107 - Winter 2016 - Midterm
0% (1)
CSI 4107 - Winter 2016 - Midterm
10 pages
Inquiry-Based Teaching Strategies
80% (5)
Inquiry-Based Teaching Strategies
21 pages
English Semantics
No ratings yet
English Semantics
70 pages
DCGAN Presentation
No ratings yet
DCGAN Presentation
16 pages
Format For Summer Internship Report
88% (24)
Format For Summer Internship Report
2 pages
CS470 Introduction To Database Management Systems: (Chapters 13 and 14 of The Textbook)
100% (1)
CS470 Introduction To Database Management Systems: (Chapters 13 and 14 of The Textbook)
22 pages
Toc Unit 1 MCQS 2019-20
100% (1)
Toc Unit 1 MCQS 2019-20
567 pages
Diagnostic Interview For ADHD in Adults (DIVA) : English
No ratings yet
Diagnostic Interview For ADHD in Adults (DIVA) : English
20 pages
CO - CSE 4102 - AI Lab Course Outline
100% (1)
CO - CSE 4102 - AI Lab Course Outline
4 pages
Exam 2003
No ratings yet
Exam 2003
21 pages
Final Sol
100% (1)
Final Sol
8 pages
ch6 Perceptron MLP PDF
No ratings yet
ch6 Perceptron MLP PDF
31 pages
Must Know Questions Deep Learning
No ratings yet
Must Know Questions Deep Learning
22 pages
Language Modeling: Prabhleen Juneja Thapar Institute of Engineering & Technology
No ratings yet
Language Modeling: Prabhleen Juneja Thapar Institute of Engineering & Technology
36 pages
CSI 4107 - Winter 2006 - Final
No ratings yet
CSI 4107 - Winter 2006 - Final
10 pages
Information Retrival Final Exam
0% (1)
Information Retrival Final Exam
16 pages
NLP Mcq+Dis Answers-Ok
No ratings yet
NLP Mcq+Dis Answers-Ok
52 pages
CSI 4107 - Winter 2016 - Final
No ratings yet
CSI 4107 - Winter 2016 - Final
11 pages
Personality Adjectives Lesson Plan
100% (4)
Personality Adjectives Lesson Plan
5 pages
Trie and Redblack Tree Mcqs
No ratings yet
Trie and Redblack Tree Mcqs
9 pages
Answers For End-Sem Exam Part - 2 (Deep Learning)
No ratings yet
Answers For End-Sem Exam Part - 2 (Deep Learning)
20 pages
Wireless 4 Dof Robotic Arm Using Mega 2560-1
No ratings yet
Wireless 4 Dof Robotic Arm Using Mega 2560-1
11 pages
Practice Final sp22
No ratings yet
Practice Final sp22
10 pages
Bahir Dar University Bahir Dar Institute of Technology Faculty of Computing Department of Computer Science
No ratings yet
Bahir Dar University Bahir Dar Institute of Technology Faculty of Computing Department of Computer Science
4 pages
Machine Learning Full Question Bank
No ratings yet
Machine Learning Full Question Bank
14 pages
Ilovepdf Merged
No ratings yet
Ilovepdf Merged
4 pages
Hr627 Total Ass Quiz
No ratings yet
Hr627 Total Ass Quiz
15 pages
Sp09midterm Revised
No ratings yet
Sp09midterm Revised
6 pages
Question Bank Module-1: Department of Computer Applications 18mca53 - Machine Learning
No ratings yet
Question Bank Module-1: Department of Computer Applications 18mca53 - Machine Learning
7 pages
Week-1 Assessment-1 Answers
No ratings yet
Week-1 Assessment-1 Answers
3 pages
Query Operation 2021
No ratings yet
Query Operation 2021
35 pages
Contoh Soal N Gram (Bagus)
No ratings yet
Contoh Soal N Gram (Bagus)
2 pages
Neural Networks Question Bank
No ratings yet
Neural Networks Question Bank
42 pages
Machine Learning Techniques Short Answers
No ratings yet
Machine Learning Techniques Short Answers
20 pages
IRS7
No ratings yet
IRS7
2 pages
ML Unit Ii
No ratings yet
ML Unit Ii
30 pages
NN Question Bank VIISem
No ratings yet
NN Question Bank VIISem
42 pages
Question Bank of Applied Machine Learning
No ratings yet
Question Bank of Applied Machine Learning
2 pages
Ir MCQ-1
No ratings yet
Ir MCQ-1
22 pages
Adversarial Search 2020
No ratings yet
Adversarial Search 2020
34 pages
Assignment - Week 6 (Neural Networks) Type of Question: MCQ/MSQ
No ratings yet
Assignment - Week 6 (Neural Networks) Type of Question: MCQ/MSQ
4 pages
Independent Component Analysis: Bhagesh Bhutani (20) Chayan Sharma (21) Deepak
No ratings yet
Independent Component Analysis: Bhagesh Bhutani (20) Chayan Sharma (21) Deepak
15 pages
117DX052018
No ratings yet
117DX052018
2 pages
DLL - Mathematics 6 - Q3 - W6
No ratings yet
DLL - Mathematics 6 - Q3 - W6
6 pages
ML Mcqs Without Answers
50% (2)
ML Mcqs Without Answers
21 pages
Syllabus
No ratings yet
Syllabus
9 pages
Machine Learning Oral Questions
No ratings yet
Machine Learning Oral Questions
10 pages
CS3401-ALGORITHMS QB Original
No ratings yet
CS3401-ALGORITHMS QB Original
51 pages
Reading Remedial Instruction Classroom Based Innovation
No ratings yet
Reading Remedial Instruction Classroom Based Innovation
3 pages
Assignment 5 (COPY)
No ratings yet
Assignment 5 (COPY)
5 pages
Mcqs Bank Unit 1: A) The Autonomous Acquisition of Knowledge Through The Use of Computer Programs
100% (1)
Mcqs Bank Unit 1: A) The Autonomous Acquisition of Knowledge Through The Use of Computer Programs
8 pages
Module-3 Association Analysis: Data Mining Association Analysis: Basic Concepts and Algorithms
No ratings yet
Module-3 Association Analysis: Data Mining Association Analysis: Basic Concepts and Algorithms
34 pages
LP I ML Viva Questions
100% (1)
LP I ML Viva Questions
9 pages
Soft Computing MCQ
No ratings yet
Soft Computing MCQ
10 pages
Assignment 6 (COPY)
No ratings yet
Assignment 6 (COPY)
6 pages
Artificial Intelligence
No ratings yet
Artificial Intelligence
2 pages
Action Plan
No ratings yet
Action Plan
2 pages
Methods and Approaches To ELT
No ratings yet
Methods and Approaches To ELT
19 pages
Hallucinations - Pseudohallucinations and Parahallucinations
No ratings yet
Hallucinations - Pseudohallucinations and Parahallucinations
9 pages
1157 CS F425 20231222015056 Mid Semester Question Paper DL
No ratings yet
1157 CS F425 20231222015056 Mid Semester Question Paper DL
2 pages
Educ 201 Cognitive Perspective
No ratings yet
Educ 201 Cognitive Perspective
84 pages
BE Verb Past PDF
No ratings yet
BE Verb Past PDF
2 pages
SP18 CS182 Midterm Solutions - Edited
No ratings yet
SP18 CS182 Midterm Solutions - Edited
14 pages
Constraint Satisfaction Problems: AIMA: Chapter 6
No ratings yet
Constraint Satisfaction Problems: AIMA: Chapter 6
64 pages
DCCN Notes
No ratings yet
DCCN Notes
27 pages
Fundamentals of Machine Learning
No ratings yet
Fundamentals of Machine Learning
2 pages
AI310 & CS361 AI (Mainstream) Midterm Exam (Fall 2023) ANSWER KEY
No ratings yet
AI310 & CS361 AI (Mainstream) Midterm Exam (Fall 2023) ANSWER KEY
2 pages
Natural Language Processing - Unit 10 - Week 8
No ratings yet
Natural Language Processing - Unit 10 - Week 8
6 pages
Negotiation
100% (1)
Negotiation
26 pages
Week 7 Solution
No ratings yet
Week 7 Solution
6 pages
ML Viva Questions
No ratings yet
ML Viva Questions
8 pages
CISC 867: Deep Learning Assignment #1: K J Net
No ratings yet
CISC 867: Deep Learning Assignment #1: K J Net
3 pages
Newport - 1988 - Constr On Learn ASL
No ratings yet
Newport - 1988 - Constr On Learn ASL
26 pages
Plotting Decision Regions - 1 - Mlxtend
No ratings yet
Plotting Decision Regions - 1 - Mlxtend
5 pages
Energy Transformation Project Rubric
0% (1)
Energy Transformation Project Rubric
2 pages
Exercises 695 Clas
No ratings yet
Exercises 695 Clas
3 pages
IELTS General Training Writing Tips For Writing A Letter
No ratings yet
IELTS General Training Writing Tips For Writing A Letter
6 pages
ANSWERS Quiz LESSON 1 MIL
No ratings yet
ANSWERS Quiz LESSON 1 MIL
1 page
Ict Project Rurbics
No ratings yet
Ict Project Rurbics
13 pages
Data Science Questions
No ratings yet
Data Science Questions
4 pages
English Grammar
100% (1)
English Grammar
17 pages
Confilict Management
No ratings yet
Confilict Management
39 pages
Case Study: TOPIC: Motivation in Difficult Economy
No ratings yet
Case Study: TOPIC: Motivation in Difficult Economy
18 pages
CS-671: Deep Learning and Its Applications Distance Metric Learning
No ratings yet
CS-671: Deep Learning and Its Applications Distance Metric Learning
15 pages
Paper 1 Response - The Pink Rickshaw
No ratings yet
Paper 1 Response - The Pink Rickshaw
1 page
Rubric For cmt450 Lab Skill Assessment
No ratings yet
Rubric For cmt450 Lab Skill Assessment
1 page
Case Study 1 - Kasandra Eliesha Malabosa
No ratings yet
Case Study 1 - Kasandra Eliesha Malabosa
4 pages
RPS Intro To Ling
No ratings yet
RPS Intro To Ling
2 pages

Midterm2006 Sol Csi4107

Uploaded by

Midterm2006 Sol Csi4107

Uploaded by

UNIVERSITY OF OTTAWA

a search engine contains a crawler to collect webpages

relevance feedback asks a user to judge the first N answers to a query in

4. (2 marks) In IR systems, a possible pre-processing step is stemming the words.

Usually the performance is higher with stemming.

(9, -5, 13, 1, 1)

cos(d1,q) = (1+1) / (sqrt(3) sqrt(3)) = 2 / 3

2. Interpolate the precision scores at 11 recall levels.

3. Why is interpolation of precision scores necessary when evaluating an IR system?

Page A points to pages B, C, and D.

(update authority scores)

(update hub scores)

How many iterations do you need for convergence?

P(A) = P(B) / 2 + P(D) / 1 = 1/6 + 5/6 = 1

You might also like