0% found this document useful (0 votes)
11 views7 pages

IR - Midsem Question Paper - 2024 - Solutionfull

This document is an examination paper for the B. Tech. (SOT-CE) course on Information Retrieval at Pandit Deendayal Energy University, dated September 23, 2024. It includes various questions related to information retrieval concepts, inverted indexes, edit distance, SOUNDEX algorithm, query processing, and ranking documents using cosine similarity. The exam consists of mandatory questions with internal choices and has a maximum score of 50 marks.

Uploaded by

yalok96639
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views7 pages

IR - Midsem Question Paper - 2024 - Solutionfull

This document is an examination paper for the B. Tech. (SOT-CE) course on Information Retrieval at Pandit Deendayal Energy University, dated September 23, 2024. It includes various questions related to information retrieval concepts, inverted indexes, edit distance, SOUNDEX algorithm, query processing, and ranking documents using cosine similarity. The exam consists of mandatory questions with internal choices and has a maximum score of 50 marks.

Uploaded by

yalok96639
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Roll No.

___________
Pandit Deendayal Energy University
Mid Semester Examination - September 2024
B. Tech. (SOT-CE) (Elective)
Semester – VII
Date: 23/09/2024
Course Name : Information Retrieval Time: 2 hours
Course Code : 20CP417T Max. Marks: 50
Instructions:
1. Do not write anything other than your roll number on the question paper.
2. Assume suitable data wherever essential and mention it clearly.
3. Writing appropriate units, nomenclature, and drawing neat sketches/schematics wherever required is an integral part of
the answer.
NOTE: All questions are mandatory to attend, however some internal choices are given.
Mark CO
s
Q1 Explain the term Information Retrieval and illustrate its goal. How is it 4*1= [CO2]
different from Database Retrieval? 4

Ans: Information Retrieval (IR) is finding material (usually documents) of an


unstructured nature (usually text) that satisfies an information need from within large
collections (usually stored on computers). [1 mark]

Goals: To retrieve documents with information that is relevant to the user’s


information need and helps the user complete a task. A good retrieval model will find
documents that are likely to be considered relevant by the person who submitted the
query. [1 mark]

Difference: [2 marks]
IR DR
Deals with unstructured or semi- Operates on structured data, typically
structured data, such as documents, organized in tables with predefined
web pages, emails, or multimedia. schemas (like in relational databases).
Uses keyword-based queries or natural Uses formal query languages like SQL
language input.
Queries can be vague or ambiguous, Queries are well-defined and expect an
and the system ranks results by exact match, with structured conditions
relevance using algorithms
The system tries to return results that Returns all data that matches the query
are most relevant to the query, even if exactly, with no concept of ranking by
they are not exact matches. relevance.
Data is indexed as inverted index. Data is stored in structured formats like
tables, with rows and columns.
Q2 With respect to inverted index answer the following (any 4): 4*3= [CO1]
1. What are the possible components of a posting list? 12
The components of a posting list are: Doc ID, Term frequency, positional
information, skip pointers. [1 mark and 2 marks with full definition of
each]
2. How do positional indexes differ from standard indexes? Explain with
an example? [2 diff 2 marks, 1 marks with example]
Positional Indexes Standard Indexes
In addition to the document IDs, it Stores only the DocIDs of
also stores the exact positions (word documents and term frequency.
offsets) of the term within the
document.

Page 1 of 5
It does not record the positions or
locations of the terms within the
document.
Along with keyword queries it also Cannot directly support phrase
supports phrase and proximity queries or proximity queries but
queries. only keyword queries are
supported.

3. What are the main difficulties in determining vocabulary terms


in languages characterized by complicated word structures? Elaborate
them. --- [3 marks]
--- [2 marks without examples]
 Phrases like co-education, State-of-the-art
 Numeric data like Dates and phone numbers eg. 20/3/91 , 3/20/91, Mar
20, 1991, B-52 100.2.86.144, (800) 234-2333, 800.234.2333
 No whitespaces eg Chinese
 Use of apostrophe
 Language specific text
 Accents and diacrits
 Multilingual Text
 Language reading direction eg Arabic

4. What are the potential challenges of Boolean retrieval when dealing


with large document collections or ambiguous queries? --- [3 marks]
 Have no provision for document ranking
 Applies only logical operators and hence extracts out the exact
results without any context or reference.
 Cannot handle phrase queries
 For large document collection storage will be the issue as
incidence matrix constructed will be composing of sparsity.
 Queries using "OR" operators can return an overwhelming
number of results, especially in large collections. This is because
any document that matches any of the terms will be included,
leading to low precision.
 Queries using "AND" or "NOT" operators may exclude
important documents if they lack one or more of the specified
terms, even if they are relevant to the user’s information need.
This results in lower recall.

5. How is a weighted term-document matrix beneficial than binary term-


document incidence matrix in Information Retrieval? --- [3 marks]

In a binary term-document matrix, entries are either 0 or 1, indicating only the


presence or absence of a term in a document. This approach doesn't consider
how important or frequent a term is within a document or across the collection.
The weighted matrix assigns numerical weights to terms based on their
significance. This allows for distinguishing between terms that are more
informative or relevant for a document and common terms that carry less
significance. By applying weights like TF-IDF (Term Frequency-Inverse
Document Frequency), the matrix downweights the common terms and boosts
the importance of rare but informative terms, improving the ranking of
documents whereas, Since the binary matrix does not account for term
frequency, all terms are treated equally in relevance calculations.
Q3 Find the Edit Distance between the term’s "intention" to "execution”. 3*1= [CO6]
3
Page 2 of 5
Ans: 5 [1 Mark] and matrix [2 marks ]

Q4 Construct an Inverted Index with Document Frequency and Positional Index 6*1= [CO4]
6
Information for the given collection of three documents:
Document 1 (DocID = D1): "Machine learning is transforming industries."
Document 2 (DocID = D2): "Artificial intelligence and machine learning are
reshaping the future."
Document 3 (DocID = D3): "The future is learning for transformation."

Note: Apply stop word removal and lemmatization wherever necessary to


extract terms, as per your understanding.

Page 3 of 5
Q5. (a) State the significance of adopting SOUNDEX Algorithm for spelling 2+5 = [CO4]
7
correction.
To resolve spelling errors and homophones during query searching [2 marks]
(b) From the given set of terms find all the terms that have same SOUNDEX
codes.
Sea, Plane, Flower, Hear, See, Here, Flour, Barry, Burrow, Berry, Bury,
Smith, Smyth, Smithe
Hint :
1. B, F, P, V → 1
2. C, G, J, K, Q, S, X, Z → 2
3. D, T → 3
4. L→4
5. M, N → 5
6. R→6
Ans: [Marks = number of groups identified (5,4,3,2,1)]
1) Sea, See – S000 2) Flower, Flour --- F460
3) Hear, Here ---- H600 4) Barry, Burrow, Berry, Bury --- B600
5 ) Smith, Smyth, Smithe --- S530 Plane ---- P450
Q6. With reference to Query Processing answer the following (any 3): 3*3= [CO6]
a) Given a Boolean query with three posting lists, state and analyze how 9
they will be resolved to produce a set of documents relevant to the user.
[3 mark]
Any Boolean query is resolved using Logical operators like AND, OR and
NOT. A merge operation is applied to all the posting lists start with smallest
posting list that is based on document frequency. This will optimize the merge
operation across more than two posting lists.

b) Distinguish between Stemming and lemmatization [2 mark] with an


example [1 mark].
Differences
Output Form: Stemming may produce non-words or root forms, whereas
lemmatization produces actual dictionary words.
Complexity: Lemmatization is more complex and computationally intensive
because it involves understanding the context and the grammatical form of the
word.
Accuracy: Lemmatization is generally more accurate than stemming, especially
in handling irregular forms.
Example Original Sentence: "He was enjoying the beautiful singing."
Stemmed Sentence: "He was enjoy the beauti sing."
Lemmatized Sentence: "He was enjoy the beautiful sing."

c) How do skip pointers improve the efficiency of intersecting large


postings lists in information retrieval? What are the potential drawbacks
of using too many or too few skip pointers in a postings list?
Skip pointers are effectively shortcuts that allow us to avoid processing parts of
the postings list that will not figure in the search results. [1 mark]

If skip pointers are too many then it means more number of comparisons to skip
pointers and also lots of space storing the skip pointers. [1 mark]
If skip pointers are too small then it means lesser number of comparisons to
skip pointers and also lots of little chances of skipping due to long spans. [1
mark]
Page 4 of 5
d) How does the Permuterm Index enable efficient search for prefix, suffix,
and infix wildcard queries?
A Permuterm Index is a data structure used in information retrieval systems,
particularly in search engines, to support efficient wildcard searches. Wildcard
searches allow users to search for variations/permutations of a word by using
symbols like * to replace one or more characters. The permuterm index helps
in handling such queries by storing all possible rotations of a term along with
the term itself. [2 mark]
Add a $ to the end of each term. Rotate the resulting term and index them in a
B-tree [1 mark]

Q7. Given a Query for searching, Q = “data is future”. Rank the given retrieved 9*1= [CO3]
documents using cosine similarity and using vector space model with Inverse 9
Document Frequency formulation.
Document 1: "Data science is the future."
Document 2: "Data drives intelligence."
Document 3: "Science and data are future."

Note: Apply stop word removal and lemmatization wherever necessary to


extract terms, as per your understanding.

***********

Page 5 of 5

You might also like