0% found this document useful (0 votes)

194 views6 pages

Assignment #1 Text Retrieval & Search Engine

The document outlines the requirements for Assignment #1 in the CP423 course on Text Retrieval and Search Engines, which includes theoretical questions and practical exercises. Students must define key concepts, design a web crawler, and build an information retrieval system prototype using Python. Deliverables include Python scripts, an Excel file for extracted data, and documentation detailing the implementation process and challenges faced.

Uploaded by

gillybobfitz

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

194 views6 pages

Assignment #1 Text Retrieval & Search Engine

Uploaded by

gillybobfitz

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

CP423 – Text Retrieval & Search Engine Assignment #1

Instruction:

1- You must use the Answer template (see the assignment folder) to write
down your answer to the theoretical questions.
2- You need to write a Python script and name each script such as “Web
[Link]”
3- Carefully read the entire assignment first and focus on the rubrics and
submission requirements.

Part 1: Theoretical Questions (40 Points)

Chapter 1: Search Engines and Information Retrieval (10 Points)

1. Define Information Retrieval (IR) and explain its relationship with search engines.
(3 Points)
2. Discuss the major challenges in designing search engines, referring to the "Big Issues"
highlighted in the chapter. Provide examples. (4 Points)
3. What are the roles of a Search Engineer, and how do they contribute to the development
of search systems? (3 Points)

Chapter 2: Architecture of a Search Engine (10 Points)

1. Describe the basic architecture of a search engine. Use a diagram to illustrate the key
components. (5 Points)
2. Compare and contrast Text Acquisition and Index Creation in the context of search
engines. (5 Points)

Chapter 3: Crawls and Feeds (10 Points)

1. Explain the concept of Web Crawling and the challenges associated with maintaining
freshness and handling the deep web. (5 Points)
2. What are Document Feeds, and how do they differ from crawling methods? Provide
real-world examples. (5 Points)

Chapter 4: Processing Text (10 Points)

CP423 – Text Retrieval & Search Engine – Assignment #1 Page 1 of 6

1. Explain the steps involved in Tokenization, Stemming, and Stopping. How do these
processes impact search engine performance? (5 Points)
2. Discuss Zipf’s Law and its relevance in understanding text processing for search
engines. (5 Points)

Part 2: Practical Exercises (30 Points)

Chapter 5: Ranking with Indexes (10 Points)

1. Create a small corpus of 10 documents. Generate an Inverted Index for the corpus
programmatically in Python. Include counts and positions for each term. (5 Points)
2. Write pseudocode for query evaluation using a document-at-a-time evaluation method.
Implement the pseudocode in Python and demonstrate its execution on your corpus.
(5 Points)

Chapter 3: Web Crawler Implementation (20 Points)

Requirements

1. Choose a Website to Crawl

Select a website from the following list (or propose a similar website for approval):

1. CNN
2. BBC
3. The New York Times
4. Wikipedia
5. National Geographic

2. Focus of Extraction

You must focus on extracting meaningful and structured information, such as:

1. News Websites (e.g., CNN, BBC, NYT):

a. Article titles
b. Publication dates
c. Article URLs
d. Article summaries or full text
2. Wikipedia:
a. Section headings
b. Summary paragraphs
c. URLs of internal or external references
3. National Geographic:
a. Article titles
b. Publication dates
c. Key topics or categories

CP423 – Text Retrieval & Search Engine – Assignment #1 Page 2 of 6

3. Web Crawling Implementation

1. Use Python and libraries like requests, BeautifulSoup, or Scrapy to crawl the selected
website.
2. Respect the website's [Link] file and abide by ethical crawling practices.
3. Limit the number of pages you crawl (e.g., 50 pages) to prevent overloading the server.

4. Data Storage in Excel

1. Store the extracted information in an Excel file using libraries like pandas or openpyxl.
2. Ensure the Excel file is well-structured and includes columns for all extracted fields.
3. Save the file with a clear and descriptive name (e.g., CNN_Articles.xlsx).

5. Keyword Search Functionality

a) Implement a script or interface that allows users to search for specific keywords in the
crawled data and retrieve matching results.

6. Topic Categorization

a) Use basic natural language processing (NLP) techniques to categorize the extracted
articles or information into relevant topics (e.g., "Politics," "Science," "Technology").

7. Data Visualization

1. Generate visual summaries of the extracted data, such as:

a. Bar charts showing the distribution of topics
b. Line graphs tracking the number of articles published over time

8. Comparison Across Sources (Optional)

a) If crawling multiple websites, analyze and compare the extracted data, such as the
frequency of specific topics or the publication volume across sites.

9. Generate Summaries

a) Implement text summarization (manual or automated) for the crawled articles or sections
using tools like spaCy or NLTK.

10. API Creation

a) Create a simple API using Flask or FastAPI that allows users to query the crawled data
programmatically.

CP423 – Text Retrieval & Search Engine – Assignment #1 Page 3 of 6

Deliverables

1. Python scripts for web crawling, data processing, and storage.

2. The Excel file(s) containing the extracted data.
3. Documentation explaining: You can combine the explanation on each Python file.

a) The crawling process and logic.

b) What challenges did you face, and how did you resolve them?

4. Visualizations: such as bar plot, etc.

Submission Guidelines: Submit your code, Excel files, and documentation as a compressed
folder.

Grading Criteria

1. Crawling Implementation: 30%

2. Data Storage & Structure: 20%
3. Functionalities (Keyword Search, Categorization, Visualization): 40%
4. Documentation: 10%

Part 3: Building an IR System Prototype (30 Points)

Building an Information Retrieval System Prototype

1. Task Overview: Design and implement a basic information retrieval system that
processes a small document collection to perform indexing and querying. This task
integrates concepts from Chapters 1–6, including text processing, indexing, and query
handling.
2. Requirements:
a) Corpus Setup:

i. Use a small collection of at least 10 documents (text files) to simulate a

dataset.
ii. Each document should contain structured text with a title and body.
b. Text Processing:
i. Implement tokenization, stopword removal, and stemming using Python
libraries like nltk or spacy.
ii. Generate a clean and preprocessed version of the corpus.
c. Index Construction:
i. Build an inverted index using the processed corpus. Each term should map
to:
1. Document IDs where it occurs.
2. Frequency of the term in each document.
3. Positions of the term in the document.
d. Query Processing:
i. Accept user queries and support Boolean and ranked retrieval methods.

CP423 – Text Retrieval & Search Engine – Assignment #1 Page 4 of 6

ii. Boolean queries should include operators like AND, OR, and NOT.
iii. Ranked retrieval should use TF-IDF scoring.
e. User Interface:
i. Create a simple command-line interface for:
1. Adding documents to the corpus.
2. Searching the corpus with a query.
3. Displaying ranked results with document titles and scores.

3. Implementation Details:
1. Use Python and libraries like nltk, numpy, and pandas for processing and
calculations.
2. Store the inverted index in memory or serialize it as a JSON/CSV file.
3. Use modular coding practices for clarity and reuse.

4. Deliverables:
a. Python script(s) with detailed comments.
b. A report including:
i. Description of the dataset.
ii. Steps for text preprocessing and index construction.
iii. Sample queries and results.
iv. Any challenges faced and solutions implemented.
c. Screenshots or logs showing the execution of the system.

Example Input/Output:

Query Input: fish AND tank

Results:
Ranked Results:
1. Document: Aquarium Basics, Score: 0.75
2. Document: Tropical Fish Care, Score: 0.65

5. Testing:
1. Test the IR system with at least 5 different queries.
2. Provide a summary of observations on its performance and accuracy.

6. Requirements:
a. Use a small corpus of 5–10 text documents.
b. Preprocess the text, including:
i. Tokenization and text cleaning.
ii. Removing stopwords.
iii. Applying stemming or lemmatization.
c. Compute the term-document matrix using TF or TF-IDF.
d. Calculate the cosine similarity between all pairs of documents.

CP423 – Text Retrieval & Search Engine – Assignment #1 Page 5 of 6

7. Implementation Details:
1. Use Python libraries such as scikit-learn, numpy, and nltk/spacy for
preprocessing and calculations.
2. Provide a visualization (e.g., heatmap) of the similarity matrix.

8. Deliverables:
a. Python script(s) with detailed comments.
b. A report summarizing:
i. Corpus description.
ii. Preprocessing steps.
iii. Results, including the similarity matrix and heatmap visualization.
iv. Key observations or insights.

Example Output:

Document Similarity Matrix:

---------------------------
Doc1 Doc2 Doc3
Doc1 1.00 0.85 0.60
Doc2 0.85 1.00 0.45
Doc3 0.60 0.45 1.00

Key Observations:
- Doc1 and Doc2 are the most similar documents, with a score of 0.85.
- Doc3 is less similar to both Doc1 and Doc2.

Grading Rubric

Criteria Points Description

Theoretical Questions 40 Completeness, clarity, and depth of explanation for all
parts.
Practical Exercises 30 Accuracy of implementation, code quality, and
thorough explanations.
Document Similarity Task 30 Correct implementation, insightful analysis, and clear
presentation.
Presentation and 10 Overall organization, clarity of writing, and proper
Documentation citations where necessary.

Submission Guidelines:

1. Submit your assignment in a single PDF document. Include the theoretical answers and
practical exercise outputs.
2. Provide Python code files for practical exercises as separate attachments.
3. Ensure that all code is properly commented and tested.

CP423 – Text Retrieval & Search Engine – Assignment #1 Page 6 of 6

Project Proposal
No ratings yet
Project Proposal
10 pages
Search Engine Project Synopsis
No ratings yet
Search Engine Project Synopsis
3 pages
Question Bank-Print-Irt
No ratings yet
Question Bank-Print-Irt
9 pages
Iste Search Engine
No ratings yet
Iste Search Engine
6 pages
Advanced Search Techniques Guide
No ratings yet
Advanced Search Techniques Guide
16 pages
CS F469 IR System Assignment
No ratings yet
CS F469 IR System Assignment
4 pages
FYP Proposal
No ratings yet
FYP Proposal
18 pages
IR Journal
No ratings yet
IR Journal
20 pages
Did It Make The News?
No ratings yet
Did It Make The News?
6 pages
NLP Mini Project
No ratings yet
NLP Mini Project
19 pages
1 Overview
No ratings yet
1 Overview
44 pages
Samaksh Gupta Programming Ass. IR
No ratings yet
Samaksh Gupta Programming Ass. IR
13 pages
IR System Architecture Guide
No ratings yet
IR System Architecture Guide
36 pages
Search Engine Report
No ratings yet
Search Engine Report
5 pages
1 Introduction MIR
No ratings yet
1 Introduction MIR
35 pages
Synopsis
No ratings yet
Synopsis
3 pages
Irt Syllabus
No ratings yet
Irt Syllabus
3 pages
AI Mini Project
No ratings yet
AI Mini Project
22 pages
Articles Search Project
No ratings yet
Articles Search Project
8 pages
Master Thesis
No ratings yet
Master Thesis
70 pages
Erformance Valuation EB Rawler: P E O W C
No ratings yet
Erformance Valuation EB Rawler: P E O W C
34 pages
B.Tech 8 Semester Project 2012: Anshul Goyal (20084027) Shrish Chandra Mishra (20084050) Amit Singh (20084057)
No ratings yet
B.Tech 8 Semester Project 2012: Anshul Goyal (20084027) Shrish Chandra Mishra (20084050) Amit Singh (20084057)
26 pages
COURSEWORK1 Details
No ratings yet
COURSEWORK1 Details
3 pages
Irt 2 Marks With Answer
No ratings yet
Irt 2 Marks With Answer
15 pages
Final Report
No ratings yet
Final Report
59 pages
Minor Assignment-3 (NLP)
No ratings yet
Minor Assignment-3 (NLP)
2 pages
Ap May 23 QP Ans
No ratings yet
Ap May 23 QP Ans
9 pages
Search Engine Anatomy Project Report
No ratings yet
Search Engine Anatomy Project Report
62 pages
Irt Ia 2
No ratings yet
Irt Ia 2
9 pages
Project Report
No ratings yet
Project Report
12 pages
Lab Manual
No ratings yet
Lab Manual
10 pages
Advanced Search Engine Development
No ratings yet
Advanced Search Engine Development
2 pages
Sithfal-Task2 Explation Matter
No ratings yet
Sithfal-Task2 Explation Matter
6 pages
NLP Lab Tasks for Students
No ratings yet
NLP Lab Tasks for Students
16 pages
IR Workbook Answers
No ratings yet
IR Workbook Answers
36 pages
WSMA Lab Manual 2
No ratings yet
WSMA Lab Manual 2
8 pages
Learning Guide Unit 7 - Home
No ratings yet
Learning Guide Unit 7 - Home
12 pages
Chap 2
No ratings yet
Chap 2
29 pages
Source Code Analysis Using Generative AI
No ratings yet
Source Code Analysis Using Generative AI
3 pages
Cs336 Spring2025 Assignment4 Data
No ratings yet
Cs336 Spring2025 Assignment4 Data
17 pages
Search Engine Architecture Guide
No ratings yet
Search Engine Architecture Guide
23 pages
Project Titles for Spring 24-25
No ratings yet
Project Titles for Spring 24-25
3 pages
2rd论文
No ratings yet
2rd论文
81 pages
Web Crawling: Based On The Slides by Filippo
No ratings yet
Web Crawling: Based On The Slides by Filippo
52 pages
Innovative Assignment - 1
No ratings yet
Innovative Assignment - 1
21 pages
Information Retrieval Course
No ratings yet
Information Retrieval Course
24 pages
IJCRT2505292
No ratings yet
IJCRT2505292
6 pages
Report Format
No ratings yet
Report Format
15 pages
Python NLP and Text Analysis Syllabus
No ratings yet
Python NLP and Text Analysis Syllabus
2 pages
Data Filtering with NLP Techniques
No ratings yet
Data Filtering with NLP Techniques
4 pages
Query Retried
No ratings yet
Query Retried
9 pages
01 Functional Requirements CV Projects-3
No ratings yet
01 Functional Requirements CV Projects-3
7 pages
Question Bank
No ratings yet
Question Bank
3 pages
New Text Document
No ratings yet
New Text Document
3 pages
Semantic Web Solutions with NLP
No ratings yet
Semantic Web Solutions with NLP
69 pages
7 CurrentTrendsAndIssues
No ratings yet
7 CurrentTrendsAndIssues
50 pages
Media Relay Gateway Setup Guide
No ratings yet
Media Relay Gateway Setup Guide
54 pages
Compressor Specs for Engineers
No ratings yet
Compressor Specs for Engineers
5 pages
Hypertext, Multimedia, and The Web
No ratings yet
Hypertext, Multimedia, and The Web
23 pages
Medium Voltage Cold-Shrink Terminations
No ratings yet
Medium Voltage Cold-Shrink Terminations
36 pages
Industrial Motion & Presence Sensors
No ratings yet
Industrial Motion & Presence Sensors
2 pages
Credit Card Fraud Detection Model Analysis
No ratings yet
Credit Card Fraud Detection Model Analysis
24 pages
Beko DFN16X10X Dishwasher Manual
0% (1)
Beko DFN16X10X Dishwasher Manual
41 pages
RVP...L2X Check Valve Specifications
100% (1)
RVP...L2X Check Valve Specifications
4 pages
TIB Hawk 6.1 Plugins For Administrator
No ratings yet
TIB Hawk 6.1 Plugins For Administrator
59 pages
Power System Protection Quiz
No ratings yet
Power System Protection Quiz
27 pages
Stair Check
No ratings yet
Stair Check
1 page
986 Lcdmmitx
No ratings yet
986 Lcdmmitx
92 pages
Snamprogetti-Desing For Piping Support-2001-138pages
No ratings yet
Snamprogetti-Desing For Piping Support-2001-138pages
138 pages
Indian School Sohar Student Diary 2021-22
No ratings yet
Indian School Sohar Student Diary 2021-22
39 pages
Lenovo 360 Partner Onboarding Guide
No ratings yet
Lenovo 360 Partner Onboarding Guide
11 pages
Colorization of Black and White Images Using Deep Learning
No ratings yet
Colorization of Black and White Images Using Deep Learning
34 pages
Webfarm Configuration Guide
No ratings yet
Webfarm Configuration Guide
33 pages
Electrical Engineer's CV
No ratings yet
Electrical Engineer's CV
9 pages
Types of Emotions in Video Content
No ratings yet
Types of Emotions in Video Content
27 pages
DIP 21EC732 Full Notes SJ
No ratings yet
DIP 21EC732 Full Notes SJ
247 pages
1000 Bus Body Tender 14 July 23 Tamil Nadu Merged
No ratings yet
1000 Bus Body Tender 14 July 23 Tamil Nadu Merged
130 pages
Digitrim62 Step by Step Calibration
No ratings yet
Digitrim62 Step by Step Calibration
3 pages
Suneel Prajapat Piping Design Engineer
No ratings yet
Suneel Prajapat Piping Design Engineer
3 pages
ABX00080 Datasheet
No ratings yet
ABX00080 Datasheet
35 pages
Aravindhraj Miniproject Report Final 2
No ratings yet
Aravindhraj Miniproject Report Final 2
93 pages
ACS860 Multidrive Cabinet Guide
No ratings yet
ACS860 Multidrive Cabinet Guide
16 pages
SYLLABUS
No ratings yet
SYLLABUS
4 pages
Service and Repair Manual: Z - 60 DC Z - 60 FE
No ratings yet
Service and Repair Manual: Z - 60 DC Z - 60 FE
193 pages
Multiple Choice Questions Numerical Methods
76% (34)
Multiple Choice Questions Numerical Methods
3 pages
Electricity Bill for April 2023
No ratings yet
Electricity Bill for April 2023
1 page

Assignment #1 Text Retrieval & Search Engine

Uploaded by

Assignment #1 Text Retrieval & Search Engine

Uploaded by

CP423 – Text Retrieval & Search Engine Assignment #1

Part 1: Theoretical Questions (40 Points)

Chapter 1: Search Engines and Information Retrieval (10 Points)

Chapter 2: Architecture of a Search Engine (10 Points)

Chapter 3: Crawls and Feeds (10 Points)

Chapter 4: Processing Text (10 Points)

CP423 – Text Retrieval & Search Engine – Assignment #1 Page 1 of 6

Part 2: Practical Exercises (30 Points)

Chapter 5: Ranking with Indexes (10 Points)

Chapter 3: Web Crawler Implementation (20 Points)

1. Choose a Website to Crawl

1. News Websites (e.g., CNN, BBC, NYT):

CP423 – Text Retrieval & Search Engine – Assignment #1 Page 2 of 6

4. Data Storage in Excel

5. Keyword Search Functionality

1. Generate visual summaries of the extracted data, such as:

8. Comparison Across Sources (Optional)

10. API Creation

CP423 – Text Retrieval & Search Engine – Assignment #1 Page 3 of 6

1. Python scripts for web crawling, data processing, and storage.

a) The crawling process and logic.

4. Visualizations: such as bar plot, etc.

1. Crawling Implementation: 30%

Part 3: Building an IR System Prototype (30 Points)

Building an Information Retrieval System Prototype

i. Use a small collection of at least 10 documents (text files) to simulate a

CP423 – Text Retrieval & Search Engine – Assignment #1 Page 4 of 6

Query Input: fish AND tank

CP423 – Text Retrieval & Search Engine – Assignment #1 Page 5 of 6

Document Similarity Matrix:

Criteria Points Description

CP423 – Text Retrieval & Search Engine – Assignment #1 Page 6 of 6

You might also like