0% found this document useful (0 votes)
13 views

Assignment #1 Text Retrieval & Search Engine

The document outlines the requirements for Assignment #1 in the CP423 course on Text Retrieval and Search Engines, which includes theoretical questions and practical exercises. Students must define key concepts, design a web crawler, and build an information retrieval system prototype using Python. Deliverables include Python scripts, an Excel file for extracted data, and documentation detailing the implementation process and challenges faced.

Uploaded by

gillybobfitz
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Assignment #1 Text Retrieval & Search Engine

The document outlines the requirements for Assignment #1 in the CP423 course on Text Retrieval and Search Engines, which includes theoretical questions and practical exercises. Students must define key concepts, design a web crawler, and build an information retrieval system prototype using Python. Deliverables include Python scripts, an Excel file for extracted data, and documentation detailing the implementation process and challenges faced.

Uploaded by

gillybobfitz
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

CP423 – Text Retrieval & Search Engine Assignment #1

Instruction:

1- You must use the Answer template (see the assignment folder) to write
down your answer to the theoretical questions.
2- You need to write a Python script and name each script such as “Web
crawler.py”
3- Carefully read the entire assignment first and focus on the rubrics and
submission requirements.

Part 1: Theoretical Questions (40 Points)

Chapter 1: Search Engines and Information Retrieval (10 Points)

1. Define Information Retrieval (IR) and explain its relationship with search engines.
(3 Points)
2. Discuss the major challenges in designing search engines, referring to the "Big Issues"
highlighted in the chapter. Provide examples. (4 Points)
3. What are the roles of a Search Engineer, and how do they contribute to the development
of search systems? (3 Points)

Chapter 2: Architecture of a Search Engine (10 Points)

1. Describe the basic architecture of a search engine. Use a diagram to illustrate the key
components. (5 Points)
2. Compare and contrast Text Acquisition and Index Creation in the context of search
engines. (5 Points)

Chapter 3: Crawls and Feeds (10 Points)

1. Explain the concept of Web Crawling and the challenges associated with maintaining
freshness and handling the deep web. (5 Points)
2. What are Document Feeds, and how do they differ from crawling methods? Provide
real-world examples. (5 Points)

Chapter 4: Processing Text (10 Points)

CP423 – Text Retrieval & Search Engine – Assignment #1 Page 1 of 6


1. Explain the steps involved in Tokenization, Stemming, and Stopping. How do these
processes impact search engine performance? (5 Points)
2. Discuss Zipf’s Law and its relevance in understanding text processing for search
engines. (5 Points)

Part 2: Practical Exercises (30 Points)

Chapter 5: Ranking with Indexes (10 Points)

1. Create a small corpus of 10 documents. Generate an Inverted Index for the corpus
programmatically in Python. Include counts and positions for each term. (5 Points)
2. Write pseudocode for query evaluation using a document-at-a-time evaluation method.
Implement the pseudocode in Python and demonstrate its execution on your corpus.
(5 Points)

Chapter 3: Web Crawler Implementation (20 Points)

Requirements

1. Choose a Website to Crawl

Select a website from the following list (or propose a similar website for approval):

1. CNN
2. BBC
3. The New York Times
4. Wikipedia
5. National Geographic

2. Focus of Extraction

You must focus on extracting meaningful and structured information, such as:

1. News Websites (e.g., CNN, BBC, NYT):


a. Article titles
b. Publication dates
c. Article URLs
d. Article summaries or full text
2. Wikipedia:
a. Section headings
b. Summary paragraphs
c. URLs of internal or external references
3. National Geographic:
a. Article titles
b. Publication dates
c. Key topics or categories

CP423 – Text Retrieval & Search Engine – Assignment #1 Page 2 of 6


3. Web Crawling Implementation

1. Use Python and libraries like requests, BeautifulSoup, or Scrapy to crawl the selected
website.
2. Respect the website's robots.txt file and abide by ethical crawling practices.
3. Limit the number of pages you crawl (e.g., 50 pages) to prevent overloading the server.

4. Data Storage in Excel

1. Store the extracted information in an Excel file using libraries like pandas or openpyxl.
2. Ensure the Excel file is well-structured and includes columns for all extracted fields.
3. Save the file with a clear and descriptive name (e.g., CNN_Articles.xlsx).

5. Keyword Search Functionality

a) Implement a script or interface that allows users to search for specific keywords in the
crawled data and retrieve matching results.

6. Topic Categorization

a) Use basic natural language processing (NLP) techniques to categorize the extracted
articles or information into relevant topics (e.g., "Politics," "Science," "Technology").

7. Data Visualization

1. Generate visual summaries of the extracted data, such as:


a. Bar charts showing the distribution of topics
b. Line graphs tracking the number of articles published over time

8. Comparison Across Sources (Optional)

a) If crawling multiple websites, analyze and compare the extracted data, such as the
frequency of specific topics or the publication volume across sites.

9. Generate Summaries

a) Implement text summarization (manual or automated) for the crawled articles or sections
using tools like spaCy or NLTK.

10. API Creation

a) Create a simple API using Flask or FastAPI that allows users to query the crawled data
programmatically.

CP423 – Text Retrieval & Search Engine – Assignment #1 Page 3 of 6


Deliverables

1. Python scripts for web crawling, data processing, and storage.


2. The Excel file(s) containing the extracted data.
3. Documentation explaining: You can combine the explanation on each Python file.

a) The crawling process and logic.


b) What challenges did you face, and how did you resolve them?

4. Visualizations: such as bar plot, etc.


Submission Guidelines: Submit your code, Excel files, and documentation as a compressed
folder.

Grading Criteria

1. Crawling Implementation: 30%


2. Data Storage & Structure: 20%
3. Functionalities (Keyword Search, Categorization, Visualization): 40%
4. Documentation: 10%

Part 3: Building an IR System Prototype (30 Points)

Building an Information Retrieval System Prototype

1. Task Overview: Design and implement a basic information retrieval system that
processes a small document collection to perform indexing and querying. This task
integrates concepts from Chapters 1–6, including text processing, indexing, and query
handling.
2. Requirements:
a) Corpus Setup:

i. Use a small collection of at least 10 documents (text files) to simulate a


dataset.
ii. Each document should contain structured text with a title and body.
b. Text Processing:
i. Implement tokenization, stopword removal, and stemming using Python
libraries like nltk or spacy.
ii. Generate a clean and preprocessed version of the corpus.
c. Index Construction:
i. Build an inverted index using the processed corpus. Each term should map
to:
1. Document IDs where it occurs.
2. Frequency of the term in each document.
3. Positions of the term in the document.
d. Query Processing:
i. Accept user queries and support Boolean and ranked retrieval methods.

CP423 – Text Retrieval & Search Engine – Assignment #1 Page 4 of 6


ii. Boolean queries should include operators like AND, OR, and NOT.
iii. Ranked retrieval should use TF-IDF scoring.
e. User Interface:
i. Create a simple command-line interface for:
1. Adding documents to the corpus.
2. Searching the corpus with a query.
3. Displaying ranked results with document titles and scores.

3. Implementation Details:
1. Use Python and libraries like nltk, numpy, and pandas for processing and
calculations.
2. Store the inverted index in memory or serialize it as a JSON/CSV file.
3. Use modular coding practices for clarity and reuse.

4. Deliverables:
a. Python script(s) with detailed comments.
b. A report including:
i. Description of the dataset.
ii. Steps for text preprocessing and index construction.
iii. Sample queries and results.
iv. Any challenges faced and solutions implemented.
c. Screenshots or logs showing the execution of the system.

Example Input/Output:

Query Input: fish AND tank


Results:
Ranked Results:
1. Document: Aquarium Basics, Score: 0.75
2. Document: Tropical Fish Care, Score: 0.65

5. Testing:
1. Test the IR system with at least 5 different queries.
2. Provide a summary of observations on its performance and accuracy.

6. Requirements:
a. Use a small corpus of 5–10 text documents.
b. Preprocess the text, including:
i. Tokenization and text cleaning.
ii. Removing stopwords.
iii. Applying stemming or lemmatization.
c. Compute the term-document matrix using TF or TF-IDF.
d. Calculate the cosine similarity between all pairs of documents.

CP423 – Text Retrieval & Search Engine – Assignment #1 Page 5 of 6


7. Implementation Details:
1. Use Python libraries such as scikit-learn, numpy, and nltk/spacy for
preprocessing and calculations.
2. Provide a visualization (e.g., heatmap) of the similarity matrix.

8. Deliverables:
a. Python script(s) with detailed comments.
b. A report summarizing:
i. Corpus description.
ii. Preprocessing steps.
iii. Results, including the similarity matrix and heatmap visualization.
iv. Key observations or insights.

Example Output:

Document Similarity Matrix:


---------------------------
Doc1 Doc2 Doc3
Doc1 1.00 0.85 0.60
Doc2 0.85 1.00 0.45
Doc3 0.60 0.45 1.00

Key Observations:
- Doc1 and Doc2 are the most similar documents, with a score of 0.85.
- Doc3 is less similar to both Doc1 and Doc2.

Grading Rubric

Criteria Points Description


Theoretical Questions 40 Completeness, clarity, and depth of explanation for all
parts.
Practical Exercises 30 Accuracy of implementation, code quality, and
thorough explanations.
Document Similarity Task 30 Correct implementation, insightful analysis, and clear
presentation.
Presentation and 10 Overall organization, clarity of writing, and proper
Documentation citations where necessary.

Submission Guidelines:

1. Submit your assignment in a single PDF document. Include the theoretical answers and
practical exercise outputs.
2. Provide Python code files for practical exercises as separate attachments.
3. Ensure that all code is properly commented and tested.

CP423 – Text Retrieval & Search Engine – Assignment #1 Page 6 of 6

You might also like