Assignment #1 Text Retrieval & Search Engine
Assignment #1 Text Retrieval & Search Engine
Instruction:
1- You must use the Answer template (see the assignment folder) to write
down your answer to the theoretical questions.
2- You need to write a Python script and name each script such as “Web
crawler.py”
3- Carefully read the entire assignment first and focus on the rubrics and
submission requirements.
1. Define Information Retrieval (IR) and explain its relationship with search engines.
(3 Points)
2. Discuss the major challenges in designing search engines, referring to the "Big Issues"
highlighted in the chapter. Provide examples. (4 Points)
3. What are the roles of a Search Engineer, and how do they contribute to the development
of search systems? (3 Points)
1. Describe the basic architecture of a search engine. Use a diagram to illustrate the key
components. (5 Points)
2. Compare and contrast Text Acquisition and Index Creation in the context of search
engines. (5 Points)
1. Explain the concept of Web Crawling and the challenges associated with maintaining
freshness and handling the deep web. (5 Points)
2. What are Document Feeds, and how do they differ from crawling methods? Provide
real-world examples. (5 Points)
1. Create a small corpus of 10 documents. Generate an Inverted Index for the corpus
programmatically in Python. Include counts and positions for each term. (5 Points)
2. Write pseudocode for query evaluation using a document-at-a-time evaluation method.
Implement the pseudocode in Python and demonstrate its execution on your corpus.
(5 Points)
Requirements
Select a website from the following list (or propose a similar website for approval):
1. CNN
2. BBC
3. The New York Times
4. Wikipedia
5. National Geographic
2. Focus of Extraction
You must focus on extracting meaningful and structured information, such as:
1. Use Python and libraries like requests, BeautifulSoup, or Scrapy to crawl the selected
website.
2. Respect the website's robots.txt file and abide by ethical crawling practices.
3. Limit the number of pages you crawl (e.g., 50 pages) to prevent overloading the server.
1. Store the extracted information in an Excel file using libraries like pandas or openpyxl.
2. Ensure the Excel file is well-structured and includes columns for all extracted fields.
3. Save the file with a clear and descriptive name (e.g., CNN_Articles.xlsx).
a) Implement a script or interface that allows users to search for specific keywords in the
crawled data and retrieve matching results.
6. Topic Categorization
a) Use basic natural language processing (NLP) techniques to categorize the extracted
articles or information into relevant topics (e.g., "Politics," "Science," "Technology").
7. Data Visualization
a) If crawling multiple websites, analyze and compare the extracted data, such as the
frequency of specific topics or the publication volume across sites.
9. Generate Summaries
a) Implement text summarization (manual or automated) for the crawled articles or sections
using tools like spaCy or NLTK.
a) Create a simple API using Flask or FastAPI that allows users to query the crawled data
programmatically.
Grading Criteria
1. Task Overview: Design and implement a basic information retrieval system that
processes a small document collection to perform indexing and querying. This task
integrates concepts from Chapters 1–6, including text processing, indexing, and query
handling.
2. Requirements:
a) Corpus Setup:
3. Implementation Details:
1. Use Python and libraries like nltk, numpy, and pandas for processing and
calculations.
2. Store the inverted index in memory or serialize it as a JSON/CSV file.
3. Use modular coding practices for clarity and reuse.
4. Deliverables:
a. Python script(s) with detailed comments.
b. A report including:
i. Description of the dataset.
ii. Steps for text preprocessing and index construction.
iii. Sample queries and results.
iv. Any challenges faced and solutions implemented.
c. Screenshots or logs showing the execution of the system.
Example Input/Output:
5. Testing:
1. Test the IR system with at least 5 different queries.
2. Provide a summary of observations on its performance and accuracy.
6. Requirements:
a. Use a small corpus of 5–10 text documents.
b. Preprocess the text, including:
i. Tokenization and text cleaning.
ii. Removing stopwords.
iii. Applying stemming or lemmatization.
c. Compute the term-document matrix using TF or TF-IDF.
d. Calculate the cosine similarity between all pairs of documents.
8. Deliverables:
a. Python script(s) with detailed comments.
b. A report summarizing:
i. Corpus description.
ii. Preprocessing steps.
iii. Results, including the similarity matrix and heatmap visualization.
iv. Key observations or insights.
Example Output:
Key Observations:
- Doc1 and Doc2 are the most similar documents, with a score of 0.85.
- Doc3 is less similar to both Doc1 and Doc2.
Grading Rubric
Submission Guidelines:
1. Submit your assignment in a single PDF document. Include the theoretical answers and
practical exercise outputs.
2. Provide Python code files for practical exercises as separate attachments.
3. Ensure that all code is properly commented and tested.