Detailed IR Document 3
Detailed IR Document 3
Web search engines use crawlers (also called spiders or bots) to systematically browse and
index the web. Crawling involves fetching web pages and following links to discover new
content. The indexed pages are then used to serve search results. Challenges in crawling
include dealing with duplicate content, dynamic pages, and ensuring coverage. Effective
crawling is essential for comprehensive and up-to-date IR systems.
```python
from sklearn.feature_extraction.text import TfidfVectorizer
docs = ["IR is fun", "Information retrieval is important"]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(docs)
print(X.toarray())
```
Each row is a document vector. Cosine similarity can measure relevance.
```python
from sklearn.feature_extraction.text import TfidfVectorizer
docs = ["information retrieval", "retrieval of data", "search engine"]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(docs)
print(X.toarray())
```
Calculating cosine similarity:
```python
from sklearn.metrics.pairwise import cosine_similarity
similarity = cosine_similarity(X[0:1], X)
print("Similarity scores:", similarity)
```
This model supports partial matching and ranking. Its limitation is treating words
independently without semantic relationships. Extensions like LSI and word embeddings
address these issues.
The vector space model is widely used in practical IR applications such as search engines
and document classification systems.