0% found this document useful (0 votes)
16 views5 pages

Artificial Intellegence Project Module - I: Contemporary Curriculum, Pedagogy, and Practice (C2P2) BY

Uploaded by

202100264
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views5 pages

Artificial Intellegence Project Module - I: Contemporary Curriculum, Pedagogy, and Practice (C2P2) BY

Uploaded by

202100264
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Activity based Project Report on

Artificial
Intellegence Project
Module - I
Submitted to Vishwakarma
University, Pune Under the
Initiative of

Contemporary Curriculum, Pedagogy, and Practice (C2P2)


BY:
NAME:OJAS KHAWAS
ROLL.NO.:10
SRN:202100264
DIV:C
Third Year Engineering

Faculty Incharge:- Prof. N. Z.


Tarapore Date Of Project 1:-

Department of Computer
Engineering Faculty of Science and
Technology

Academic Year 2023-2024 Term-II


REPORT
Web Crawling and Page Indexing using Breadth-First Search

1. Introduction
Web crawling and page indexing are fundamental processes in web search engines, enabling
the discovery and retrieval of information from the vast expanse of the World Wide Web.
Breadth-first search (BFS) is a popular algorithm used in web crawling due to its efficiency
in systematically exploring web pages.
This report aims to provide a detailed overview of web crawling and page indexing using
BFS.

2. Web Crawling
Web crawling, also known as web scraping or web spidering, is the process of systematically
browsing the internet to gather information from web pages.
It involves fetching and analyzing web pages, following hyperlinks, and extracting relevant
data for indexing or other purposes. Web crawlers, or bots, are automated programs designed
to perform this task.

2.1 Breadth-First Search (BFS)


Breadth-first search is a graph traversal algorithm that systematically explores all the nodes
of a graph level by level. In the context of web crawling, BFS is used to explore the web
graph, where web pages are represented as nodes and hyperlinks as edges.
BFS starts at a given web page (or set of pages) known as the seed URLs and systematically
explores all reachable pages in breadth-first manner.

2.1.1 Breadth-First Search Algorithm


1. Initialization:
• Create a queue to store nodes to be visited.
• Enqueue the starting node (or nodes) into the queue.
• Initialize a set or array to keep track of visited nodes.
2. Exploration Loop:
• While the queue is not empty:
• Dequeue a node from the queue.
• Mark the dequeued node as visited.
• Process the node (e.g., print its value or perform operations).
• Enqueue all unvisited neighboring nodes of the dequeued node into the
queue.
3. Termination Condition:
• Repeat the exploration loop until the queue is empty.

2.1.2 Breadth-First Search Pseudo Code


BFS(Graph, start_node):
queue Q
visited_set = {}
Q.enqueue(start_node)
visited_set.add(start_node)
while Q is not empty:
current_node = Q.dequeue()
process(current_node)
for each neighbor in Graph.neighbors(current_node):
if neighbor not in visited_set:
visited_set.add(neighbor)
Q.enqueue(neighbor)

function process(node):
print(node)
2.2 Steps Involved in BFS Web Crawling
1. Seed URL Selection: The process begins with selecting a set of seed URLs, typically
starting points from which the web crawler begins its exploration.
2. URL Frontier: A queue data structure, known as the URL frontier, is used to store
URLs waiting to be crawled. The seed URLs are initially placed in this queue.
3. Crawling: The crawler dequeues URLs from the frontier, fetches the corresponding
web pages, and extracts relevant information. It then parses the HTML content to
discover hyperlinks, which are added to the URL frontier for subsequent crawling.
4. Duplicate URL Detection: To avoid revisiting the same URLs multiple times, the
crawler maintains a list of visited URLs and checks for duplicates before enqueueing
URLs into the frontier.
5. Content Processing: Extracted content from crawled web pages is processed and
may undergo filtering, normalization, or other preprocessing steps based on the
requirements of the indexing system.
6. Indexing: The extracted data is indexed, i.e., organized and stored in a searchable
format. This indexing facilitates efficient retrieval of relevant information in response
to user queries.
3. Page Indexing
Page indexing is the process of creating an organized database of web pages and their
associated content to enable efficient search and retrieval. It involves parsing the content of
web pages, extracting relevant information, and storing it in a structured format for quick
access.

3.1 Indexing Techniques


Several indexing techniques can be employed to organize and store the information extracted
from web pages. These include:
• Inverted Indexing: This technique maps terms to the documents/pages in which they
appear, enabling efficient full-text search.
• Keyword Indexing: Keywords or key phrases extracted from web pages are indexed
to facilitate keyword-based searches.
• Metadata Indexing: Metadata such as title, author, publication date, and other
attributes are indexed for more refined search capabilities.
• Anchor Text Indexing: Anchor text extracted from hyperlinks can be indexed to
enhance the relevance of search results.

3.2 Challenges in Page Indexing


• Scalability: Indexing a large number of web pages efficiently poses scalability
challenges, requiring distributed indexing systems and optimized algorithms.
• Freshness: Maintaining up-to-date indexes in the face of dynamic web content
necessitates continuous crawling and indexing processes.
• Quality and Relevance: Ensuring the quality and relevance of indexed content is
crucial for providing accurate search results. This involves addressing issues such as
spam, duplicates, and low-quality content.
• Multimedia Content: Indexing multimedia content such as images, videos, and audio
files requires specialized techniques beyond text-based indexing.

4. Conclusion
Web crawling and page indexing play vital roles in enabling efficient search and retrieval of
information from the web. Breadth-first search (BFS) is a widely used algorithm for web
crawling due to its systematic exploration of the web graph. Page indexing involves parsing
and organizing web content to create searchable indexes, employing various techniques such
as inverted indexing, keyword indexing, and metadata indexing. Despite challenges such as
scalability and maintaining index freshness, advancements in technology continue to improve
the effectiveness and efficiency of web crawling and indexing systems.

You might also like