Artificial Intellegence Project Module - I: Contemporary Curriculum, Pedagogy, and Practice (C2P2) BY

Uploaded by

202100264

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views5 pages

Artificial Intellegence Project Module - I: Contemporary Curriculum, Pedagogy, and Practice (C2P2) BY

Uploaded by

202100264

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

Activity based Project Report on

Artificial
Intellegence Project
Module - I
Submitted to Vishwakarma
University, Pune Under the
Initiative of

Contemporary Curriculum, Pedagogy, and Practice (C2P2)

BY:
NAME:OJAS KHAWAS
ROLL.NO.:10
SRN:202100264
DIV:C
Third Year Engineering

Faculty Incharge:- Prof. N. Z.

Tarapore Date Of Project 1:-

Department of Computer
Engineering Faculty of Science and
Technology

Academic Year 2023-2024 Term-II

REPORT
Web Crawling and Page Indexing using Breadth-First Search

1. Introduction
Web crawling and page indexing are fundamental processes in web search engines, enabling
the discovery and retrieval of information from the vast expanse of the World Wide Web.
Breadth-first search (BFS) is a popular algorithm used in web crawling due to its efficiency
in systematically exploring web pages.
This report aims to provide a detailed overview of web crawling and page indexing using
BFS.

2. Web Crawling
Web crawling, also known as web scraping or web spidering, is the process of systematically
browsing the internet to gather information from web pages.
It involves fetching and analyzing web pages, following hyperlinks, and extracting relevant
data for indexing or other purposes. Web crawlers, or bots, are automated programs designed
to perform this task.

2.1 Breadth-First Search (BFS)

Breadth-first search is a graph traversal algorithm that systematically explores all the nodes
of a graph level by level. In the context of web crawling, BFS is used to explore the web
graph, where web pages are represented as nodes and hyperlinks as edges.
BFS starts at a given web page (or set of pages) known as the seed URLs and systematically
explores all reachable pages in breadth-first manner.

2.1.1 Breadth-First Search Algorithm

1. Initialization:
• Create a queue to store nodes to be visited.
• Enqueue the starting node (or nodes) into the queue.
• Initialize a set or array to keep track of visited nodes.
2. Exploration Loop:
• While the queue is not empty:
• Dequeue a node from the queue.
• Mark the dequeued node as visited.
• Process the node (e.g., print its value or perform operations).
• Enqueue all unvisited neighboring nodes of the dequeued node into the
queue.
3. Termination Condition:
• Repeat the exploration loop until the queue is empty.

2.1.2 Breadth-First Search Pseudo Code

BFS(Graph, start_node):
queue Q
visited_set = {}
Q.enqueue(start_node)
visited_set.add(start_node)
while Q is not empty:
current_node = Q.dequeue()
process(current_node)
for each neighbor in Graph.neighbors(current_node):
if neighbor not in visited_set:
visited_set.add(neighbor)
Q.enqueue(neighbor)

function process(node):
print(node)
2.2 Steps Involved in BFS Web Crawling
1. Seed URL Selection: The process begins with selecting a set of seed URLs, typically
starting points from which the web crawler begins its exploration.
2. URL Frontier: A queue data structure, known as the URL frontier, is used to store
URLs waiting to be crawled. The seed URLs are initially placed in this queue.
3. Crawling: The crawler dequeues URLs from the frontier, fetches the corresponding
web pages, and extracts relevant information. It then parses the HTML content to
discover hyperlinks, which are added to the URL frontier for subsequent crawling.
4. Duplicate URL Detection: To avoid revisiting the same URLs multiple times, the
crawler maintains a list of visited URLs and checks for duplicates before enqueueing
URLs into the frontier.
5. Content Processing: Extracted content from crawled web pages is processed and
may undergo filtering, normalization, or other preprocessing steps based on the
requirements of the indexing system.
6. Indexing: The extracted data is indexed, i.e., organized and stored in a searchable
format. This indexing facilitates efficient retrieval of relevant information in response
to user queries.
3. Page Indexing
Page indexing is the process of creating an organized database of web pages and their
associated content to enable efficient search and retrieval. It involves parsing the content of
web pages, extracting relevant information, and storing it in a structured format for quick
access.

3.1 Indexing Techniques

Several indexing techniques can be employed to organize and store the information extracted
from web pages. These include:
• Inverted Indexing: This technique maps terms to the documents/pages in which they
appear, enabling efficient full-text search.
• Keyword Indexing: Keywords or key phrases extracted from web pages are indexed
to facilitate keyword-based searches.
• Metadata Indexing: Metadata such as title, author, publication date, and other
attributes are indexed for more refined search capabilities.
• Anchor Text Indexing: Anchor text extracted from hyperlinks can be indexed to
enhance the relevance of search results.

3.2 Challenges in Page Indexing

• Scalability: Indexing a large number of web pages efficiently poses scalability
challenges, requiring distributed indexing systems and optimized algorithms.
• Freshness: Maintaining up-to-date indexes in the face of dynamic web content
necessitates continuous crawling and indexing processes.
• Quality and Relevance: Ensuring the quality and relevance of indexed content is
crucial for providing accurate search results. This involves addressing issues such as
spam, duplicates, and low-quality content.
• Multimedia Content: Indexing multimedia content such as images, videos, and audio
files requires specialized techniques beyond text-based indexing.

4. Conclusion
Web crawling and page indexing play vital roles in enabling efficient search and retrieval of
information from the web. Breadth-first search (BFS) is a widely used algorithm for web
crawling due to its systematic exploration of the web graph. Page indexing involves parsing
and organizing web content to create searchable indexes, employing various techniques such
as inverted indexing, keyword indexing, and metadata indexing. Despite challenges such as
scalability and maintaining index freshness, advancements in technology continue to improve
the effectiveness and efficiency of web crawling and indexing systems.

Lecture16 Crawling
No ratings yet
Lecture16 Crawling
39 pages
Week 4
No ratings yet
Week 4
38 pages
5.web Crawler Writeup
No ratings yet
5.web Crawler Writeup
7 pages
Web Crawler
0% (1)
Web Crawler
16 pages
IR Unit 3
No ratings yet
IR Unit 3
64 pages
IRT
No ratings yet
IRT
100 pages
PRWB: A Framework For Creating Personal, Site-Specific Web Crawlers
No ratings yet
PRWB: A Framework For Creating Personal, Site-Specific Web Crawlers
6 pages
08 Web Search and Web Crawling
No ratings yet
08 Web Search and Web Crawling
33 pages
Web Info PDF
No ratings yet
Web Info PDF
4 pages
Web Crawling
No ratings yet
Web Crawling
44 pages
Dark Runes - Writeup
100% (1)
Dark Runes - Writeup
4 pages
Cse3024 WM Module-2 Smsatapathy
No ratings yet
Cse3024 WM Module-2 Smsatapathy
106 pages
Web Crawlers: Presented By: B. Tech. Final Year Information Technology
No ratings yet
Web Crawlers: Presented By: B. Tech. Final Year Information Technology
27 pages
Saru Project
No ratings yet
Saru Project
12 pages
Searching The Web
No ratings yet
Searching The Web
24 pages
Summarize Principles of Distributed Database Systems Chapter 12 Web Data Management
No ratings yet
Summarize Principles of Distributed Database Systems Chapter 12 Web Data Management
24 pages
Crawling The Web: Seed Page and Then Uses The External Links Within It To Attend To Other Pages
No ratings yet
Crawling The Web: Seed Page and Then Uses The External Links Within It To Attend To Other Pages
25 pages
Document 2
No ratings yet
Document 2
18 pages
Lab1 Crawling Python
No ratings yet
Lab1 Crawling Python
10 pages
3 Web Crawling
No ratings yet
3 Web Crawling
39 pages
WI Sem8
No ratings yet
WI Sem8
56 pages
Web Crawling: Based On The Slides by Filippo
No ratings yet
Web Crawling: Based On The Slides by Filippo
52 pages
Study of Webcrawler: Implementation of Efficient and Fast Crawler
No ratings yet
Study of Webcrawler: Implementation of Efficient and Fast Crawler
6 pages
UNIT III-Web Crawlers Why Do We Need Web Crawlers?
No ratings yet
UNIT III-Web Crawlers Why Do We Need Web Crawlers?
19 pages
S O W C A: Urvey F EB Rawling Lgorithms
No ratings yet
S O W C A: Urvey F EB Rawling Lgorithms
8 pages
Inverted Indexing For Text Retrieval
No ratings yet
Inverted Indexing For Text Retrieval
21 pages
Data Mining Module 5 Important Topics PYQs
No ratings yet
Data Mining Module 5 Important Topics PYQs
28 pages
Name: Ojas Jayant Khawas Class: TY-C Roll No.:10 SRN No.:202100264 Title: Web Crawling and Page Indexing Using Breadth First Search
No ratings yet
Name: Ojas Jayant Khawas Class: TY-C Roll No.:10 SRN No.:202100264 Title: Web Crawling and Page Indexing Using Breadth First Search
7 pages
Web Search
No ratings yet
Web Search
49 pages
Information Retrieval Lecture 10 - Web Crawling
No ratings yet
Information Retrieval Lecture 10 - Web Crawling
8 pages
Lect 02-Crawling Part A
No ratings yet
Lect 02-Crawling Part A
21 pages
Chapter 3
No ratings yet
Chapter 3
39 pages
CIS 455/555: Internet and Web Systems: Crawling and Publish/Subscribe February 15, 2012
No ratings yet
CIS 455/555: Internet and Web Systems: Crawling and Publish/Subscribe February 15, 2012
34 pages
Keyw Word Quer Ry Based D Focused Dwebc Rawler: Sciencedirect
No ratings yet
Keyw Word Quer Ry Based D Focused Dwebc Rawler: Sciencedirect
7 pages
Dept. of Cse, Msec 2014-15
No ratings yet
Dept. of Cse, Msec 2014-15
19 pages
Unit IV
No ratings yet
Unit IV
12 pages
Ir 5
No ratings yet
Ir 5
18 pages
Crawler and URL Retrieving & Queuing
No ratings yet
Crawler and URL Retrieving & Queuing
5 pages
Unit 7 - Search Engine
No ratings yet
Unit 7 - Search Engine
10 pages
Web Search Engingine Indexing Crawling and Ranking
No ratings yet
Web Search Engingine Indexing Crawling and Ranking
63 pages
Focused Crawler
No ratings yet
Focused Crawler
3 pages
IR-UNIT 10 (Web Crawling)
No ratings yet
IR-UNIT 10 (Web Crawling)
62 pages
Babouk Focused Web Crawling For Corpus Compilation and Automatic Terminology Extraction
No ratings yet
Babouk Focused Web Crawling For Corpus Compilation and Automatic Terminology Extraction
2 pages
Different Types of Web Crawlers
No ratings yet
Different Types of Web Crawlers
40 pages
Grade 7 Final Exams Paper
No ratings yet
Grade 7 Final Exams Paper
9 pages
Crawler: 1.0 Introduction
No ratings yet
Crawler: 1.0 Introduction
12 pages
I) Web Crawling: Yash Pahlani D17B 49
No ratings yet
I) Web Crawling: Yash Pahlani D17B 49
7 pages
Deep Crawling of Web Sites Using Frontier Technique: Samantula Hemalatha
No ratings yet
Deep Crawling of Web Sites Using Frontier Technique: Samantula Hemalatha
11 pages
Research Paper
No ratings yet
Research Paper
5 pages
Web Crawling: Christopher Olston and Marc Najork
No ratings yet
Web Crawling: Christopher Olston and Marc Najork
49 pages
Build A Web Crawler
No ratings yet
Build A Web Crawler
6 pages
IntelliSpace - SP - 4522 962 70341 - 0311
No ratings yet
IntelliSpace - SP - 4522 962 70341 - 0311
20 pages
A Study of Focused Web Crawling Techniques
No ratings yet
A Study of Focused Web Crawling Techniques
4 pages
Adaptive Focus
No ratings yet
Adaptive Focus
6 pages
Seminar Report: Submitted By: Aanchal Garg CSE
No ratings yet
Seminar Report: Submitted By: Aanchal Garg CSE
22 pages
Crawling The Web: Information Retrieval © Crista Lopes, UCI
No ratings yet
Crawling The Web: Information Retrieval © Crista Lopes, UCI
25 pages
Ms. Poonam Sinai Kenkre
No ratings yet
Ms. Poonam Sinai Kenkre
43 pages
CS571 Note
No ratings yet
CS571 Note
2 pages
Completed Final UNIT-V 9.10.17
100% (1)
Completed Final UNIT-V 9.10.17
74 pages
Ultimate Sharepoint Migration Checklist
No ratings yet
Ultimate Sharepoint Migration Checklist
14 pages
MCD ABB - RobotStudio SIL With - OPCDA
No ratings yet
MCD ABB - RobotStudio SIL With - OPCDA
8 pages
A Survey of Focused Web Crawling Algorithms
No ratings yet
A Survey of Focused Web Crawling Algorithms
4 pages
تمكين بيانات الخلفية - مساعدة Google Play
No ratings yet
تمكين بيانات الخلفية - مساعدة Google Play
2 pages
Drive Test (DT) Performing and Analysis
No ratings yet
Drive Test (DT) Performing and Analysis
11 pages
IOT Unit 1
No ratings yet
IOT Unit 1
8 pages
Early Rising
No ratings yet
Early Rising
151 pages
1.6.1 Packet Tracer - Implement A Small Network: (Instructor Version)
No ratings yet
1.6.1 Packet Tracer - Implement A Small Network: (Instructor Version)
4 pages
Services Online Bihar Application Process
No ratings yet
Services Online Bihar Application Process
4 pages
Internet Gateway Best Practice Security Policy
No ratings yet
Internet Gateway Best Practice Security Policy
52 pages
(ILLUSION) AI Girl and Honey Select 2 - Card Sharing Thread: Attachments
No ratings yet
(ILLUSION) AI Girl and Honey Select 2 - Card Sharing Thread: Attachments
1 page
Appen
No ratings yet
Appen
11 pages
OpenText Content Server 20.2.0 - Module Installation and Upgrade Guide English (LLESCOR200200-IMO-EN-01)
No ratings yet
OpenText Content Server 20.2.0 - Module Installation and Upgrade Guide English (LLESCOR200200-IMO-EN-01)
22 pages
3 - IJIM Aswani SEM
No ratings yet
3 - IJIM Aswani SEM
10 pages
IR12950D BMS Webserver M45 Eng
No ratings yet
IR12950D BMS Webserver M45 Eng
31 pages
Penguins - Sight Word Readers Set 36: Click Here For More Free Printables!
No ratings yet
Penguins - Sight Word Readers Set 36: Click Here For More Free Printables!
9 pages
Avaya Aura Call Center Elite 6.3 Overview and Specification
No ratings yet
Avaya Aura Call Center Elite 6.3 Overview and Specification
49 pages
E Commerce 2018 14th Edition Laudon Test Bank
100% (45)
E Commerce 2018 14th Edition Laudon Test Bank
25 pages
CH 2
No ratings yet
CH 2
29 pages
Folder Redirection
No ratings yet
Folder Redirection
32 pages
Akanksha Project
No ratings yet
Akanksha Project
20 pages
Automated Waste Collection and Management System: Software Architecture Document (SAD)
No ratings yet
Automated Waste Collection and Management System: Software Architecture Document (SAD)
13 pages
Resume Writing Guide: Your Full Name Here
No ratings yet
Resume Writing Guide: Your Full Name Here
2 pages
Social Media Answers
No ratings yet
Social Media Answers
9 pages
Probox Automation
No ratings yet
Probox Automation
1 page
Yahoo Was All Too Human For The Internet - FT
No ratings yet
Yahoo Was All Too Human For The Internet - FT
3 pages
Karthik Rajan ATS Resume
No ratings yet
Karthik Rajan ATS Resume
1 page
Omin Society Roadmap
No ratings yet
Omin Society Roadmap
3 pages
Modern JavaScript Applications
From Everand
Modern JavaScript Applications
Narayan Prusty
No ratings yet
Mastering the Art of Web Scraping: Unraveling the Secrets of Expert-Level Programming
From Everand
Mastering the Art of Web Scraping: Unraveling the Secrets of Expert-Level Programming
Steve Jones
No ratings yet
Web Scraping with Python Step by Step: A Practical Guide with Examples
From Everand
Web Scraping with Python Step by Step: A Practical Guide with Examples
William E. Clark
No ratings yet