I) Web Crawling: Yash Pahlani D17B 49
I) Web Crawling: Yash Pahlani D17B 49
Aim:
1. Write a python program to crawl a specific website and extract all the URLs
found on the home page.
2. Create a web crawler that collects information from multiple pages of a
website and saves the data in a structured format like csv or JSON.
Theory:
I) Web Crawling
Web crawlers use HTTP requests to communicate with web servers and retrieve
web pages, imitating how humans access websites. By extracting Uniform
Resource Locators (URLs) from web pages, crawlers discover new pages to visit,
expanding their exploration.
The data collected by web crawlers needs to be organized for effective use.
Structured formats like CSV and JSON are commonly used. CSV presents data in
rows and columns, while JSON uses key-value pairs for hierarchical organization.
II) Working
A web crawler, also known as a web spider or web robot, is a computer program
that systematically browses the World Wide Web, typically for the purpose of
indexing websites for later retrieval. Web crawlers are an essential part of the
internet infrastructure, as they allow search engines to index the vast amount of
information available on the web.
Yash Pahlani D17B 49
The web crawler starts with a list of seed URLs, which are known websites that it
will crawl first. It then visits each seed URL and extracts all the hyperlinks from
the page. These hyperlinks are then added to the crawler's queue of pages to crawl.
The crawler continues to crawl pages from its queue until it reaches a
predetermined limit, such as the number of pages it can crawl per day or the
amount of time it can spend crawling. It also stops crawling pages if it encounters a
page that is blocked or that is not accessible.
As the crawler crawls pages, it extracts the text, images, and other content from the
pages. It also parses the HTML code of the pages to learn about the structure of the
website. This information is then stored in the crawler's database.
The crawler periodically updates its database with new information about the
websites it has crawled. This information is used by search engines to index
websites and to provide search results to users.
Yash Pahlani D17B 49
Here are some of the key steps involved in how a web crawler works:
➢ Start with a list of seed URLs: The web crawler starts with a list of known
websites that it will crawl first. These seed URLs can be provided by the
crawler's developer or they can be generated by the crawler itself.
➢ Extract hyperlinks from pages: Once the web crawler visits a seed URL, it
extracts all the hyperlinks from the page. These hyperlinks are then added to
the crawler's queue of pages to crawl.
➢ Crawl pages from the queue: The web crawler continues to crawl pages from
its queue until it reaches a predetermined limit. It also stops crawling pages
if it encounters a page that is blocked or that is not accessible.
➢ Extract content from pages: As the web crawler crawls pages, it extracts the
text, images, and other content from the pages. It also parses the HTML
code of the pages to learn about the structure of the website.
➢ Store information in a database: The web crawler stores the information it
extracts from pages in a database. This information is used by search engines
to index websites and to provide search results to users.
➢ Periodically update the database: The web crawler periodically updates its
database with new information about the websites it has crawled. This
ensures that the search engine's index is always up-to-date.
Code:
i) Python program crawl the specific website and extract all the urls on the
homepage
Code:
Yash Pahlani D17B 49
Output:
Yash Pahlani D17B 49
ii) Create a web crawler that collects information from multiple pages of the
websites and save the data in the structured format like csv or js
Code:
Yash Pahlani D17B 49
OUTPUT:
Data.csv
Conclusion:
Web crawlers are essential to the internet infrastructure. They allow search engines
to index the vast amount of information available on the web, making it possible
for users to find the information they need quickly and easily. Web crawlers are
also used for a variety of other purposes, such as collecting data from websites,
monitoring websites for changes, and analyzing website traffic.