0% found this document useful (0 votes)
8 views7 pages

I) Web Crawling: Yash Pahlani D17B 49

Uploaded by

dummyvesit49
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views7 pages

I) Web Crawling: Yash Pahlani D17B 49

Uploaded by

dummyvesit49
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Yash Pahlani D17B 49

Aim:

1. Write a python program to crawl a specific website and extract all the URLs
found on the home page.
2. Create a web crawler that collects information from multiple pages of a
website and saves the data in a structured format like csv or JSON.

Theory:

I) Web Crawling

Web crawling is an automated process that explores websites and gathers


information from them. It's like a digital explorer that navigates the internet by
following links and collecting data from web pages. This technique, also known as
web scraping, is crucial for tasks like building search engine indexes, monitoring
content changes, and gathering data for analysis.

Web crawlers use HTTP requests to communicate with web servers and retrieve
web pages, imitating how humans access websites. By extracting Uniform
Resource Locators (URLs) from web pages, crawlers discover new pages to visit,
expanding their exploration.

The data collected by web crawlers needs to be organized for effective use.
Structured formats like CSV and JSON are commonly used. CSV presents data in
rows and columns, while JSON uses key-value pairs for hierarchical organization.

II) Working

A web crawler, also known as a web spider or web robot, is a computer program
that systematically browses the World Wide Web, typically for the purpose of
indexing websites for later retrieval. Web crawlers are an essential part of the
internet infrastructure, as they allow search engines to index the vast amount of
information available on the web.
Yash Pahlani D17B 49

Here is a diagram of how a web crawler works:

The web crawler starts with a list of seed URLs, which are known websites that it
will crawl first. It then visits each seed URL and extracts all the hyperlinks from
the page. These hyperlinks are then added to the crawler's queue of pages to crawl.

The crawler continues to crawl pages from its queue until it reaches a
predetermined limit, such as the number of pages it can crawl per day or the
amount of time it can spend crawling. It also stops crawling pages if it encounters a
page that is blocked or that is not accessible.

As the crawler crawls pages, it extracts the text, images, and other content from the
pages. It also parses the HTML code of the pages to learn about the structure of the
website. This information is then stored in the crawler's database.

The crawler periodically updates its database with new information about the
websites it has crawled. This information is used by search engines to index
websites and to provide search results to users.
Yash Pahlani D17B 49

Here are some of the key steps involved in how a web crawler works:

➢ Start with a list of seed URLs: The web crawler starts with a list of known
websites that it will crawl first. These seed URLs can be provided by the
crawler's developer or they can be generated by the crawler itself.
➢ Extract hyperlinks from pages: Once the web crawler visits a seed URL, it
extracts all the hyperlinks from the page. These hyperlinks are then added to
the crawler's queue of pages to crawl.
➢ Crawl pages from the queue: The web crawler continues to crawl pages from
its queue until it reaches a predetermined limit. It also stops crawling pages
if it encounters a page that is blocked or that is not accessible.
➢ Extract content from pages: As the web crawler crawls pages, it extracts the
text, images, and other content from the pages. It also parses the HTML
code of the pages to learn about the structure of the website.
➢ Store information in a database: The web crawler stores the information it
extracts from pages in a database. This information is used by search engines
to index websites and to provide search results to users.
➢ Periodically update the database: The web crawler periodically updates its
database with new information about the websites it has crawled. This
ensures that the search engine's index is always up-to-date.

III) Difficulties in Web Crawling:

➢ Variability in Website Structures: Websites vary in design and structure,


making uniform data extraction challenging.
➢ Dynamic Content Loading: Asynchronous loading of content can
complicate capturing data effectively.
➢ Handling Large Data Volumes: The sheer volume of internet data requires
efficient storage and management.
➢ Changing URLs and Redirects: URL changes and redirections must be
managed for accurate data retrieval.
➢ Robots.txt and Crawl Restrictions: Crawlers must adhere to websites'
"robots.txt" rules to avoid prohibited areas.
➢ Legal and Ethical Concerns: Data privacy, intellectual property rights, and
terms of use need consideration.
Yash Pahlani D17B 49

➢ Server Responses and Error Handling: Robust error handling is essential to


manage server errors and timeouts.

IV) Examples of web crawlers


Most popular search engines have their own web crawlers that use a specific algorithm to
gather information about webpages. Web crawler tools can be desktop- or cloud-based.
Some examples of web crawlers used for search engine indexing include the following:
● Amazon Bot is the Amazon web crawler.
● Bingbot is Microsoft's search engine crawler for Bing.
● DuckDuckBot is the crawler for the search engine DuckDuckGo.
● Googlebot is the crawler for Google's search engine.
● Yahoo Slurp is the crawler for Yahoo's search engine.
● Yandex Bot is the crawler for the Yandex search engine.

Code:
i) Python program crawl the specific website and extract all the urls on the
homepage
Code:
Yash Pahlani D17B 49

Output:
Yash Pahlani D17B 49

ii) Create a web crawler that collects information from multiple pages of the
websites and save the data in the structured format like csv or js
Code:
Yash Pahlani D17B 49

OUTPUT:

Data.csv

Conclusion:
Web crawlers are essential to the internet infrastructure. They allow search engines
to index the vast amount of information available on the web, making it possible
for users to find the information they need quickly and easily. Web crawlers are
also used for a variety of other purposes, such as collecting data from websites,
monitoring websites for changes, and analyzing website traffic.

You might also like