0% found this document useful (0 votes)
15 views

Web Crawling - python

This document provides an overview of web scraping and web crawling using Python, focusing on tools like BeautifulSoup and Scrapy. It covers the basics of web scraping, including ethical considerations, HTTP requests, and the differences between scraping and crawling. Additionally, it includes practical lab activities and quizzes to reinforce learning.

Uploaded by

sakshamskill3
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

Web Crawling - python

This document provides an overview of web scraping and web crawling using Python, focusing on tools like BeautifulSoup and Scrapy. It covers the basics of web scraping, including ethical considerations, HTTP requests, and the differences between scraping and crawling. Additionally, it includes practical lab activities and quizzes to reinforce learning.

Uploaded by

sakshamskill3
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

Module - 3

Interaction with Web Pages


by Python
Disclaimer
The content is curated from
online/offline resources and used for
educational purpose only
Chapters for Discussion

Chapter - 1
Python Fundamentals

Chapter - 2
Python File Handling and Exception Handling

Chapter - 3
Python Object-Oriented Programming

Chapter - 4
Interaction with web pages by Python
Chapter – 4.2

Interaction with web pages by Python


Learning Objective
You will learn in this chapter:
• Understand the basics of web scraping, including its
purpose, key concepts, and ethical considerations.
• Apply BeautifulSoup to parse HTML and extract
specific data from web pages effectively.
• Implement web requests to retrieve web page
content needed for web scraping.
• Analyze the differences between web scraping and
web crawling, identifying the appropriate use cases
for each.
• Develop a basic web crawler using Scrapy or
BeautifulSoup to systematically collect data from
websites.
What is Web Scraping?
• Web scraping, web harvesting, or web data extraction
• Web scraping is the process of automatically extracting information from websites.
• It involves downloading web pages and extracting the relevant data from them.

Source
https://medium.com/geekculture/web-scraping-with-python-a-complete-step-by-step-guide-code-5174e52340ea
Uses of Web Scraping
Web scraping has countless applications for both
personal and commercial needs. Every organization
or person has unique requirements when it comes to
data collection.

Source
https://www.webharvy.com/articles/web-scraper-use-cases.html
How Web Scrapers Work?

Source
https://avinetworks.com/glossary/web-scraping/
Basic Concepts
Tools and Libraries
• requests: A Python library for sending HTTP
requests and handling responses.
• BeautifulSoup: A Python library for parsing HTML
and XML documents. It provides methods to
navigate and search the parse tree.
• lxml: A Python library for parsing and processing
XML and HTML documents, known for its speed.
• Selenium: A tool for automating web browsers. It is
used for scraping dynamic content loaded by
JavaScript.
• Scrapy: A comprehensive web scraping framework
that provides tools for extracting, processing, and
storing data.
Steps in Web Scraping
1. Identify the Data: Determine what data you need
and locate where it is on the website.
2. Inspect the Web Page: Use browser developer
tools to inspect the HTML structure of the page and
identify the elements containing the data.
3. Send a Request: Use an HTTP request to fetch the
page content.
4. Parse the HTML: Use a parser to process the
HTML and extract data.
5. Extract Data: Find and retrieve the specific pieces
of data you are interested in.
6. Store or Use the Data: Save the extracted data to
a file, database, or use it as needed.
Ethical Considerations

Source
https://fastercapital.com/topics/ethical-considerations-in-web-scraping.html
Web Requests
Web requests are essential for web scraping and
interacting with websites. When you access a
website, your browser sends a web request to the
server, which responds with the requested resource
(usually an HTML page). Web requests allow you to
interact with web servers programmatically, retrieve
data, and automate certain tasks.
Types of HTTP Requests
• GET: The GET request is used to retrieve data from
the server. When you enter a URL in your browser,
it sends a GET request to fetch the web page.
• POST: The POST request is used to send data to
the server, often when submitting form data on a
web page.
• PUT: The PUT request is used to update existing
data on the server. It sends data to the server to
replace the current representation of the target
resource.
• DELETE: The DELETE request is used to remove
data from the server. It tells the server to delete the
specified resource.
Using the Requests Module
• The requests module in Python makes it easy to send HTTP requests.
Installing the Requests Module:
• To install the requests module, you can use pip:

pip install requests


Example in Python
Parsing HTML using BeautifulSoup
What is BeautifulSoup?
BeautifulSoup is a Python library used for parsing HTML and XML documents. It creates a parse
tree that can be used to extract data from HTML.

Installing BeautifulSoup
To install BeautifulSoup, you can use pip:

pip install beautifulsoup4

Source
https://stackabuse.com/guide-to-parsing-html-with-beautifulsoup-in-python/
Extracting Data by BeautifulSoup
• BeautifulSoup provides methods like find() and find_all() to search for specific elements.
• For instance, we can use find() to locate a single element and find_all() to retrieve a list of all elements that
match a given criteria.

Source
https://www.scrapingdog.com/blog/beautifulsoup-tutorial-web-scraping-with-python/
Lab Activity

• Lab 59. Implementation of Parsing HTML using BeautifulSoup and Extracting data
from web pages
• Lab 60. Implementation of web request using requests user-friendly tool for sending
HTTP requests to access or interact with web content
Introduction to Web Crawling
What is Web Crawling?
• Web crawling is the process of systematically
browsing the World Wide Web, typically for the
purpose of web indexing.
• A web crawler (also known as a spider or bot) starts
with a list of URLs to visit, known as seeds.

Source
https://www.simplilearn.com/what-is-a-web-crawler-article
How working web crawling

Source
https://www.akamai.com/glossary/what-is-a-web-crawler#:~:text=Web%20crawling%20is%20the%20task,for%20use%20in%20analytics%20software.
Key Concepts in Web Crawling

Seed URLs Frontier

1 The starting points for a web crawler.


2 The list of URLs to be crawled.

Politeness Policy Depth

Respecting the robots.txt file and rate The number of levels of links the crawler
3 limiting to avoid overloading the server. 4 follows from the seed URL.
Example on Web Crawling Using Python

Output:

Source
https://techjury.net/blog/web-crawling-vs-web-scraping/
Difference between Web Scraping and Web Crawling

Source
https://techjury.net/blog/web-crawling-vs-web-scraping/
Crawling Websites Using Scrapy or BeautifulSoup
• Both Scrapy and BeautifulSoup are popular for web scraping, but they are used differently:
• Scrapy is a powerful web scraping and crawling framework that makes it easy to extract
structured data. It’s best suited for large-scale scraping projects where you need to scrape
multiple pages or websites.
• BeautifulSoup is a library used for parsing HTML and XML. It's often paired with requests to
make HTTP requests but doesn't have the built-in crawling capabilities of Scrapy.
Lab Activity

• Lab 61. To implement a simple web crawler in Python, using requests and
BeautifulSoup
Summary
• Web Scraping: The process of extracting data from
websites.
• BeautifulSoup: A Python library used for parsing HTML
and XML.
• Requests Module: A Python library used for sending
HTTP requests.
• HTTP Requests: GET (fetch data), POST (send data),
PUT (update data), DELETE (remove data).
• Web Crawling: Systematically browsing the web to collect
data.
• Difference: Web scraping focuses on specific data from
specific pages, while web crawling involves browsing
multiple pages and sites.
• BeautifulSoup: Suitable for basic web scraping and
crawling tasks when combined with requests.
QUIZ
Let’s Start
Quiz
1. Which Python module is commonly used for sending
HTTP requests?

a) urllib
b) requests
c) http.client
d) request

Answer: B
requests
Quiz
2. Which of the following is an important consideration when web
scraping?

a) The color of the website


b) The size of the images on the website
c) Website’s scraping policies (robots.txt)
d) The font used on the website

Answer: C
Website’s scraping policies (robots.txt)
Quiz
3. What is the purpose of the find_all() method in BeautifulSoup?

a) It finds the first element that matches the given criteria.


b) It finds all elements that match the given criteria.
c) It removes all elements that match the given criteria.
d) It replaces all elements that match the given criteria.

Answer: B
It finds all elements that match the given criteria.
Quiz
4. Which of the following best describes the term "seed URL" in
web crawling?

a) A URL where a crawler starts its operation


b) A URL that a crawler skips
c) A URL that is dynamically generated
d) A URL that a crawler uses for authentication

Answer: A
A URL where a crawler starts its operation
Quiz
5. What is the main difference between a web scraper and a web
crawler?

a) A web scraper collects data, a web crawler indexes web pages.


b) A web scraper indexes web pages, a web crawler collects data.
c) A web scraper is faster than a web crawler.
d) A web scraper follows links, a web crawler does not.

Answer: A
A web scraper collects data, a web crawler indexes web pages.
Reference
• https://techjury.net/blog/web-crawling-vs-web-scraping/
• https://soax.com/blog/web-crawling-vs-web-scraping
• https://www.linkedin.com/pulse/http-standard-methods-why-you-should-use-them-mahmoud-mahmoud
• https://medium.com/geekculture/web-scraping-with-python-a-complete-step-by-step-guide-code-
5174e52340ea
• https://www.simplilearn.com/what-is-a-web-crawler-article
• https://www.scrapingdog.com/blog/beautifulsoup-tutorial-web-scraping-with-python/
Thank You

You might also like