Web Crawling - python
Web Crawling - python
Chapter - 1
Python Fundamentals
Chapter - 2
Python File Handling and Exception Handling
Chapter - 3
Python Object-Oriented Programming
Chapter - 4
Interaction with web pages by Python
Chapter – 4.2
Source
https://medium.com/geekculture/web-scraping-with-python-a-complete-step-by-step-guide-code-5174e52340ea
Uses of Web Scraping
Web scraping has countless applications for both
personal and commercial needs. Every organization
or person has unique requirements when it comes to
data collection.
Source
https://www.webharvy.com/articles/web-scraper-use-cases.html
How Web Scrapers Work?
Source
https://avinetworks.com/glossary/web-scraping/
Basic Concepts
Tools and Libraries
• requests: A Python library for sending HTTP
requests and handling responses.
• BeautifulSoup: A Python library for parsing HTML
and XML documents. It provides methods to
navigate and search the parse tree.
• lxml: A Python library for parsing and processing
XML and HTML documents, known for its speed.
• Selenium: A tool for automating web browsers. It is
used for scraping dynamic content loaded by
JavaScript.
• Scrapy: A comprehensive web scraping framework
that provides tools for extracting, processing, and
storing data.
Steps in Web Scraping
1. Identify the Data: Determine what data you need
and locate where it is on the website.
2. Inspect the Web Page: Use browser developer
tools to inspect the HTML structure of the page and
identify the elements containing the data.
3. Send a Request: Use an HTTP request to fetch the
page content.
4. Parse the HTML: Use a parser to process the
HTML and extract data.
5. Extract Data: Find and retrieve the specific pieces
of data you are interested in.
6. Store or Use the Data: Save the extracted data to
a file, database, or use it as needed.
Ethical Considerations
Source
https://fastercapital.com/topics/ethical-considerations-in-web-scraping.html
Web Requests
Web requests are essential for web scraping and
interacting with websites. When you access a
website, your browser sends a web request to the
server, which responds with the requested resource
(usually an HTML page). Web requests allow you to
interact with web servers programmatically, retrieve
data, and automate certain tasks.
Types of HTTP Requests
• GET: The GET request is used to retrieve data from
the server. When you enter a URL in your browser,
it sends a GET request to fetch the web page.
• POST: The POST request is used to send data to
the server, often when submitting form data on a
web page.
• PUT: The PUT request is used to update existing
data on the server. It sends data to the server to
replace the current representation of the target
resource.
• DELETE: The DELETE request is used to remove
data from the server. It tells the server to delete the
specified resource.
Using the Requests Module
• The requests module in Python makes it easy to send HTTP requests.
Installing the Requests Module:
• To install the requests module, you can use pip:
Installing BeautifulSoup
To install BeautifulSoup, you can use pip:
Source
https://stackabuse.com/guide-to-parsing-html-with-beautifulsoup-in-python/
Extracting Data by BeautifulSoup
• BeautifulSoup provides methods like find() and find_all() to search for specific elements.
• For instance, we can use find() to locate a single element and find_all() to retrieve a list of all elements that
match a given criteria.
Source
https://www.scrapingdog.com/blog/beautifulsoup-tutorial-web-scraping-with-python/
Lab Activity
• Lab 59. Implementation of Parsing HTML using BeautifulSoup and Extracting data
from web pages
• Lab 60. Implementation of web request using requests user-friendly tool for sending
HTTP requests to access or interact with web content
Introduction to Web Crawling
What is Web Crawling?
• Web crawling is the process of systematically
browsing the World Wide Web, typically for the
purpose of web indexing.
• A web crawler (also known as a spider or bot) starts
with a list of URLs to visit, known as seeds.
Source
https://www.simplilearn.com/what-is-a-web-crawler-article
How working web crawling
Source
https://www.akamai.com/glossary/what-is-a-web-crawler#:~:text=Web%20crawling%20is%20the%20task,for%20use%20in%20analytics%20software.
Key Concepts in Web Crawling
Respecting the robots.txt file and rate The number of levels of links the crawler
3 limiting to avoid overloading the server. 4 follows from the seed URL.
Example on Web Crawling Using Python
Output:
Source
https://techjury.net/blog/web-crawling-vs-web-scraping/
Difference between Web Scraping and Web Crawling
Source
https://techjury.net/blog/web-crawling-vs-web-scraping/
Crawling Websites Using Scrapy or BeautifulSoup
• Both Scrapy and BeautifulSoup are popular for web scraping, but they are used differently:
• Scrapy is a powerful web scraping and crawling framework that makes it easy to extract
structured data. It’s best suited for large-scale scraping projects where you need to scrape
multiple pages or websites.
• BeautifulSoup is a library used for parsing HTML and XML. It's often paired with requests to
make HTTP requests but doesn't have the built-in crawling capabilities of Scrapy.
Lab Activity
• Lab 61. To implement a simple web crawler in Python, using requests and
BeautifulSoup
Summary
• Web Scraping: The process of extracting data from
websites.
• BeautifulSoup: A Python library used for parsing HTML
and XML.
• Requests Module: A Python library used for sending
HTTP requests.
• HTTP Requests: GET (fetch data), POST (send data),
PUT (update data), DELETE (remove data).
• Web Crawling: Systematically browsing the web to collect
data.
• Difference: Web scraping focuses on specific data from
specific pages, while web crawling involves browsing
multiple pages and sites.
• BeautifulSoup: Suitable for basic web scraping and
crawling tasks when combined with requests.
QUIZ
Let’s Start
Quiz
1. Which Python module is commonly used for sending
HTTP requests?
a) urllib
b) requests
c) http.client
d) request
Answer: B
requests
Quiz
2. Which of the following is an important consideration when web
scraping?
Answer: C
Website’s scraping policies (robots.txt)
Quiz
3. What is the purpose of the find_all() method in BeautifulSoup?
Answer: B
It finds all elements that match the given criteria.
Quiz
4. Which of the following best describes the term "seed URL" in
web crawling?
Answer: A
A URL where a crawler starts its operation
Quiz
5. What is the main difference between a web scraper and a web
crawler?
Answer: A
A web scraper collects data, a web crawler indexes web pages.
Reference
• https://techjury.net/blog/web-crawling-vs-web-scraping/
• https://soax.com/blog/web-crawling-vs-web-scraping
• https://www.linkedin.com/pulse/http-standard-methods-why-you-should-use-them-mahmoud-mahmoud
• https://medium.com/geekculture/web-scraping-with-python-a-complete-step-by-step-guide-code-
5174e52340ea
• https://www.simplilearn.com/what-is-a-web-crawler-article
• https://www.scrapingdog.com/blog/beautifulsoup-tutorial-web-scraping-with-python/
Thank You