0% found this document useful (0 votes)

38 views34 pages

Web Crawling - Python

This document provides an overview of web scraping and web crawling using Python, focusing on tools like BeautifulSoup and Scrapy. It covers the basics of web scraping, including ethical considerations, HTTP requests, and the differences between scraping and crawling. Additionally, it includes practical lab activities and quizzes to reinforce learning.

Uploaded by

sakshamskill3

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

38 views34 pages

Web Crawling - Python

Uploaded by

sakshamskill3

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 34

Module - 3

Interaction with Web Pages

by Python
Disclaimer
The content is curated from
online/offline resources and used for
educational purpose only
Chapters for Discussion

Chapter - 1
Python Fundamentals

Chapter - 2
Python File Handling and Exception Handling

Chapter - 3
Python Object-Oriented Programming

Chapter - 4
Interaction with web pages by Python
Chapter – 4.2

Interaction with web pages by Python

Learning Objective
You will learn in this chapter:
• Understand the basics of web scraping, including its
purpose, key concepts, and ethical considerations.
• Apply BeautifulSoup to parse HTML and extract
specific data from web pages effectively.
• Implement web requests to retrieve web page
content needed for web scraping.
• Analyze the differences between web scraping and
web crawling, identifying the appropriate use cases
for each.
• Develop a basic web crawler using Scrapy or
BeautifulSoup to systematically collect data from
websites.
What is Web Scraping?
• Web scraping, web harvesting, or web data extraction
• Web scraping is the process of automatically extracting information from websites.
• It involves downloading web pages and extracting the relevant data from them.

Source
https://medium.com/geekculture/web-scraping-with-python-a-complete-step-by-step-guide-code-5174e52340ea
Uses of Web Scraping
Web scraping has countless applications for both
personal and commercial needs. Every organization
or person has unique requirements when it comes to
data collection.

Source
https://www.webharvy.com/articles/web-scraper-use-cases.html
How Web Scrapers Work?

Source
https://avinetworks.com/glossary/web-scraping/
Basic Concepts
Tools and Libraries
• requests: A Python library for sending HTTP
requests and handling responses.
• BeautifulSoup: A Python library for parsing HTML
and XML documents. It provides methods to
navigate and search the parse tree.
• lxml: A Python library for parsing and processing
XML and HTML documents, known for its speed.
• Selenium: A tool for automating web browsers. It is
used for scraping dynamic content loaded by
JavaScript.
• Scrapy: A comprehensive web scraping framework
that provides tools for extracting, processing, and
storing data.
Steps in Web Scraping
1. Identify the Data: Determine what data you need
and locate where it is on the website.
2. Inspect the Web Page: Use browser developer
tools to inspect the HTML structure of the page and
identify the elements containing the data.
3. Send a Request: Use an HTTP request to fetch the
page content.
4. Parse the HTML: Use a parser to process the
HTML and extract data.
5. Extract Data: Find and retrieve the specific pieces
of data you are interested in.
6. Store or Use the Data: Save the extracted data to
a file, database, or use it as needed.
Ethical Considerations

Source
https://fastercapital.com/topics/ethical-considerations-in-web-scraping.html
Web Requests
Web requests are essential for web scraping and
interacting with websites. When you access a
website, your browser sends a web request to the
server, which responds with the requested resource
(usually an HTML page). Web requests allow you to
interact with web servers programmatically, retrieve
data, and automate certain tasks.
Types of HTTP Requests
• GET: The GET request is used to retrieve data from
the server. When you enter a URL in your browser,
it sends a GET request to fetch the web page.
• POST: The POST request is used to send data to
the server, often when submitting form data on a
web page.
• PUT: The PUT request is used to update existing
data on the server. It sends data to the server to
replace the current representation of the target
resource.
• DELETE: The DELETE request is used to remove
data from the server. It tells the server to delete the
specified resource.
Using the Requests Module
• The requests module in Python makes it easy to send HTTP requests.
Installing the Requests Module:
• To install the requests module, you can use pip:

pip install requests

Example in Python
Parsing HTML using BeautifulSoup
What is BeautifulSoup?
BeautifulSoup is a Python library used for parsing HTML and XML documents. It creates a parse
tree that can be used to extract data from HTML.

Installing BeautifulSoup
To install BeautifulSoup, you can use pip:

pip install beautifulsoup4

Source
https://stackabuse.com/guide-to-parsing-html-with-beautifulsoup-in-python/
Extracting Data by BeautifulSoup
• BeautifulSoup provides methods like find() and find_all() to search for specific elements.
• For instance, we can use find() to locate a single element and find_all() to retrieve a list of all elements that
match a given criteria.

Source
https://www.scrapingdog.com/blog/beautifulsoup-tutorial-web-scraping-with-python/
Lab Activity

• Lab 59. Implementation of Parsing HTML using BeautifulSoup and Extracting data
from web pages
• Lab 60. Implementation of web request using requests user-friendly tool for sending
HTTP requests to access or interact with web content
Introduction to Web Crawling
What is Web Crawling?
• Web crawling is the process of systematically
browsing the World Wide Web, typically for the
purpose of web indexing.
• A web crawler (also known as a spider or bot) starts
with a list of URLs to visit, known as seeds.

Source
https://www.simplilearn.com/what-is-a-web-crawler-article
How working web crawling

Source
https://www.akamai.com/glossary/what-is-a-web-crawler#:~:text=Web%20crawling%20is%20the%20task,for%20use%20in%20analytics%20software.
Key Concepts in Web Crawling

Seed URLs Frontier

1 The starting points for a web crawler.

2 The list of URLs to be crawled.

Politeness Policy Depth

Respecting the robots.txt file and rate The number of levels of links the crawler
3 limiting to avoid overloading the server. 4 follows from the seed URL.
Example on Web Crawling Using Python

Output:

Source
https://techjury.net/blog/web-crawling-vs-web-scraping/
Difference between Web Scraping and Web Crawling

Source
https://techjury.net/blog/web-crawling-vs-web-scraping/
Crawling Websites Using Scrapy or BeautifulSoup
• Both Scrapy and BeautifulSoup are popular for web scraping, but they are used differently:
• Scrapy is a powerful web scraping and crawling framework that makes it easy to extract
structured data. It’s best suited for large-scale scraping projects where you need to scrape
multiple pages or websites.
• BeautifulSoup is a library used for parsing HTML and XML. It's often paired with requests to
make HTTP requests but doesn't have the built-in crawling capabilities of Scrapy.
Lab Activity

• Lab 61. To implement a simple web crawler in Python, using requests and
BeautifulSoup
Summary
• Web Scraping: The process of extracting data from
websites.
• BeautifulSoup: A Python library used for parsing HTML
and XML.
• Requests Module: A Python library used for sending
HTTP requests.
• HTTP Requests: GET (fetch data), POST (send data),
PUT (update data), DELETE (remove data).
• Web Crawling: Systematically browsing the web to collect
data.
• Difference: Web scraping focuses on specific data from
specific pages, while web crawling involves browsing
multiple pages and sites.
• BeautifulSoup: Suitable for basic web scraping and
crawling tasks when combined with requests.
QUIZ
Let’s Start
Quiz
1. Which Python module is commonly used for sending
HTTP requests?

a) urllib
b) requests
c) http.client
d) request

Answer: B
requests
Quiz
2. Which of the following is an important consideration when web
scraping?

a) The color of the website

b) The size of the images on the website
c) Website’s scraping policies (robots.txt)
d) The font used on the website

Answer: C
Website’s scraping policies (robots.txt)
Quiz
3. What is the purpose of the find_all() method in BeautifulSoup?

a) It finds the first element that matches the given criteria.

b) It finds all elements that match the given criteria.
c) It removes all elements that match the given criteria.
d) It replaces all elements that match the given criteria.

Answer: B
It finds all elements that match the given criteria.
Quiz
4. Which of the following best describes the term "seed URL" in
web crawling?

a) A URL where a crawler starts its operation

b) A URL that a crawler skips
c) A URL that is dynamically generated
d) A URL that a crawler uses for authentication

Answer: A
A URL where a crawler starts its operation
Quiz
5. What is the main difference between a web scraper and a web
crawler?

a) A web scraper collects data, a web crawler indexes web pages.

b) A web scraper indexes web pages, a web crawler collects data.
c) A web scraper is faster than a web crawler.
d) A web scraper follows links, a web crawler does not.

Answer: A
A web scraper collects data, a web crawler indexes web pages.
Reference
• https://techjury.net/blog/web-crawling-vs-web-scraping/
• https://soax.com/blog/web-crawling-vs-web-scraping
• https://www.linkedin.com/pulse/http-standard-methods-why-you-should-use-them-mahmoud-mahmoud
• https://medium.com/geekculture/web-scraping-with-python-a-complete-step-by-step-guide-code-
5174e52340ea
• https://www.simplilearn.com/what-is-a-web-crawler-article
• https://www.scrapingdog.com/blog/beautifulsoup-tutorial-web-scraping-with-python/
Thank You

A Practical Guide To Web Scraping (PDFDrive)
No ratings yet
A Practical Guide To Web Scraping (PDFDrive)
107 pages
Web Scraping With Python Tutorials From A To Z
100% (2)
Web Scraping With Python Tutorials From A To Z
35 pages
Web Scraping Report
No ratings yet
Web Scraping Report
14 pages
Web Scraping Cheat Sheet (2021), Python For Web Scraping by Frank Andrade Geek Culture - Medium
100% (3)
Web Scraping Cheat Sheet (2021), Python For Web Scraping by Frank Andrade Geek Culture - Medium
26 pages
Python Module-4
No ratings yet
Python Module-4
109 pages
2016 KTM SXF XCF 250 Service Repair Manual PDF
No ratings yet
2016 KTM SXF XCF 250 Service Repair Manual PDF
306 pages
Scraping
100% (1)
Scraping
25 pages
Web Scraping - Unit 1
100% (1)
Web Scraping - Unit 1
31 pages
Text Processing For NLP Web Scrapping
No ratings yet
Text Processing For NLP Web Scrapping
18 pages
Data - Collection Python
No ratings yet
Data - Collection Python
40 pages
Python Web Scraping Tutorial
92% (12)
Python Web Scraping Tutorial
65 pages
Scraping Book
No ratings yet
Scraping Book
50 pages
Web Scraping
No ratings yet
Web Scraping
5 pages
Vba Web Scraping
0% (1)
Vba Web Scraping
718 pages
Web Scraping Presentation With Images
No ratings yet
Web Scraping Presentation With Images
4 pages
06 WebScrapingData
No ratings yet
06 WebScrapingData
39 pages
Data Analysis by Web Scraping Using Python
No ratings yet
Data Analysis by Web Scraping Using Python
6 pages
Unit 11 Application Development Using Python
No ratings yet
Unit 11 Application Development Using Python
19 pages
Data Collection
No ratings yet
Data Collection
10 pages
Web Scraping With Python
No ratings yet
Web Scraping With Python
16 pages
DAP 4 Module
No ratings yet
DAP 4 Module
45 pages
Web Scraping Using Python - Notes
No ratings yet
Web Scraping Using Python - Notes
6 pages
DAP Module4
No ratings yet
DAP Module4
109 pages
Front End and Backend
No ratings yet
Front End and Backend
3 pages
WEBSCRAping Buildwithpython
No ratings yet
WEBSCRAping Buildwithpython
78 pages
Web Crawling and Social Media Mining: Module No. 5
No ratings yet
Web Crawling and Social Media Mining: Module No. 5
77 pages
Programming 2 Lectures
No ratings yet
Programming 2 Lectures
52 pages
1.8 Data Scrapping PDF
No ratings yet
1.8 Data Scrapping PDF
42 pages
Seminar Completed
No ratings yet
Seminar Completed
22 pages
Mini Project Report
No ratings yet
Mini Project Report
13 pages
Web Scraping
No ratings yet
Web Scraping
4 pages
Web Scraping
No ratings yet
Web Scraping
28 pages
Module 4
No ratings yet
Module 4
14 pages
Scraping Book Python PDF
No ratings yet
Scraping Book Python PDF
50 pages
Webscraping
No ratings yet
Webscraping
12 pages
Practical Web Scraping For Economists 1744341390
No ratings yet
Practical Web Scraping For Economists 1744341390
33 pages
Experiment2 Web Scraping and Data Analysis
No ratings yet
Experiment2 Web Scraping and Data Analysis
5 pages
Web Scraping Using Python
No ratings yet
Web Scraping Using Python
18 pages
PDF Document 2
No ratings yet
PDF Document 2
24 pages
20 - BeautifulSoup Library For Web Scraping
No ratings yet
20 - BeautifulSoup Library For Web Scraping
12 pages
1.1 Web Scraping
No ratings yet
1.1 Web Scraping
34 pages
Learning Scrapy - Sample Chapter
0% (1)
Learning Scrapy - Sample Chapter
16 pages
Resultados Busca
No ratings yet
Resultados Busca
1,788 pages
Web Scrapping Final
No ratings yet
Web Scrapping Final
7 pages
Web Scraping 2
No ratings yet
Web Scraping 2
14 pages
Retrieving Data From The Web
No ratings yet
Retrieving Data From The Web
9 pages
Notes For Web Scraping - BeautifulSoup-3903
No ratings yet
Notes For Web Scraping - BeautifulSoup-3903
6 pages
WebScraping Lessons 1
100% (1)
WebScraping Lessons 1
3 pages
S12 Web Scraping
No ratings yet
S12 Web Scraping
13 pages
Web Scraping With Python and Selenium: Sarah Fatima, Shaik Luqmaan Nuha Abdul Rasheed
No ratings yet
Web Scraping With Python and Selenium: Sarah Fatima, Shaik Luqmaan Nuha Abdul Rasheed
5 pages
20 - 3 - A Study
No ratings yet
20 - 3 - A Study
5 pages
Breadcrumbs: What Are Breadcrumbs in SEO?
100% (1)
Breadcrumbs: What Are Breadcrumbs in SEO?
9 pages
HTML Tags - Javatpoint
No ratings yet
HTML Tags - Javatpoint
11 pages
I) Web Crawling: Yash Pahlani D17B 49
No ratings yet
I) Web Crawling: Yash Pahlani D17B 49
7 pages
4F IntroToWebScraping
No ratings yet
4F IntroToWebScraping
6 pages
Scraping
No ratings yet
Scraping
6 pages
Download
No ratings yet
Download
4 pages
Web Scarpping
No ratings yet
Web Scarpping
4 pages
Web Scrapping: Dept - of CS&E, BIET, Davangere Page - 1
No ratings yet
Web Scrapping: Dept - of CS&E, BIET, Davangere Page - 1
8 pages
Introduction To Web Crawling Chapter - 13
No ratings yet
Introduction To Web Crawling Chapter - 13
3 pages
Web Scraping and HTML Basics
No ratings yet
Web Scraping and HTML Basics
4 pages
Q-1 Web Scraping: Definition and Significance
No ratings yet
Q-1 Web Scraping: Definition and Significance
4 pages
Burp Suite
No ratings yet
Burp Suite
2 pages
The Ultimate Web Scraping With Python Bootcamp 2023 - Coderprog
No ratings yet
The Ultimate Web Scraping With Python Bootcamp 2023 - Coderprog
3 pages
Api and Data Structure
No ratings yet
Api and Data Structure
3 pages
HTML Entities PDF
No ratings yet
HTML Entities PDF
6 pages
Building Portfolio Website Presentation
No ratings yet
Building Portfolio Website Presentation
29 pages
Apps For All by Heydon Pickering
No ratings yet
Apps For All by Heydon Pickering
121 pages
Untold Coding
No ratings yet
Untold Coding
30 pages
HTML Form & CGI Concepts
No ratings yet
HTML Form & CGI Concepts
27 pages
HTML Notes by Harikrishna Ji Ok Bhai
No ratings yet
HTML Notes by Harikrishna Ji Ok Bhai
7 pages
PW 01 Introduction To Web Technologies
No ratings yet
PW 01 Introduction To Web Technologies
34 pages
Scanning Sikap Dev: Tue 15 Aug 2023, at 09:50:26
No ratings yet
Scanning Sikap Dev: Tue 15 Aug 2023, at 09:50:26
29 pages
ITE101 Living in The IT Era Week 14
No ratings yet
ITE101 Living in The IT Era Week 14
10 pages
Marketing Plan: Overview: Before You Start... How To Use This Document
No ratings yet
Marketing Plan: Overview: Before You Start... How To Use This Document
8 pages
Technical SEO Checklist + Fixes
No ratings yet
Technical SEO Checklist + Fixes
20 pages
Casestudy
No ratings yet
Casestudy
14 pages
HTML Responsive - Javatpoint
No ratings yet
HTML Responsive - Javatpoint
5 pages
WP GTU Study Material Presentations Unit-1 19012022083402PM
No ratings yet
WP GTU Study Material Presentations Unit-1 19012022083402PM
18 pages
HTML - Lists, Images and Hyperlinks
No ratings yet
HTML - Lists, Images and Hyperlinks
9 pages
104 Css - 022142 - 033955 PDF
No ratings yet
104 Css - 022142 - 033955 PDF
11 pages
GTH
No ratings yet
GTH
14 pages
Social Networking Sites and India
No ratings yet
Social Networking Sites and India
3 pages
TIK - Chapter 2 Exploring Cyberspace
No ratings yet
TIK - Chapter 2 Exploring Cyberspace
7 pages
Release Notes
No ratings yet
Release Notes
8 pages
Introduction To Matrix Bricks Infotech
No ratings yet
Introduction To Matrix Bricks Infotech
33 pages
Prod Webdispatcher Instance Profile
No ratings yet
Prod Webdispatcher Instance Profile
3 pages