0% found this document useful (0 votes)

5 views6 pages

Scraping

Web scraping is the automated process of extracting data from websites, beginning with an HTTP request to retrieve HTML content, followed by parsing and extracting specific data. Understanding HTML structure is crucial for effective scraping, as it involves navigating tags and attributes to locate the desired information. Tools like Python's Requests and BeautifulSoup, along with pandas for table extraction, are essential for implementing web scraping techniques.

Uploaded by

gharikrishnan710

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views6 pages

Scraping

Uploaded by

gharikrishnan710

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

Web Scraping and HTML Basics

Introduction to web scraping

Web scraping, also known as web harvesting or web data extraction, is the process of
extracting information from websites or web pages. It involves automated retrieval of data
from web sources. People use it for various applications such as data analysis, mining,
price comparison, content aggregation, and more.

How web scraping works

HTTP request

The process typically begins with an HTTP request. A web scraper sends an HTTP request to
a specific URL, similar to how a web browser would when you visit a website. The request is
usually an HTTP GET request, which retrieves the web page's content.

Web page retrieval

The web server hosting the website responds to the request by returning the requested web
page's HTML content. This content includes the visible text and media elements and the
underlying HTML structure that defines the page's layout.

HTML parsing

Once the HTML content is received, you need to parse the content. Parsing involves
breaking down the HTML structure into components, such as tags, attributes, and text
content. You can use BeautifulSoup in Python. It creates a structured representation of the
HTML content that can be easily navigated and manipulated.

Data extraction

With the HTML content parsed, web scrapers can now identify and extract the specific data
they need. This data can include text, links, images, tables, product prices, news articles,
and more. Scrapers locate the data by searching for relevant HTML tags, attributes, and
patterns in the HTML structure.

Data transformation

Extracted data may need further processing and transformation. For instance, you can
remove HTML tags from text, convert data formats, or clean up messy data. This step
ensures the data is ready for analysis or other use cases.

Storage
After extraction and transformation, you can store the scraped data in various formats,
such as databases, spreadsheets, JSON, or CSV files. The choice of storage format
depends on the specific project's requirements.

Automation

In many cases, scripts or programs automate web scraping. These automation tools allow
recurring data extraction from multiple web pages or websites. Automated scraping is
especially useful for collecting data from dynamic websites that regularly update their
content.

Hypertext markup language (HTML) serves as the foundation of web pages. Understanding
its structure is crucial for web scraping.

• <html> is the root element of an HTML page.

• <head> contains meta-information about the HTML page.

• <body> displays the content on the web page, often the data of interest.

• <h3> tags are type 3 headings, making text larger and bold, typically used for player
names.

• <p> tags represent paragraphs and contain player salary information.

Composition of an HTML tag

HTML tags define the structure of web content and can contain attributes.

• An HTML tag consists of an opening (start) tag and a closing (end) tag.

• Tags have names (<a> for an anchor tag).

• Tags may contain attributes with an attribute name and value, providing additional
information to the tag.

HTML document tree

You can visualize HTML documents as trees with tags as nodes.

• Tags can contain strings and other tags, making them the tag's children.

• Tags within the same parent tag are considered siblings.

• For example, the <html> tag contains both <head> and <body> tags, making them
descendants of <html but children of <html>. <head> and <body> are siblings.

HTML tables

HTML tables are essential for presenting structured data.

• Define an HTML table using the <table> tag.

• Each table row is defined with a <tr> tag.

• The first row often uses the table header tag, typically <th>.

• The table cell is represented by <td> tags, defining individual cells in a row.
Web scraping

Web scraping involves extracting information from web pages using Python. It can save
time and automate data collection.

Required tools

Web scraping requires Python code and two essential modules: Requests and Beautiful
Soup. Ensure you have both modules installed in your Python environment.

1. 1

2. 2

1. # Import Beautiful Soup to parse the web page content

2. from bs4 import BeautifulSoup

Copied!

Fetching and parsing HTML

To start web scraping, you need to fetch the HTML content of a webpage and parse it using
Beautiful Soup. Here's a step-by-step example:
1. import requests

2. from bs4 import BeautifulSoup

4. # Specify the URL of the webpage you want to scrape

5. url = 'https://en.wikipedia.org/wiki/IBM'

7. # Send an HTTP GET request to the webpage

8. response = requests.get(url)

10. # Store the HTML content in a variable

11. html_content = response.text

12.

13. # Create a BeautifulSoup object to parse the HTML

14. soup = BeautifulSoup(html_content, 'html.parser')

15.

16. # Display a snippet of the HTML content

17. print(html_content[:500])

Navigating the HTML structure

BeautifulSoup represents HTML content as a tree-like structure, allowing for easy

navigation. You can use methods like find_all to filter and extract specific HTML elements.
For example, to find all anchor tags () and print their text:

1. # Find all <a> tags (anchor tags) in the HTML

2. links = soup.find_all('a')

4. # Iterate through the list of links and print their text

5. for the link in links:

6. print(link.text)
Custom data extraction

Web scraping allows you to navigate the HTML structure and extract specific information
based on your requirements. This process may involve finding specific tags, attributes, or
text content within the HTML document.

Using BeautifulSoup for HTML parsing

Beautiful Soup is a powerful tool for navigating and extracting specific web page parts. It
allows you to find elements based on their tags, attributes, or text, making extracting the
information you're interested in easier.

Using pandas read_html for table extraction

Pandas, a Python library, provides a function called read_html, which can automatically
extract data from websites' tables and present it in a format suitable for analysis. It’s
similar to taking a table from a webpage and importing it into a spreadsheet for further
analysis.

Ipc Cessna 404 PDF
100% (1)
Ipc Cessna 404 PDF
838 pages
Python Web Scraping Tutorial
92% (12)
Python Web Scraping Tutorial
65 pages
White Book - 6 Del
75% (8)
White Book - 6 Del
18 pages
Web Scraping Cheat Sheet (2021), Python For Web Scraping by Frank Andrade Geek Culture - Medium
100% (3)
Web Scraping Cheat Sheet (2021), Python For Web Scraping by Frank Andrade Geek Culture - Medium
26 pages
How To Scrape Websites With Python and BeautifulSoup PDF
100% (2)
How To Scrape Websites With Python and BeautifulSoup PDF
10 pages
Hypertext Markup Language (HTML) Fundamentals: How to Master HTML with Ease
From Everand
Hypertext Markup Language (HTML) Fundamentals: How to Master HTML with Ease
Steven Bright
No ratings yet
GNS-XLS User Manual
100% (3)
GNS-XLS User Manual
431 pages
Web Scraping and HTML Basics
No ratings yet
Web Scraping and HTML Basics
4 pages
Web Scarpping
No ratings yet
Web Scarpping
4 pages
Download
No ratings yet
Download
4 pages
Notes For Web Scraping - BeautifulSoup-3903
No ratings yet
Notes For Web Scraping - BeautifulSoup-3903
6 pages
S12 Web Scraping
No ratings yet
S12 Web Scraping
13 pages
Webscraping
No ratings yet
Webscraping
12 pages
BeautifulSoup Notes
No ratings yet
BeautifulSoup Notes
22 pages
DAP 4 Module
No ratings yet
DAP 4 Module
45 pages
1.1 Web Scraping
No ratings yet
1.1 Web Scraping
34 pages
Web Crawling - Python
No ratings yet
Web Crawling - Python
34 pages
Web Scraping Using Python - Notes
No ratings yet
Web Scraping Using Python - Notes
6 pages
HKU - 7001 - 4. Web Scraping
No ratings yet
HKU - 7001 - 4. Web Scraping
73 pages
Web Scraping With Python
No ratings yet
Web Scraping With Python
16 pages
Scraping
100% (1)
Scraping
25 pages
Python Module-4
No ratings yet
Python Module-4
109 pages
Data - Collection Python
No ratings yet
Data - Collection Python
40 pages
Web Scrapping: Dept - of CS&E, BIET, Davangere Page - 1
No ratings yet
Web Scrapping: Dept - of CS&E, BIET, Davangere Page - 1
8 pages
Web Scraping
No ratings yet
Web Scraping
4 pages
Web Scraping Using Python
No ratings yet
Web Scraping Using Python
18 pages
Beautiful Soup Tutorial
100% (2)
Beautiful Soup Tutorial
56 pages
Scrap Website With Python Free Code Camp
No ratings yet
Scrap Website With Python Free Code Camp
6 pages
Web Scaping - YL
No ratings yet
Web Scaping - YL
10 pages
Text Processing For NLP Web Scrapping
No ratings yet
Text Processing For NLP Web Scrapping
18 pages
Web Scraping Presentation With Images
No ratings yet
Web Scraping Presentation With Images
4 pages
Web Scraping With Python Tutorials From A To Z
100% (2)
Web Scraping With Python Tutorials From A To Z
35 pages
Web Scraping With Python and Selenium: Sarah Fatima, Shaik Luqmaan Nuha Abdul Rasheed
No ratings yet
Web Scraping With Python and Selenium: Sarah Fatima, Shaik Luqmaan Nuha Abdul Rasheed
5 pages
Html5 for Beginners: A Step-By-Step Guide
From Everand
Html5 for Beginners: A Step-By-Step Guide
Zack Mark Lakeman
No ratings yet
Arindam Manna, Financial Analytics
No ratings yet
Arindam Manna, Financial Analytics
9 pages
Data Analysis by Web Scraping Using Python
No ratings yet
Data Analysis by Web Scraping Using Python
6 pages
Web Scraping Takeaways
No ratings yet
Web Scraping Takeaways
2 pages
20 - BeautifulSoup Library For Web Scraping
No ratings yet
20 - BeautifulSoup Library For Web Scraping
12 pages
054-En
No ratings yet
054-En
2 pages
DAP - Module 4
No ratings yet
DAP - Module 4
57 pages
Q-1 Web Scraping: Definition and Significance
No ratings yet
Q-1 Web Scraping: Definition and Significance
4 pages
WebScraping Lessons 1
100% (1)
WebScraping Lessons 1
3 pages
DAP Module4
No ratings yet
DAP Module4
109 pages
055-En
No ratings yet
055-En
2 pages
Practical Web Scraping For Economists 1744341390
No ratings yet
Practical Web Scraping For Economists 1744341390
33 pages
Web Scraping by Using R
No ratings yet
Web Scraping by Using R
3 pages
Unit 11 Application Development Using Python
No ratings yet
Unit 11 Application Development Using Python
19 pages
Module 4
No ratings yet
Module 4
14 pages
Web Scraping Python - Chapter 1
No ratings yet
Web Scraping Python - Chapter 1
29 pages
Webscraping1 1 PDF
No ratings yet
Webscraping1 1 PDF
10 pages
Web Devlopment
From Everand
Web Devlopment
Netra
No ratings yet
Web Scraping
No ratings yet
Web Scraping
5 pages
A Practical Guide To Web Scraping (PDFDrive)
No ratings yet
A Practical Guide To Web Scraping (PDFDrive)
107 pages
Web+Scraping+Cheat+Sheet+2 0
No ratings yet
Web+Scraping+Cheat+Sheet+2 0
3 pages
FDSWeb Scraping
No ratings yet
FDSWeb Scraping
31 pages
WEBSCRAping Buildwithpython
No ratings yet
WEBSCRAping Buildwithpython
78 pages
Scraping Book
No ratings yet
Scraping Book
50 pages
Scraping Book Python PDF
No ratings yet
Scraping Book Python PDF
50 pages
08 Web Scraping
No ratings yet
08 Web Scraping
13 pages
Sma U-2
No ratings yet
Sma U-2
19 pages
Web Scraping Cheat Sheet 2.0
No ratings yet
Web Scraping Cheat Sheet 2.0
3 pages
Implementing Web Scraping in Python With Beautifulsoup
No ratings yet
Implementing Web Scraping in Python With Beautifulsoup
6 pages
HTML in 30 Pages
From Everand
HTML in 30 Pages
U.Q. Magnusson
4.5/5 (14)
Mid Term Examination Spring 2021: Total Marks: 20 Marks
No ratings yet
Mid Term Examination Spring 2021: Total Marks: 20 Marks
14 pages
Method Statement For Insulation of Ducting System
No ratings yet
Method Statement For Insulation of Ducting System
11 pages
I.T. Part-B - Unit 1
No ratings yet
I.T. Part-B - Unit 1
16 pages
41 Paper Dragons Ideas in 2023 Dragon Puppet, Paper Puppets, Dragon Design
No ratings yet
41 Paper Dragons Ideas in 2023 Dragon Puppet, Paper Puppets, Dragon Design
1 page
Circuit
No ratings yet
Circuit
6 pages
Tractor Transmission Performance
No ratings yet
Tractor Transmission Performance
9 pages
PCI Form
No ratings yet
PCI Form
5 pages
Developing Materials For Language Teaching Tomlinson
No ratings yet
Developing Materials For Language Teaching Tomlinson
33 pages
Black Denim Tear - Google Search
No ratings yet
Black Denim Tear - Google Search
1 page
Sample Rubrics
100% (3)
Sample Rubrics
3 pages
MODULE 6 Lesson 1 Activity
100% (1)
MODULE 6 Lesson 1 Activity
5 pages
Superflo Whisperflo Single Speed Pump Brochure
No ratings yet
Superflo Whisperflo Single Speed Pump Brochure
2 pages
Echipamente de Sudare EWM
No ratings yet
Echipamente de Sudare EWM
148 pages
Class 16 - Module 3 - Properties and Types of Nanomaterials - DR - Ajitha - PHY17016
No ratings yet
Class 16 - Module 3 - Properties and Types of Nanomaterials - DR - Ajitha - PHY17016
12 pages
Building Split-Level Corner Stairs
No ratings yet
Building Split-Level Corner Stairs
8 pages
Tech 1 PDF
No ratings yet
Tech 1 PDF
5 pages
Balanced and Unbalanced Forces
No ratings yet
Balanced and Unbalanced Forces
16 pages
Rules On Notarial Practice
No ratings yet
Rules On Notarial Practice
9 pages
Practical Lesson 2 Cultivation of Drosophila Melanogaster
0% (1)
Practical Lesson 2 Cultivation of Drosophila Melanogaster
5 pages
Lesson Plan-Time Management
No ratings yet
Lesson Plan-Time Management
4 pages
Acceptance of Terms and Conditions
No ratings yet
Acceptance of Terms and Conditions
5 pages
Operation Guide 5229: Daylight Saving Time (DST) 12-Hour and 24-Hour Timekeeping
No ratings yet
Operation Guide 5229: Daylight Saving Time (DST) 12-Hour and 24-Hour Timekeeping
6 pages
Api 570 - Aug - 2022
No ratings yet
Api 570 - Aug - 2022
4 pages
Math 062 Glossary - English Language Document - United States - English Glossary Review
No ratings yet
Math 062 Glossary - English Language Document - United States - English Glossary Review
14 pages
English Test - FEBRUARY - W2 - SUN
No ratings yet
English Test - FEBRUARY - W2 - SUN
10 pages
Hector Posts Volume5
No ratings yet
Hector Posts Volume5
140 pages
Pile CAP Design Example + 2 Piles
100% (3)
Pile CAP Design Example + 2 Piles
3 pages