0% found this document useful (0 votes)

57 views6 pages

Notes For Web Scraping - BeautifulSoup-3903

The document provides an overview of web scraping, focusing on tools like BeautifulSoup and the requests module in Python for extracting data from HTML and XML. It explains the Document Object Model (DOM), HTML structure, and the importance of parsing in web scraping, along with methods for data extraction. Additionally, it highlights the limitations of BeautifulSoup and suggests using more comprehensive frameworks for complex scraping tasks.

Uploaded by

rs9154040

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

57 views6 pages

Notes For Web Scraping - BeautifulSoup-3903

Uploaded by

rs9154040

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

Web Scraping - BeautifulSoup:

Web Scraping:

● A web scraping tool refers to a software program crafted for the extraction, or 'web
scraping,' of pertinent data from websites. In the process of collecting relevant
information from websites, it is highly likely that you will employ a web scraper to
extract specific datasets. This scraping tool, also known as a website scraper, plays a
crucial role in the web scraping process by making HTTP requests to a target website
and extracting data from its pages. It effectively parses content that is publicly
accessible, visible to users, and rendered by the server in the form of HTML.

● Web scraping, also known as web data extraction, involves the automated collection of
both structured and unstructured data from the internet. It serves various purposes,
including price monitoring, price intelligence, news tracking, lead generation, and
market research. Individuals and businesses leverage web scraping to extract valuable
insights from publicly available web data, enabling them to make informed decisions. If
you've ever manually copied and pasted information from a website, you've essentially
performed a similar function as a web scraper. However, web scraping distinguishes
itself by utilizing machine learning and intelligent automation to efficiently retrieve
vast amounts of data from the vast expanse of the internet, eliminating the tedious
manual extraction process. Whether you choose to employ a web scraper or
collaborate with a web data extraction partner, understanding the basics of web
scraping is crucial for successful data retrieval.

Requests
● In Python, the "requests" module is a popular library used for making HTTP requests. It
provides a simple and convenient way to interact with web services, allowing users to
send HTTP requests and handle the corresponding responses. The requests module
supports various HTTP methods, such as GET, POST, and others, making it versatile for
tasks like fetching data from web servers or interacting with web APIs.

!pip3 install requests

BeautifulSoup:

● Beautiful Soup, a Python library, is designed for extracting data from HTML, XML, and
various markup languages. In scenarios where webpages present relevant data for
research, like date or address information without a direct download option, Beautiful
Soup proves instrumental. This tool aids in retrieving specific content from webpages,
eliminating HTML markup, and facilitating the saving of extracted information.
Essentially, Beautiful Soup serves as a web scraping tool, assisting in the cleaning and
parsing of documents obtained from the web.

Uses of Beautiful Soup:

● Beautiful Soup library proves valuable in isolating titles and links from webpages by
efficiently extracting all text within HTML tags. Additionally, it facilitates the
modification of HTML within the document under consideration.

Installing BeautifulSoup:

● The installation of Beautiful Soup is a breeze if you already have pip or another Python
installer in place. In case pip is not installed, a brief tutorial on installing Python
modules can guide you through the process. Once pip is set up, simply run the following
command in the terminal to install Beautiful Soup.

!pip3 install beautifulsoup4

● To import BeautifulSoup

from bs4 import BeautifulSoup

● Moreover, it is necessary to install a "parser" for interpreting the HTML.

!pip3 install lxml

Introduction to the DOM:

● The Document Object Model (DOM) serves as the data representation of the elements
forming the structure and content of a web document. This guide provides an
introduction to the DOM, delving into how it internally represents an HTML document
and exploring the use of APIs for creating web content and applications.

● The Document Object Model (DOM) functions as a programming interface for web
documents, offering a representation of the page that allows programs to dynamically
alter the document's structure, style, and content. Utilizing nodes and objects, the
DOM enables interaction between programming languages and the web page.
● A web page exists as a document, viewable in the browser window, or accessible as an
HTML source. Despite being the same document in both instances, the Document
Object Model (DOM) representation enables manipulation. Serving as an
object-oriented depiction of the web page, it becomes alterable through scripting
languages like JavaScript.

HTML:

● HTML, an acronym for Hyper Text Markup Language, serves as the fundamental
language for crafting web pages and web applications. Let's delve into the definition of
Hypertext Markup Language and understand its role in the creation of web pages.

● HyperText refers to "Text within Text," where a text contains links within it, creating a
hypertext. Clicking on a link that directs you to a new webpage signifies interacting
with a hypertext. This concept enables the linking of two or more web pages (HTML
documents) to each other.

● A markup language is a computer language employed to impose layout and formatting

conventions on a text document. This language enhances interactivity and dynamism
by transforming plain text into elements such as images, tables, links, and more.

● A web page, often authored in HTML and interpreted by a web browser, is a document
accessible through a specific URL. Web pages can be categorized as either static or
dynamic. Solely utilizing HTML, static web pages can be created.

Tags:

● Tags are used to represent HTML elements. These can be seen as keywords that define
how a web browser will format and display the website content.
● The HTML tags are usually available in pairs, i.e. opening and closing (it's the same, with
the tag name '/' at the beginning) tag.

Eg: <html> and </html> is a tag that comes in pairs and <hr> does not have a closing tag.
NOTE: There are also "self-closing" tags, whereby a br tag, for eg., will look like "<br/>"
instead of simply "<br>".

Some of the important tags are used in BeautifulSoup:

● <a> (Anchor): Useful for extracting hyperlinks, which can lead to data sources
● <p> (Paragraph): Commonly used for text content, including textual data that you may
want to scrape.
● <h1>, <h2>, <h3>,...<h6> (Headings): Often used for titles and section headings that
provide structure to web pages and data.
● <ul>, <ol> (Unordered and Ordered Lists): Useful for extracting structured lists of
data.
● <li> (List Item): Elements within lists, which can contain valuable data.
● <table>: Used for presenting tabular data. Essential for scraping structured data tables.
● <tr> (Table Row): Rows within tables, containing data points.
● <tr> (Table Row): Rows within tables, containing data points.
● <div>: Frequently used as a container for various content, including data.
● <form>: Used for web forms, which may be a source of data.
● <input>: Found within forms and can hold data that can be extracted.
● <textarea>: Often used for multi-line text input, where textual data might be present.
● <label>: Linked to form elements, providing descriptions or labels for input fields.
● <img>: Useful for extracting image data or checking image attributes.
● <iframe>: May contain data or content embedded from other sources.
● <script>: Contains JavaScript code, which sometimes loads or manipulates data.
● <input>: Used for various input fields like text boxes, radio buttons, and checkboxes.
● <select>: Used for dropdown lists, which may contain selectable data.
● <body>: contains all the contents of an HTML document, such as title, headings,
paragraphs, hyperlinks, tables, lists, etc.

Extra:

To get the list of all valid tags in HTML5, visit:

https://developer.mozilla.org/en-US/docs/Web/HTML/Element

Navigation of the structure in the DOM:

● Parents are the tags that encapsulate an element, forming a hierarchical structure
within HTML. In the Document Object Model (DOM), these parent tags contain and
surround child elements, creating a tree-like representation. Understanding the
parent-child relationship is crucial in navigating and manipulating the structure of
HTML documents using languages like JavaScript. It allows developers to access and
modify content within specific sections of a webpage, providing flexibility in web
development and design.

● Siblings are other tags that share the same parent within an HTML document. In the
Document Object Model (DOM), siblings are elements that reside at the same
hierarchical level, having a common parent. They are adjacent to each other in the
structure of the HTML markup.
● Children refer to the tags contained within a specific HTML element. In the context of
the Document Object Model (DOM), children are elements nested inside their parent
tag. This relationship creates a hierarchical structure where the parent is the
encompassing tag, and the children are the tags residing inside it.

What is Parsing in Web Scraping?

● In the realm of web scraping, parsing refers to the transformation of unstructured data
into a more organized format, such as a parse tree, making it simpler to comprehend,
utilize, and extract valuable information.

Parse the HTML:

● To initiate the HTML parsing process, generate a BeautifulSoup object and include the
HTML to be parsed as a mandatory argument. The resulting soup object will represent
a parsed rendition of the provided HTML.

import requests
from bs4 import BeautifulSoup
url = 'https://en.wikipedia.org/wiki/Borders_of_India'
response = requests.get(url).text
soup = BeautifulSoup(response,’html.parser’)
print(soup)
● This returns the parsed HTML and creates a BeautifulSoup object from the HTML
response.

Leverage BeautifulSoup's Object Methods for Extracting Information from HTML.

● The BeautifulSoup library incorporates numerous built-in methods designed for

extracting data from HTML. Employ methods such as soup.find() or soup.find_all() to
retrieve specific elements from the parsed HTML.

import requests
from bs4 import BeautifulSoup
url = 'https://en.wikipedia.org/wiki/Borders_of_India'
response = requests.get(url).text
soup = BeautifulSoup(response)
for i in soup.findAll(“h2”):
print(i.text)
● We then retrieve any HTML element (the title tag in this case) from the BeautifulSoup
object with the soup.findAll() method.
Limitations:

● While BeautifulSoup is a powerful tool, it comes with certain limitations and

drawbacks. Notably, it lacks the capability to execute JavaScript or handle dynamic
web pages that involve user interaction or AJAX requests. To address these
functionalities, one might consider utilizing browser automation tools like Selenium or
Scrapy.

● BeautifulSoup also faces limitations when dealing with intricate web scraping tasks,
including tasks like crawling multiple pages, managing data storage in databases,
and handling errors and retries. To tackle such scenarios, it's advisable to turn to a
comprehensive web scraping framework like Scrapy or PySpider. BeautifulSoup
excels in simpler, static web scraping tasks, primarily focused on parsing and
extracting data from individual web pages.

Beautiful Soup Tutorial
100% (2)
Beautiful Soup Tutorial
56 pages
Web Scraping Handbook
No ratings yet
Web Scraping Handbook
115 pages
Web Scraping With Python Tutorials From A To Z
100% (2)
Web Scraping With Python Tutorials From A To Z
35 pages
Web Scraping Cheat Sheet (2021), Python For Web Scraping by Frank Andrade Geek Culture - Medium
100% (3)
Web Scraping Cheat Sheet (2021), Python For Web Scraping by Frank Andrade Geek Culture - Medium
26 pages
Practical Introduction To Web Scraping in Python
100% (1)
Practical Introduction To Web Scraping in Python
14 pages
Scraping
100% (1)
Scraping
25 pages
Implementing Web Scraping in Python With Beautifulsoup
No ratings yet
Implementing Web Scraping in Python With Beautifulsoup
6 pages
Quicken 2000 Deluxe User's Guide For Macintosh
100% (1)
Quicken 2000 Deluxe User's Guide For Macintosh
600 pages
Web Scraping Report
No ratings yet
Web Scraping Report
14 pages
Web Scraping - Unit 1
100% (1)
Web Scraping - Unit 1
31 pages
Python Module-4
No ratings yet
Python Module-4
109 pages
Iot Unit 4
No ratings yet
Iot Unit 4
53 pages
Data - Collection Python
No ratings yet
Data - Collection Python
40 pages
Harq Lte
No ratings yet
Harq Lte
8 pages
Course Notes - Web Scraping and API Fundamentals in Python
No ratings yet
Course Notes - Web Scraping and API Fundamentals in Python
10 pages
RICOH SP 3400SF/SP 3410SF Fax Quickguide
No ratings yet
RICOH SP 3400SF/SP 3410SF Fax Quickguide
46 pages
Twitter Blackbox
No ratings yet
Twitter Blackbox
324 pages
Pastor Billing Format (Format Boii) PDF
No ratings yet
Pastor Billing Format (Format Boii) PDF
1 page
Python Web Scraping Tutorial
92% (12)
Python Web Scraping Tutorial
65 pages
Case Analysis Boom and Bust
100% (1)
Case Analysis Boom and Bust
3 pages
Test in Media Answers
100% (4)
Test in Media Answers
3 pages
Digital Initiatives To Boost General Insurance Awareness & Penetration Among Youth
No ratings yet
Digital Initiatives To Boost General Insurance Awareness & Penetration Among Youth
11 pages
SafireControlCenter AC Client Software - User Manual - EN - V1.0.0 - 20190508
No ratings yet
SafireControlCenter AC Client Software - User Manual - EN - V1.0.0 - 20190508
114 pages
DAP Module4
No ratings yet
DAP Module4
109 pages
WebScraping Lessons 1
100% (1)
WebScraping Lessons 1
3 pages
HKU - 7001 - 4. Web Scraping
No ratings yet
HKU - 7001 - 4. Web Scraping
73 pages
E Commarce Assignment
No ratings yet
E Commarce Assignment
39 pages
Web Scraping Using Python - Notes
No ratings yet
Web Scraping Using Python - Notes
6 pages
DAP - Module 4
No ratings yet
DAP - Module 4
57 pages
DAP 4 Module
No ratings yet
DAP 4 Module
45 pages
1904 A - History - of - Thessaly-Roland Grubb Kent
No ratings yet
1904 A - History - of - Thessaly-Roland Grubb Kent
47 pages
Web Crawling - Python
No ratings yet
Web Crawling - Python
34 pages
04 DataMunging PDF
No ratings yet
04 DataMunging PDF
36 pages
04 DataMunging PDF
No ratings yet
04 DataMunging PDF
36 pages
04 DataMunging PDF
No ratings yet
04 DataMunging PDF
36 pages
Techage PoE NVR - 140x210mm - XMeye - Pro - User Manual
No ratings yet
Techage PoE NVR - 140x210mm - XMeye - Pro - User Manual
34 pages
1.1 Web Scraping
No ratings yet
1.1 Web Scraping
34 pages
LiveCache Start and Stop
No ratings yet
LiveCache Start and Stop
10 pages
Dr. Robert Anthony - Money Magnet Report
100% (2)
Dr. Robert Anthony - Money Magnet Report
9 pages
Requirement of VRLA Battery-11-12-21
No ratings yet
Requirement of VRLA Battery-11-12-21
3 pages
Space & Time Complexity
No ratings yet
Space & Time Complexity
3 pages
Internet Evangelism An Effective Method
No ratings yet
Internet Evangelism An Effective Method
23 pages
BeautifulSoup Notes
No ratings yet
BeautifulSoup Notes
22 pages
FDSWeb Scraping
No ratings yet
FDSWeb Scraping
31 pages
PDF Document 2
No ratings yet
PDF Document 2
24 pages
Lesson 4 Unstructured Data
No ratings yet
Lesson 4 Unstructured Data
20 pages
Web Scraping With Python
No ratings yet
Web Scraping With Python
16 pages
Web Scraping Using Python
No ratings yet
Web Scraping Using Python
18 pages
Lazada Seller Center - Pdfaaaaaaaaaaaaaaaaaa PDF
No ratings yet
Lazada Seller Center - Pdfaaaaaaaaaaaaaaaaaa PDF
20 pages
Web Scraping by Using R
No ratings yet
Web Scraping by Using R
3 pages
Sma U-2
No ratings yet
Sma U-2
19 pages
Trike 2
No ratings yet
Trike 2
23 pages
Web Scraping and Data Collection CheatSheet 1731972399
No ratings yet
Web Scraping and Data Collection CheatSheet 1731972399
10 pages
Lecture 1 - ML (1) - 2944
No ratings yet
Lecture 1 - ML (1) - 2944
1 page
Webscraping
No ratings yet
Webscraping
12 pages
Service Quality Issues in Financial Services
No ratings yet
Service Quality Issues in Financial Services
14 pages
20 - BeautifulSoup Library For Web Scraping
No ratings yet
20 - BeautifulSoup Library For Web Scraping
12 pages
Networking Proposal
No ratings yet
Networking Proposal
9 pages
Arindam Manna, Financial Analytics
No ratings yet
Arindam Manna, Financial Analytics
9 pages
SOP - NLCC Access To CRA System
No ratings yet
SOP - NLCC Access To CRA System
10 pages
Webscraping1 1 PDF
No ratings yet
Webscraping1 1 PDF
10 pages
Web Scrapping: Dept - of CS&E, BIET, Davangere Page - 1
No ratings yet
Web Scrapping: Dept - of CS&E, BIET, Davangere Page - 1
8 pages
S12 Web Scraping
No ratings yet
S12 Web Scraping
13 pages
Web Scaping - YL
No ratings yet
Web Scaping - YL
10 pages
Smart Programming: Java Program Structure & Explanation
No ratings yet
Smart Programming: Java Program Structure & Explanation
11 pages
Web Scrapping Final
No ratings yet
Web Scrapping Final
7 pages
Retrieving Data From The Web
No ratings yet
Retrieving Data From The Web
9 pages
How To Set Up Free SSL Certificates For Qlik Sense - Ptarmigan Labs
No ratings yet
How To Set Up Free SSL Certificates For Qlik Sense - Ptarmigan Labs
8 pages
EAI Architecture Vodafone India
No ratings yet
EAI Architecture Vodafone India
11 pages
How To Hack Wheel of Fortune Best Tips To Play and Win - GetMega
No ratings yet
How To Hack Wheel of Fortune Best Tips To Play and Win - GetMega
6 pages
Download
No ratings yet
Download
4 pages
Fujitsu CELSIUS H710 Mobile Workstation: Data Sheet
No ratings yet
Fujitsu CELSIUS H710 Mobile Workstation: Data Sheet
7 pages
Scraping
No ratings yet
Scraping
6 pages
Web Scarpping
No ratings yet
Web Scarpping
4 pages
Smart Programming: Main Method Syntaxes
No ratings yet
Smart Programming: Main Method Syntaxes
5 pages
A Guide To Web Scraping in Python Using Beautiful Soup
No ratings yet
A Guide To Web Scraping in Python Using Beautiful Soup
6 pages
NetFlow Analyzer PRTG
No ratings yet
NetFlow Analyzer PRTG
2 pages
Web Scraping and HTML Basics
No ratings yet
Web Scraping and HTML Basics
4 pages
Api and Data Structure
No ratings yet
Api and Data Structure
3 pages
Ibm Python Module 5 Apis Data Collection
No ratings yet
Ibm Python Module 5 Apis Data Collection
3 pages
Quick Start Guide For The Self-Service Attendance Terminal
No ratings yet
Quick Start Guide For The Self-Service Attendance Terminal
2 pages
Web Scraping Takeaways
No ratings yet
Web Scraping Takeaways
2 pages
Connecting To A Server in A Different Network Via SapGui
No ratings yet
Connecting To A Server in A Different Network Via SapGui
2 pages
Mastering Web Development Your Guide to Building, Deploying, and Optimizing Websites: Your Guide to the Digital World, #1
From Everand
Mastering Web Development Your Guide to Building, Deploying, and Optimizing Websites: Your Guide to the Digital World, #1
Atokhon Ghaniev
No ratings yet
Comprehensive Hypertext Markup Language (HTML).: A Tutorial Guide to Editing and Developing a Responsive and Dynamic Website for Beginners.
From Everand
Comprehensive Hypertext Markup Language (HTML).: A Tutorial Guide to Editing and Developing a Responsive and Dynamic Website for Beginners.
Ibrahim Nugwa Abdulrazak
No ratings yet
James Learning Javascript Programming
From Everand
James Learning Javascript Programming
James Lombard
No ratings yet
Beginning HTML and CSS
From Everand
Beginning HTML and CSS
Rob Larsen
No ratings yet
Web Devlopment
From Everand
Web Devlopment
Netra
No ratings yet
The Basics of Front-End Web Development (HTML, CSS, and JavaScript): Learn How To Design and Build Websites As A Beginner
From Everand
The Basics of Front-End Web Development (HTML, CSS, and JavaScript): Learn How To Design and Build Websites As A Beginner
Roggie Clark
No ratings yet
The Basics of HTML (Hypertext Markup Language) Coding For Beginners: Learn Foundational HTML Programming Concepts
From Everand
The Basics of HTML (Hypertext Markup Language) Coding For Beginners: Learn Foundational HTML Programming Concepts
Roggie Clark
No ratings yet
Web Design With Html5, a Primer
From Everand
Web Design With Html5, a Primer
Matthew Macarty
No ratings yet
Html5 for Beginners: A Step-By-Step Guide
From Everand
Html5 for Beginners: A Step-By-Step Guide
Zack Mark Lakeman
No ratings yet
Hypertext Markup Language (HTML) Fundamentals: How to Master HTML with Ease
From Everand
Hypertext Markup Language (HTML) Fundamentals: How to Master HTML with Ease
Steven Bright
No ratings yet