Data Science with Python
Web Scraping with BeautifulSoup
Learning Objectives
By the end of this lesson, you will be able to:
Define web scraping and explain its importance
List the steps involved in the web scraping process
Describe basic terminologies, such as parser, object, and
tree associated with the BeautifulSoup
Explain various operations, such as searching, modifying, and
navigating the tree to yield the required result
Web Scraping
What Is Web Scraping?
Web scraping is a computer software technique for extracting information from websites in an automated fashion.
What Is Web Scraping?
Why Web Scraping
Every day, you find yourself in a situation where you need to extract data from the web.
Why Web Scraping
Web Scraping Process
Web Scraping Process: Basic Preparation
There are two basic things to consider before setting up the web scraping process:
Understanding the target data on the Internet
Finalizing the list of websites
Web Scraping Process
Once you have understood the target data and finalized the list of websites, you need to design the web scraping
process.
The steps involved in a typical web scraping process are as follows :
Step 1: A web request is sent to the targeted website to collect the required data.
Web Scraping Process
Once you have understood the target data and finalized the list of websites, you need to design the web scraping
process.
The steps involved in a typical web scraping process are as follows:
Step 2: The information is retrieved from the targeted website in HTML or XML format from web.
Web Scraping Process
Once you have understood the target data and finalized the list of websites, you need to design the web scraping
process.
The steps involved in a typical web scraping process are as follows:
Step 3: The retrieved information is parsed to the several parsers based on the data format. Parsing is a technique
to read data and extract information from the available document.
Web Scraping Process
Once you have understood the target data and finalized the list of websites, you need to design the web scraping
process.
The steps involved in a typical web scraping process are as follows:
Step 4: The parsed data is stored in the desired format. You can follow the same process to scrap another targeted
web.
Web Scraping Software
A web scraping software will interact with websites in the same way as your web browser.
A Web scraper is used to extract the information from web in routine and automated manner.
Web Browser Web Scraping Software
Displays the data Saves data from the web page to the local file or
database
Web Scraping Considerations
Reading and understanding the legal information along with terms and conditions mentioned in the website is
important.
Web Scraping Considerations
Legal Constraints
Notice
Copyright
Trademark Material
Patented Information
Web Scraping Tool: BeautifulSoup
SymPy Requests SQLAlchemy BeautifulSoup Twisted
Scrapy wxPython Pillow Pyglet matplotlib
Nose IPython SciPy Pygame NumPy
Web Scraping Tool: BeautifulSoup
BeautifulSoup, is an easy, intuitive, and a robust Python library designed for web scraping.
SymPy Requests SQLAlchemy BeautifulSoup Twisted
Scrapy wxPython Pillow Pyglet matplotlib
Nose IPython SciPy Pygame NumPy
Features of BeautifulSoup
Efficient tool for dissecting documents and extracting information from the web pages
Has powerful sets of built-in methods for navigating, searching, and modifying a parse tree
Contains a parser that supports both html and xml documents
Converts all incoming documents to unicode automatically
Converts all outgoing documents to UTF-8 automatically
Common Data/Page Formats on the Web
Common Data/Page Formats on the Web
An HTML page is one of the
oldest, easiest, and the most
popular methods to upload
information on the web.
Common Data/Page Formats on the Web
An HTML 5 is a new HTML
standard which gained
popularity with the mobile
devices.
Common Data/Page Formats on the Web
XML is another popular way
to upload your information
on the web.
Common Data/Page Formats on the Web
CSS is mainly used for the
consistent presentation of
data using cascaded style
sheets.
Common Data/Page Formats on the Web
Application Program
Interface or APIs have now
become a common practice
to extract information from
the web.
Common Data/Page Formats on the Web
PDF is also widely used
to upload information
and reports.
Common Data/Page Formats on the Web
JavaScript Object
Notation, or JSON, is a
lightweight and popular
format used for
information exchange
on the web.
Parser
Parser
What is a parser?
How does it help Data Scientists in
the web scraping process?
Parser
A Parser is a basic tool to interpret or render information from a web document.
A Parser is also used to validate the input information before processing it.
Program instructions Objects
Input Output
Commands Parser Methods
Markup tags Attributes
Importance of Parsing
Parsing data is one of the most important steps in the web scraping process.
Failing to parse the data would eventually lead to a failure of the entire process.
! !
Parser
Parser
Various Parsers
Various parsers supported by BeautifulSoup are:
html.parser HTML parser is Python-based, fast, and lenient.
Lxml html is not built using Python and it depends on C.
lxml html
However, it is fast and lenient in nature.
lxml xml Lxml xml is the only xml parser available and it also depends on
C.
HTML5lib is another Python-based parser; however, it is slow
html5lib
and can create valid HTML5.
Importance of Objects
A web document gets transformed into a complex tree of objects.
Objects
Object
Relationship
HTML
Parser
Tree
A tree is defined as a collection of simple and complex objects.
Types of Objects
BeautifulSoup transforms a complex HTML document into a complex tree of Python objects.
There are four types of objects. They are:
A tag object is an XML or HTML tag in the web document. Tags
Tag
have a lot of attributes and methods.
A NavigableString is a string or set of characters that
NavigableString
correspond to the text present within a tag.
BeautifulSoup A BeautifulSoup represents the entire web document and
supports navigating and searching the document tree.
A Comment represents the comment or information section of
Comment
the document. It is a special type of NavigableString.
Parsing Web Documents and Extracting Data Using Objects
Demonstrate how to scrape a web document, parse it, and use objects to extract information.
Understanding Tree
html
head Body
tiltle meta meta h1 p ul
a li li li
Understanding Tree
html tag
Body tag
Division or a Section
Cascaded style sheets
Understanding Tree
BeautifulSoup Parent
html Direct child
div:
oraganizationlist
ul: ul: ul:
IT HR Finance
Siblings
li: li:
HRmanager HRmanager
div: div: div: div:
Class: name Class: ID Class: name Class: ID
Various Operations
Searching Tree: Filters
With the help of the search filters technique, you can extract specific information from the parsed document.
The filters can be treated as search criteria for extracting the information based on the elements present in the
document.
Searching Tree: Filters
There are various kinds of filters used for searching information from a tree.
A string is the simplest filter. BeautifulSoup will perform a
String
match against the search string.
Regular A regular expression filters the match against the search
Expressions criteria.
List A list filters the string that matches against the search item in
the list.
A function filters the elements that match against its only
Function
argument.
Searching the Tree: find_all()
BeautifulSoup defines a lot of methods for searching the parsed tree.
Methods and
Attributes
Searching methods
find_all() find()
Searching the tree with find_all()
The find_all() searches and retrieves all tags’ descendants that matche your filters.
The syntax for find_all(): Arguments
find_all(name, attrs, recursive, string, limit, **kwargs)
Method
Pass argument for tags with Filter multiple attributes by
names passing multiple keywords in the
argument
Pass argument for tags
with attributes
Limit the search result to numeric value
Pass argument as Boolean value passed in the argument
for recursive operation
Search for string instead
of tags
Searching the tree with find ()
The find_all() finds the entire document looking for results.
To find one result, use find().
The find() method has a syntax similar to that of the find_all() method; however, there are some key differences.
Method Name Search Scope Match Found Match Not Found
Find_all() Scans entire document Returns list with values Returns empty list
Searches only for passed Returns only the first
Find() Returns Nothing
argument match value
Searching the Tree with Other Methods
Searching the parse tree can also be performed by various other methods such as:
Tree Search
find_parents() find_parent()
find_previous_siblings() find_previous_sibling()
find_all_next() find_next()
find_all_previous() find_previous()
find_next_siblings() find_next_sibling()
scans for all the matches scans only for the first match
Searching in a Tree with Filters
Demonstrate the ways to search in a tree using filters.
Navigating Options
With the help of BeautifulSoup, it is easy to navigate the parse tree based on the need.
There are four options to navigate the tree. They are:
Navigating Down
Navigating Up
Navigating Sideways
Navigating Back and
Forth
Navigating Options
There are four options to navigate the tree. They are:
Navigating Down This technique shows you how to extract information from children tags.
Following are the attributes used to navigate down:
• .contents and .children
Navigating Up • .descendants
• .string
• .strings and stripped_strings
Navigating Sideways
Navigating Back and
Forth
Navigating Options
There are four options to navigate the tree:
Navigating Down
Every tag has a parent and two attributes, .parents and .parent, to help navigate
up the family tree.
Navigating Up
Navigating Sideways
Navigating Back and
Forth
Navigating Options
There are four options to navigate the tree:
Navigating Down
This technique shows you how to extract information from the same level in the tree.
Navigating Up The attributes used to navigate sideways are: .next_sibling and .previous_sibling.
Navigating Sideways
Navigating Back and
Forth
Navigating Options
There are four options to navigate the tree:
Navigating Down
This technique shows you how to parse the tree back and forth.
Navigating Up The attributes used to navigate back and forth are:
.next_element and .previous_element
.next_elements and .previous_elements
Navigating Sideways
Navigating Back and
Forth
Navigating a Tree
Demonstrate how to navigate the web tree using various techniques.
Modifying the Tree
With BeautifulSoup, you can also modify the tree and write your changes as a new HTML or XML document.
There are several methods to modify the tree:
.string()
append()
NavigableString()
.new_tag()
insert()
Insert_before() and insert_after()
clear()
extract()
decompose()
replace with ()
wrap()
unwrap()
Modifying the Tree
Demonstrate how to modify a web tree to get the desired result with the help of an example.
Parsing Only Part of the Document
But, how can you overcome this problem?
Use SoupStrainer class
Allows you to choose the part of the document to
be parsed
This feature of parsing a part of the document will not work with the html5lib parser.
Parsing Part of the Document
Demonstrate how to parse only a part of document with the help of an example.
Output: Printing and Formatting
Output
Prettify()
Printing
Unicode() or str()
Formatting
Output: Printing and Formatting
Prettify()
The prettify or pretty printing method turns a parse tree into
a decorative formatted Unicode string.
Output
Printing
Formatting
Unicode() or str()
Output: Printing and Formatting
Prettify()
Output
Unicode() or str()
Printing The unicode()or str() method turns a parse tree into a non-
decorative formatting string.
Formatting
Output: Printing and Formatting
The formatters are used to generate different types of output with the desired formatting.
Output
Html and xml
Printing
Minimal
Formatting Html and xml formatting
will convert unicode
None
characters into html and
xml entities respectively.
Uppercase and
lowercase
Output: Printing and Formatting
The formatters are used to generate different types of output with the desired formatting.
Output
Html and xml
Printing
Minimal
Formatting The minimal formatting
None will process content with
valid html/ xml tags.
Uppercase and
lowercase
Output: Printing and Formatting
The formatters are used to generate different types of output with the desired formatting.
Output
Html and xml
Printing
Minimal
Formatting None formatting will not
None modify the content or
string on output.
Uppercase and
lowercase
Output: Printing and Formatting
The formatters are used to generate different types of output with the desired formatting.
Output
Html and xml
Printing
Minimal
Uppercase and
Formatting lowercase formatting will
None convert string values to
uppercase and
lowercase, respectively.
Uppercase and
lowercase
Formatting and Printing
Demonstrate how to format, print, and encode the web document.
Encoding
Document Encoding Output Encoding
• HTML or XML documents are written in
specific encodings, such as ASCII or UTF-8.
• When you write a document from
BeautifulSoup, you get a UTF-8 document
• When you load the document into irrespective of the original encoding.
BeautifulSoup, it gets converted into
Unicode.
• If some other encoding is required, you can
pass it to prettify.
• The original encoding can be extracted
from attribute .original encoding of the
BeautifulSoup object.
Web Scraping
Scrape the Simplilearn website page and perform the following tasks:
• View and print the Simplilearn web page content in a proper format
• View the head and title
• Print all the href links present in the Simplilearn web page
Simplilearn website URL: http://www.simplilearn.com/
Web Scraping
Scrape the Simplilearn website resource page and perform the following tasks:
• View and print the Simplilearn web page content in a proper format
• View the head and title
• Print all the href links present in the Simplilearn web page
• Search and print the resource headers of the Simplilearn web page
• Search resource topics
• View the article names and navigate through them
Simplilearn website URL: http://www.simplilearn.com/resources
Knowledge Check
Knowledge
Check
Which of the following is the only xml parser?
1
a. html.parser
b. lxml
c. lxml.xml
d. html5lib
Knowledge
Check
Which of the following is the only xml parser?
1
a. html.parser
b. lxml
c. lxml.xml
d. html5lib
The correct answer is c
lxml.xml is the only xml parser available for BeautifulSoup object.
Knowledge
Check
In which of the following formats is the BeautifulSoup output encoded?
2
a. ASCII
b. Unicode
c. latin-1
d. UTF-8
Knowledge
Check
In which of the following formats is the BeautifulSoup output encoded?
2
a. ASCII
b. Unicode
c. latin-1
d. UTF-8
The correct answer is d
The output of the BeautifulSoup is always UTF-8 encoded.
Knowledge
Check
Which of the following libraries is used to extract a web page?
3
a. Beautiful Soup
b. Pandas
c. Requests
d. Numpy
Knowledge
Check
Which of the following libraries is used to extract a web page?
3
a. Beautiful Soup
b. Pandas
c. Requests
d. Numpy
The correct answer is c
Requests is the right API to extract the web page.
Knowledge
Check
Which of the following is NOT an object in BeautifulSoup?
4
a. Tag
b. NextSibling
c. NavigableString
d. Comment
Knowledge
Check
Which of the following is NOT an object in BeautifulSoup?
4
a. Tag
b. NextSibling
c. NavigableString
d. Comment
The correct answer is b
NextSibling is a navigation method.
Key Takeaways
You are now able to:
Define web scraping and explain its importance
List the steps involved in the web scraping process
Describe basic terminologies, such as parser, object, and
tree associated with the BeautifulSoup
Explain various operations, such as searching, modifying, and
navigating the tree to yield the required result
Thank You