Getting Started With Beautiful Soup Build Your Own Web Scraper and Learn All About Web Scraping With Beautiful Soup (PDFDrive)
Getting Started With Beautiful Soup Build Your Own Web Scraper and Learn All About Web Scraping With Beautiful Soup (PDFDrive)
info
Getting Started with
Beautiful Soup
Build your own web scraper and learn all about web
scraping with Beautiful Soup
Vineeth G. Nair
BIRMINGHAM - MUMBAI
www.it-ebooks.info
Getting Started with Beautiful Soup
All rights reserved. No part of this book may be reproduced, stored in a retrieval
system, or transmitted in any form or by any means, without the prior written
permission of the publisher, except in the case of brief quotations embedded in
critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy
of the information presented. However, the information contained in this book is
sold without warranty, either express or implied. Neither the author, nor Packt
Publishing, and its dealers and distributors will be held liable for any damages
caused or alleged to be caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the
companies and products mentioned in this book by the appropriate use of capitals.
However, Packt Publishing cannot guarantee the accuracy of this information.
Livery Place
35 Livery Street
Birmingham B3 2PB, UK.
ISBN 978-1-78328-955-4
www.packtpub.com
www.it-ebooks.info
Credits
Reviewers Proofreader
John J. Czaplewski Maria Gould
Christian S. Perone
Indexer
Zhang Xiang
Hemangini Bari
Acquisition Editor
Graphics
Nikhil Karkal
Sheetal Aute
Copy Editor
Janbal Dharmaraj
www.it-ebooks.info
About the Author
He developed an interest in Python during his college days and began working as a
freelance programmer. This led him to work on several web scraping projects using
Beautiful Soup. It helped him gain a fair level of mastery on the technology and a
good reputation in the freelance arena. He can be reached at vineethgnair.mec@
gmail.com. You can visit his website at www.kochi-coders.com.
www.it-ebooks.info
About the Reviewers
www.it-ebooks.info
www.PacktPub.com
Did you know that Packt offers eBook versions of every book published, with PDF
and ePub files available? You can upgrade to the eBook version at www.PacktPub.
com and as a print book customer, you are entitled to a discount on the eBook copy.
Get in touch with us at [email protected] for more details.
At www.PacktPub.com, you can also read a collection of free technical articles, sign
up for a range of free newsletters and receive exclusive discounts and offers on Packt
books and eBooks.
TM
http://PacktLib.PacktPub.com
Do you need instant solutions to your IT questions? PacktLib is Packt's online
digital book library. Here, you can access, read and search across Packt's entire
library of books.
Why Subscribe?
• Fully searchable across every book published by Packt
• Copy and paste, print and bookmark content
• On demand and accessible via web browser
www.it-ebooks.info
Table of Contents
Preface 1
Chapter 1: Installing Beautiful Soup 7
Installing Beautiful Soup 7
Installing Beautiful Soup in Linux 7
Installing Beautiful Soup using package manager 8
Installing Beautiful Soup using pip or easy_install 9
Installing Beautiful Soup using pip 9
Installing Beautiful Soup using easy_install 9
Installing Beautiful Soup in Windows 10
Verifying Python path in Windows 10
Installing Beautiful Soup using setup.py 12
Using Beautiful Soup without installation 12
Verifying the installation 13
Quick reference 13
Summary 14
Chapter 2: Creating a BeautifulSoup Object 15
Creating a BeautifulSoup object 15
Creating a BeautifulSoup object from a string 16
Creating a BeautifulSoup object from a file-like object 16
Creating a BeautifulSoup object for XML parsing 18
Understanding the features argument 19
Tag 22
Accessing the Tag object from BeautifulSoup 22
Name of the Tag object 23
Attributes of a Tag object 23
The NavigableString object 24
Quick reference 24
Summary 25
www.it-ebooks.info
Table of Contents
[ ii ]
www.it-ebooks.info
Table of Contents
[ iii ]
www.it-ebooks.info
www.it-ebooks.info
Preface
Web scraping is now widely used to get data from websites. Whether it be e-mails,
contact information, or selling prices of items, we rely on web scraping techniques
as they allow us to collect large data with minimal effort, and also, we don't require
database or other backend access to get this data as they are represented as web pages.
Beautiful Soup allows us to get data from HTML and XML pages. This book helps
us by explaining the installation and creation of a sample website scraper using
Beautiful Soup. Searching and navigation methods are explained with the help of
simple examples, screenshots, and code samples in this book. The different parser
support offered by Beautiful Soup, supports for scraping pages with encodings,
formatting the output, and other tasks related to scraping a page are all explained in
detail. Apart from these, practical approaches to understanding patterns on a page,
using the developer tools in browsers will enable you to write similar scrapers for
any other website.
Also, the practical approach followed in this book will help you to design a simple
web scraper to scrape and compare the selling prices of various books from three
websites, namely, Amazon, Barnes and Noble, and PacktPub.
www.it-ebooks.info
Preface
Chapter 3, Search Using Beautiful Soup, discusses in detail the different search methods
in Beautiful Soup, namely, find(), find_all(), find_next(), and find_parents();
code examples for a scraper using search methods to get information from a website;
and understanding the application of search methods in combination.
Chapter 4, Navigation Using Beautiful Soup, discusses in detail the different navigation
methods provided by Beautiful Soup, methods specific to navigating downwards
and upwards, and sideways, to the previous and next elements of the HTML tree.
Chapter 5, Modifying Content Using Beautiful Soup, discusses modifying the HTML
tree using Beautiful Soup, and the creation and deletion of HTML tags. Altering the
HTML tag attributes is also covered with the help of simple examples.
Chapter 8, Creating a Web Scraper, discusses creating a web scraper for three websites,
namely, Amazon, Barnes and Noble, and PacktPub, to get the book selling price based
on ISBN. Searching and navigation methods used to create the parser, use of developer
tools so as to identify the patterns required to create the parser, and the full code
sample for scraping the mentioned websites are also explained in this chapter.
For Chapter 3, Search Using Beautiful Soup and Chapter 8, Creating a Web Scraper,
you must have an Internet connection to scrape different websites using the code
examples provided.
www.it-ebooks.info
Preface
Conventions
In this book, you will find a number of styles of text that distinguish between
different kinds of information. Here are some examples of these styles, and an
explanation of their meaning.
Code words in text, database table names, folder names, filenames, file extensions,
pathnames, dummy URLs, user input, and Twitter handles are shown as follows:
"The prettify() method can be called either on a Beautiful Soup object or any of
the Tag objects."
When we wish to draw your attention to a particular part of a code block, the
relevant lines or items are set in bold:
[3]
www.it-ebooks.info
Preface
New terms and important words are shown in bold. Words that you see on the
screen, in menus or dialog boxes for example, appear in the text like this: "The output
methods in Beautiful Soup escape only the HTML entities of >,<, and & as >, <,
and &."
Reader feedback
Feedback from our readers is always welcome. Let us know what you think about
this book—what you liked or may have disliked. Reader feedback is important for us
to develop titles that you really get the most out of.
If there is a topic that you have expertise in and you are interested in either writing
or contributing to a book, see our author guide on www.packtpub.com/authors.
Customer support
Now that you are the proud owner of a Packt book, we have a number of things to
help you to get the most from your purchase.
[4]
www.it-ebooks.info
Preface
Errata
Although we have taken every care to ensure the accuracy of our content, mistakes
do happen. If you find a mistake in one of our books—maybe a mistake in the text or
the code—we would be grateful if you would report this to us. By doing so, you can
save other readers from frustration and help us improve subsequent versions of this
book. If you find any errata, please report them by visiting http://www.packtpub.
com/submit-errata, selecting your book, clicking on the errata submission form link,
and entering the details of your errata. Once your errata are verified, your submission
will be accepted and the errata will be uploaded on our website, or added to any list of
existing errata, under the Errata section of that title. Any existing errata can be viewed
by selecting your title from http://www.packtpub.com/support.
Piracy
Piracy of copyright material on the Internet is an ongoing problem across all media.
At Packt, we take the protection of our copyright and licenses very seriously. If you
come across any illegal copies of our works, in any form, on the Internet, please
provide us with the location address or website name immediately so that we can
pursue a remedy.
We appreciate your help in protecting our authors, and our ability to bring you
valuable content.
Questions
You can contact us at [email protected] if you are having a problem with
any aspect of the book, and we will do our best to address it.
[5]
www.it-ebooks.info
www.it-ebooks.info
Installing Beautiful Soup
Before we begin using Beautiful Soup, we should ensure that it is properly installed
on our machine. The steps required are so simple that any user can install this in no
time. In this chapter, we will be covering the following topics:
www.it-ebooks.info
Installing Beautiful Soup
Normally, these are the following three ways to install Beautiful Soup in
Linux machines:
The choices are ranked depending on the complexity levels and to avoid the trial-
and-error method. The easiest method is always using the package manager since
it requires less effort from the user, so we will cover this first. If the installation
is successful in one step, we don't need to do the next because the three steps
mentioned previously do the same thing.
The preceding command will install Beautiful Soup Version 4 in our Linux
operating system. Installing new packages in the system normally requires root
user privileges, which is why we append sudo in front of the apt-get command. If
we didn't append sudo, we will basically end up with a permission denied error. If
the packages are already updated, we will see the following success message in the
command line itself:
[8]
www.it-ebooks.info
Chapter 1
Since we are using a recent version of Ubuntu or Debian, python-bs4 will be listed
in the apt repository. But if the preceding command fails with Package Not Found
Error, it means that the package list is not up-to-date. This normally happens if we
have just installed our operating system and the package list is not downloaded from
the package repository. In this case, we need to first update the package list using the
following command:
sudo apt-get update
The preceding command will update the necessary package list from the online
package repositories. After this, we need to try the preceding command to install
Beautiful Soup.
In the older versions of the Linux operating system, even after running the apt-
get update command, we might not be able to install Beautiful Soup because it
might not be available in the repositories. In these scenarios, we can rely on the other
methods of installation using either pip or easy_install.
The preceding command will install Beautiful Soup Version 4 in the system after
downloading the necessary packages from http://pypi.python.org/.
[9]
www.it-ebooks.info
Installing Beautiful Soup
All the previous methods to install Beautiful Soup in Linux will not work if you
do not have an active network connection. So, in case everything fails, we can still
install Beautiful Soup. The last option would be to use the setup.py script that
comes with every Python package downloaded from pypi.python.org. This
method is also the recommended method to install Beautiful Soup in Windows and
in Mac OS X machines. So, we will discuss this method in the Installing Beautiful Soup
in Windows section.
The preceding command will work without any errors if the path to Python is
already added in the environment path variable or we are already within the Python
installed directory. But, it would be good to check the path variable for the Python
directory entry.
If it doesn't exist in the path variable, we have to find out the actual path, which is
entirely dependent on where you installed Python. For Python 2.x, it will be by
C:\Python2x by default, and for Python 3.x, the path will be C:\Python3x by default.
We have to add this to the Path environment variable in the Windows machine.
For this, right-click on My Computer | Properties | Environment Variables |
System Variable.
Pick the Path variable and add the following section to the Path variable:
;C:\PythonXY for example C:\Python27
[ 10 ]
www.it-ebooks.info
Chapter 1
After the Python path is ready, we can follow the steps for installing Beautiful Soup
on a Windows machine.
[ 11 ]
www.it-ebooks.info
Installing Beautiful Soup
We are not done with the list of possible options to use Beautiful
Soup. We can use Beautiful Soup in our applications even if all of the
options outlined until now fail.
www.it-ebooks.info
Chapter 1
After we perform all the preceding steps, we are good to use Beautiful Soup. In
order to import Beautiful Soup in this case, either we need to open the terminal
in the directory where the bs4 directory exists or add this directory to the Python
Path variable; otherwise, we will get the module not found error. This extra step
is required because the method is specific to a project where the bs4 directory is
included. But in the case of installing methods, as we have seen previously, Beautiful
Soup will be available globally and can be used in any of the projects, and so the
additional steps are not required.
If we did not install Beautiful Soup and instead copied the bs4 directory in the
workspace, we have to change to the directory where we have placed the bs4
directory before using the preceding commands.
Quick reference
The following table is an overview of commands and their implications:
[ 13 ]
www.it-ebooks.info
Installing Beautiful Soup
Summary
In this chapter, we covered the various options to install Beautiful Soup in Linux
machines. We also discussed a way of installing Beautiful Soup in Windows, Linux,
and Mac OS X using the Python setup.py script itself. We also discussed the method
to use Beautiful Soup without even installing it. The verification of the Beautiful
Soup installation was also covered.
In the next chapter, we are going to have a first look at Beautiful Soup by learning
the different methods of converting HTML/XML content to different Beautiful Soup
objects and thereby understanding the properties of Beautiful Soup.
[ 14 ]
www.it-ebooks.info
Creating a BeautifulSoup
Object
We saw how to install Beautiful Soup in Linux, Windows, and Mac OS X machines in
Chapter 1, Installing Beautiful Soup.
Beautiful Soup is widely used for getting data from web pages. We can use Beautiful
Soup to extract any data in an HTML/XML document, for example, to get all links in
a page or to get text inside tags on the page. In order to achieve this, Beautiful Soup
offers us different objects, and simple searching and navigation methods.
• BeautifulSoup
• Tag
• NavigableString
www.it-ebooks.info
Creating a BeautifulSoup Object
The previous code will create the BeautifulSoup object based on the input string
helloworld. We can see that the input has been treated as HTML and the content of
the object can be verified by print(soup_string).
<html><body><p>Helloworld</p></body></html>
The output of the previous code can be different in some systems based
on the parser used. This is explained later in this chapter.
During the creation of the object, Beautiful Soup converts the input markup (HTML/
XML) to a tree structure using the supported parsers. While doing so, the markup
will be represented as different Beautiful Soup objects such as BeautifulSoup, Tag,
and NavigableString.
For example, consider the case where we need to get a list of all the books published
by Packt Publishing, which is available at http://www.packtpub.com/books. In
order to reduce the overhead of visiting this URL from our browser to get the page
content as String, it is appropriate to create the BeautifulSoup object by providing
the file-like object of the URL.
import urllib2
from bs4 import BeautifulSoup
[ 16 ]
www.it-ebooks.info
Chapter 2
url = "http://www.packtpub.com/books"
page = urllib2.urlopen(url)
soup_packtpage = BeautifulSoup(page)
In the previous Python script, we have used the urllib2 module, which is a native
Python module, to open the http://www.packtpub.com/books page. The urllib2.
urlopen() method returns a file-like object for the input URL. Then we create the
BeautifulSoup object, soup_packtpage, by passing the file-like object.
For this, create a local folder in your machine by executing the command mkdir
Soup from a terminal. Create an HTML file, foo.html, in this folder using touch
Soup/foo.html. From the same terminal, change to the directory just created
using cd Soup.
Now let us see the creation of BeautifulSoup using the file foo.html.
with open("foo.html","r") as foo_file:
soup_foo = BeautifulSoup(foo_file)
The previous lines of code create a BeautifulSoup object based on the contents
of the local file, foo.html.
Beautiful Soup has a basic warning mechanism to notify whether we have passed a
filename instead of the file object.
But still a BeautifulSoup object is created assuming the string ("foo.html") that
we passed as HTML.
[ 17 ]
www.it-ebooks.info
Creating a BeautifulSoup Object
The same warning mechanism also notifies us if we tried to pass in a URL instead of
the URL file object.
soup_url = BeautifulSoup("http://www.packtpub.com/books")
Here also the BeautifulSoup object is created by considering the string (URL)
as HTML.
print(soup_url)
So we should pass either the file handle or string to the BeautifulSoup constructor.
[ 18 ]
www.it-ebooks.info
Chapter 2
• lxml
• html5lib
• html.parser
The features argument of the BeautifulSoup constructor can accept either a list of
strings or a string value. The currently supported features by each TreeBuilder and
the underlying parsers are described in the following table:
[ 19 ]
www.it-ebooks.info
Creating a BeautifulSoup Object
In the previous code, we passed xml as the value for the features argument and
created the soup_xml object. We can see that the same content ("<p>Helloworld</
p>") is now being treated as XML instead of HTML.
print(soup_xml)
#output
<?xml version="1.0" encoding="utf-8"?>
<p>Hello World</p>
The previous code will fail and throw the following error (since lxml is not installed
in the system):
bs4.FeatureNotFound: Couldn't find a tree builder with the
features you requested: xml. Do you need to install a parser
library?
In this case, we should install the required parsers using easy_install, pip, or
setup.py install.
[ 20 ]
www.it-ebooks.info
Chapter 2
Here the HTML is invalid since there is no closing </a> tag. The processing of this
invalid HTML using the previously mentioned parsers is given as follows:
From the output, it is clear that the lxml parser has processed the invalid
HTML. It added the closing </a> tag and also considered the invalid content
as an attribute of the <a> tag. Apart from this, it has also added the <html>
and <body> tags, which was not present in the input. Addition of the <html>
and <body> tags will be done by default if we use lxml.
From the output, it is clear that the html5lib parser has added the <html>,
<head>, and <body> tags, which was not present in the input. For example,
the lxml parser and the html5lib parser will also add these tags for any
input. But at the same time, it has discarded the invalid <a> tag to produce a
different representation of the input.
[ 21 ]
www.it-ebooks.info
Creating a BeautifulSoup Object
So, it is good to specify the parser by giving the features argument because this
helps to ensure that the input is processed in the same manner across different
machines. Otherwise, there is a possibility that the same code will break in one of the
machines if some invalid HTML is present, as the default parser that is picked up
by Beautiful Soup will produce a different tree. Specifying the features argument
helps to ensure that the tree generated is identical across all machines.
While creating the BeautifulSoup object, other objects are also created, which
include the following:
• Tag
• NavigableString
Tag
The Tag object represents different tags of HTML and XML documents. The creation
of Tag objects is done when parsing the documents. The different HTML/XML tags
identified during parsing are represented as corresponding Tag objects and these
objects will have attributes and contents of the HTML/XML tag. The Tag objects can
be used for searching and navigation within the HTML/XML document.
[ 22 ]
www.it-ebooks.info
Chapter 2
The previous script will print the first <a> tag in the document. We can see that
type(atag) is 'bs4.element.Tag'.
HTML/XML tags have names (for example, the name for the tag <a> is a and the tag
<p> is p) and attributes (for example, class, id, and style). The Tag object allows us to
get the name and attributes associated with each HTML tag.
The previous code prints the name of the object atag, which is nothing but the name
of the tag <a>.
We can change the name of the tag by changing the value of the .name property.
#output
<html><body><p>Test html a tag example</p>
<p href="http://www.packtpub.com'>Home</p>
<a href="http://www.packtpub.com/books'>Books</a>
</body></html>
From the output, we can see that first the <a> tag was replaced with the <p> tag.
#output
http://www.packtpub.com
[ 23 ]
www.it-ebooks.info
Creating a BeautifulSoup Object
The previous code prints the URL (http://www.packtpub.com) associated with the
first <a> tag by accessing the value of the href attribute.
Different attributes associated with a tag can be accessed using the .attrs property.
Apart from the name and attributes, a Tag object has helper methods for navigating
and searching through the document, which we will discuss in the following chapters.
We can get the text stored inside a particular tag by using ".string".
first_a_string = soup_atag.string
Quick reference
You can view the following references to get an overview of creating the
following objects:
• BeautifulSoup
°° soup = BeautifulSoup(string)
°° soup = BeautifulSoup(string,features="xml") #for xml
• Tag
°° tag = soup.tag #accessing a tag
°° tag.name #Tag name
°° tag['attribute'] #Tag attribute
[ 24 ]
www.it-ebooks.info
Chapter 2
• NavigableString
°° soup.tag.string #get Tag's string
Summary
In this chapter, we learned the different objects in the Beautiful Soup module. We
understood how the HTML/XML document is converted to a BeautifulSoup
object with the help of underlying TreeBuilders. We also had a look at the creation
of BeautifulSoup by passing a string and a file object (for a local file and URL).
Creating BeautifulSoup for XML parsing and the use of the features argument in
the constructor were also explained. We saw how the different tags and texts within
the HTML/XML document are represented as a Tag and NavigableString object in
Beautiful Soup.
In the next chapter, we will learn the different searching methods, such as find(),
find_all(), and find_next(), provided by Beautiful Soup. With the help of these
searching methods, we will be able to get data out of the HTML/XML document,
which is indeed the most powerful feature of Beautiful Soup.
[ 25 ]
www.it-ebooks.info
www.it-ebooks.info
Search Using Beautiful Soup
We saw the creation of the BeautifulSoup object and other objects, such as Tag
and NavigableString in Chapter 2, Creating a BeautifulSoup Object. The HTML/XML
document is converted to these objects for the ease of searching and navigating
within the document.
In this chapter, we will learn the different searching methods provided by Beautiful
Soup to search based on tag name, attribute values of tag, text within the document,
regular expression, and so on. At the end, we will make use of these searching
methods to scrape data from an online web page.
• find()
• find_all()
• find_parent()
• find_parents()
• find_next_sibling()
• find_next_siblings()
• find_previous_sibling()
• find_previous_siblings()
www.it-ebooks.info
Search Using Beautiful Soup
• find_previous()
• find_all_previous()
• find_next()
• find_all_next()
[ 28 ]
www.it-ebooks.info
Chapter 3
<div class="number">100</div>
</li>
</ul>
<ul id="tertiaryconsumers">
<li class="tertiaryconsumerlist">
<div class="name">lion</div>
<div class="number">80</div>
</li>
<li class="tertiaryconsumerlist">
<div class="name">tiger</div>
<div class="number">50</div>
</li>
</ul>
</body>
</html>
[ 29 ]
www.it-ebooks.info
Search Using Beautiful Soup
Now, we can change to the Soup directory using the following command:
cd Soup.
#output
plants
Since producers come as the first entry for the <ul> tag, we can use the find()
method, which normally searches for only the first occurrence of a particular tag in a
BeautifulSoup object. We store this in producer_entries. The next line prints the
name of the first producer. From the previous HTML diagram, we can understand
that the first producer is stored inside the first <div> tag of the first <li> tag that
immediately follows the first <ul> tag, as shown in the following code:
<ul id="producers">
<li class="producerlist">
<div class="name">plants</div>
<div class="number">100000</div>
</li>
</ul>
So, after running the preceding code, we will get plants, which is the first producer,
as the output.
Explaining find()
At this point, we know that find() is used to search for the first occurrence of
any items within a BeautifulSoup object. The signature of the find() method is
as follows:
find(name,attrs,recursive,text,**kwargs)
[ 30 ]
www.it-ebooks.info
Chapter 3
As the signature implies, the find() method accepts the parameters, such as name,
attrs, recursive,text, and **kwargs. Parameters such as name, attrs, and text
are the filters that can be applied on a find() method.
#output
<class 'bs4.element.Tag'>
The preceding code finds the first occurrence of the li tag within the HTML
document and then prints the type of tag_li.
#output
<class 'bs4.element.Tag'>
By default, find() returns the first Tag object with name equals to the string
we passed.
[ 31 ]
www.it-ebooks.info
Search Using Beautiful Soup
search_for_stringonly = soup.find(text="fox")
#output
fox
The preceding code will search for the occurrence of the fox text within the
ecological pyramid. Searching for the text using Beautiful Soup is case sensitive. For
example, case_sensitive_string = soup.find(text="Fox") will return None.
Let us take an example where we are given a page with e-mail IDs, as mentioned in
the following code, and we are asked to find the first e-mail ID:
<br/>
<div>The below HTML has the information that has email ids.</div>
[email protected]
<div>[email protected]</div>
<span>[email protected]</span>
Here, the e-mail IDs are scattered across the page with one inside the <div> tag,
another inside the <span> tag, and the first one, which is not enclosed by any tag.
It is difficult here to find the first e-mail ID. But if we can represent the e-mail ID
using regular expression, find() can search based on the expression to get the first
e-mail ID.
So in this case, we just need to form the regular expression for the e-mail ID and pass
it to the find() method. The find() method will use the regular expression, the
match() method, to find a match for the given regular expression.
[ 32 ]
www.it-ebooks.info
Chapter 3
<div>[email protected]</div>
<span>[email protected]</span>
"""
soup = BeautifulSoup(email_id_example,"lxml")
emailid_regexp = re.compile("\w+@\w+\.\w+")
first_email_id = soup.find(text=emailid_regexp)
print(first_email_id)
#output
[email protected]
In the preceding code, we created the regular expression for the e-mail ID in the
emailid_regexp variable. The pattern we used previously is \w+@\w+\. The \w+
symbol represents one or more alphanumeric characters followed by @, and then
again followed by one or more alphanumeric character, then a . symbol, and one or
more alphanumeric character. This matches the e-mail ID in the preceding example.
We then passed the emailid_regexp variable to the find() method to find the first
text that matches the preceding pattern.
primary_consumers = soup.find(id="primaryconsumers")
print(primary_consumers.li.div.string)
#output
deer
[ 33 ]
www.it-ebooks.info
Search Using Beautiful Soup
If we analyze the HTML, we can see that the first primary consumer is stored as
follows:
<ul id="primaryconsumers">
<li class="primaryconsumerlist">
<div class="name">deer</div>
<div class="number">1000</div>
</li>
</ul>
We can see that the first primary consumer is stored inside the first <div> tag of the
first <li> tag. The second line prints the string stored inside this <div> tag, which is
the first primary consumer name, which is deer.
Searching based on attribute values will work for most of the attributes, such as id,
style, and title. But, there are some exceptions in the case of a couple of attributes
as follows:
• Custom attributes
• Class
For searching based on the id attribute, we used the following code line:
soup.find(id="primaryconsumer")
But, if we use the attribute value the same way for the following HTML, the code
will throw an error as keyword can't be an expression:
customattr = ""'<p data-custom="custom">custom attribute
example</p>"""
customsoup = BeautifulSoup(customattr,'lxml')
customSoup.find(data-custom="custom")
The error is thrown because Python variables cannot contain a - character and the
data-custom variable that we passed contained a - character.
[ 34 ]
www.it-ebooks.info
Chapter 3
In such cases, we need to pass in the keyword arguments as a dictionary in the attrs
argument as follows:
using_attrs = customsoup.find(attrs={'data-custom':'custom'})
print(using_attrs)
#output
'<p data-custom="custom">custom attribute example</p>'
#output
<li class="primaryconsumerlist">
<div class="name">deer</div>
<div class="number">1000</div>
</li>
Since searching based on class is a common thing, Beautiful Soup has a special
keyword argument that can be passed for matching the CSS class. The keyword
argument that we can use is class_ and since this is not a reserved keyword in
Python, it won't throw an error.
Line 1:
css_class = soup.find(class_ = "primaryconsumers" )
Line 2:
css_class = soup.find(attrs={'class':'primaryconsumers'})
[ 35 ]
www.it-ebooks.info
Search Using Beautiful Soup
Let's take an example of finding the secondary consumers using functions within the
find() method:
def is_secondary_consumers(tag):
return tag.has_attr('id') and tag.get('id') ==
'secondaryconsumers'
The function checks whether the tag has the id attribute and if its value is
secondaryconsumers. If the two conditions are met, the function will return true,
and so, we will get the particular tag we were looking for in the following code:
secondary_consumer = soup.find(is_secondary_consumers)
print(secondary_consumer.li.div.string)
#output
fox
We use the find() method by passing the function that returns either true or false,
and so, the tag for which the function returns true is displayed, which in our case,
corresponds to the first secondary consumer.
But, what if there were multiple tags with the same attribute value? For example,
refer to the following code:
<p class="identical">
Example of p tag with class identical
</p>
<div class="identical">
Example of div tag with class identical
</div>
Here, we have a div tag and a p tag with the same class attribute value "identical".
In this case, if we want to search only for the div tag with the class attribute's value
= identical, we can use a combination of search using a tag and attribute value
within the find() method.
[ 36 ]
www.it-ebooks.info
Chapter 3
#output
<div class="identical">
Example of div tag with class identical
</div>
The preceding code line finds all the tags with the = "tertiaryconsumerlist"
class. If given a type check on this variable, we can see that it is nothing but a list of
tag objects as follows:
print(type(all_tertiaryconsumers))
#output
<class 'list'>
We can iterate through this list to display all tertiary consumer names by using the
following code:
for tertiaryconsumer in all_tertiaryconsumers:
print(tertiaryconsumer.div.string)
[ 37 ]
www.it-ebooks.info
Search Using Beautiful Soup
#output
lion
tiger
The limit parameter is used to specify a limit to the number of results that we get.
For example, from the e-mail ID sample we saw, we can use find_all() to get all
the e-mail IDs. Refer to the following code:
email_ids = soup.find_all(text=emailid_regexp)
print(email_ids)
#output
[u'[email protected]',u'[email protected]',u'[email protected]']
Here, if we pass limit, it will limit the result set to the limit we impose, as shown in
the following example:
email_ids_limited = soup.find_all(text=emailid_regexp,limit=2)
print(email_ids_limited)
#output
[u'[email protected]',u'[email protected]']
From the output, we can see that the result is limited to two.
We can pass True or False values to find the methods. If we pass True to find_
all(), it will return all tags in the soup object. In the case of find(), it will be the
first tag within the object. The print(soup.find_all(True)) line of code will print
out all the tags associated with the soup object.
[ 38 ]
www.it-ebooks.info
Chapter 3
In the case of searching for text, passing True will return all text within the document
as follows:
all_texts = soup.find_all(text=True)
print(all_texts)
#output
[u'\n', u'\n', u'\n', u'\n', u'\n', u'plants', u'\n', u'100000',
u'\n', u'\n', u'\n', u'algae', u'\n', u'100000', u'\n', u'\n',
u'\n', u'\n', u'\n', u'deer', u'\n', u'1000', u'\n', u'\n',
u'\n', u'rabbit', u'\n', u'2000', u'\n', u'\n', u'\n',
u'\n', u'\n', u'fox', u'\n', u'100', u'\n', u'\n', u'\n',
u'bear', u'\n', u'100', u'\n', u'\n', u'\n', u'\n',
u'\n', u'lion', u'\n', u'80', u'\n', u'\n', u'\n',
u'tiger', u'\n', u'50', u'\n', u'\n', u'\n', u'\n',
u'\n']
The preceding output prints every text content within the soup object including the
new-line characters too.
Also, in the case of text, we can pass a list of strings and find_all() will find every
string defined in the list:
all_texts_in_list = soup.find_all(text=["plants","algae"])
print(all_texts_in_list)
#output
[u'plants', u'algae']
This is same in the case of searching for the tags, attribute values of tag, custom
attributes, and the CSS class.
For finding all the div and li tags, we can use the following code line:
div_li_tags = soup.find_all(["div","li"])
Both find() and find_all() search an object's descendants (that is, all children
coming after it in the tree), their children, and so on. We can control this behavior by
using the recursive parameter. If recursive = False, search happens only on an
object's direct children.
[ 39 ]
www.it-ebooks.info
Search Using Beautiful Soup
For example, in the following code, search happens only at direct children for div
and li tags. Since the direct child of the soup object is html, the following code will
give an empty list:
div_li_tags = soup.find_all(["div","li"],recursive=False)
print(div_li_tags)
#output
[]
If find_all() can't find results, it will return an empty list, whereas find()
returns None.
These tags that we intend to visit will be in a direct relationship with the tag we
already searched. For example, we may need to find the immediate parent tag of a
particular tag. Also, there will be situations to find the previous tag, next tag, tags
that are in the same level (siblings), and so on. In these cases, there are searching
methods provided within the BeautifulSoup object, for example, find_parents(),
find_next_siblings(), and so on. Normally, we use these methods followed
by a find() or find_all() method since these methods are used for finding one
particular tag and we are interested in finding the other tags, which are in relation
with this tag.
In the primary consumer example, we can find the immediate parent tag associated
with primaryconsumer as follows:
primaryconsumers = soup.find_all(class_="primaryconsumerlist")
primaryconsumer = primaryconsumers[0]
parent_ul = primaryconsumer.find_parents('ul')
print(parent_ul)
[ 40 ]
www.it-ebooks.info
Chapter 3
#output
<ul id="primaryconsumers">
<li class="primaryconsumerlist">
<div class="name">deer</div>
<div class="number">1000</div>
</li>
<li class="primaryconsumerlist">
<div class="name">rabbit</div>
<div class="number">2000</div>
</li>
</ul>
The first line will store all the primary consumers in the primaryconsumers variable.
We take the first entry and store that in primaryconsumer. We are trying to find all the
<ul> tags, which are the parent tags of the primaryconsumer list that we got. From the
preceding output, we can understand that the result will contain the whole structure of
the parent, which will also include the tag for which we found the parent. We can use
the find_parent() method to find the immediate parent of the tag.
parent_p = primaryconsumer.find_parent("p")
The preceding code will search for the first <p> tag, which is the immediate parent
of the <li> tag with the primaryconsumerlist class in the ecological pyramid
example.
An easy way to get the immediate parent tag for a particular tag is to use the find_
parent() method without any parameter as follows:
immediateprimary_consumer_parent = primary_consumer.find_parent()
[ 41 ]
www.it-ebooks.info
Search Using Beautiful Soup
This means that in our example, the ul tags with the classes producers,
primaryconsumers, secondaryconsumers, and teritiaryconsumers are siblings.
Also, in the following diagram for producers, we can see that the plants value,
which is the first producer and algae, which is the second producer, cannot be
treated as siblings, since they are not at the same level:
But, both div with the value plants and the value for number 10000 can be
considered as siblings, as they are at the same level.
[ 42 ]
www.it-ebooks.info
Chapter 3
Beautiful Soup comes with methods to help us find the siblings too.
The find_next_siblings() method allows to find the next siblings, whereas find_
next_sibling() allows to find the next sibling. In the following example, we can
find out the siblings of the producers:
producers= soup.find(id='producers')
next_siblings = producers.find_next_siblings()
print(next_siblings)
#output
[<ul id="primaryconsumers">
<li class="primaryconsumerlist">
<div class="name">deer</div>
<div class="number">1000</div>
</li>
<li class="primaryconsumerlist">
<div class="name">rabbit</div>
<div class="number">2000</div>
</li>
<li class="secondaryconsumerlist">
<div class="name">fox</div>
<div class="number">100</div>
</li>
<li class="secondaryconsumerlist">
<div class="name">bear</div>
[ 43 ]
www.it-ebooks.info
Search Using Beautiful Soup
<div class="number">100</div>
</li>
<li class="tertiaryconsumerlist">
<div class="name">lion</div>
<div class="number">80</div>
</li>
<li class="tertiaryconsumerlist">
<div class="name">tiger</div>
<div class="number">50</div>
</li>
</ul>]
So, we find the next siblings for the producer, which are the primary consumers,
secondary consumers, and tertiary consumers.
Like other find methods, we can have the different filters such as text, regular
expression, attribute value, and tag name to pass to these methods to find the
siblings accordingly.
[ 44 ]
www.it-ebooks.info
Chapter 3
For example, we can find all the li tags that come after the first div tag using the
following code:
first_div = soup.div
all_li_tags = first_div.find_all_next("li")
#output
[<li class="producerlist">
<div class="name">plants</div>
<div class="number">100000</div>
</li>, <li class="producerlist">
<div class="name">algae</div>
<div class="number">100000</div>
</li>, <li class="primaryconsumerlist">
<div class="name">deer</div>
<div class="number">1000</div>
</li>, <li class="primaryconsumerlist">
<div class="name">rabbit</div>
<div class="number">2000</div>
</li>, <li class="secondaryconsumerlist">
<div class="name">fox</div>
<div class="number">100</div>
</li>, <li class="secondaryconsumerlist">
<div class="name">bear</div>
<div class="number">100</div>
</li>, <li class="tertiaryconsumerlist">
<div class="name">lion</div>
<div class="number">80</div>
</li>, <li class="tertiaryconsumerlist">
<div class="name">tiger</div>
<div class="number">50</div>
</li>]
[ 45 ]
www.it-ebooks.info
Search Using Beautiful Soup
The first step involved in this exercise is to analyze the HTML document and to
understand its logical structure. This will help us to understand how the required
information is stored within the HTML document.
We can use the Google Chrome developer tools to understand the logical structure of
the page. Let us do that by performing the following steps:
The previous URL content under Google Chrome developer tool will be shown as
follows for a particular book title:
[ 46 ]
www.it-ebooks.info
Chapter 3
We can see that every book title is stored under a <span> tag with class="field-
content". If we take a closer look, we can see that both the published date
information and the price information is stored inside a similar <span> tag with
class="field-content". In order to uniquely identify the preceding data, we
can't take the <span> tag since all the three pieces of information are stored under
the <span> tag with the same class. So, it will be better to identify another tag that
encloses each piece of information uniquely. In this case, we can see that the <span>
tag corresponding to the book title is stored inside a <div> tag with class="views-
field-title". Similarly, for published data information, the corresponding <span>
tag is stored inside a <div> tag with class=" views-field-field-date-of-
publication-value". Likewise, the <span> tag for price is stored inside the <div>
tag with class="views-field-sell-price". From the web page source, we can see
that the structure is as follows and it supports the findings that we had:
<div class="views-field-title">
<span class="field-content">
<a href="/angularjs-web-application-development/book">
Mastering Web Application Development with AngularJS</a>
</span>
</div>
<div class="views-field-field-date-of-publication-value">
<span class="field-content">Published: August 2013</span>
</div>
<div class="views-field-sell-price">
<label class="views-label-sell-price">
Our price:
</label>
<span class="field-content">
1,386.00</span>
</div>
[ 47 ]
www.it-ebooks.info
Search Using Beautiful Soup
This is same for each book title in the page. Further analysis of the page shows that
all the book titles are wrapped inside a table with class="views-view-grid", as
shown in the following screenshot:
So, we have formed the logical structure for finding all the books in the web page.
The next step is to form a parsing strategy based on the logical structure we found
out. For the preceding case, we can form a parsing strategy as follows:
[ 48 ]
www.it-ebooks.info
Chapter 3
4. To find the price of the corresponding title, we should search for the next div
tag with class=" views-field-sell-price". From the logical structure,
we can see that for the published date also, we should get the string inside
the span tag.
We can apply the preceding parsing strategy for all the titles within the page. The
next step is to convert this into a script so that we get all information related to the
book title from the page.
We will first create the BeautifulSoup object based on the web page by taking the
example from the previous chapter as follows:
import urllib2
from bs4 import BeautifulSoup
url = "http://www.packtpub.com/books"
page = urllib2.urlopen(url)
soup_packtpage = BeautifulSoup(page)
page.close()
Then, we will search for all the div tags with the views-field-title class within
this table tag as follows:
all_book_titles = all_books_table.find_all("div",class_="views-
field-title")
As next step, we will iterate through these tag objects, which represent the book title,
and will find the published date and price for each title as follows:
for book_title in all_book_titles:
book_title_span = book_title.span
print("Title Name is :"+book_title_span.a.string)
published_date = book_title.find_next("div",class_="views-field-
field-date-of-publication-value")
print("Published Date is :"+published_date.span.string)
price = book_title.find_next("div",class_="views-field-sell-
price")
print("Price is :"+price.span.string)
#output
Title Name is :Mastering Web Application Development with AngularJS
[ 49 ]
www.it-ebooks.info
Search Using Beautiful Soup
In the preceding script, we print the title of the book by getting the string inside
the a tag, which is inside the first span tag. We searched for the next <div> tag by
using find_next("div",class_="views-field-field-date-of-publication-
value") to get the published date information. Similarly, for price, we used find_
next("div",class_="views-field-sell-price"). The published date and price
information was stored as a string of the first span tag and so, we used span.string
to print this information. We will get all the book titles, published date, and price from
the web page at http://www.packtpub.com/books by running the preceding script.
We have created the program based on the HTML structure of the current
page. The HTML code of this page can change over the course of time
resulting in a change of the logical structure and parsing strategy.
[ 50 ]
www.it-ebooks.info
Chapter 3
Quick reference
Here, we will see the following searching methods in Beautiful Soup:
[ 51 ]
www.it-ebooks.info
Search Using Beautiful Soup
Summary
In this chapter, we dealt with the various search methods in Beautiful Soup, such
as find(), find_all(), and find_next(). The different parameters that can
be used with these methods were also explained with the help of sample code.
Combinations of the different filters for the search methods and also finding the tags
in relationships were also discussed in this chapter. We also looked at forming the
logical structure and parsing strategy for finding all the information related to the
book title from an online web page using the different search methods.
In the next chapter, we will learn the different navigation methods in Beautiful Soup.
[ 52 ]
www.it-ebooks.info
Navigation Using
Beautiful Soup
In Chapter 3, Search Using Beautiful Soup, we saw how to apply searching methods
to search tags, texts, and more in an HTML document. Beautiful Soup does much
more than just searching. Beautiful Soup can also be used to navigate through the
HTML/XML document. Beautiful Soup comes with attributes to help in the case of
navigation. We can find the same information up to some level using the searching
methods, but in some cases due to the structure of the page, we have to combine
both searching and navigation mechanisms to get the desired result. Navigation
techniques come in handy in those cases. In this chapter, we will get into navigation
using Beautiful Soup in detail.
www.it-ebooks.info
Navigation Using Beautiful Soup
<li class="producerlist">
<div class="name">algae</div>
<div class="number">100000</div>
</li>
</ul>
</div>"""
For the previous code snippet, the following HTML tree is formed:
BeautifulSoup
html
div:
ecopyramid
ul:
producers
li: li:
producerlist producerlist
In the previous figure, we can see that Beautiful Soup is the root of the tree, the Tag
objects make up the different nodes of the tree, while NavigableString objects make
up the leaves of the tree.
Navigation in Beautiful Soup is intended to help us visit the nodes of this HTML/
XML tree. From a particular node, it is possible to:
[ 54 ]
www.it-ebooks.info
Chapter 4
Navigating down
Any object, such as Tag or BeautifulSoup, that has children can use this navigation.
Navigating down can be achieved in two ways.
In the previous code, by using soup.ul, we navigate to the first entry of the ul tag
within the soup object's children.
This can also be done for Tag objects by using the following code:
first_producer = producer_entries.li
print(first_producer)
#output
<li class="="producerlist">
<div class="="name">plants</div>
<div class="="number">100000</div>
</li>
The previous code used navigation on the tag object, producer_entries, to find
the first entry of the <li> tag. We can verify this from the output. But this cannot be
used on a NavigableString object, as it doesn't have any children.
producer_name = first_producer.div.string
This will throw the following AttributeError since NavigableString can't have
any child objects:
AttributeError: 'NavigableString' object has no attribute 'li'
[ 55 ]
www.it-ebooks.info
Navigation Using Beautiful Soup
• Direct children: These come immediately after a node in an HTML tree. For
example, in the following figure, html is the direct child of BeautifulSoup.
BeautifulSoup
BeautifulSoup
div:
ecopyramid
descendants ul:
producers
li: li:
producerlist producerlist
[ 56 ]
www.it-ebooks.info
Chapter 4
Based on the previous categorization, there are the following different attributes for
navigating to the children:
• .contents
• .children
• .descendants
These attributes will be present in all Tag objects and the BeautifulSoup object that
facilitates navigation to the children.
From the output, we know that type is a list that holds the children. In this case,
the number of children of the BeautifulSoup object can be understood from the
following code snippet:
print len(soup.contents)
#output
1
We can use any type of list navigation in Python on .contents too. For example, we
can print the name of all children using the following code:
for tag in soup.contents:
print(tag.name)
#output
html
Now let us see that in the case of the Tag object producer_entries using the
following code snippet:
for child in producer_entries.contents:
print(child)
#output
<li class="producerlist">
<div class="name">plants</div>
<div class="number">100000</div>
[ 57 ]
www.it-ebooks.info
Navigation Using Beautiful Soup
</li>
<li class="producerlist">
<div class="name">algae</div>
<div class="number">100000</div>
#output
<class 'list_generator'>
We can iterate over .children of the BeautifulSoup object, and get the children as
in the example code given as follows:
for tag in soup.children:
print(tag.name)
#output
html
From the output, we can see that .descendants gives 13, whereas .contents or
.children gave only 1.
[ 58 ]
www.it-ebooks.info
Chapter 4
#output
html
body
p
ul
li
div
plants
div
100000
li
div
algae
div
100000
Here we are iterating through all the descendants of the soup object. Since
NavigableString doesn't have the .Name attribute, we are checking it and
printing the string itself in the previous code. But for a Tag object, we just print
the .name attribute.
The output for the code is entirely different from the ones in which we used
.contents or .children.
#output
plants
[ 59 ]
www.it-ebooks.info
Navigation Using Beautiful Soup
Navigating up
Like navigating down to find children, Beautiful Soup allows users to find the
parents of a particular Tag/NavigableString object. Navigating up is done
using .parent and .parents.
#output
div
The .parent attribute of the top most <html>/<xml> tag is the BeautifulSoup
object itself.
html_tag = soup.html
print(html_tag.parent.name)
#output
u'[document]'
[ 60 ]
www.it-ebooks.info
Chapter 4
Since the soup object is at the root of the tree, it didn't have a parent. So .parent on
the soup object will return None.
print(soup.parent)
#output
None
third_div = soup.find_all("div")[2]
In the previous code, we store the third <div> entry, which is <div
class="="name">algae</div> in third_div.
li
ul
body
html
[document]
In the previous code, we navigate to the <li> tag, which is the immediate parent
object of third_div, then to the <ul> tag, which is the parent of the <li> tag.
Likewise, navigation to the html tag and finally [document], which represents
the soup object, is done.
[ 61 ]
www.it-ebooks.info
Navigation Using Beautiful Soup
#output
u'algae'
#output
<li class="producerlist"><div class="name">plants</div><div
class="number">100000</div></li>
If a tag doesn't have a previous sibling, it will return None, that is print(first_
producer.previous_sibling) will give us None since there are no previous sibling
for this tag.
The previous code snippet will give only the <li> tag, which is the only previous
sibling. The same iteration can be used for next_siblings to find the siblings
coming after an object.
[ 62 ]
www.it-ebooks.info
Chapter 4
For example, the immediate element parsed after the first <li> tag is the <div> tag.
first_producer = soup.li
print(first_producer.next_element)
#output
<div class="name">plants</div>
The previous code prints the next element, which is <div class="name">plants</
div>.
In the same way, .previous_element can be used to find the immediate element
parsed before a particular tag or string.
second_div = soup.find_all("("div")[")[1]
print(second_div.previous_element)
#output
plants
From the output, it is clear that the one parsed immediately before the second <div>
tag is the string plants.
Quick reference
The following commands will help you to navigate down the tree:
[ 63 ]
www.it-ebooks.info
Navigation Using Beautiful Soup
The following commands help you to navigate to the previous or next element:
Summary
In this chapter, we discussed the various navigation techniques in Beautiful Soup.
We have discussed four ways of navigation, that is, navigating up, down, sideways,
next, and before elements with the help of examples.
In the next chapter, we will learn about the different ways to modify the parsed
HTML tree by adding new contents, for example, tags, strings, modifying tags,
and deleting existing ones.
[ 64 ]
www.it-ebooks.info
Modifying Content Using
Beautiful Soup
Beautiful Soup can be used effectively to search or navigate within an HTML/XML
document. Apart from this, we can also use Beautiful Soup to change the content
of an HTML/XML document. Modification of the content means the addition or
deletion of a new tag, changing the tag name, altering tag attribute values, changing
text content, and so on. This is useful in situations where we need to parse the
HTML/XML document, change its structure, and write it back as an HTML/XML
document with the modification.
Consider a case where we have an HTML document with a <table> tag holding
around 1,000 or more rows (the <tr> tag ) with an existing set of two columns (the
<td> tag ) per row. We want to add a new set of the two <td> tags to each row. It
would be highly inappropriate to manually add these <td> tags for each of the <tr>
tags. In this case, we can use Beautiful Soup to search or/and navigate through
each of the <tr> tags to add these <td> tags. We can then save these changes as a
new HTML document, which will then have the four <td> tags per <tr> tag. This
chapter deals with the different methods to modify content using Beautiful Soup by
considering the ecological pyramid example that we used in the previous chapters.
www.it-ebooks.info
Modifying Content Using Beautiful Soup
soup = BeautifulSoup(html_markup,"lxml")
producer_entries = soup.ul
print(producer_entries.name)
#output
'ul'
From the preceding output, we can see that producer_entries has the name ul; we
can easily change this by modifying the .name property.
[ 66 ]
www.it-ebooks.info
Chapter 5
#output
<html>
<body>
<div class="ecopyramid">
<div id="producers">
<li class="producerlist">
<div class="name">
plants
</div>
<div class="number">
100000
</div>
</li>
<li class="producerlist">
<div class="name">
algae
</div>
<div class="number">
100000
[ 67 ]
www.it-ebooks.info
Modifying Content Using Beautiful Soup
</div>
</li>
</div>
</div>
</body>
</html>
This also causes the HTML tree to change as shown in the following diagram:
We can use the prettify() method to show the formatted output, as shown in the
preceding code. As we can see, changing the Tag object's name property also changes
the HTML tree. So, we should be careful when renaming the tags since it can lead to
malformed HTML tags.
[ 68 ]
www.it-ebooks.info
Chapter 5
Now, if we print the producer_entries object, we can see the change in place
as follows:
print(producer_entries.prettify())
#output
<div id="producers_new_value">
<li class="producerlist">
<div class="name">
plants
</div>
<div class="number">
100000
</div>
</li>
<li class="producerlist">
<div class="name">
algae
</div>
<div class="number">
100000
</div>
</li>
</div>
The preceding code will add the new attribute if it doesn't exist or will update the
attribute value if the attribute already existed in the HTML tree.
print(producer_entries.prettify())
#output
<div class="newclass" id="producers_new_value">
<li class="producerlist">
<div class="name">
plants
</div>
<div class="number">
100000
</div>
[ 69 ]
www.it-ebooks.info
Modifying Content Using Beautiful Soup
</li>
<li class="producerlist">
<div class="name">
algae
</div>
<div class="number">
100000
</div>
</li>
</div>
From the preceding output, we can verify that the new class attribute is being
added to the HTML tree.
The preceding code will remove the attribute class associated with the ul tag. Refer
to the following code:
print(producer_entries.prettify())
#output
<div id="producers_new_value">
<li class="producerlist">
<div class="name">
plants
</div>
<div class="number">
100000
</div>
</li>
<li class="producerlist">
<div class="name">
algae
</div>
<div class="number">
100000
[ 70 ]
www.it-ebooks.info
Chapter 5
</div>
</li>
</div>
The preceding code will create and store the new li tag in the new_li_tag variable.
The new_tag() method requires only the tag name as mandatory. We can pass
attribute values or other properties as optional parameters. That is, we can have the
following code:
new_atag=soup.new_tag("a",href="www.example.com")
In the preceding example, we created the <a> tag by giving a name as well as the
href property and its value.
It is also possible to add a new attribute value to the previously created li tag using
the following code:
new_li_tag.attrs={'class':'producerlist'}
[ 71 ]
www.it-ebooks.info
Modifying Content Using Beautiful Soup
The preceding code will append the newly created li tag to .contents of the ul tag.
So, the li tag will now be the child of the ul tag. The ul tag structure will look like
the following code:
print(producer_entries.prettify())
#output
<ul id="producers">
<li class="producerlist">
<div class="name">
plants
</div>
<div class="number">
100000
</div>
</li>
<li class="producerlist">
<div class="name">
algae
</div>
<div class="number">
100000
</div>
</li>s
<li class="producerlist">
</li>
</ul>
From the preceding output, we can see that the newly created li tag is added as the
new child of ul. Now, we have to add the two div tags inside this li tag.
[ 72 ]
www.it-ebooks.info
Chapter 5
The preceding lines of code will create the two new div tags with the corresponding
class attributes as follows:
new_li_tag.insert(0,new_div_name_tag)
new_li_tag.insert(1,new_div_number_tag)
print(new_li_tag.prettify())
#output
<li class_="producerlist">
<div class="name">
</div>
<div class="number">
</div>
</li>
Now, we can see that new tags have been inserted into the li tags. But, the
respective strings are missing in these tags.
[ 73 ]
www.it-ebooks.info
Modifying Content Using Beautiful Soup
#output
<ul id="producers">
<li class="producerlist">
<div class="name">
plants
</div>
<div class="number">
100000
</div>
</li>
<li class="producerlist">
<div class="name">
algae
</div>
<div class="number">
100000
</div>
</li>
<li class="producerlist">
<div class="name">
phytoplankton
</div>
<div class="number">
</div>
</li>
</ul>
We can see that the string has been added in the preceding code example.
[ 74 ]
www.it-ebooks.info
Chapter 5
#output
<html>
<body>
<div class="ecopyramid">
<ul id="producers">
<li class="producerlist">
<div class="name">
plants
</div>
<div class="number">
100000
</div>
</li>
<li class="producerlist">
<div class="name">
algae
</div>
<div class="number">
100000
</div>
</li>
<li class="producerlist">
<div class="name">
phytoplankton
producer
</div>
<div class="number">
</div>
</li>
</ul>
[ 75 ]
www.it-ebooks.info
Modifying Content Using Beautiful Soup
</div>
</body>
</html>
There is one more method, new_string(), that will help in creating a new string as
follows:
new_string_toappend = soup.new_string("producer")
new_div_name_tag.append(new_string_toappend)
The preceding code will create the new string and now, we can use either append()
or insert() to add the newly created string to the tree.
Like append(), we can also use insert() for inserting strings as follows:
new_string_toinsert =soup.new_string("10000")
new_div_number_tag.insert(0,new_string_toinsert)
The resulting tree after the addition of the producer will look like the following code:
<html>
<body>
<div class="ecopyramid">
<ul id="producers">
<li class="producerlist">
<div class="name">
plants
</div>
<div class="number">
100000
</div>
</li>
<li class="producerlist">
<div class="name">
algae
</div>
<div class="number">
100000
</div>
</li>
<li class_="producerlist">
<div class_="name">
phytoplankton
producer
</div>
<div class="number">
[ 76 ]
www.it-ebooks.info
Chapter 5
10000
</div>
</li>
</ul>
</div>
</body>
</html>
We can remove this producer entry by removing the div tags first and then the
li tags from the ul tag. We will remove the div tag with class="name" using the
decompose() method.
third_producer = soup.find_all("li")[2]
div_name = third_producer.div
div_name.decompose()
print(third_producer.prettify())
#output
<li class_="producerlist">
<div class_="number">
10000
</div>
</li>
In the preceding code, we have used find_all() to find all the li entries and we
stored the third producer into the third_producer variable. Then, we found the first
div tag and removed it using the decompose() method.
[ 77 ]
www.it-ebooks.info
Modifying Content Using Beautiful Soup
Likewise, we can remove the div tag with the class="number" entry too. Refer to
the following code:
third_producer = soup.find_all("li")[2]
div_number= third_producer.div
div_number.decompose()
print(third_producer.prettify())
#output
<li class="producerlist">
</li>
The extract() method is used to remove a tag or string from an HTML tree.
Additionally, it also returns a handle to the tag or string removed. Unlike
decompose(), the extract can be used in strings as well. This is shown as follows:
third_producer_removed=third_producer.extract()
print(soup.prettify())
#output
<html>
<body>
<div class="ecopyramid">
<ul id="producers">
<li class="producerlist">
<div class="name">
plants
</div>
<div class="number">
100000
</div>
</li>
<li class="producerlist">
<div class="name">
[ 78 ]
www.it-ebooks.info
Chapter 5
algae
</div>
<div class="number">
100000
</div>
</li>
</ul>
</div>
</body>
</html>
After executing the preceding code, we can see that the third producer, which is
phytoplankton, has been removed from the ecological pyramid example.
For example, we can remove the <div> tags holding plants and the corresponding
<div> with the class number from the li tag using the clear() method; this is
shown as follows:
li_plants=soup.li
Since it is the first li tag, we will be able to select it using the name; this is shown as
follows:
li_plants.clear()
This will remove all .contents of the tag. So, clear() will remove all the strings
and the children within a particular tag. This is shown as follows:
print(li_name)
#output
<li class="producerlist"></li>
[ 79 ]
www.it-ebooks.info
Modifying Content Using Beautiful Soup
#output
<html>
<body>
<div class="ecopyramid">
<ul id="producers">
<li class="producerlist">
<div class="name">
plants
</div>
<div class="number">
100000
</div>
<div class="ecosystem">
soil
</div>
</li>
<li class="producerlist">
<div class="name">
algae
</div>
[ 80 ]
www.it-ebooks.info
Chapter 5
<div class="number">
100000
</div>
</li>
</ul>
</div>
</body>
</html>
Here, we have created a new div tag and appended the string soil to it.
We used the insert_after() method to insert the tag in the correct place.
Likewise, we can use insert_before() to insert a Tag or NavigableString
object before something else in the tree.
#output
<html>
<body>
<div class="ecopyramid">
<ul id="producers">
<li class="producerlist">
<div class="name">
phytoplankton
</div>
<div class="number">
100000
</div>
<div class="ecosystem">
soil
</div>
</li>
[ 81 ]
www.it-ebooks.info
Modifying Content Using Beautiful Soup
<li class="producerlist">
<div class="name">
algae
</div>
<div class="number">
100000
</div>
</li>
</ul>
</div>
</body>
</html>
The wrap() method is used to wrap a tag or string with another tag that we
pass. For example, we can wrap the entire contents in the li tag with another
<div> tag in our following example:
li_tags = soup.find_all("li")
for li in li_tags:
new_divtag = soup.new_tag("div")
li.wrap(new_divtag)
print(soup.prettify())
#output
<html>
<body>
<div class="ecopyramid">
<ul id="producers">
<div>
<li class="producerlist">
<div class="name">
phytoplankton
</div>
<div class="number">
100000
</div>
<div class="ecosystem">
soil
</div>
</li>
</div>
<div>
[ 82 ]
www.it-ebooks.info
Chapter 5
<li class="producerlist">
<div class="name">
algae
</div>
<div class="number">
100000
</div>
</li>
</div>
</ul>
</div>
</body>
</html>
From the preceding output, it is clear that we wrapped the li tag with
another div tag.
The unwrap() method does the opposite of wrap() and is used to unwrap
the contents as follows:
soup = BeautifulSoup(html_markup,"lxml")
li_tags = soup.find_all("li")
for li in li_tags:
li.div.unwrap()
print(soup.prettify())
#output
<html>
<body>
<div class="ecopyramid">
<ul id="producers">
<li class="producerlist">
plants
<div class="number">
100000
</div>
<div class="ecosystem">
soil
</div>
</li>
</div>
<div>
<li class="producerlist">
algae
[ 83 ]
www.it-ebooks.info
Modifying Content Using Beautiful Soup
<div class="number">
100000
</div>
</li>
</div>
</ul>
</div>
</body>
</html>
Here, the first div tag will be unwrapped, that is, the div tag with
class="name".
Quick reference
You can take a look at the following references to get an overview of the
modifying content:
www.it-ebooks.info
Chapter 5
• Special functions:
The following are the special functions used to add or alter tags:
[ 85 ]
www.it-ebooks.info
Modifying Content Using Beautiful Soup
Summary
In this chapter, we took a look at the content modification techniques in Beautiful
Soup. The creation and addition of new tags and the modification of attribute values
were discussed with the help of an example. The deletion and replacing of content
was also explained. Finally, we dealt with some special functions, such as replace_
with(), wrap(), and unwrap(), which are very helpful when it comes to dealing
with changing the contents.
In the next chapter, we will discuss the encoding support in Beautiful Soup.
[ 86 ]
www.it-ebooks.info
Encoding Support in
Beautiful Soup
All web pages will have an encoding associated with it. Modern websites have
different encodings such as UTF-8, and Latin-1. Nowadays, UTF-8 is the encoding
standard used in websites. So, while dealing with the scraping of such pages, it is
important that the scraper should also be capable of understanding those encodings.
Otherwise, the user will see certain characters in the web browser whereas the result
you would get after using a scraper would be gibberish characters. For example,
consider a sample web content from Wikipedia where we are able to see the Spanish
character ñ.
www.it-ebooks.info
Encoding Support in Beautiful Soup
If we run the same content through a scraper with no support for the previous
encoding used by the website, we might end up with the following content:
The Spanish language is written using the Spanish alphabet, which
is the Latin alphabet with one additional letter,
e単e (単), for a total of 27 letters.
Beautiful Soup uses the UnicodeDammit library to automatically detect the encoding
of the document. Beautiful Soup converts the content to Unicode while creating soup
objects from the document.
#output '
The Spanish language is written using the Spanish alphabet, which
is the Latin alphabet with one additional letter, e単e (単), for
a total of 27 letters.
[ 88 ]
www.it-ebooks.info
Chapter 6
From the previous output, we can see that there is a difference in the additional
letter part (e単e (単)) because there is a gibberish character instead of the actual
representation. This is due to the wrong interpretation of the original encoding
in the document by UnicodeDammit.
#output '
<html>
<body>
<p>
The Spanish language is written using the Spanish alphabet,
which is the Latin alphabet with one additional letter,
eñe ⟨ñ⟩, for a total of 27 letters.
</p>
</body>
</html>
[ 89 ]
www.it-ebooks.info
Encoding Support in Beautiful Soup
There are no longer gibberish characters because we have specified the correct
encoding and we can verify this from the output.
The encoding that we pass should be correct, otherwise the character encoding
will be wrong; for example, if we pass the encoding as latin-1 in the preceding
HTML fragment, the result will be different.
soup = BeautifulSoup(html_markup,"lxml",from_encoding="latin-1")
print(soup.prettify())
#output
'The Spanish language is written using the Spanish alphabet, which
is the Latin alphabet with one additional letter, e単e (単), for
a total of 27 letters.
So it is better to pass the encoding only if we are sure about the encoding used
in the document.
Output encoding
Encoding support is also present for the output text from Beautiful Soup. There are
certain output methods in Beautiful Soup, for example, prettify(), which will give
the output only in the UTF-8 encoding. Even though the encoding was something
different like ISO 8859-2, the output will be in UTF-8. For example, the following
HTML content is an example of ISO8859-2 encoding:
html_markup = """
<html>
<meta http-equiv="Content-Type"
content="text/html;charset=ISO8859-2"/>
<p>cédille (from French), is a hook or tail ( ž ) added under
certain letters as a diacritical mark to modify their
pronunciation
</p>"""
soup = BeautifulSoup(html_markup,"lxml")
[ 90 ]
www.it-ebooks.info
Chapter 6
</head>
<body>
<p>
cĂŠdille (from French), is a hook or tail ( Ĺž ) added under
certain letters as a diacritical mark to modify their
pronunciation
</p>
</body>
</html>
Note that the meta tag got changed to utf-8 to reflect the changes; also, the
characters are different from the original content.
This is the default behavior of the prettify() method. But if we don't want the
encoding to be changed to UTF-8, we can specify the output encoding by passing
it in the prettify() method as shown in the following code snippet:
print(soup.prettify("ISO8859-2")
The preceding code line will give the output in ISO8859-2 itself.
<html>
<head>
<meta content="text/html;charset=ISO-8859-2" http-
equiv="Content-Type"/>
</head>
<body>
<p>
cédille (from French), is a hook or tail ( ž ) added under
certain letters as a diacritical mark to modify their
pronunciation
</p>
</body>
</html>
We can also call encode() on any Beautiful Soup object and represent it in the
encoding we pass. The encode() method also considers UTF-8 encoding by
default. This is shown in the following code snippet:
print(soup.p.encode())
#output
cĂŠdille (from French), is a hook or tail ( Ĺž ) added under
certain letters as a diacritical mark to modify their
pronunciation
[ 91 ]
www.it-ebooks.info
Encoding Support in Beautiful Soup
#output
cédille (from French), is a hook or tail ( ž ) added under certain
letters as a diacritical mark to modify their pronunciation
Quick reference
You can take a look at the following references to get an overview of the code present
in this chapter:
• soup = BeautifulSoup(html_markup,"lxml",from_
encoding="latin-1"). Here, from_encoding is used while creating
BeautifulSoup to specify the document encoding.
• soup.original_encoding: This gives the original encoding detected by
Beautiful Soup.
• The output content in specific encoding is listed using the following methods:
°° soup.prettify()
°° soup.encode()
Summary
In this chapter, we saw the encoding support in Beautiful Soup. We understood how
to get the original encoding detected by Beautiful Soup. We also learned to create a
BeautifulSoup object by explicitly specifying the encoding. The output encoding
was also discussed in this chapter. The next chapter deals with the different methods
provided by Beautiful Soup to display content.
[ 92 ]
www.it-ebooks.info
Output in Beautiful Soup
Beautiful Soup not only searches, navigates, and modifies the HTML/XML, but also
the output content in a good format. Beautiful Soup can deal with different types of
printing such as:
• Formatted printing
• Unformatted printing
Apart from these, Beautiful Soup provides different formatters to format the output.
Since the HTML tree can undergo modification after creation, these output methods
will help in viewing the modified HTML tree.
Also in this chapter, we will discuss a simple method of getting only the text stored
within a web page.
Formatted printing
Beautiful Soup has two supported ways of printing. The first one is formatted
printing that prints the current Beautiful Soup object into the formatted Unicode
strings. Each tag is printed in a separate line with good indentation and this leads
to the right look and feel. Beautiful Soup has the built-in method prettify() for
formatted printing. For example:
html_markup = """<p class="ecopyramid">
<ul id="producers">
<li class="producerlist">
<div class="name">plants</div>
<div class="number">100000</div>
</li>
<li class="producerlist">
<div class="name">algae</div>
www.it-ebooks.info
Output in Beautiful Soup
<div class="number">100000</div>
</li>
</ul>"""
soup = BeautifulSoup(html_markup,"lxml")
print(soup.prettify())
In the output, we can see that <html><body> gets appended. This is because
Beautiful Soup uses the lxml parser and it identifies any string passed by default as
HTML and performs the printing after appending the extra tags.
The prettify() method can be called either on a Beautiful Soup object or any of the
tag objects. For example:
producer_entry = soup.ul
print(producer_entry.prettify())
Unformatted printing
Beautiful Soup supports the plain printing of the BeautifulSoup and Tag objects.
This will return only the plain string without any formatting.
[ 94 ]
www.it-ebooks.info
Chapter 7
If we use the str() method on the BeautifulSoup or the Tag object, we get a
normal Python string, shown as follows:
print(str(soup))
#output
'<html><body><p class="ecopyramid"></p><ul id="producers"><li
class="producerlist"><div class="name">plants</div><div
class="number">100000</div></li><li class="producerlist"><div
class="name">algae</div><div
class="number">100000</div></li></ul></body></html>'
We can use the encode() method that we used in Chapter 6, Encoding Support in
Beautiful Soup, to encode the output in a specific encoding format.
We can use the decode() function on the BeautifulSoup or Tag object to get the
Unicode string.
print(soup.decode())
#output
u'<html><body><p class="ecopyramid"></p><ul id="producers"><li
class="producerlist"><div class="name">plants</div><div
class="number">100000</div></li><li class="producerlist"><div
class="name">algae</div><div
class="number">100000</div></li></ul></body></html>'
Apart from this, Beautiful Soup supports different formatters to format the output.
[ 95 ]
www.it-ebooks.info
Output in Beautiful Soup
<th>&</th>
<td align="center">&amp;</td>
<td align="center">ampersand</td>
</tr>
<tr>
<th>¢</th>
<td align="center">&cent;</td>
<td align="center">cent</td>
</tr>
<tr>
<th>©</th>
<td align="center">&copy;</td>
<td align="center">copyright</td>
</tr>
<tr>
<th>÷</th>
<td align="center">&divide;</td>
<td align="center">divide</td>
</tr>
<tr>
<th>></th>
<td align="center">&gt;</td>
<td align="center">greater than</td>
</tr>
</table>
</body>
</html>
[ 96 ]
www.it-ebooks.info
Chapter 7
The left-most column represents the symbols and the corresponding HTML entities
are represented in the next column. For the symbol &, the corresponding HTML
entity code is & likewise for the symbol ©, the code is ©.
The output methods in Beautiful Soup escape only the HTML entities of >, <, and &
as >, <, and &. Rest of the special entities are converted to Unicode while
constructing the BeautifulSoup object, and upon output using prettify() or other
methods, we get only the UTF-8 string of the HTML entities. We won't get the HTML
entities back (except for &, >, and <).
html_markup = """<html>
<body>& & ampersand
¢ ¢ cent
© © copyright
÷ ÷ divide
> > greater than
</body>
</html>
"""
soup = BeautifulSoup(html_markup,"lxml")
print(soup.prettify())
Here we have created the soup object based on the text content for the HTML page
in a browser that had the & symbol instead of the & code. Likewise for other
entities in the prettify() method, the output is shown as follows:
• minimal
• html
• None
• function
[ 97 ]
www.it-ebooks.info
Output in Beautiful Soup
We can pass different formatters as parameters to any of the output methods, such as
prettify(), encode(), and decode().
The following screenshot shows the difference between using the minimal and
html formatters:
From the previous screenshot we can identify that the formatter html changes
whatever Unicode characters possible back to HTML entities.
[ 98 ]
www.it-ebooks.info
Chapter 7
Here we define a function to remove the character a from the strings passed.
print(soup.prettify(formatter=remove_chara))
We use this function to strip out the a characters from the output. We should note
that this will not escape the three special characters &, >, and < in the
output as shown in the following screenshot:
[ 99 ]
www.it-ebooks.info
Output in Beautiful Soup
Using get_text()
Getting just text from websites is a common task. Beautiful Soup provides the
method get_text() for this purpose.
If we want to get only the text of a Beautiful Soup or a Tag object, we can use the
get_text() method. For example:
#output
plants
100000
algae
100000
The get_text() method returns the text inside the Beautiful Soup or Tag object as
a single Unicode string. But get_text() has issues when dealing with web pages.
Web pages often have JavaScript code, and the get_text() method returns the
JavaScript code as well. For example, in Chapter 3, Search Using Beautiful Soup, we
saw the example of scraping book details from packtpub.com.
import urllib2
from bs4 import BeautifulSoup
url = "http://www.packtpub.com/books"
page = urllib2.urlopen(url)
soup_packtpage = BeautifulSoup(page,"lxml")
We can print the text inside the page using get_text(); this is shown in the
following code snippet:
print(soup_packtpage.get_text())
[ 100 ]
www.it-ebooks.info
Chapter 7
With the previous code line, we will also get JavaScript code in the output as shown:
$(window).load(function() {
$("img[data-original]").addClass("lazy");
$("img.lazy").lazyload();
setTimeout(function() {
addthis_config = Drupal.settings.addthis.config_default;
addthis_share = Drupal.settings.addthis.share_default;
if (typeof pageTracker != "undefined")
{addthis_config.data_ga_tracker = pageTracker;}
var at = document.createElement("script"); at.type =
"text/javascript"; at.async = true;
at.src = "//dgdsbygo8mp3h.cloudfront.net/sites/default/
files/addthis/addthis_widget.js";
var s = document.getElementsByTagName("script")[0];
s.parentNode.insertBefore(at, s);
}, 5);
});
Removing the JavaScript and printing only the text within a document can be
achieved using the following code line:
[x.extract() for x in soup_packtpage.find_all('script')]
The previous code line will remove all the script elements from the document. After
this, the print(soup_packtpage.get_text()) code line will print only the text
stored within the page.
Quick reference
You can take a look at the following references to get an overview:
[ 101 ]
www.it-ebooks.info
Output in Beautiful Soup
• get_text(): This method gives you all the content of an object or a tag:
Summary
This chapter dealt with the different output methods in Beautiful Soup. We saw
the formatted and unformatted printing in the BeautifulSoup object and also the
different output formatters to control the output formatting. We also saw a method
of getting only the text from a web page.
In the next chapter, we will create a web scraper using searching, navigation, and
other techniques we have studied in this book so far.
[ 102 ]
www.it-ebooks.info
Creating a Web Scraper
In this chapter, we will create a website scraper using the different searching and
navigation techniques we studied in the previous chapters. The scraper will visit three
websites to find the selling price of books based on the ISBN. We will first find out
the book list and selling price from packtpub.com. We will also find the ISBN of the
books from packtpub.com and search other websites such as www.amazon.com and
www.barnesandnoble.com based on this ISBN. By doing this, we will automate the
process of finding the selling price of the books on three websites and will also get a
hands-on experience in implementing scrapers for these websites. Since the website
structure may change later, the code examples and images used in this chapter may
also become invalid. So, it is better to take the examples as a reference and change the
code accordingly. It is good to visit these websites for a better understanding.
We have seen how to scrape the book title and the selling price from packtpub.com
in Chapter 3, Search Using Beautiful Soup. The example we discussed in that chapter
considered only the first page and didn't include the other pages that also had the list
of books. So in the next topic, we will find different pages containing a list of books.
www.it-ebooks.info
Creating a Web Scraper
So, we need to find out a method for getting multiple pages that contain the list of
books. Logically, it seems to look at the page being pointed at by the next element
in the current page. Taking a look at the next element, for page 49, using the Google
Chrome developer tools, we can see that it actually points to the next page link,
that is, /books?page=49. If we observe different pages using the Google Chrome
developer tools, we can see that the link to the next page's actually has a pattern of
/books?page=n for the n+1 page, that is, n=49 for the 50th page, as shown in the
following screenshot:
From the preceding screenshot, we can further understand that the next element is
within the <li> tag with class="pager-next last". Inside the <li> tag, there is an
<a> tag that holds the link to the next page. In this case, the corresponding value is /
books?page=49, which points to the 50th page. We have to add www.packtpub.com
to this value to make a valid URL, as www.packtpub.com/books?page=49.
[ 104 ]
www.it-ebooks.info
Chapter 8
If we analyze the packtpub.com website, we can see that the list of published books
ends at page 50. So, we need to ensure that our program stops at this page. The
program can stop looking for more pages if it is unable to find the next element.
For example, as shown in the following screenshot, for page 50, we don't have the
next element:
We can use the following code to find pages containing a list of books from
packtpub.com:
import urllib2
import re
from bs4 import BeautifulSoup
packtpub_url = "http://www.packtpub.com/"
def get_bookurls(url):
page = urllib2.urlopen(url)
soup_packtpage = BeautifulSoup(page,"lxml")
[ 105 ]
www.it-ebooks.info
Creating a Web Scraper
page.close()
next_page_li = soup_packtpage.find("li", class_="pager-next
last")
if next_page_li is None :
next_page_url = None
else:
next_page_url = packtpub_url+next_page_li.a.get('href')
return next_page_url
The preceding get_bookurls() function returns the next page URL if we provide
the current page URL. For the last page, it returns None.
In order to create this list to collect these details, use the following code:
start_url = "www.packtpub.com/books"
continue_scrapping = True
books_url = [start_url]
while continue_scrapping:
next_page_url= get_bookurls(start_url)
if next_page_url is None:
continue_scraping = False
else:
books_url.append(next_page_url)
start_url = next_page_url
[ 106 ]
www.it-ebooks.info
Chapter 8
We can use the following code to find the details of each book:
def get_bookdetails(url):
page = urllib2.urlopen(url)
soup_packtpage = BeautifulSoup(page,"lxml")
page.close()
all_books_table = soup_packtpage.find("table",class_="views-
view-grid")
all_book_titles =all_books_table.find_all("div",class_="views-
field-title")
isbn_list = []
for book_title in all_book_titles:
book_title_span = book_title.span
print("Title Name:"+book_title_span.a.string)
print("Url:"+book_title_span.a.get('href'))
price = book_title.find_next("div",class_="views-field-sell-
price")
print("PacktPub Price:"+price.span.string)
isbn_list.append(get_isbn(book_title_span.a.get('href')))
return isbn_list
The preceding code is the same as the code we used in Chapter 3, Search Using Beautiful
Soup, to get the book details. An addition is the use of isbn_list to hold the ISBN
numbers and the get_isbn function that returns the ISBN for a particular book.
[ 107 ]
www.it-ebooks.info
Creating a Web Scraper
The ISBN of a book is stored in the book's details page, as shown in the
following screenshot:
The details page of a book when viewing through the developer tools has the ISBN,
as shown in the following screenshot:
[ 108 ]
www.it-ebooks.info
Chapter 8
From the preceding screenshot, we can see that the ISBN number is stored as text
followed by the ISBN inside the b tag.
Now, in the following code, let us see how we can find the ISBN using the get_
isbn() function:
def get_isbn(url):
book_title_url = packtpub_url + url
page = urllib2.urlopen(book_title_url)
soup_bookpage = BeautifulSoup(page,"lxml")
page.close()
isbn_tag = soup_bookpage.find('b',text=re.compile("ISBN :"))
return isbn_tag.next_sibling
In the preceding code, we searched for the b tag with the text that matches the
pattern ISBN:. The ISBN is next_sibling of the b tag.
In each main page, there will be a list of books, and for each book, there will be an
ISBN. So we need to call the get_bookdetails() method for each of the books_url
lists as follows:
isbns = []
for bookurl in books_url:
isbns+= get_bookdetails(bookurl)
The print(isbns) function will print the list of ISBNs for all the books that are
currently published by packtpub.com.
We scraped the selling price, book title, and ISBN from the PacktPub website. We will
use the ISBN to search for the selling price of the same books in both www.amazon.com
and www.barnesandnoble.com. With that, our scraper will be complete.
[ 109 ]
www.it-ebooks.info
Creating a Web Scraper
The page generated after the search in Amazon will have a URL structure
as follows:
http://www.amazon.com/s/ref=nb_sb_noss?url=search-alias%3Daps&field-
keywords=1783289554
From this, we can conclude that if we substitute field-keywords of the URL with
the corresponding ISBN, we will be getting the details for that ISBN.
From the preceding screenshot, we can see that the price is stored inside the div
tag with the newPrice class. We can find the selling price from Amazon using the
following code:
def get_sellingprice_amazon(isbn):
url_foramazon =
"http://www.amazon.com/s/ref=nb_sb_noss?url=search-
alias%3Daps&field-keywords="
url_for_isbn_inamazon = url_foramazon+isbn
page = urllib2.urlopen(url_for_isbn_inamazon)
soup_amazon = BeautifulSoup(page,"lxml")
[ 110 ]
www.it-ebooks.info
Chapter 8
page.close()
selling_price_tag = soup_amazon.find('div',class_="newPrice")
if selling_price_tag:
print ("Amazon Price"+selling_price_tag.span.string)
We created the soup object based on the URL. After creating the soup object, we
found the div tags with the newPrice class. The selling price is stored inside the
<span> tag and we print it using print (selling_price_tag.span.string).
The URL that can be used for Barnes and Noble is http://www.barnesandnoble.
com/s/ISBN, where ISBN is the ISBN value, for example, http://www.
barnesandnoble.com/s/1904811590).
Now, we have to find the selling price from the page. The page at Barnes and Noble
will have the selling price listed in a div tag with the price class (highlighted) in the
following screenshot:
[ 111 ]
www.it-ebooks.info
Creating a Web Scraper
We can find the selling price from Barnes and Noble using the following code:
def get_sellingprice_barnes(isbn):
url_forbarnes = http://www.barnesandnoble.com/s/
url_for_isbn_inbarnes = url_forbarnes+isbn
page = urllib2.urlopen(url_for_isbn_inbarnes,"lxml")
soup_barnes = BeautifulSoup(page,"lxml")
page.close()
selling_price_tag = soup_barnes.find('div',class_="price
hilight")
if selling_price_tag:
print ("Barnes Price"+selling_price_tag.string)
The entire code for creating a web scrapper would be available at the code bundle
Summary
In this chapter, we created a sample scraper using Beautiful Soup. We used the
search and navigation methods of Beautiful Soup to get information from packtpub.
com, amazon.com, and barnesandnoble.com.
[ 112 ]
www.it-ebooks.info
Index
Symbols Beautiful Soup installation, in Linux
easy_install, used 9
.children attribute 58 package manager, used 8, 9
.contents attribute 57 performing 7, 8
.descendants attribute 58 pip, used 9
.next_sibling attribute 62 Beautiful Soup installation, in Windows
.parent attribute 60 performing 10
.parents attribute 61 Python path, verifying 10, 11
.previous_sibling attribute 62 setup.py, used 12
.string attribute 59 BeautifulSoup object
.strings attribute 60 creating 15
creating, for XML parsing 18
A creating, from file-like object 16, 17
creating, from string 16
append() method features argument 19
using 71, 75
C
B
clear() method
Beautiful Soup using 79
encoding 88
formatted printing 93
installation, verifying 13
D
installing 7 decode() function 95
installing, in Linux 7 decompose() method
installing, in Windows 10 using 77
navigation 53 descendants 56
output 93 direct children 56
output encoding 90
output formatters 95 E
searching 27
search methods 27 easy_install 9
TreeBuilders 19 encode() method 91, 95
unformatted printing 94 extract() method
using, without installation 12, 13 using 78
www.it-ebooks.info
F html formatter 98
html.parser
features argument 19-22 using 21
file-like object
about 16 I
BeautifulSoup object, creating from 16, 17
find_all() method, searching with insert_after() method 80
about 37 insert() method
parameters, using 38, 39 using 73
tertiary consumers, searching 37 installation
find_all_previous() method 45 Beautiful Soup 7
find() method 30
find() method, searching with L
regular expressions based search 32
limit parameter, find_all() method 38
searching, functions used 35, 36
Linux
searching methods, applying in combina-
Beautiful Soup, installing 7
tion 36
lxml parser
tag attribute values based search 33
using 21
tag, searching 31
text, searching 32
find_next_sibling() method 43 M
find_next_siblings() method 43 minimal formatter 98
find_parents() method 37
find_previous_sibling() method 44 N
find_previous_siblings() method 44
find_siblings() method 37 NavigableString object 24, 53
formatted printing 93 navigating down
function formatter 99 about 55
functions, for content modifications child tag name, using 55
insert_after() 80 predefined attributes, using 56
insert_before() 80 special attributes 59
replace_with() method 81 navigating sideways, to siblings
unwrap() method 82 .next_sibling attribute 62
wrap() method 82 .previous_sibling attribute 62
navigating up
G about 60
.parent attribute 60
get_bookurls() function 106 .parents attribute 61
get_text() navigation, Beautiful Soup
about 100 about 53, 54
using 100, 101 navigating down 55
navigating sideways to siblings 61
H navigating to next and previous objects
parsed 63
html5lib parser
navigating up 60
using 21
new tag
HTML document
adding, append() method used 72
encoding, specifying 89, 90
[ 114 ]
www.it-ebooks.info
adding, new_tag() method used 71 find() method used 28, 29
creating, new_tag() method used 71 next sibling, searching 44
new div tag, adding to li tag 73 parent tags, searching 40
new producer, adding using new_tag() and previous sibling, searching 45
append() 71 siblings, searching 42, 44
new_tag() method tags, searching 40
using 71 searching methods
next sibling about 27
searching for 44 used, for scraping information from web
None formatter 99 page 46-50
searching, with find()
O first producer, finding 29, 30
performing 28, 29
output encoding 90, 91 setup.py script
output formatters using 12
about 95-97 siblings
function formatter 99 searching for 42-44
html formatter 98 soup.original_encoding 89
minimal formatter 98 string contents, modifying
None formatter 99 about 73
.string attribute, used 74
P strings, adding using .append() method 75
strings, adding using insert() method 76
parameters, find_all() method
strings, adding using new_string() method
limit parameter 38
76
pip 9
str() method 95
predefined attributes, for navigating down
about 56
.children attribute 58 T
.contents attribute 57 tag attribute values based search, find()
descendants 56 method used
.descendants attribute 58 about 33
direct children 56 CSS class based search 35
prettify() method 91-94 custom attributes based search 34
previous sibling first primary consumer, finding 33
searching for 45 Tag modifying, Beautiful Soup used
Python Package Index (PyPI) 9 attribute value, adding 69
attribute values, modifying 68
R attribute value, updating 68
name property, modifying 66-68
replace_with() method 81
new tag, adding 71
tag attributes, deleting 70
S Tag object
searching, in Beautiful Soup about 22
about 27 accessing, from BeautifulSoup 22
find_all() method used 37 attributes 23
name 23
[ 115 ]
www.it-ebooks.info
tags W
contents, deleting using Beautiful Soup 79
deleting, from HTML document 77 website scraper
parent tags, searching for 40, 41 book details, finding 107-109
producer, deleting using decompose() 77 book details, getting 103
producer, deleting using extract() 78 creating 103, 106
searching for 40 pages containing list of books, finding 104-
TreeBuilders 19 106
selling price, getting from Barnes and Noble
U 111
selling price, searching on Amazon 109, 110
unformatted printing 94 wrap() method 82
UnicodeDammit library 89
unicode() method 94
unwrap() method 83
UTF-8 87
UTF-8 encoding 89
[ 116 ]
www.it-ebooks.info
Thank you for buying
Getting Started with Beautiful Soup
Our books and publications share the experiences of your fellow IT professionals in adapting
and customizing today's systems, applications, and frameworks. Our solution based books
give you the knowledge and power to customize the software and technologies you're using
to get the job done. Packt books are more specific and less general than the IT books you have
seen in the past. Our unique business model allows us to bring you more focused information,
giving you more of what you need to know, and less of what you don't.
Packt is a modern, yet unique publishing company, which focuses on producing quality,
cutting-edge books for communities of developers, administrators, and newbies alike. For
more information, please visit our website: www.packtpub.com.
www.it-ebooks.info
PySide GUI Application
Development
ISBN: 978-1-84969-959-4 Paperback: 140 pages
www.it-ebooks.info
CherryPy Essentials
Rapid Python Web Application
Development
ISBN: 978-1-90481-184-8 Paperback: 272 pages
www.it-ebooks.info