Python BeautifulSoup - Parse HTML, XML Documents in Python
Python BeautifulSoup - Parse HTML, XML Documents in Python
Python BeautifulSoup
last modified July 27, 2020
BeautifulSoup
BeautifulSoup is a Python library for parsing HTML and XML documents. It is often used for web
scraping. BeautifulSoup transforms a complex HTML document into a complex tree of Python
objects, such as tag, navigable string, or comment.
Installing BeautifulSoup
We use the pip3 command to install the necessary modules.
index.html
<!DOCTYPE html>
<html>
<head>
<title>Header</title>
<meta charset="utf-8">
</head>
<body>
<h2>Operating systems</h2>
<p>
FreeBSD is an advanced computer operating system used to
power modern servers, desktops, and embedded platforms.
</p>
<p>
Debian is a Unix-like computer operating system that is
composed entirely of free software.
</p>
</body>
</html>
Python BeautifulSoup simple example
In the first example, we use BeautifulSoup module to get three tags.
simple.py
#!/usr/bin/python
contents = f.read()
print(soup.h2)
print(soup.head)
print(soup.li)
We import the BeautifulSoup class from the bs4 module. The BeautifulSoup is the main class
for doing work.
contents = f.read()
Google Ads -
Sitio O cial
Con Google Ads, no hay
contratos ni mínimo de
Google Ads inversión.
We open the index.html file and read its contents with the read method.
A BeautifulSoup object is created; the HTML data is passed to the constructor. The second option
specifies the parser.
print(soup.h2)
print(soup.head)
print(soup.li)
There are multiple li elements; the line prints the first one.
$ ./simple.py
<h2>Operating systems</h2>
<head>
<title>Header</title>
<meta charset="utf-8"/>
</head>
<li>Solaris</li>
tags_names.py
#!/usr/bin/python
contents = f.read()
The code example prints HTML code, name, and text of the h2 tag.
$ ./tags_names.py
HTML: <h2>Operating systems</h2>, name: h2, text: Operating systems
traverse_tree.py
#!/usr/bin/python
contents = f.read()
if child.name:
print(child.name)
The example goes through the document tree and prints the names of all HTML tags.
$ ./traverse_tree.py
html
head
title
meta
body
h2
ul
li
li
li
li
li
p
p
get_children.py
#!/usr/bin/python
contents = f.read()
root = soup.html
root_childs = [e.name for e in root.children if e.name is not None]
print(root_childs)
The example retrieves children of the html tag, places them into a Python list and prints them to
the console. Since the children attribute also returns spaces between the tags, we add a condition
to include only the tag names.
$ ./get_children.py
['head', 'body']
get_descendants.py
#!/usr/bin/python
contents = f.read()
root = soup.body
scraping.py
#!/usr/bin/python
resp = req.get('http://webcode.me')
print(soup.title)
print(soup.title.text)
print(soup.title.parent)
The example retrieves the title of a simple web page. It also prints its parent.
resp = req.get('http://webcode.me')
print(soup.title)
print(soup.title.text)
print(soup.title.parent)
We retrieve the HTML code of the title, its text, and the HTML code of its parent.
$ ./scraping.py
<title>My html page</title>
My html page
<head>
<meta charset="utf-8"/>
<meta content="width=device-width, initial-scale=1.0" name="viewport"/>
<title>My html page</title>
</head>
prettify.py
#!/usr/bin/python
resp = req.get('http://webcode.me')
print(soup.prettify())
$ ./prettify.py
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8"/>
<meta content="width=device-width, initial-scale=1.0" name="viewport"/>
<title>
My html page
</title>
</head>
<body>
<p>
Today is a beautiful day. We go swimming and fishing.
</p>
<p>
Hello there. How are you?
</p>
</body>
</html>
$ mkdir public
$ cp index.html public/
scraping2.py
#!/usr/bin/python
resp = req.get('http://localhost:8000/')
print(soup.title)
print(soup.body)
find_by_id.py
#!/usr/bin/python
contents = f.read()
The code example finds ul tag that has mylist id. The commented line has is an alternative way of
doing the same task.
find_all.py
#!/usr/bin/python
contents = f.read()
$ ./find_all.py
li: Solaris
li: FreeBSD
li: Debian
li: NetBSD
li: Windows
find_all2.py
#!/usr/bin/python
contents = f.read()
The example finds all h2 and p elements and prints their text.
The find_all method can also take a function which determines what elements should be
returned.
find_by_fun.py
#!/usr/bin/python
def myfun(tag):
return tag.is_empty_element
contents = f.read()
tags = soup.find_all(myfun)
print(tags)
The example prints empty elements.
$ ./find_by_fun.py
[<meta charset="utf-8"/>]
regex.py
#!/usr/bin/python
import re
contents = f.read()
strings = soup.find_all(string=re.compile('BSD'))
print(' '.join(txt.split()))
select_nth_tag.py
#!/usr/bin/python
contents = f.read()
This example uses a CSS selector to print the HTML code of the third li element.
$ ./select_nth_tag.py
<li>Debian</li>
select_by_id.py
#!/usr/bin/python
contents = f.read()
print(soup.select_one('#mylist'))
append_tag.py
#!/usr/bin/python
contents = f.read()
newtag = soup.new_tag('li')
newtag.string='OpenBSD'
ultag = soup.ul
ultag.append(newtag)
print(ultag.prettify())
newtag = soup.new_tag('li')
newtag.string='OpenBSD'
ultag = soup.ul
ultag.append(newtag)
print(ultag.prettify())
insert_tag.py
#!/usr/bin/python
contents = f.read()
newtag = soup.new_tag('li')
newtag.string='OpenBSD'
ultag = soup.ul
ultag.insert(2, newtag)
print(ultag.prettify())
The example inserts a li tag at the third position into the ul tag.
replace_text.py
#!/usr/bin/python
contents = f.read()
soup = BeautifulSoup(contents, 'lxml')
tag = soup.find(text='Windows')
tag.replace_with('OpenBSD')
print(soup.ul.prettify())
The example finds a specific element with the find method and replaces its content with the
replace_with method.
decompose_tag.py
#!/usr/bin/python
contents = f.read()
ptag2 = soup.select_one('p:nth-of-type(2)')
ptag2.decompose()
print(soup.body.prettify())