100% found this document useful (1 vote)

248 views21 pages

Python BeautifulSoup Tutorial

This document provides an overview of the Python BeautifulSoup library for parsing HTML and XML documents. It discusses how to install BeautifulSoup and lxml, create BeautifulSoup objects, extract tags and attributes, traverse the document tree, find elements by id or other criteria, and use BeautifulSoup for basic web scraping tasks. Examples are provided for common operations like getting tag names and contents, finding children and descendants, and searching for elements.

Uploaded by

Juan Cuartas

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

248 views21 pages

Python BeautifulSoup Tutorial

Uploaded by

Juan Cuartas

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

ZetCode

All Spring Boot Python C# Java JavaScript Subscribe

Python BeautifulSoup
last modified July 27, 2020

Python BeautifulSoup tutorial is an introductory tutorial to BeautifulSoup Python library. The

examples find tags, traverse document tree, modify document, and scrape web pages.

BeautifulSoup
BeautifulSoup is a Python library for parsing HTML and XML documents. It is often used for web
scraping. BeautifulSoup transforms a complex HTML document into a complex tree of Python
objects, such as tag, navigable string, or comment.

Installing BeautifulSoup
We use the pip3 command to install the necessary modules.

$ sudo pip3 install lxml

We need to install the lxml module, which is used by BeautifulSoup.

$ sudo pip3 install bs4

BeautifulSoup is installed with the above command.

The HTML file

In the examples, we will use the following HTML file:

index.html
<!DOCTYPE html>
<html>
<head>
<title>Header</title>
<meta charset="utf-8">
</head>

<body>
<h2>Operating systems</h2>

<ul id="mylist" style="width:150px">

<li>Solaris</li>
<li>FreeBSD</li>
<li>Debian</li>
<li>NetBSD</li>
<li>Windows</li>
</ul>

<p>
FreeBSD is an advanced computer operating system used to
power modern servers, desktops, and embedded platforms.
</p>

<p>
Debian is a Unix-like computer operating system that is
composed entirely of free software.
</p>

</body>
</html>
Python BeautifulSoup simple example
In the first example, we use BeautifulSoup module to get three tags.

simple.py
#!/usr/bin/python

from bs4 import BeautifulSoup

with open('index.html', 'r') as f:

contents = f.read()

soup = BeautifulSoup(contents, 'lxml')

print(soup.h2)
print(soup.head)
print(soup.li)

The code example prints HTML code of three tags.

from bs4 import BeautifulSoup

We import the BeautifulSoup class from the bs4 module. The BeautifulSoup is the main class
for doing work.

with open('index.html', 'r') as f:

contents = f.read()
Google Ads -
Sitio O cial
Con Google Ads, no hay
contratos ni mínimo de
Google Ads inversión.

We open the index.html file and read its contents with the read method.

soup = BeautifulSoup(contents, 'lxml')

A BeautifulSoup object is created; the HTML data is passed to the constructor. The second option
specifies the parser.

print(soup.h2)
print(soup.head)

Here we print the HTML code of two tags: h2 and head.

print(soup.li)

There are multiple li elements; the line prints the first one.

$ ./simple.py
<h2>Operating systems</h2>
<head>
<title>Header</title>
<meta charset="utf-8"/>
</head>
<li>Solaris</li>

This is the output.

BeautifulSoup tags, name, text

The name attribute of a tag gives its name and the text attribute its text content.

tags_names.py
#!/usr/bin/python

from bs4 import BeautifulSoup

with open('index.html', 'r') as f:

contents = f.read()

soup = BeautifulSoup(contents, 'lxml')

print(f'HTML: {soup.h2}, name: {soup.h2.name}, text: {soup.h2.text}')

The code example prints HTML code, name, and text of the h2 tag.
$ ./tags_names.py
HTML: <h2>Operating systems</h2>, name: h2, text: Operating systems

This is the output.

BeautifulSoup traverse tags

With the recursiveChildGenerator method we traverse the HTML document.

traverse_tree.py
#!/usr/bin/python

from bs4 import BeautifulSoup

with open('index.html', 'r') as f:

contents = f.read()

soup = BeautifulSoup(contents, 'lxml')

for child in soup.recursiveChildGenerator():

if child.name:
print(child.name)

The example goes through the document tree and prints the names of all HTML tags.

$ ./traverse_tree.py
html
head
title
meta
body
h2
ul
li
li
li
li
li
p
p

In the HTML document we have these tags.

BeautifulSoup element children

With the children attribute, we can get the children of a tag.

get_children.py
#!/usr/bin/python

from bs4 import BeautifulSoup

with open('index.html', 'r') as f:

contents = f.read()

soup = BeautifulSoup(contents, 'lxml')

root = soup.html
root_childs = [e.name for e in root.children if e.name is not None]
print(root_childs)

The example retrieves children of the html tag, places them into a Python list and prints them to
the console. Since the children attribute also returns spaces between the tags, we add a condition
to include only the tag names.

$ ./get_children.py
['head', 'body']

The html tags has two children: head and body.

BeautifulSoup element descendants

With the descendants attribute we get all descendants (children of all levels) of a tag.

get_descendants.py
#!/usr/bin/python

from bs4 import BeautifulSoup

with open('index.html', 'r') as f:

contents = f.read()

soup = BeautifulSoup(contents, 'lxml')

root = soup.body

root_childs = [e.name for e in root.descendants if e.name is not None]

print(root_childs)

The example retrieves all descendants of the body tag.

$ ./get_descendants.py
['h2', 'ul', 'li', 'li', 'li', 'li', 'li', 'p', 'p']

These are all the descendants of the body tag.

BeautifulSoup web scraping

Requests is a simple Python HTTP library. It provides methods for accessing Web resources via
HTTP.

scraping.py
#!/usr/bin/python

from bs4 import BeautifulSoup

import requests as req

resp = req.get('http://webcode.me')

soup = BeautifulSoup(resp.text, 'lxml')

print(soup.title)
print(soup.title.text)
print(soup.title.parent)

The example retrieves the title of a simple web page. It also prints its parent.

resp = req.get('http://webcode.me')

soup = BeautifulSoup(resp.text, 'lxml')

We get the HTML data of the page.

print(soup.title)
print(soup.title.text)
print(soup.title.parent)

We retrieve the HTML code of the title, its text, and the HTML code of its parent.

$ ./scraping.py
<title>My html page</title>
My html page
<head>
<meta charset="utf-8"/>
<meta content="width=device-width, initial-scale=1.0" name="viewport"/>
<title>My html page</title>
</head>

This is the output.

BeautifulSoup prettify code

With the prettify method, we can make the HTML code look better.

prettify.py
#!/usr/bin/python

from bs4 import BeautifulSoup

import requests as req

resp = req.get('http://webcode.me')

soup = BeautifulSoup(resp.text, 'lxml')

print(soup.prettify())

We prettify the HTML code of a simple web page.

$ ./prettify.py
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8"/>
<meta content="width=device-width, initial-scale=1.0" name="viewport"/>
<title>
My html page
</title>
</head>
<body>
<p>
Today is a beautiful day. We go swimming and fishing.
</p>
<p>
Hello there. How are you?
</p>
</body>
</html>

This is the output.

BeautifulSoup scraping with built-in web server

We can also serve HTML pages with a simple built-in HTTP server.

$ mkdir public
$ cp index.html public/

We create a public directory and copy the index.html there.

$ python -m http.server --directory public

Serving HTTP on 0.0.0.0 port 8000 (http://0.0.0.0:8000/) ...

Then we start the Python HTTP server.

scraping2.py
#!/usr/bin/python

from bs4 import BeautifulSoup

import requests as req

resp = req.get('http://localhost:8000/')

soup = BeautifulSoup(resp.text, 'lxml')

print(soup.title)
print(soup.body)

Now we get the document from the locally running server.

BeautifulSoup find elements by Id

With the find method we can find elements by various means including element id.

find_by_id.py
#!/usr/bin/python

from bs4 import BeautifulSoup

with open('index.html', 'r') as f:

contents = f.read()

soup = BeautifulSoup(contents, 'lxml')

#print(soup.find('ul', attrs={ 'id' : 'mylist'}))

print(soup.find('ul', id='mylist'))

The code example finds ul tag that has mylist id. The commented line has is an alternative way of
doing the same task.

BeautifulSoup find all tags

With the find_all method we can find all elements that meet some criteria.

find_all.py
#!/usr/bin/python

from bs4 import BeautifulSoup

with open('index.html', 'r') as f:

contents = f.read()

soup = BeautifulSoup(contents, 'lxml')

for tag in soup.find_all('li'):

print(f'{tag.name}: {tag.text}')
The code example finds and prints all li tags.

$ ./find_all.py
li: Solaris
li: FreeBSD
li: Debian
li: NetBSD
li: Windows

This is the output.

The find_all method can take a list of elements to search for.

find_all2.py
#!/usr/bin/python

from bs4 import BeautifulSoup

with open('index.html', 'r') as f:

contents = f.read()

soup = BeautifulSoup(contents, 'lxml')

tags = soup.find_all(['h2', 'p'])

for tag in tags:

print(' '.join(tag.text.split()))

The example finds all h2 and p elements and prints their text.
The find_all method can also take a function which determines what elements should be
returned.

find_by_fun.py
#!/usr/bin/python

from bs4 import BeautifulSoup

def myfun(tag):

return tag.is_empty_element

with open('index.html', 'r') as f:

contents = f.read()

soup = BeautifulSoup(contents, 'lxml')

tags = soup.find_all(myfun)
print(tags)
The example prints empty elements.

$ ./find_by_fun.py
[<meta charset="utf-8"/>]

The only empty element in the document is meta.

It is also possible to find elements by using regular expressions.

regex.py
#!/usr/bin/python

import re

from bs4 import BeautifulSoup

with open('index.html', 'r') as f:

contents = f.read()

soup = BeautifulSoup(contents, 'lxml')

strings = soup.find_all(string=re.compile('BSD'))

for txt in strings:

print(' '.join(txt.split()))

The example prints content of elements that contain 'BSD' string.

$ ./regex.py
FreeBSD
NetBSD
FreeBSD is an advanced computer operating system used to power modern servers, desktops, and embedded plat

This is the output.

BeautifulSoup CSS selectors

With the select and select_one methods, we can use some CSS selectors to find elements.

select_nth_tag.py
#!/usr/bin/python

from bs4 import BeautifulSoup

with open('index.html', 'r') as f:

contents = f.read()

soup = BeautifulSoup(contents, 'lxml')

print(soup.select('li:nth-of-type(3)'))

This example uses a CSS selector to print the HTML code of the third li element.

$ ./select_nth_tag.py
<li>Debian</li>

This is the third li element.

The # character is used in CSS to select tags by their id attributes.

select_by_id.py
#!/usr/bin/python

from bs4 import BeautifulSoup

with open('index.html', 'r') as f:

contents = f.read()

soup = BeautifulSoup(contents, 'lxml')

print(soup.select_one('#mylist'))

The example prints the element that has mylist id.

BeautifulSoup append element

The append method appends a new tag to the HTML document.

append_tag.py
#!/usr/bin/python

from bs4 import BeautifulSoup

with open('index.html', 'r') as f:

contents = f.read()

soup = BeautifulSoup(contents, 'lxml')

newtag = soup.new_tag('li')
newtag.string='OpenBSD'

ultag = soup.ul

ultag.append(newtag)

print(ultag.prettify())

The example appends a new li tag.

newtag = soup.new_tag('li')
newtag.string='OpenBSD'

First, we create a new tag with the new_tag method.

ultag = soup.ul

We get the reference to the ul tag.

ultag.append(newtag)

We append the newly created tag to the ul tag.

print(ultag.prettify())

We print the ul tag in a neat format.

BeautifulSoup insert element

The insert method inserts a tag at the specified location.

insert_tag.py
#!/usr/bin/python

from bs4 import BeautifulSoup

with open('index.html', 'r') as f:

contents = f.read()

soup = BeautifulSoup(contents, 'lxml')

newtag = soup.new_tag('li')
newtag.string='OpenBSD'

ultag = soup.ul

ultag.insert(2, newtag)

print(ultag.prettify())

The example inserts a li tag at the third position into the ul tag.

BeautifulSoup replace text

The replace_with replaces a text of an element.

replace_text.py
#!/usr/bin/python

from bs4 import BeautifulSoup

with open('index.html', 'r') as f:

contents = f.read()
soup = BeautifulSoup(contents, 'lxml')

tag = soup.find(text='Windows')
tag.replace_with('OpenBSD')

print(soup.ul.prettify())

The example finds a specific element with the find method and replaces its content with the
replace_with method.

BeautifulSoup remove element

The decompose method removes a tag from the tree and destroys it.

decompose_tag.py
#!/usr/bin/python

from bs4 import BeautifulSoup

with open('index.html', 'r') as f:

contents = f.read()

soup = BeautifulSoup(contents, 'lxml')

ptag2 = soup.select_one('p:nth-of-type(2)')

ptag2.decompose()

print(soup.body.prettify())

The example removes the second p element.

In this tutorial, we have worked with the Python BeautifulSoup library.

Python Decorators - Using Decorator Functions in Python
No ratings yet
Python Decorators - Using Decorator Functions in Python
15 pages
Using Python Libraries
No ratings yet
Using Python Libraries
101 pages
Python for Programmers: Project Guide
No ratings yet
Python for Programmers: Project Guide
131 pages
Python For You and Me
No ratings yet
Python For You and Me
173 pages
Files - Python Questions and Answers - Sanfoundry PDF
No ratings yet
Files - Python Questions and Answers - Sanfoundry PDF
15 pages
Eric Rescorla SSL and Tls PDF
0% (1)
Eric Rescorla SSL and Tls PDF
2 pages
Python Property Decorators Explained
No ratings yet
Python Property Decorators Explained
19 pages
Advanced XSS 1700217269
No ratings yet
Advanced XSS 1700217269
45 pages
Python, Install PIP
No ratings yet
Python, Install PIP
18 pages
Python Cheat Sheet April 2021
100% (2)
Python Cheat Sheet April 2021
26 pages
Python 201
No ratings yet
Python 201
342 pages
Python Dictionary Guide
No ratings yet
Python Dictionary Guide
11 pages
Python Language 581 860
No ratings yet
Python Language 581 860
280 pages
Python Basics: Before Numpy
No ratings yet
Python Basics: Before Numpy
49 pages
Essntial Guide To Machine Data
No ratings yet
Essntial Guide To Machine Data
130 pages
SQL (Relational) Databases - FastAPI
No ratings yet
SQL (Relational) Databases - FastAPI
228 pages
Python Interview Prep Guide
100% (1)
Python Interview Prep Guide
144 pages
Real Python Cheat Sheet
No ratings yet
Real Python Cheat Sheet
3 pages
Daniel Greenfeld, Audrey M. Roy - A Wedge of Django (2021, Two Scoops Press)
No ratings yet
Daniel Greenfeld, Audrey M. Roy - A Wedge of Django (2021, Two Scoops Press)
363 pages
Python Requests Essentials Guide
No ratings yet
Python Requests Essentials Guide
17 pages
Byte of Python
100% (1)
Byte of Python
94 pages
Python 201: Intermediate Python (Michael Driscoll)
No ratings yet
Python 201: Intermediate Python (Michael Driscoll)
30 pages
Python Documentation Contents
No ratings yet
Python Documentation Contents
149 pages
JSON Overview and Applications
100% (1)
JSON Overview and Applications
71 pages
Python Enthusiasts' Advanced Reads
0% (1)
Python Enthusiasts' Advanced Reads
104 pages
Functional Programming in Python-3
No ratings yet
Functional Programming in Python-3
7 pages
Linked Lists Lab: Data Structures and Implementation
No ratings yet
Linked Lists Lab: Data Structures and Implementation
10 pages
The Carreta (Jungle Novels, #1)
0% (1)
The Carreta (Jungle Novels, #1)
12 pages
Python Textbok
No ratings yet
Python Textbok
215 pages
Pycharm 2017.1 Help PDF
No ratings yet
Pycharm 2017.1 Help PDF
1,707 pages
Overview of Python 3 Features
No ratings yet
Overview of Python 3 Features
2 pages
OOP in Python-Textbok
No ratings yet
OOP in Python-Textbok
221 pages
Python & Linear Algebra Basics
No ratings yet
Python & Linear Algebra Basics
46 pages
Pthon Pogmmng Fo Begnnes B PHP
No ratings yet
Pthon Pogmmng Fo Begnnes B PHP
172 pages
Learning Flask Framework - Sample Chapter
100% (2)
Learning Flask Framework - Sample Chapter
27 pages
Brief Introduction To The C Programming Language: Washington
100% (2)
Brief Introduction To The C Programming Language: Washington
51 pages
Introduction To Numpy Exercise
No ratings yet
Introduction To Numpy Exercise
24 pages
Python Honors Notes
No ratings yet
Python Honors Notes
130 pages
Python Cheatsheet
100% (2)
Python Cheatsheet
51 pages
Hacking MAAS: Coding Style
No ratings yet
Hacking MAAS: Coding Style
7 pages
Python For Web Scraping - Week 3: 1 Installing A Module
No ratings yet
Python For Web Scraping - Week 3: 1 Installing A Module
4 pages
SQLite Database Guide
100% (1)
SQLite Database Guide
15 pages
Lab 4 - Lists, and Data Abstraction - CS 61A Summer 2019 PDF
No ratings yet
Lab 4 - Lists, and Data Abstraction - CS 61A Summer 2019 PDF
10 pages
Using Python As Calculator
No ratings yet
Using Python As Calculator
14 pages
Python Crash Course For Beginners
No ratings yet
Python Crash Course For Beginners
152 pages
List Comprehension in Python
No ratings yet
List Comprehension in Python
8 pages
CSRF Attack Prevention Guide
No ratings yet
CSRF Attack Prevention Guide
8 pages
Python Cheat Sheet
No ratings yet
Python Cheat Sheet
16 pages
Python Arsenal For RE
No ratings yet
Python Arsenal For RE
53 pages
NumPy Extended Cheatsheet Guide
No ratings yet
NumPy Extended Cheatsheet Guide
8 pages
Python Shell Programming
No ratings yet
Python Shell Programming
49 pages
Bisa Bia, Bisa Bel Interpretação de Texto
No ratings yet
Bisa Bia, Bisa Bel Interpretação de Texto
4 pages
Power Plant and Calculations: Danfoss High Pressure Pumps
100% (4)
Power Plant and Calculations: Danfoss High Pressure Pumps
6 pages
Domino's Pizza Inc. Analyst Estimates & Ratings - WSJ
No ratings yet
Domino's Pizza Inc. Analyst Estimates & Ratings - WSJ
3 pages
3 Types of Attitudes Perfectly Explained From Start To Finish - Psychologenie
No ratings yet
3 Types of Attitudes Perfectly Explained From Start To Finish - Psychologenie
7 pages
Supplementary Materials
No ratings yet
Supplementary Materials
10 pages
Cleaned Technical Proposal Odoo V13
No ratings yet
Cleaned Technical Proposal Odoo V13
46 pages
Technical Proposal For Implementing Odoo Enterprise Ver 13 - PDF
No ratings yet
Technical Proposal For Implementing Odoo Enterprise Ver 13 - PDF
41 pages
Chennai To Madurai
No ratings yet
Chennai To Madurai
2 pages
Top Questions To Discuss With Your Business Partner (Checklist) - Prog - World
No ratings yet
Top Questions To Discuss With Your Business Partner (Checklist) - Prog - World
10 pages
Python List Comprehensions - Learn Python List Comprehensions
No ratings yet
Python List Comprehensions - Learn Python List Comprehensions
12 pages
Python Urllib3 - Accessing Web Resources Via HTTP
No ratings yet
Python Urllib3 - Accessing Web Resources Via HTTP
19 pages
Python Magic Methods - Using Magic Methods in Python
No ratings yet
Python Magic Methods - Using Magic Methods in Python
18 pages
Python F-String - Formatting Strings in Python With F-String
No ratings yet
Python F-String - Formatting Strings in Python With F-String
13 pages
Python CSV - Read, Write CSV in Python
100% (1)
Python CSV - Read, Write CSV in Python
11 pages
Python Create Dictionary - Creating Dictionaries in Python
No ratings yet
Python Create Dictionary - Creating Dictionaries in Python
8 pages
Python Click - Creating Command Line Interfaces
No ratings yet
Python Click - Creating Command Line Interfaces
19 pages
BUMN 2023 Reading Comprehension
No ratings yet
BUMN 2023 Reading Comprehension
18 pages
Mrcog
100% (5)
Mrcog
286 pages
Task of Semi OEL: General Use of Deep Foundation
No ratings yet
Task of Semi OEL: General Use of Deep Foundation
9 pages
Community Connect
No ratings yet
Community Connect
25 pages
CPP Vs FPP
No ratings yet
CPP Vs FPP
7 pages
Advertising and Marketing Research MCQ
No ratings yet
Advertising and Marketing Research MCQ
18 pages
The Laterite Profile
No ratings yet
The Laterite Profile
13 pages
Environmental Social and Governance ESG
No ratings yet
Environmental Social and Governance ESG
14 pages
Bts and 360 Brochure 2025 Jan 10
No ratings yet
Bts and 360 Brochure 2025 Jan 10
6 pages
Devlopment Projects of Cuttack Municipal Corporation 2023-24
No ratings yet
Devlopment Projects of Cuttack Municipal Corporation 2023-24
43 pages
Authority To Lease - Template
100% (3)
Authority To Lease - Template
4 pages
China Gross Fixed Capital Growth 2022
No ratings yet
China Gross Fixed Capital Growth 2022
80 pages
Test Cases for Facebook Project Report
No ratings yet
Test Cases for Facebook Project Report
14 pages
10 Info and Work Sheet Service Brake System
No ratings yet
10 Info and Work Sheet Service Brake System
31 pages
v1.5 - Final Placement Report - 2021-2023
No ratings yet
v1.5 - Final Placement Report - 2021-2023
9 pages
Pastil Hub
No ratings yet
Pastil Hub
8 pages
U.S. Silicon Production Overview 2005
No ratings yet
U.S. Silicon Production Overview 2005
2 pages
Mang Inasal's Market Strategy
No ratings yet
Mang Inasal's Market Strategy
13 pages
Tuning Fork Test PDF
No ratings yet
Tuning Fork Test PDF
24 pages
Sar Thiazize Diuretics
No ratings yet
Sar Thiazize Diuretics
2 pages
The Role of Plate Tectonics in Long-Term Climate Regulation
No ratings yet
The Role of Plate Tectonics in Long-Term Climate Regulation
3 pages
Sentence Crafting for Writers
No ratings yet
Sentence Crafting for Writers
8 pages
Peppa Pig Valentine's Day
No ratings yet
Peppa Pig Valentine's Day
27 pages
Income Generating Projectby Loida
No ratings yet
Income Generating Projectby Loida
12 pages
Nike's Strategic Management Analysis
No ratings yet
Nike's Strategic Management Analysis
9 pages
1st Year CS Series Test Basics
No ratings yet
1st Year CS Series Test Basics
1 page
Presentation Poverty Scoring For Myanmar
No ratings yet
Presentation Poverty Scoring For Myanmar
29 pages
The Magical Garden Adventure Published
No ratings yet
The Magical Garden Adventure Published
35 pages
The Arup Journal Issue 4 1994
100% (1)
The Arup Journal Issue 4 1994
20 pages
Application WS 2025 2026
No ratings yet
Application WS 2025 2026
2 pages