How to Remove HTML Tags from String in Python
Last Updated :
26 Nov, 2024
Removing HTML tags from a string in Python can be achieved using various methods, including regular expressions and specialized libraries like Beautiful Soup. Each approach is suitable for specific scenarios, depending on your requirements. Let’s explore how to efficiently remove HTML tags.
Using Regular Expressions
The simplest way to remove HTML tags is by using the re module. This method is lightweight and efficient for straightforward cases.
Python
import re
# Sample string with HTML tags
s1 = "<h1>Welcome to Python Programming</h1>"
# Removing HTML tags using regex
s2 = re.sub(r"<.*?>", "", s1)
print(s2)
OutputWelcome to Python Programming
The re.sub() method replaces all occurrences of the pattern <.*?> with an empty string, effectively removing all HTML tags from the input string.
Let's explore other methods of removing HTML tags from a string in python:
Using Beautiful Soup ( For Nested HTML Structures)
For more robust and complex cases, especially when dealing with malformed HTML, Beautiful Soup is a preferred choice. It ensures better accuracy and handles a wider range of edge cases.
Python
from bs4 import BeautifulSoup
# Sample string with HTML tags
s1 = "<h1>Welcome to <b>Python Programming</b></h1>"
# Removing HTML tags using Beautiful Soup
soup = BeautifulSoup(s1, "html.parser")
s2 = soup.get_text()
print(s2)
Output:
Welcome to Python Programming
Beautiful Soup parses the string as HTML and extracts only the text content using the get_text() method.
Using lxml
The lxml library is another efficient option, especially for performance-critical applications. It parses the HTML and extracts the text content with minimal overhead.
Python
from lxml.html import fromstring
# Sample string with HTML tags
s1 = "<h1>Welcome to Python Programming</h1>"
# Removing HTML tags using lxml
tree = fromstring(s1)
s2 = tree.text_content()
print(s2)
Output:
Welcome to Python Programming
The text_content() method extracts text while ignoring all HTML tags.
Similar Reads
How to Remove a Substring in Python? In Python, removing a substring from a string can be achieved through various methods such as using replace() function, slicing, or regular expressions. Depending on your specific use case, you may want to remove all instances of a substring or just the first occurrence. Letâs explore different ways
2 min read
Remove URLs from string in Python A regular expression (regex) is a sequence of characters that defines a search pattern in text. To remove URLs from a string in Python, you can either use regular expressions (regex) or some external libraries like urllib.parse. The re-module in Python is used for working with regular expressions. I
3 min read
Python - Remove substring list from String Our task is to remove multiple substrings from a string in Python using various methods like string replace in a loop, regular expressions, list comprehensions, functools.reduce, and custom loops. For example, given the string "Hello world!" and substrings ["Hello", "ld"], we want to get " wor!" by
3 min read
Python - Remove Punctuation from String In this article, we will explore various methods to Remove Punctuations from a string.Using str.translate() with str.maketrans()str.translate() method combined with is str.maketrans() one of the fastest ways to remove punctuation from a string because it works directly with string translation tables
2 min read
Remove spaces from a string in Python Removing spaces from a string is a common task in Python that can be solved in multiple ways. For example, if we have a string like " g f g ", we might want the output to be "gfg" by removing all the spaces. Let's look at different methods to do so:Using replace() methodTo remove all spaces from a s
2 min read