Rule-Based Data Extraction in HTML using Python textminer Module
Last Updated :
21 Jul, 2021
While working with HTML, there are various requirements to extract data from plain HTML tags in an ordered manner in form of python data containers such as lists, dict, integers, etc. This article deals with a library that helps to achieve this using a rule-based approach.
Features of Python - textminer:
- Extracts data in form of a list, dictionary, and texts from HTML.
- Used rule-based system in YAML format.
- Supports extraction from URL in form of scraping.
Installation:
Use the below command to install Python textminer:
pip install textminer
Functions Description:
The following functions come in handy while extracting data from HTML:
Syntax:
extract(html, rule)
Parameters:
- html: The HTML to extract data from.
- rule: Rule in YAML format to apply on HTML to extract data.
Syntax:
extract_from_url(url, rule)
Parameters:
- rule: Rule in YAML format to apply on HTML to extract data.
- url: The HTML URL from which extraction of HTML has to be performed.
Example 1: Extracting data from HTML
This basic rule in YAML format is formulated to extract data between a suffix and a prefix.
Python3
import textminer
# input html
inp_html = '<html><body><div>GFG is best for Geeks</div></body></html>'
# yaml rule string
rule = '''
value:
prefix: <div>
suffix: </div>
'''
# using extract() to get required data
res = textminer.extract(inp_html, rule)
print("The data extracted between divs : ")
print(res)
Output :
Extracted data between divs
Example 2: Extracting a list from HTML
The python-based list can be extracted from Html which is commonly referred to using list tags, by using <li> and </li> as prefix and suffix of rule. Additionally, the "list" keyword needs to be added to achieve this.
Python3
import textminer
# input html
inp_html = """<html>
<body>
<ul>
<li>Gfg</li>
<li>is</li>
<li>best</li>
</ul>
</body>
</html>"""
# yaml rule string
# extracting list using <li>
# using "list" keyword
rule = '''
list:
prefix: <li>
suffix: </li>
'''
# using extract() to get required data
res = textminer.extract(inp_html, rule)
print("The data extracted between list tags : ")
print(res)
Output :
Extracted list
Example 3: Extracting dictionary from HTML using defined data types.
Similar to the above example, a dictionary can be extracted using "dic" keyword, with mentioning "key" required to map key to, and value is extracted using defining prefix and suffix tags with a specific id. The data type can be mentioned using the "type" keyword.
Python3
import textminer
# input html
inp_html = """<html>
<body>
<div id="Gfg">Best</div>
<div id="4">Geeks</div>
</body>
</html>"""
# yaml rule string
# extracting dict. using dict
# using int to extract key in integer format
rule = '''
dict:
- key: gfg
prefix: <div id="Gfg">
suffix: </div>
- key: 4
prefix: <div id="4">
suffix: </div>
type: int
'''
# using extract() to get required data
res = textminer.extract(inp_html, rule)
print("The data extracted between dictionary tags : ")
print(res)
Output :
Extracted Dictionary
Example 4: Extract HTML from URL
Apart from giving HTML as a string, HTML can also be provided using a url using extract_from_url().
Python3
import textminer
# required url
target_url = "https://www.geeksforgeeks.org/"
# extracting title from url
rule = '''
value:
prefix: <title>
suffix: </title>
'''
# using extract() to get required data
res = textminer.extract_from_url(target_url, rule)
print("The data extracted between title tags from url : ")
print(res)
Output :
Extraction from URL.
Similar Reads
Generate HTML using tinyhtml module in Python Creating HTML can sometimes be tedious tasks and difficult to debug and error-prone. A way to solve this is using some library which can take care of the opening and closing div, etc, which can reduce chances for mistakes. We will use the tinyhtml module for this purpose. This module provides a set
3 min read
Python | Extract URL from HTML using lxml Link extraction is a very common task when dealing with the HTML parsing. For every general web crawler that's the most important function to perform. Out of all the Python libraries present out there, lxml is one of the best to work with. As explained in this article, lxml provides a number of help
4 min read
Extract JSON from HTML using BeautifulSoup in Python In this article, we are going to extract JSON from HTML using BeautifulSoup in Python. Module neededbs4: Beautiful Soup(bs4) is a Python library for pulling data out of HTML and XML files. This module does not come built-in with Python. To install this type the below command in the terminal.pip inst
3 min read
Extract CSS tag from a given HTML using Python Prerequisite: Implementing Web Scraping in Python with BeautifulSoup In this article, we are going to see how to extract CSS from an HTML document or URL using python. Â Module Needed: bs4: Beautiful Soup(bs4) is a Python library for pulling data out of HTML and XML files. This module does not come b
2 min read
Extract the HTML code of the given tag and its parent using BeautifulSoup In this article, we will discuss how to extract the HTML code of the given tag and its parent using BeautifulSoup. Modules Needed First, we need to install all these modules on our computer. BeautifulSoup: Our primary module contains a method to access a webpage over HTTP.pip install bs4lxml: Helper
3 min read
Convert HTML source code to JSON Object using Python In this post, we will see how we can convert an HTML source code into a JSON object. JSON objects can be easily transferred, and they are supported by most of the modern programming languages. We can read JSON from Javascript and parse it as a Javascript object easily. Javascript can be used to make
3 min read
Parsel: How to Extract Text From HTML in Python Parsel is a Python library used for extracting data from HTML and XML documents. It provides tools for parsing, navigating, and extracting information using CSS selectors and XPath expressions. Parsel is particularly useful for web scraping tasks where you need to programmatically extract specific d
2 min read
Get tag name using Beautifulsoup in Python Prerequisite: Beautifulsoup Installation Name property is provided by Beautiful Soup which is a web scraping framework for Python. Web scraping is the process of extracting data from the website using automated tools to make the process faster. Name object corresponds to the name of an XML or HTML t
1 min read
How to Extract Weather Data from Google in Python? In this article, we will see how to extract weather data from google. Google does not have its own weather API, it fetches data from weather.com and shows it when you search on Google. So, we will scrape the data from Google, and also we will see another method to fetch a schematic depiction of a lo
4 min read
Find the title tags from a given html document using BeautifulSoup in Python Let's see how to Find the title tags from a given html document using BeautifulSoup in python. so we can find the title tag from html document using BeautifulSoup find() method. The find function takes the name of the tag as string input and returns the first found match of the particular tag from t
1 min read