Python Pandas - Working with HTML Data



The Pandas library provides extensive functionalities for handling data from various formats. One such format is HTML (HyperText Markup Language), which is a commonly used format for structuring web content. The HTML files may contain tabular data, which can be extracted and analyzed using the Pandas library.

An HTML table is a structured format used to represent tabular data in rows and columns within a webpage. Extracting this tabular data from an HTML is possible by using the pandas.read_html() function. Writing the Pandas DataFrame back to an HTML table is also possible using the DataFrame.to_html() method.

In this tutorial, we will learn about how to work with HTML data using Pandas, including reading HTML tables and writing the Pandas DataFrames to HTML tables.

Reading HTML Tables from a URL

The pandas.read_html() function is used for reading tables from HTML files, strings, or URLs. It automatically parses <table> elements in HTML and returns a list of pandas.DataFrame objects.

Example

Here is the basic example of reading the data from a URL using the pandas.read_html() function.

import pandas as pd

# Read HTML table from a URL
url = "https://www.tutorialspoint.com/sql/sql-clone-tables.htm"
tables = pd.read_html(url)

# Access the first table from the URL
df = tables[0]

# Display the resultant DataFrame
print('Output First DataFrame:', df.head())

Following is the output of the above code −

Output First DataFrame:
ID NAME AGE ADDRESS SALARY
0 1 Ramesh 32 Ahmedabad 2000.0
1 2 Khilan 25 Delhi 1500.0
2 3 Kaushik 23 Kota 2000.0
3 4 Chaitali 25 Mumbai 6500.0
4 5 Hardik 27 Bhopal 8500.0

Reading HTML Data from a String

Reading HTML data directly from a string can be possible by using the Python's io.StringIO module.

Example

The following example demonstrates how to read the HTML string using StringIO without saving to a file.

import pandas as pd
from io import StringIO

# Create an HTML string
html_str = """
<table>
   <tr><th>C1</th><th>C2</th><th>C3</th></tr>
   <tr><td>a</td><td>b</td><td>c</td></tr>
   <tr><td>x</td><td>y</td><td>z</td></tr>
</table>
"""

# Read the HTML string
dfs = pd.read_html(StringIO(html_str))
print(dfs[0])

Following is the output of the above code −


C1 C2 C3
0 a b c
1 x y z

Example

This is an alternative way of reading the HTML string with out using the io.StringIO module. Here we will save the HTML string into a temporary file and read it using the pandas.read_html() function.

import pandas as pd

# Create an HTML string
html_str = """
<table>
   <tr><th>C1</th><th>C2</th><th>C3</th></tr>
   <tr><td>a</td><td>b</td><td>c</td></tr>
   <tr><td>x</td><td>y</td><td>z</td></tr>
</table>
"""

# Save to a temporary file and read
with open("temp.html", "w") as f:
    f.write(html_str)

df = pd.read_html("temp.html")[0]
print(df)

Following is the output of the above code −


C1 C2 C3
0 a b c
1 x y z

Handling Multiple Tables from an HTML file

While reading an HTML file of containing multiple tables, we can handle it by using the match parameter of the pandas.read_html() function to read a table that has specific text.

Example

The following example reads a table that has a specific text from the HTML file of having multiple tables using the match parameter.

import pandas as pd

# Read tables from a SQL tutorial
url = "https://www.tutorialspoint.com/sql/sql-clone-tables.htm"
tables = pd.read_html(url, match='Field')

# Access the table
df = tables[0]
print(df.head())

Following is the output of the above code −


Field Type Null Key Default Extra
1 ID int(11) NO PRI NaN NaN
2 NAME varchar(20) NO NaN NaN NaN
3 AGE int(11) NO NaN NaN NaN
4 ADDRESS char(25) YES NaN NaN NaN
5 SALARY decimal(18,2) YES NaN NaN NaN

Writing DataFrames to HTML

Pandas DataFrame objects can be converted to HTML tables using the DataFrame.to_html() method. This method returns a string if the parameter buf is set to None.

Example

The following example demonstrates how to write a Pandas DataFrame to an HTML Table using the DataFrame.to_html() method.

import pandas as pd

# Create a DataFrame
df = pd.DataFrame([[1, 2], [3, 4]], columns=["A", "B"])

# Convert the DataFrame to HTML table
html = df.to_html()

# Display the HTML string
print(html)

Following is the output of the above code −

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>A</th>
      <th>B</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>1</td>
      <td>2</td>
    </tr>
    <tr>
      <th>1</th>
      <td>3</td>
      <td>4</td>
    </tr>
  </tbody>
</table>
Advertisements