Open In App

Read Html File In Python Using Pandas

Last Updated : 12 Jun, 2025
Comments
Improve
Suggest changes
Like Article
Like
Report

We are given an HTML file that contains one or more tables, and our task is to extract these tables as DataFrames using Python. For example, if we have an HTML file with a table like this:

<table>
<tr><th>Code</th><th>Language</th><th>Difficulty</th></tr>
<tr><td>Python</td><td>Python</td><td>Intermediate</td></tr>
</table>

Then the output should be a DataFrame:

Code Language Difficulty
0 Python Python Intermediate

Pandas provides multiple ways to read HTML tables, including using read_html() directly or in combination with other tools like requests, BeautifulSoup, or the lxml parser. Let’s explore each of these methods with code examples.

Using read_html()

This method uses Pandas’ built-in read_html() function, which automatically extracts all tables from an HTML file and returns them as a list of DataFrames. It’s ideal for simple HTML files with clean tabular structures.

Python
import pandas as pd

def read_html_file(path):
    df = pd.read_html(path)[0]
    return df

path = 'data/geeks_for_geeks.html'
df = read_html_file(path)
print(df)
HTML

Output:

Name Topic Difficulty
0 Introduction to Python Python Beginner
1 Data Structures Algorithms Intermediate
2 Machine Learning Basics Machine Learning Advanced

Explanation:

  • pd.read_html(path) returns a list of tables; [0] accesses the first one.
  • This is the simplest and most direct way to read HTML tables using Pandas.

Using BeautifulSoup with read_html()

This method first parses the HTML file using BeautifulSoup to allow finer control over the content, then passes the parsed HTML to pd.read_html() for table extraction. It's helpful when you need to clean or inspect the HTML before loading the data.

Python
from bs4 import BeautifulSoup
import pandas as pd

def read_with_bs(path):
    with open(path, 'r', encoding='utf-8') as f:
        soup = BeautifulSoup(f, 'lxml')
    tables = pd.read_html(str(soup))
    return tables[0]

path = 'data/languages.html'
df = read_with_bs(path)
print(df)
HTML

Output:

Code Language Difficulty
0 HTML HTML/CSS Beginner
1 Python Python Intermediate
2 JavaScript JavaScript Advanced

Explanation:

  • BeautifulSoup(f, 'lxml') loads and parses the HTML.
  • pd.read_html(str(soup)) reads tables from the parsed content.

Using requests with read_html()

This method fetches an HTML page from a URL using the requests library and passes the response content to pd.read_html(). It’s useful when the table data is hosted online.

Python
import requests
import pandas as pd

def read_from_url(url):
    res = requests.get(url)
    tables = pd.read_html(res.text)
    return tables[0]

url = 'https://example.com/topics.html'
df = read_from_url(url)
print(df)
HTML

Note: You'll need to host this HTML content on a live server or use an existing URL with a similar table for the code to work.

Output:

Category Topic Difficulty
0 Data Structures Algorithms Beginner
1 Web Development HTML/CSS Intermediate
2 Machine Learning Python Advanced

Explanation:

  • requests.get(url) fetches HTML content from the web.
  • pd.read_html(res.text) parses tables from the response.

Using lxml Parser with read_html()

This method specifies the lxml parser while reading the HTML file using read_html(). It’s known for being fast and efficient, especially for large or complex HTML documents.

Python
HTML

Output:

Title Author Difficulty
0 Python Basics John Doe Beginner
1 Data Analysis Jane Smith Intermediate
2 Machine Learning Algorithms David Johnson Advanced

Explanation:

  • flavor='lxml' tells pandas to use the lxml parser for faster and more reliable parsing.
  • Returns the first table found in the file.

Next Article
Article Tags :
Practice Tags :

Similar Reads