Read Html File In Python Using Pandas

Last Updated : 12 Jun, 2025

We are given an HTML file that contains one or more tables, and our task is to extract these tables as DataFrames using Python. For example, if we have an HTML file with a table like this:

<table>
<tr><th>Code</th><th>Language</th><th>Difficulty</th></tr>
<tr><td>Python</td><td>Python</td><td>Intermediate</td></tr>
</table>

Then the output should be a DataFrame:

Code Language Difficulty
0 Python Python Intermediate

Pandas provides multiple ways to read HTML tables, including using read_html() directly or in combination with other tools like requests, BeautifulSoup, or the lxml parser. Let’s explore each of these methods with code examples.

Using read_html()

This method uses Pandas’ built-in read_html() function, which automatically extracts all tables from an HTML file and returns them as a list of DataFrames. It’s ideal for simple HTML files with clean tabular structures.

Python

import pandas as pd

def read_html_file(path):
    df = pd.read_html(path)[0]
    return df

path = 'data/geeks_for_geeks.html'
df = read_html_file(path)
print(df)

HTML

<!DOCTYPE html>
<html>
<head>
    <title>Table Example</title>
</head>
<body>

<table border="1">
    <tr>
        <th>Name</th>
        <th>Topic</th>
        <th>Difficulty</th>
    </tr>
    <tr>
        <td>Introduction to Python</td>
        <td>Python</td>
        <td>Beginner</td>
    </tr>
    <tr>
        <td>Data Structures</td>
        <td>Algorithms</td>
        <td>Intermediate</td>
    </tr>
    <tr>
        <td>Machine Learning Basics</td>
        <td>Machine Learning</td>
        <td>Advanced</td>
    </tr>
</table>

</body>
</html>

Output:

Name Topic Difficulty
0 Introduction to Python Python Beginner
1 Data Structures Algorithms Intermediate
2 Machine Learning Basics Machine Learning Advanced

Explanation:

pd.read_html(path) returns a list of tables; [0] accesses the first one.
This is the simplest and most direct way to read HTML tables using Pandas.

Using BeautifulSoup with read_html()

This method first parses the HTML file using BeautifulSoup to allow finer control over the content, then passes the parsed HTML to pd.read_html() for table extraction. It's helpful when you need to clean or inspect the HTML before loading the data.

Python

from bs4 import BeautifulSoup
import pandas as pd

def read_with_bs(path):
    with open(path, 'r', encoding='utf-8') as f:
        soup = BeautifulSoup(f, 'lxml')
    tables = pd.read_html(str(soup))
    return tables[0]

path = 'data/languages.html'
df = read_with_bs(path)
print(df)

HTML

<!DOCTYPE html>
<html>
<head>
    <title>Programming Languages</title>
</head>
<body>

<table border="1">
    <tr>
        <th>Code</th>
        <th>Language</th>
        <th>Difficulty</th>
    </tr>
    <tr>
        <td>HTML</td>
        <td>HTML/CSS</td>
        <td>Beginner</td>
    </tr>
    <tr>
        <td>Python</td>
        <td>Python</td>
        <td>Intermediate</td>
    </tr>
    <tr>
        <td>JavaScript</td>
        <td>JavaScript</td>
        <td>Advanced</td>
    </tr>
</table>

</body>
</html>

Output:

Code Language Difficulty
0 HTML HTML/CSS Beginner
1 Python Python Intermediate
2 JavaScript JavaScript Advanced

Explanation:

BeautifulSoup(f, 'lxml') loads and parses the HTML.
pd.read_html(str(soup)) reads tables from the parsed content.

Using requests with read_html()

This method fetches an HTML page from a URL using the requests library and passes the response content to pd.read_html(). It’s useful when the table data is hosted online.

Python

import requests
import pandas as pd

def read_from_url(url):
    res = requests.get(url)
    tables = pd.read_html(res.text)
    return tables[0]

url = 'https://example.com/topics.html'
df = read_from_url(url)
print(df)

HTML

<!DOCTYPE html>
<html>
<head>
    <title>Topics in Different Categories</title>
</head>
<body>

<table border="1">
    <tr>
        <th>Category</th>
        <th>Topic</th>
        <th>Difficulty</th>
    </tr>
    <tr>
        <td>Data Structures</td>
        <td>Algorithms</td>
        <td>Beginner</td>
    </tr>
    <tr>
        <td>Web Development</td>
        <td>HTML/CSS</td>
        <td>Intermediate</td>
    </tr>
    <tr>
        <td>Machine Learning</td>
        <td>Python</td>
        <td>Advanced</td>
    </tr>
</table>

</body>
</html>

Note: You'll need to host this HTML content on a live server or use an existing URL with a similar table for the code to work.

Output:

Category Topic Difficulty
0 Data Structures Algorithms Beginner
1 Web Development HTML/CSS Intermediate
2 Machine Learning Python Advanced

Explanation:

requests.get(url) fetches HTML content from the web.
pd.read_html(res.text) parses tables from the response.

Using lxml Parser with read_html()

This method specifies the lxml parser while reading the HTML file using read_html(). It’s known for being fast and efficient, especially for large or complex HTML documents.

Python

import pandas as pd

def read_html_lxml(path):
    tables = pd.read_html(path, flavor='lxml')
    return tables[0]

html_path = 'books.html'
df = read_html_lxml(html_path)
print(df)

HTML

<!DOCTYPE html>
<html>
<head>
    <title>Book Information</title>
</head>
<body>

<table border="1">
    <tr>
        <th>Title</th>
        <th>Author</th>
        <th>Difficulty</th>
    </tr>
    <tr>
        <td>Python Basics</td>
        <td>John Doe</td>
        <td>Beginner</td>
    </tr>
    <tr>
        <td>Data Analysis</td>
        <td>Jane Smith</td>
        <td>Intermediate</td>
    </tr>
    <tr>
        <td>Machine Learning Algorithms</td>
        <td>David Johnson</td>
        <td>Advanced</td>
    </tr>
</table>

</body>
</html>

Output:

Title Author Difficulty
0 Python Basics John Doe Beginner
1 Data Analysis Jane Smith Intermediate
2 Machine Learning Algorithms David Johnson Advanced

Explanation:

flavor='lxml' tells pandas to use the lxml parser for faster and more reliable parsing.
Returns the first table found in the file.

How to import an excel file into Python using Pandas?

saif_ali

Improve

Article Tags :

Read Html File In Python Using Pandas

Using read_html()

Using BeautifulSoup with read_html()

Using requests with read_html()

Using lxml Parser with read_html()

Similar Reads

Thank You!

What kind of Experience do you want to share?