Skip to content

BUG: Fix pd.read_html handling of rowspan in table header #60464

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Dec 3, 2024

Conversation

snitish
Copy link
Member

@snitish snitish commented Dec 2, 2024

From the original thread:

s = '<table><tr><th rowspan="2">A</th><th>B</th></tr><tr><td>1</td></tr><tr><td>C</td><td>2</td></tr></table>'
buf = io.StringIO(s)
print(pd.read_html(buf)[0])
#    A                  B
#    A Unnamed: 1_level_1
# 0  1                NaN

# Expected:
#    A  B
# 0  A  1
# 1  C  2

The bug is due to rowspan > 1 in the header row which leads to overflow into the body rows. Current logic does not handle this case. I fix it by overflowing the partial rows from the header into the body (and similarly from body to footer if any).

@snitish snitish mentioned this pull request Dec 2, 2024
3 tasks
Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Could you add a whatsnew note in v3.0.0.rst under the I/O section?

@mroeschke mroeschke added the IO HTML read_html, to_html, Styler.apply, Styler.applymap label Dec 2, 2024
@snitish snitish requested a review from mroeschke December 2, 2024 18:47
@rhshadrach rhshadrach added the Bug label Dec 2, 2024
@mroeschke mroeschke added this to the 3.0 milestone Dec 3, 2024
@mroeschke mroeschke merged commit d9dfaa9 into pandas-dev:main Dec 3, 2024
51 of 55 checks passed
@mroeschke
Copy link
Member

Thanks @snitish

@snitish snitish deleted the 60210 branch February 6, 2025 19:46
KevsterAmp pushed a commit to KevsterAmp/pandas that referenced this pull request Mar 12, 2025
…#60464)

* BUG: Fix pd.read_html handling of rowspan in table header

* BUG: Fix docstring error in _expand_colspan_rowspan

* BUG: Update return type for _expand_colspan_rowspan

* BUG: Address review and add not to whatsnew
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO HTML read_html, to_html, Styler.apply, Styler.applymap
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: rowspan in read_html failed
3 participants