0% found this document useful (0 votes)

108 views12 pages

Extracting Data From HTML Table

The user is trying to extract data from an HTML table in a shell script. Several solutions are provided, including using Python with the BeautifulSoup library to parse the HTML and extract the table data into variables. Another suggestion is to use Perl regular expressions to match the <th> and <td> tags and output the data in key-value pairs. Pandas is also recommended for reading the HTML table directly into a DataFrame.

Uploaded by

Jayadevan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

108 views12 pages

Extracting Data From HTML Table

Uploaded by

Jayadevan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

Extracting data from HTML table

stackoverflow.com/questions/11790535/extracting-data-from-html-table

I am looking for a way to get certain info from HTML in linux shell environment.

This is bit that I'm interested in :

<table class="details" border="0" cellpadding="5" cellspacing="2" width="95%">

<tr valign="top">
<th>Tests</th>
<th>Failures</th>
<th>Success Rate</th>
<th>Average Time</th>
<th>Min Time</th>
<th>Max Time</th>
</tr>
<tr valign="top" class="Failure">
<td>103</td>
<td>24</td>
<td>76.70%</td>
<td>71 ms</td>
<td>0 ms</td>
<td>829 ms</td>
</tr>
</table>

And I want to store in shell variables or echo these in key value pairs extracted from
above html. Example :

Tests : 103
Failures : 24
Success Rate : 76.70 %
and so on..

What I can do at the moment is to create a java program that will use sax parser or html
parser such as jsoup to extract this info.

But using java here seems to be overhead with including the runnable jar inside the
"wrapper" script you want to execute.

I'm sure that there must be "shell" languages out there that can do the same i.e. perl,
python, bash etc.

My problem is that I have zero experience with these, can somebody help me resolve this
"fairly easy" issue

Quick update:

1/12
I forgot to mention that I've got more tables and more rows in the .html document sorry
about that (early morning).

Update #2:

Tried to install Bsoup like this since I don't have root access :

$ wget
http://www.crummy.com/software/BeautifulSoup/bs4/download/4.0/beautifulsoup4-
4.1.0.tar.gz
$ tar -zxvf beautifulsoup4-4.1.0.tar.gz
$ cp -r beautifulsoup4-4.1.0/bs4 .
$ vi htmlParse.py # (paste code from ) Tichodromas' answer, just in case this
(http://pastebin.com/4Je11Y9q) is what I pasted
$ run file (python htmlParse.py)

error:

$ python htmlParse.py
Traceback (most recent call last):
File "htmlParse.py", line 1, in ?
from bs4 import BeautifulSoup
File "/home/gdd/setup/py/bs4/__init__.py", line 29
from .builder import builder_registry
^
SyntaxError: invalid syntax

Update #3 :

Running Tichodromas' answer get this error :

Traceback (most recent call last):

File "test.py", line 27, in ?
headings = [th.get_text() for th in table.find("tr").find_all("th")]
TypeError: 'NoneType' object is not callable

any ideas?

python linux perl bash

edited Aug 3 '12 at 9:02

asked Aug 3 '12 at 6:38

Gandalf StormCrow
24.5k6666 gold badges159159 silver badges248248 bronze badges

There is a nice library for python that might help: BeautifulSoup ->
crummy.com/software/BeautifulSoup/bs4/doc . – Jakob S. Aug 3 '12 at 6:53

2/12
@Jakob S. thank you for the comment, as I told you I'm a newbie so I downloaded
tarbal and tried to install it python setup.py install get this permission error
error: could not create '/usr/lib/python2.4/site-packages/bs4':
Permission denied , how do I specify in which directory to install it. Is there
something similar to -prefix when installing other commands – Gandalf
StormCrow Aug 3 '12 at 7:06

I have to admit I am not sure how to achieve this if you don't have root access - and
I don't have Linux here at the moment to try. In principal it should be possible to
simply copy the package to the correct directory relative to your source .py file, so
that it can be found by the interpreter. – Jakob S. Aug 3 '12 at 7:14

See the doc: "If all else fails, the license for Beautiful Soup allows you to package
the entire library with your application. You can download the tarball, copy its bs4
directory into your application’s codebase, and use Beautiful Soup without installing
it at all." ( crummy.com/software/BeautifulSoup/bs4/doc/… ) – Jakob S. Aug 3 '12 at
7:16
1

You could/should install bs4 in a separate virtualenv. You'll have pseudo root
privileges in it. – Balthazar Rouberol Aug 3 '12 at 7:29

| Show 10 more comments

7 Answers
Active Oldest Votes
49

A Python solution using BeautifulSoup4 (Edit: with proper skipping. Edit3: Using
class="details" to select the table ):

3/12
from bs4 import BeautifulSoup

html = """
<table class="details" border="0" cellpadding="5" cellspacing="2" width="95%">
<tr valign="top">
<th>Tests</th>
<th>Failures</th>
<th>Success Rate</th>
<th>Average Time</th>
<th>Min Time</th>
<th>Max Time</th>
</tr>
<tr valign="top" class="Failure">
<td>103</td>
<td>24</td>
<td>76.70%</td>
<td>71 ms</td>
<td>0 ms</td>
<td>829 ms</td>
</tr>
</table>"""

soup = BeautifulSoup(html)
table = soup.find("table", attrs={"class":"details"})

# The first tr contains the field names.

headings = [th.get_text() for th in table.find("tr").find_all("th")]

datasets = []
for row in table.find_all("tr")[1:]:
dataset = zip(headings, (td.get_text() for td in row.find_all("td")))
datasets.append(dataset)

print datasets

The result looks like this:

[[(u'Tests', u'103'),
(u'Failures', u'24'),
(u'Success Rate', u'76.70%'),
(u'Average Time', u'71 ms'),
(u'Min Time', u'0 ms'),
(u'Max Time', u'829 ms')]]

Edit2: To produce the desired output, use something like this:

for dataset in datasets:

for field in dataset:
print "{0:<16}: {1}".format(field[0], field[1])

Result:

Tests : 103
Failures : 24
Success Rate : 76.70%
Average Time : 71 ms
Min Time : 0 ms
Max Time : 829 ms

4/12
edited May 27 '14 at 19:34

kmonsoor
5,87366 gold badges3838 silver badges5151 bronze badges

answered Aug 3 '12 at 7:15

user647772

thank you for your answer, answer to your comment above. can I use the class as
identifier, I don't have ID ? class would be details – Gandalf StormCrow Aug 3
'12 at 7:41

@GandalfStormCrow Yes, this can be done. I've edited my answer. – user647772

Aug 3 '12 at 7:46

Is it certain that this answer actually works in Python 2.4? @Gandalf, you said in a
comment that you installed "the older version of bsoup" (BeautifulSoup 3, I
presume). And the line saying "I'm using Python 2.4.3" is gone. So this is a bit
confusing. – mzjn Aug 3 '12 at 11:18

Python 2.4.3 was released on 29-MAR-2006! I think an update would be advisable.

– user647772 Aug 3 '12 at 14:00
2

I've got: print(datasets) [<zip object at 0x0000000013406708>, <zip object at

0x0000000013406448>] while headings are ok. – Peter.k Nov 15 '17 at 21:46

| Show 2 more comments

Use pandas.read_html:

import pandas as pd
html_tables = pd.read_html('resources/test.html')
df = html_tables[0]
df.T # transpose to align
0
Tests 103
Failures 24
Success Rate 76.70%
Average Time 71 ms

answered Oct 3 '19 at 18:02

Jordan Valansi
9122 silver badges44 bronze badges

5/12
Add a comment |
4

Here is the top answer, adapted for Python3 compatibility, and improved by stripping
whitespace in cells:

from bs4 import BeautifulSoup

soup = BeautifulSoup(s, 'html.parser')

table = soup.find("table")

# The first tr contains the field names.

headings = [th.get_text().strip() for th in table.find("tr").find_all("th")]

print(headings)

datasets = []
for row in table.find_all("tr")[1:]:
dataset = dict(zip(headings, (td.get_text() for td in row.find_all("td"))))
datasets.append(dataset)

print(datasets)

answered May 31 '17 at 4:07

Michel Müller
4,70333 gold badges2222 silver badges4747 bronze badges

Add a comment |

6/12
Assuming your html code is stored in a mycode.html file, here is a bash way:

paste -d: <(grep '<th>' mycode.html | sed -e 's,</*th>,,g') <(grep '<td>'

mycode.html | sed -e 's,</*td>,,g')

note: the output is not perfectly aligned

edited Aug 3 '12 at 17:44

answered Aug 3 '12 at 7:53

Stephane Rouberol
4,0921616 silver badges1818 bronze badges

Thanks for the answer, I need to get particular table, have more than one table –
Gandalf StormCrow Aug 3 '12 at 7:59
4

I heard that parsing HTML or XML with regexes is broken by definition. – ychaouche
Jan 12 '14 at 14:36

Add a comment |
1

7/12
undef $/;
$text = <DATA>;

@tabs = $text =~ m!<table.?>(.?)</table>!gms;

for (@tabs) {
@th = m!<th>(.*?)</th>!gms;
@td = m!<td>(.*?)</td>!gms;
}
for $i (0..$#th) {
printf "%-16s\t: %s\n", $th[$i], $td[$i];
}

__DATA__
<table class="details" border="0" cellpadding="5" cellspacing="2" width="95%">
<tr valign="top">
<th>Tests</th>
<th>Failures</th>
<th>Success Rate</th>
<th>Average Time</th>
<th>Min Time</th>
<th>Max Time</th>
</tr>
<tr valign="top" class="Failure">
<td>103</td>
<td>24</td>
<td>76.70%</td>
<td>71 ms</td>
<td>0 ms</td>
<td>829 ms</td>
</tr>
</table>

output as follows:

Tests : 103
Failures : 24
Success Rate : 76.70%
Average Time : 71 ms
Min Time : 0 ms
Max Time : 829 ms

edited Aug 3 '12 at 7:23

answered Aug 3 '12 at 6:56

cdtits
1,12866 silver badges77 bronze badges

@cdtits thank you for your response, will this work if my file contains several tables?
– Gandalf StormCrow Aug 3 '12 at 7:06

If you're going to use perl, I'd recommend HTML::TableExtract...IMO it even beats

the python ugly soup solutions. – runrig Apr 24 '19 at 21:53

Add a comment |

8/12
1

A Python solution that uses only the standard library (takes advantage of the fact that the
HTML happens to be well-formed XML). More than one row of data can be handled.

(Tested with Python 2.6 and 2.7. The question was updated saying that the OP uses
Python 2.4, so this answer may not be very useful in this case. ElementTree was added
in Python 2.5)

from xml.etree.ElementTree import fromstring

HTML = """
<table class="details" border="0" cellpadding="5" cellspacing="2" width="95%">
<tr valign="top">
<th>Tests</th>
<th>Failures</th>
<th>Success Rate</th>
<th>Average Time</th>
<th>Min Time</th>
<th>Max Time</th>
</tr>
<tr valign="top" class="Failure">
<td>103</td>
<td>24</td>
<td>76.70%</td>
<td>71 ms</td>
<td>0 ms</td>
<td>829 ms</td>
</tr>
<tr valign="top" class="whatever">
<td>A</td>
<td>B</td>
<td>C</td>
<td>D</td>
<td>E</td>
<td>F</td>
</tr>
</table>"""

tree = fromstring(HTML)
rows = tree.findall("tr")
headrow = rows[0]
datarows = rows[1:]

for num, h in enumerate(headrow):

data = ", ".join([row[num].text for row in datarows])
print "{0:<16}: {1}".format(h.text, data)

Output:

Tests : 103, A
Failures : 24, B
Success Rate : 76.70%, C
Average Time : 71 ms, D
Min Time : 0 ms, E
Max Time : 829 ms, F

9/12
edited Aug 3 '12 at 9:19

answered Aug 3 '12 at 7:39

mzjn
41.9k99 gold badges9898 silver badges207207 bronze badges

thank you for your answer. Instead of reading from a particular html string, can I
specify like this : get me a table with class="details" from this html file and do
what you've just done? – Gandalf StormCrow Aug 3 '12 at 7:42

Now it works with more than one data row. I have tested this with Python 2.6 and
2.7, but now I see that you use 2.4.3 (which I don't have). So it may not help you.
Anyway, I wanted to show that it is possible to do this kind of thing without extra
libraries. – mzjn Aug 3 '12 at 8:56
1

The string formatting syntax that I (and @Tichodroma) use will not work in 2.4. –
mzjn Aug 3 '12 at 9:02

get me a table with class="details" from this html file. Yes, that can be done using
ElementTree (but not with Python 2.4). ElementTree was added in Python 2.5. –
mzjn Aug 3 '12 at 9:09

Add a comment |
1

Below is a python regex based solution that I have tested on python 2.7. It doesn't rely on
xml module--so will work in case xml is not fully well formed.

10/12
import re
# input args: html string
# output: tables as a list, column max length
def extract_html_tables(html):
tables=[]
maxlen=0
rex1=r'<table.*?/table>'
rex2=r'<tr.*?/tr>'
rex3=r'<(td|th).*?/(td|th)>'
s = re.search(rex1,html,re.DOTALL)
while s:
t = s.group() # the table
s2 = re.search(rex2,t,re.DOTALL)
table = []
while s2:
r = s2.group() # the row
s3 = re.search(rex3,r,re.DOTALL)
row=[]
while s3:
d = s3.group() # the cell
#row.append(strip_tags(d).strip() )
row.append(d.strip() )

r = re.sub(rex3,'',r,1,re.DOTALL)
s3 = re.search(rex3,r,re.DOTALL)

table.append( row )
if maxlen<len(row):
maxlen = len(row)

t = re.sub(rex2,'',t,1,re.DOTALL)
s2 = re.search(rex2,t,re.DOTALL)

html = re.sub(rex1,'',html,1,re.DOTALL)
tables.append(table)
s = re.search(rex1,html,re.DOTALL)
return tables, maxlen

11/12
answered Oct 5 '17 at 3:35

paolov
1,63111 gold badge2424 silver badges3030 bronze badges

Add a comment |

Your Answer

Post as a guest

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service, privacy policy and
cookie policy

Not the answer you're looking for? Browse other questions tagged
python linux perl bash or ask your own question.

12/12

Angamardana
No ratings yet
Angamardana
33 pages
Web Scraping Cheat Sheet (2021), Python For Web Scraping by Frank Andrade Geek Culture - Medium
100% (3)
Web Scraping Cheat Sheet (2021), Python For Web Scraping by Frank Andrade Geek Culture - Medium
26 pages
Kerala & Himachal Pradesh
100% (10)
Kerala & Himachal Pradesh
12 pages
Webscraping1 1 PDF
No ratings yet
Webscraping1 1 PDF
10 pages
DAP Module4
No ratings yet
DAP Module4
109 pages
DAP - Module 4
No ratings yet
DAP - Module 4
57 pages
Practical Introduction To Web Scraping in Python
100% (1)
Practical Introduction To Web Scraping in Python
14 pages
Beautiful Soup Documentation - Beautiful Soup 4.4.0 Documentation
No ratings yet
Beautiful Soup Documentation - Beautiful Soup 4.4.0 Documentation
49 pages
055-En
No ratings yet
055-En
2 pages
Python Module-4
No ratings yet
Python Module-4
109 pages
Beautiful Soup Documentation: Getting Help
100% (1)
Beautiful Soup Documentation: Getting Help
56 pages
Convert HTML Table Into CSV File in Python
No ratings yet
Convert HTML Table Into CSV File in Python
4 pages
Beautiful Soup Documentation
No ratings yet
Beautiful Soup Documentation
61 pages
Apuntes Curso
No ratings yet
Apuntes Curso
2 pages
Beautiful Soup Documentation
No ratings yet
Beautiful Soup Documentation
53 pages
Beautiful Soup
No ratings yet
Beautiful Soup
40 pages
Chapter1 PDF
No ratings yet
Chapter1 PDF
22 pages
Test 2
No ratings yet
Test 2
2 pages
Python For Web Scraping - Week 3: 1 Installing A Module
No ratings yet
Python For Web Scraping - Week 3: 1 Installing A Module
4 pages
Beautiful Soup
No ratings yet
Beautiful Soup
61 pages
Web Scraping
No ratings yet
Web Scraping
11 pages
Python - ImportError - No Module Named BeautifulSoup - Stack Overflow
No ratings yet
Python - ImportError - No Module Named BeautifulSoup - Stack Overflow
6 pages
Citl Exp 8
No ratings yet
Citl Exp 8
7 pages
Implementing Web Scraping in Python With Beautifulsoup
No ratings yet
Implementing Web Scraping in Python With Beautifulsoup
6 pages
Beautiful Soup Documentation - Beautiful Soup 4.13.0 Documentation
No ratings yet
Beautiful Soup Documentation - Beautiful Soup 4.13.0 Documentation
54 pages
DA Unit 4
No ratings yet
DA Unit 4
46 pages
Importing Data in Python Ii: Importing Flat Files From The Web
No ratings yet
Importing Data in Python Ii: Importing Flat Files From The Web
22 pages
05 MGMT 590 Fall 2019 Beautiful Soup
No ratings yet
05 MGMT 590 Fall 2019 Beautiful Soup
9 pages
Beautifulsoup: Web Scraping With Python
No ratings yet
Beautifulsoup: Web Scraping With Python
43 pages
Ujjual PDF Web Scraping 2
No ratings yet
Ujjual PDF Web Scraping 2
2 pages
A Guide To Web Scraping in Python Using Beautiful Soup
No ratings yet
A Guide To Web Scraping in Python Using Beautiful Soup
6 pages
Getting Data II Solutions
No ratings yet
Getting Data II Solutions
9 pages
Beautifulsoup: Web Scraping With Python: Andrew Peterson
No ratings yet
Beautifulsoup: Web Scraping With Python: Andrew Peterson
43 pages
Strip HTML Tags Using Python
No ratings yet
Strip HTML Tags Using Python
8 pages
01 Python 02 Data Sourcing
No ratings yet
01 Python 02 Data Sourcing
9 pages
HKU - 7001 - 4. Web Scraping
No ratings yet
HKU - 7001 - 4. Web Scraping
73 pages
Python Notes
No ratings yet
Python Notes
11 pages
Bash Command Line Pro Tips
From Everand
Bash Command Line Pro Tips
Jason Cannon
4.5/5 (8)
Beautiful Soup
No ratings yet
Beautiful Soup
7 pages
SESION 10 (Pandas 2)
No ratings yet
SESION 10 (Pandas 2)
120 pages
11python Reading HTML Pages
No ratings yet
11python Reading HTML Pages
2 pages
Oracle Certified Professional Java Programmer OCPJP 1Z0 809
From Everand
Oracle Certified Professional Java Programmer OCPJP 1Z0 809
Manish Soni
No ratings yet
LPIC-1 Primer
From Everand
LPIC-1 Primer
John Greene
4.5/5 (3)
Inspiring Powershell Articles
From Everand
Inspiring Powershell Articles
Murat Yildirimoglu
No ratings yet
BeautifulSoup For Python RPA
No ratings yet
BeautifulSoup For Python RPA
6 pages
Parsing The Web: Let S Find The Following Data For The First 100 Movies
No ratings yet
Parsing The Web: Let S Find The Following Data For The First 100 Movies
3 pages
Pandas Documentation PDF
No ratings yet
Pandas Documentation PDF
86 pages
Pandas
No ratings yet
Pandas
57 pages
50 Recipes for Programming Node.js
From Everand
50 Recipes for Programming Node.js
Jamie Munro
3/5 (4)
6 - Text Vectorization-CSC688-SP22
No ratings yet
6 - Text Vectorization-CSC688-SP22
5 pages
Web+Scraping+Cheat+Sheet+2 0
No ratings yet
Web+Scraping+Cheat+Sheet+2 0
3 pages
Configuration of a Simple Samba File Server, Quota and Schedule Backup
From Everand
Configuration of a Simple Samba File Server, Quota and Schedule Backup
Dr. Hedaya Alasooly
No ratings yet
UNIX Shell Programming Interview Questions You'll Most Likely Be Asked
From Everand
UNIX Shell Programming Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
Data Wrangling & Visualization - II
No ratings yet
Data Wrangling & Visualization - II
41 pages
Efficient Python Tricks and Tools For Data Scientists
100% (1)
Efficient Python Tricks and Tools For Data Scientists
23 pages
4 Data Transformation Using Pandas
No ratings yet
4 Data Transformation Using Pandas
59 pages
Web Scraping Cheat Sheet 2.0
No ratings yet
Web Scraping Cheat Sheet 2.0
3 pages
The Mac Terminal Reference and Scripting Primer
From Everand
The Mac Terminal Reference and Scripting Primer
Jay Docherty
4.5/5 (3)
How Can I Get Href Links From HTML Using Python?: 6 Answers
No ratings yet
How Can I Get Href Links From HTML Using Python?: 6 Answers
3 pages
Python Web Crawler
No ratings yet
Python Web Crawler
15 pages
Turn Messy Data Into Insights With Pandas 1747162138
No ratings yet
Turn Messy Data Into Insights With Pandas 1747162138
44 pages
Web Scraping With Python
No ratings yet
Web Scraping With Python
16 pages
Step By Step ITS MOBILE Application
No ratings yet
Step By Step ITS MOBILE Application
16 pages
Step by Step Guide
No ratings yet
Step by Step Guide
23 pages
GMT20210206 063531 - Backtest I
No ratings yet
GMT20210206 063531 - Backtest I
15 pages
886371
100% (3)
886371
28 pages
Taws8000 00918001
No ratings yet
Taws8000 00918001
89 pages
Multihead K Series Brochure 10 2016 (Small)
No ratings yet
Multihead K Series Brochure 10 2016 (Small)
2 pages
Iot Automatic Street Light
No ratings yet
Iot Automatic Street Light
5 pages
K8s Installation On Rocky Linux
No ratings yet
K8s Installation On Rocky Linux
6 pages
Itaagi Conas Bss - Physics
No ratings yet
Itaagi Conas Bss - Physics
55 pages
MiCOM P44X
No ratings yet
MiCOM P44X
8 pages
Getting Started Cytoscan Cytogenetics Suite Flyer
No ratings yet
Getting Started Cytoscan Cytogenetics Suite Flyer
2 pages
Bit Batch 171 Study Plan - June 2018 PDF
No ratings yet
Bit Batch 171 Study Plan - June 2018 PDF
3 pages
Log
No ratings yet
Log
37 pages
Module5 Answers Complete
No ratings yet
Module5 Answers Complete
5 pages
Ieb144 Final
No ratings yet
Ieb144 Final
67 pages
Verilog HDL Syllabus
No ratings yet
Verilog HDL Syllabus
3 pages
HP Designjet T2300 eMFP Product Series - The Scanner Diagnostic Plot HP® Customer Support
No ratings yet
HP Designjet T2300 eMFP Product Series - The Scanner Diagnostic Plot HP® Customer Support
8 pages
Siebel 8 Sizing Questionnaire - Metro
100% (1)
Siebel 8 Sizing Questionnaire - Metro
52 pages
T15may2024 Examm
No ratings yet
T15may2024 Examm
142 pages
Ai Workshop - 4-20-14
No ratings yet
Ai Workshop - 4-20-14
17 pages
Using Hysys Aspen Plus
No ratings yet
Using Hysys Aspen Plus
2 pages
Summarize Measurement and Protection Relay Smpr-1: E.1 Orion Italia Series Modbus Protocol
No ratings yet
Summarize Measurement and Protection Relay Smpr-1: E.1 Orion Italia Series Modbus Protocol
30 pages
Vl9251 Testing of Vlsi Circuits
No ratings yet
Vl9251 Testing of Vlsi Circuits
63 pages
Atmel Studio Readme
No ratings yet
Atmel Studio Readme
31 pages
Setting Up Letter Templates
No ratings yet
Setting Up Letter Templates
5 pages
Computer Vision Mini Project Report
No ratings yet
Computer Vision Mini Project Report
7 pages
Virtualization Set B Answer Key
No ratings yet
Virtualization Set B Answer Key
5 pages
Manual para Referencia
No ratings yet
Manual para Referencia
9 pages
S02M04 RN00050 Ed1 OMS Element Manager Troubleshooting
No ratings yet
S02M04 RN00050 Ed1 OMS Element Manager Troubleshooting
22 pages
Easy Way To Download Videos From YouTube
No ratings yet
Easy Way To Download Videos From YouTube
8 pages
Recalculate Actual Units & Costs
No ratings yet
Recalculate Actual Units & Costs
4 pages
Centroid v3.16 Mill Operator Manual
No ratings yet
Centroid v3.16 Mill Operator Manual
307 pages

Extracting Data From HTML Table

Uploaded by

Extracting Data From HTML Table

Uploaded by

Extracting data from HTML table

This is bit that I'm interested in :

<table class="details" border="0" cellpadding="5" cellspacing="2" width="95%">

Running Tichodromas' answer get this error :

Traceback (most recent call last):

python linux perl bash

asked Aug 3 '12 at 6:38

| Show 10 more comments

# The first tr contains the field names.

The result looks like this:

Edit2: To produce the desired output, use something like this:

for dataset in datasets:

answered Aug 3 '12 at 7:15

@GandalfStormCrow Yes, this can be done. I've edited my answer. – user647772

Python 2.4.3 was released on 29-MAR-2006! I think an update would be advisable.

I've got: print(datasets) [<zip object at 0x0000000013406708>, <zip object at

| Show 2 more comments

answered Oct 3 '19 at 18:02

from bs4 import BeautifulSoup

soup = BeautifulSoup(s, 'html.parser')

# The first tr contains the field names.

answered May 31 '17 at 4:07

paste -d: <(grep '<th>' mycode.html | sed -e 's,</*th>,,g') <(grep '<td>'

note: the output is not perfectly aligned

edited Aug 3 '12 at 17:44

answered Aug 3 '12 at 7:53

@tabs = $text =~ m!<table.*?>(.*?)</table>!gms;

edited Aug 3 '12 at 7:23

answered Aug 3 '12 at 6:56

If you're going to use perl, I'd recommend HTML::TableExtract...IMO it even beats

from xml.etree.ElementTree import fromstring

for num, h in enumerate(headrow):

answered Aug 3 '12 at 7:39

Required, but never shown

You might also like

@tabs = $text =~ m!<table.?>(.?)</table>!gms;