Extracting Data From HTML Table
Extracting Data From HTML Table
stackoverflow.com/questions/11790535/extracting-data-from-html-table
32
I am looking for a way to get certain info from HTML in linux shell environment.
And I want to store in shell variables or echo these in key value pairs extracted from
above html. Example :
Tests : 103
Failures : 24
Success Rate : 76.70 %
and so on..
What I can do at the moment is to create a java program that will use sax parser or html
parser such as jsoup to extract this info.
But using java here seems to be overhead with including the runnable jar inside the
"wrapper" script you want to execute.
I'm sure that there must be "shell" languages out there that can do the same i.e. perl,
python, bash etc.
My problem is that I have zero experience with these, can somebody help me resolve this
"fairly easy" issue
Quick update:
1/12
I forgot to mention that I've got more tables and more rows in the .html document sorry
about that (early morning).
Update #2:
Tried to install Bsoup like this since I don't have root access :
$ wget
http://www.crummy.com/software/BeautifulSoup/bs4/download/4.0/beautifulsoup4-
4.1.0.tar.gz
$ tar -zxvf beautifulsoup4-4.1.0.tar.gz
$ cp -r beautifulsoup4-4.1.0/bs4 .
$ vi htmlParse.py # (paste code from ) Tichodromas' answer, just in case this
(http://pastebin.com/4Je11Y9q) is what I pasted
$ run file (python htmlParse.py)
error:
$ python htmlParse.py
Traceback (most recent call last):
File "htmlParse.py", line 1, in ?
from bs4 import BeautifulSoup
File "/home/gdd/setup/py/bs4/__init__.py", line 29
from .builder import builder_registry
^
SyntaxError: invalid syntax
Update #3 :
any ideas?
Gandalf StormCrow
24.5k6666 gold badges159159 silver badges248248 bronze badges
There is a nice library for python that might help: BeautifulSoup ->
crummy.com/software/BeautifulSoup/bs4/doc . – Jakob S. Aug 3 '12 at 6:53
2/12
@Jakob S. thank you for the comment, as I told you I'm a newbie so I downloaded
tarbal and tried to install it python setup.py install get this permission error
error: could not create '/usr/lib/python2.4/site-packages/bs4':
Permission denied , how do I specify in which directory to install it. Is there
something similar to -prefix when installing other commands – Gandalf
StormCrow Aug 3 '12 at 7:06
I have to admit I am not sure how to achieve this if you don't have root access - and
I don't have Linux here at the moment to try. In principal it should be possible to
simply copy the package to the correct directory relative to your source .py file, so
that it can be found by the interpreter. – Jakob S. Aug 3 '12 at 7:14
See the doc: "If all else fails, the license for Beautiful Soup allows you to package
the entire library with your application. You can download the tarball, copy its bs4
directory into your application’s codebase, and use Beautiful Soup without installing
it at all." ( crummy.com/software/BeautifulSoup/bs4/doc/… ) – Jakob S. Aug 3 '12 at
7:16
1
You could/should install bs4 in a separate virtualenv. You'll have pseudo root
privileges in it. – Balthazar Rouberol Aug 3 '12 at 7:29
7 Answers
Active Oldest Votes
49
A Python solution using BeautifulSoup4 (Edit: with proper skipping. Edit3: Using
class="details" to select the table ):
3/12
from bs4 import BeautifulSoup
html = """
<table class="details" border="0" cellpadding="5" cellspacing="2" width="95%">
<tr valign="top">
<th>Tests</th>
<th>Failures</th>
<th>Success Rate</th>
<th>Average Time</th>
<th>Min Time</th>
<th>Max Time</th>
</tr>
<tr valign="top" class="Failure">
<td>103</td>
<td>24</td>
<td>76.70%</td>
<td>71 ms</td>
<td>0 ms</td>
<td>829 ms</td>
</tr>
</table>"""
soup = BeautifulSoup(html)
table = soup.find("table", attrs={"class":"details"})
datasets = []
for row in table.find_all("tr")[1:]:
dataset = zip(headings, (td.get_text() for td in row.find_all("td")))
datasets.append(dataset)
print datasets
[[(u'Tests', u'103'),
(u'Failures', u'24'),
(u'Success Rate', u'76.70%'),
(u'Average Time', u'71 ms'),
(u'Min Time', u'0 ms'),
(u'Max Time', u'829 ms')]]
Result:
Tests : 103
Failures : 24
Success Rate : 76.70%
Average Time : 71 ms
Min Time : 0 ms
Max Time : 829 ms
4/12
edited May 27 '14 at 19:34
kmonsoor
5,87366 gold badges3838 silver badges5151 bronze badges
user647772
thank you for your answer, answer to your comment above. can I use the class as
identifier, I don't have ID ? class would be details – Gandalf StormCrow Aug 3
'12 at 7:41
Is it certain that this answer actually works in Python 2.4? @Gandalf, you said in a
comment that you installed "the older version of bsoup" (BeautifulSoup 3, I
presume). And the line saying "I'm using Python 2.4.3" is gone. So this is a bit
confusing. – mzjn Aug 3 '12 at 11:18
Use pandas.read_html:
import pandas as pd
html_tables = pd.read_html('resources/test.html')
df = html_tables[0]
df.T # transpose to align
0
Tests 103
Failures 24
Success Rate 76.70%
Average Time 71 ms
Jordan Valansi
9122 silver badges44 bronze badges
5/12
Add a comment |
4
Here is the top answer, adapted for Python3 compatibility, and improved by stripping
whitespace in cells:
html = """
<table class="details" border="0" cellpadding="5" cellspacing="2" width="95%">
<tr valign="top">
<th>Tests</th>
<th>Failures</th>
<th>Success Rate</th>
<th>Average Time</th>
<th>Min Time</th>
<th>Max Time</th>
</tr>
<tr valign="top" class="Failure">
<td>103</td>
<td>24</td>
<td>76.70%</td>
<td>71 ms</td>
<td>0 ms</td>
<td>829 ms</td>
</tr>
</table>"""
print(headings)
datasets = []
for row in table.find_all("tr")[1:]:
dataset = dict(zip(headings, (td.get_text() for td in row.find_all("td"))))
datasets.append(dataset)
print(datasets)
Michel Müller
4,70333 gold badges2222 silver badges4747 bronze badges
Add a comment |
6/12
Assuming your html code is stored in a mycode.html file, here is a bash way:
Stephane Rouberol
4,0921616 silver badges1818 bronze badges
Thanks for the answer, I need to get particular table, have more than one table –
Gandalf StormCrow Aug 3 '12 at 7:59
4
I heard that parsing HTML or XML with regexes is broken by definition. – ychaouche
Jan 12 '14 at 14:36
Add a comment |
1
7/12
undef $/;
$text = <DATA>;
__DATA__
<table class="details" border="0" cellpadding="5" cellspacing="2" width="95%">
<tr valign="top">
<th>Tests</th>
<th>Failures</th>
<th>Success Rate</th>
<th>Average Time</th>
<th>Min Time</th>
<th>Max Time</th>
</tr>
<tr valign="top" class="Failure">
<td>103</td>
<td>24</td>
<td>76.70%</td>
<td>71 ms</td>
<td>0 ms</td>
<td>829 ms</td>
</tr>
</table>
output as follows:
Tests : 103
Failures : 24
Success Rate : 76.70%
Average Time : 71 ms
Min Time : 0 ms
Max Time : 829 ms
cdtits
1,12866 silver badges77 bronze badges
@cdtits thank you for your response, will this work if my file contains several tables?
– Gandalf StormCrow Aug 3 '12 at 7:06
Add a comment |
8/12
1
A Python solution that uses only the standard library (takes advantage of the fact that the
HTML happens to be well-formed XML). More than one row of data can be handled.
(Tested with Python 2.6 and 2.7. The question was updated saying that the OP uses
Python 2.4, so this answer may not be very useful in this case. ElementTree was added
in Python 2.5)
HTML = """
<table class="details" border="0" cellpadding="5" cellspacing="2" width="95%">
<tr valign="top">
<th>Tests</th>
<th>Failures</th>
<th>Success Rate</th>
<th>Average Time</th>
<th>Min Time</th>
<th>Max Time</th>
</tr>
<tr valign="top" class="Failure">
<td>103</td>
<td>24</td>
<td>76.70%</td>
<td>71 ms</td>
<td>0 ms</td>
<td>829 ms</td>
</tr>
<tr valign="top" class="whatever">
<td>A</td>
<td>B</td>
<td>C</td>
<td>D</td>
<td>E</td>
<td>F</td>
</tr>
</table>"""
tree = fromstring(HTML)
rows = tree.findall("tr")
headrow = rows[0]
datarows = rows[1:]
Output:
Tests : 103, A
Failures : 24, B
Success Rate : 76.70%, C
Average Time : 71 ms, D
Min Time : 0 ms, E
Max Time : 829 ms, F
9/12
edited Aug 3 '12 at 9:19
mzjn
41.9k99 gold badges9898 silver badges207207 bronze badges
thank you for your answer. Instead of reading from a particular html string, can I
specify like this : get me a table with class="details" from this html file and do
what you've just done? – Gandalf StormCrow Aug 3 '12 at 7:42
Now it works with more than one data row. I have tested this with Python 2.6 and
2.7, but now I see that you use 2.4.3 (which I don't have). So it may not help you.
Anyway, I wanted to show that it is possible to do this kind of thing without extra
libraries. – mzjn Aug 3 '12 at 8:56
1
The string formatting syntax that I (and @Tichodroma) use will not work in 2.4. –
mzjn Aug 3 '12 at 9:02
get me a table with class="details" from this html file. Yes, that can be done using
ElementTree (but not with Python 2.4). ElementTree was added in Python 2.5. –
mzjn Aug 3 '12 at 9:09
Add a comment |
1
Below is a python regex based solution that I have tested on python 2.7. It doesn't rely on
xml module--so will work in case xml is not fully well formed.
10/12
import re
# input args: html string
# output: tables as a list, column max length
def extract_html_tables(html):
tables=[]
maxlen=0
rex1=r'<table.*?/table>'
rex2=r'<tr.*?/tr>'
rex3=r'<(td|th).*?/(td|th)>'
s = re.search(rex1,html,re.DOTALL)
while s:
t = s.group() # the table
s2 = re.search(rex2,t,re.DOTALL)
table = []
while s2:
r = s2.group() # the row
s3 = re.search(rex3,r,re.DOTALL)
row=[]
while s3:
d = s3.group() # the cell
#row.append(strip_tags(d).strip() )
row.append(d.strip() )
r = re.sub(rex3,'',r,1,re.DOTALL)
s3 = re.search(rex3,r,re.DOTALL)
table.append( row )
if maxlen<len(row):
maxlen = len(row)
t = re.sub(rex2,'',t,1,re.DOTALL)
s2 = re.search(rex2,t,re.DOTALL)
html = re.sub(rex1,'',html,1,re.DOTALL)
tables.append(table)
s = re.search(rex1,html,re.DOTALL)
return tables, maxlen
html = """
<table class="details" border="0" cellpadding="5" cellspacing="2" width="95%">
<tr valign="top">
<th>Tests</th>
<th>Failures</th>
<th>Success Rate</th>
<th>Average Time</th>
<th>Min Time</th>
<th>Max Time</th>
</tr>
<tr valign="top" class="Failure">
<td>103</td>
<td>24</td>
<td>76.70%</td>
<td>71 ms</td>
<td>0 ms</td>
<td>829 ms</td>
</tr>
</table>"""
print extract_html_tables(html)
11/12
answered Oct 5 '17 at 3:35
paolov
1,63111 gold badge2424 silver badges3030 bronze badges
Add a comment |
Your Answer
Sign up or log in
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
By clicking “Post Your Answer”, you agree to our terms of service, privacy policy and
cookie policy
Not the answer you're looking for? Browse other questions tagged
python linux perl bash or ask your own question.
12/12