How Can I Get Href Links From HTML Using Python?: 6 Answers
How Can I Get Href Links From HTML Using Python?: 6 Answers
import urllib2
website = "WEBSITE"
openwebsite = urllib2.urlopen(website)
html = getwebsite.read()
print html
So far so good.
But I want only href links from the plain text HTML. How can I solve this problem?
6 Answers
html_page = urllib2.urlopen("http://www.yourwebsite.com")
soup = BeautifulSoup(html_page)
for link in soup.findAll('a'):
print link.get('href')
In case you just want links starting with http:// , you should use:
BeautifulSoup can not automatically close meta tags, for example. The DOM model is invalid and
there is no guarantee that you'll find what you are looking for. – Antonio Dec 28 '13 at 16:16
another problem with bsoup is, the format of the link will change from its original. So, if you want to
change the original link to point to another resource, at the moment I still have no idea how yo do this
with bsoup. Any suggestion? – swdev Oct 28 '14 at 0:54
Not all links contain http . E.g., if you code your site to remove the protocol, the links will start with
// . This means just use whatever protocol the site is loaded with (either http: or https: ). –
reubano Jan 15 '17 at 17:19
class MyHTMLParser(HTMLParser):
parser = MyHTMLParser()
parser.feed(your_html_string)
Note: The HTMLParser module has been renamed to html.parser in Python 3.0. The 2to3 tool will
automatically adapt imports when converting your sources to 3.0.
answered Jun 19 '10 at 13:02
Stephen
30.5k 6 44 59
I come to realize that, if a link contains the special HTML character such as & , it get converted
into its textual representation, such as & in this case. How do you preserve the original string? –
swdev Oct 28 '14 at 3:20
1 I likte this solution best, since it doesn't need external dependencies – DomTomCat Apr 27 '16 at 6:09
http://www.crummy.com/software/BeautifulSoup/
import BeautifulSoup
soup = BeautifulSoup.BeautifulSoup(html)
for link in soup.findAll("a"):
print link.get("href")
My answer probably sucks compared to the real gurus out there, but using some simple math,
string slicing, find and urllib, this little script will create a list containing link elements. I test google
and my output seems right. Hope it helps!
import urllib
test = urllib.urlopen("http://www.google.com").read()
sane = 0
needlestack = []
while sane == 0:
curpos = test.find("href")
if curpos >= 0:
testlen = len(test)
test = test[curpos:testlen]
curpos = test.find('"')
testlen = len(test)
test = test[curpos+1:testlen]
curpos = test.find('"')
needle = test[0:curpos]
if needle.startswith("http" or "www"):
needlestack.append(needle)
else:
sane = 1
for item in needlestack:
print item
class LinkParser(HTMLParser):
def reset(self):
HTMLParser.reset(self)
self.links = iter([])
for line in f:
parser.feed(line.decode(encoding))
yield from parser.links
Try instead:
website = urllib2.urlopen('http://10.123.123.5/foo_images/Repo/')
html = website.read()
files = re.findall('href="(.*tgz|.*tar.gz)"', html)
print sorted(x for x in (files))
I tested it only on my scenario of extracting a list of files from a web folder that exposes the
files\folder in it, e.g.: