0% found this document useful (0 votes)

262 views3 pages

How Can I Get Href Links From HTML Using Python?: 6 Answers

The document discusses several ways to extract href links from HTML using Python, including: 1. Using Beautiful Soup to find all <a> tags and extract their href attributes 2. Using a HTMLParser to handle start tags and extract href attributes from <a> tags 3. Using a regular expression to find href attributes containing file extensions like .tgz or .tar.gz

Uploaded by

Muhammad Harris Syafaat

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

262 views3 pages

How Can I Get Href Links From HTML Using Python?: 6 Answers

Uploaded by

Muhammad Harris Syafaat

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 3

How can I get href links from HTML using Python?

import urllib2

website = "WEBSITE"
openwebsite = urllib2.urlopen(website)
html = getwebsite.read()

print html

So far so good.

But I want only href links from the plain text HTML. How can I solve this problem?

python html hyperlink beautifulsoup href

edited Jun 17 '17 at 11:05 asked Jun 19 '10 at 12:58

dreftymac user371012
14.2k 20 79 146 148 1 2 4

6 Answers

Try with Beautifulsoup:

from BeautifulSoup import BeautifulSoup

import urllib2
import re

html_page = urllib2.urlopen("http://www.yourwebsite.com")
soup = BeautifulSoup(html_page)
for link in soup.findAll('a'):
print link.get('href')

In case you just want links starting with http:// , you should use:

soup.findAll('a', attrs={'href': re.compile("^http://")})

edited Jun 19 '10 at 13:13 answered Jun 19 '10 at 13:04

systempuntoout
39.5k 35 140 221

BeautifulSoup can not automatically close meta tags, for example. The DOM model is invalid and
there is no guarantee that you'll find what you are looking for. – Antonio Dec 28 '13 at 16:16

another problem with bsoup is, the format of the link will change from its original. So, if you want to
change the original link to point to another resource, at the moment I still have no idea how yo do this
with bsoup. Any suggestion? – swdev Oct 28 '14 at 0:54

Not all links contain http . E.g., if you code your site to remove the protocol, the links will start with
// . This means just use whatever protocol the site is loaded with (either http: or https: ). –
reubano Jan 15 '17 at 17:19

You can use the HTMLParser module.

The code would probably look something like this:

from HTMLParser import HTMLParser

class MyHTMLParser(HTMLParser):

def handle_starttag(self, tag, attrs):

# Only parse the 'anchor' tag.
if tag == "a":
# Check the list of defined attributes.
for name, value in attrs:
# If href is defined, print it.
if name == "href":
print name, "=", value

parser = MyHTMLParser()
parser.feed(your_html_string)

Note: The HTMLParser module has been renamed to html.parser in Python 3.0. The 2to3 tool will
automatically adapt imports when converting your sources to 3.0.
answered Jun 19 '10 at 13:02
Stephen
30.5k 6 44 59

I come to realize that, if a link contains the special HTML character such as & , it get converted
into its textual representation, such as & in this case. How do you preserve the original string? –
swdev Oct 28 '14 at 3:20

1 I likte this solution best, since it doesn't need external dependencies – DomTomCat Apr 27 '16 at 6:09

Look at using the beautiful soup html parsing library.

http://www.crummy.com/software/BeautifulSoup/

You will do something like this:

import BeautifulSoup
soup = BeautifulSoup.BeautifulSoup(html)
for link in soup.findAll("a"):
print link.get("href")

edited Mar 4 '14 at 15:27 answered Jun 19 '10 at 13:07

Peter Lyons
100k 21 200 210

Join Stack Overflow

Thanks! But use to learn,
link sharea knowledge,
instead and4 build
. – Evgeny Mar '14 at your
12:30career. Email Sign Up OR SIGN IN WITH Google Facebook

My answer probably sucks compared to the real gurus out there, but using some simple math,
string slicing, find and urllib, this little script will create a list containing link elements. I test google
and my output seems right. Hope it helps!

import urllib
test = urllib.urlopen("http://www.google.com").read()
sane = 0
needlestack = []
while sane == 0:
curpos = test.find("href")
if curpos >= 0:
testlen = len(test)
test = test[curpos:testlen]
curpos = test.find('"')
testlen = len(test)
test = test[curpos+1:testlen]
curpos = test.find('"')
needle = test[0:curpos]
if needle.startswith("http" or "www"):
needlestack.append(needle)
else:
sane = 1
for item in needlestack:
print item

answered Feb 15 '13 at 5:05

0xhughes
913 2 16 33

Here's a lazy version of @stephen's answer

from urllib.request import urlopen

from itertools import chain
from html.parser import HTMLParser

class LinkParser(HTMLParser):
def reset(self):
HTMLParser.reset(self)
self.links = iter([])

def handle_starttag(self, tag, attrs):

if tag == 'a':
for name, value in attrs:
if name == 'href':
self.links = chain(self.links, [value])

def gen_links(f, parser):

encoding = f.headers.get_content_charset() or 'UTF-8'

for line in f:
parser.feed(line.decode(encoding))
yield from parser.links

Use it like so:

>>> parser = LinkParser()

>>> f = urlopen('http://stackoverflow.com/questions/3075550')
>>> links = gen_links(f, parser)
>>> next(links)
'//stackoverflow.com'

edited Jan 15 '17 at 17:58 answered Jan 15 '17 at 17:13

reubano
1,699 18 21

Using BS4 for this specific task seems overkill.

Try instead:

website = urllib2.urlopen('http://10.123.123.5/foo_images/Repo/')
html = website.read()
files = re.findall('href="(.*tgz|.*tar.gz)"', html)
print sorted(x for x in (files))

I found this nifty piece of code on http://www.pythonforbeginners.com/code/regular-expression-re-

findall and works for me quite well.

I tested it only on my scenario of extracting a list of files from a web folder that exposes the
files\folder in it, e.g.:

and I got a sorted list of the files\folders under the URL

answered Sep 20 '17 at 11:09

RaamEE
694 1 7 17

EN 50600 Standards For Data Centres - and Their International Implementation
No ratings yet
EN 50600 Standards For Data Centres - and Their International Implementation
14 pages
Previews 1914867 Pre
No ratings yet
Previews 1914867 Pre
7 pages
Beautiful Soup Documentation - Beautiful Soup 4.13.0 Documentation
No ratings yet
Beautiful Soup Documentation - Beautiful Soup 4.13.0 Documentation
54 pages
Solvent Cements For Chlorinated Poly (Vinyl Chloride) (CPVC) Plastic Pipe and Fittings
No ratings yet
Solvent Cements For Chlorinated Poly (Vinyl Chloride) (CPVC) Plastic Pipe and Fittings
6 pages
DAP Module4
No ratings yet
DAP Module4
109 pages
ISO 9001 Lead Auditor Exam Dumps 2024 PDF
100% (1)
ISO 9001 Lead Auditor Exam Dumps 2024 PDF
21 pages
03 Web Scraping
No ratings yet
03 Web Scraping
41 pages
CTP M5 CH1, CH2
No ratings yet
CTP M5 CH1, CH2
18 pages
FPA-5000 Selection Guide
No ratings yet
FPA-5000 Selection Guide
14 pages
New Holland TT Brochure LR
100% (1)
New Holland TT Brochure LR
2 pages
Garment Fitting
100% (1)
Garment Fitting
13 pages
Iatf 16949 Clause To Eqms Module
100% (3)
Iatf 16949 Clause To Eqms Module
8 pages
Howto Urllib2
No ratings yet
Howto Urllib2
10 pages
Topology
No ratings yet
Topology
37 pages
Howto Urllib2
No ratings yet
Howto Urllib2
10 pages
Receptor Radio Mhouse r2
No ratings yet
Receptor Radio Mhouse r2
1 page
Free DNS Servers List
No ratings yet
Free DNS Servers List
3 pages
Webscraping1 1 PDF
No ratings yet
Webscraping1 1 PDF
10 pages
Cisco CML Corporate Edition DS
No ratings yet
Cisco CML Corporate Edition DS
5 pages
Howto Urllib2
No ratings yet
Howto Urllib2
12 pages
Generalized SAP BI Unit Test Case Templates
No ratings yet
Generalized SAP BI Unit Test Case Templates
6 pages
Dewp 2
No ratings yet
Dewp 2
23 pages
Chapter 11. Web Scraping
100% (1)
Chapter 11. Web Scraping
57 pages
Gamma For Metal
No ratings yet
Gamma For Metal
41 pages
(Python) Making Your Own Google Scraper & Mass Exploiter - Mukarram Khalid
No ratings yet
(Python) Making Your Own Google Scraper & Mass Exploiter - Mukarram Khalid
8 pages
Howto Urllib2
No ratings yet
Howto Urllib2
12 pages
LH Machinery House Service Crane PDF
No ratings yet
LH Machinery House Service Crane PDF
10 pages
Ir Op2
No ratings yet
Ir Op2
26 pages
Beautifulsoap4 Experiments
No ratings yet
Beautifulsoap4 Experiments
7 pages
Web Programming
No ratings yet
Web Programming
36 pages
Web Scrapping
100% (1)
Web Scrapping
20 pages
Smart Meter
No ratings yet
Smart Meter
37 pages
Web Technologies QA
No ratings yet
Web Technologies QA
5 pages
Diagram of Bus Controller
100% (1)
Diagram of Bus Controller
7 pages
Fixed Steel Ladders: Cotterman Co
No ratings yet
Fixed Steel Ladders: Cotterman Co
4 pages
Assignment
No ratings yet
Assignment
5 pages
Practical7 IR
No ratings yet
Practical7 IR
3 pages
howto-urllib2
No ratings yet
howto-urllib2
12 pages
c152 Manual
100% (1)
c152 Manual
135 pages
OSI Transport Layer: Network Fundamentals - Chapter 4
100% (1)
OSI Transport Layer: Network Fundamentals - Chapter 4
22 pages
Using TFTP Server in Windows PDF
No ratings yet
Using TFTP Server in Windows PDF
8 pages
Howto Urllib2
No ratings yet
Howto Urllib2
12 pages
Howto Urllib2
No ratings yet
Howto Urllib2
11 pages
Python Unit 4
No ratings yet
Python Unit 4
6 pages
Energy Power Conservation in Motor Applications
No ratings yet
Energy Power Conservation in Motor Applications
4 pages
Python Module-4
No ratings yet
Python Module-4
109 pages
UMTS CS Protocols and Signaling Analysis
100% (1)
UMTS CS Protocols and Signaling Analysis
446 pages
IT Security
No ratings yet
IT Security
56 pages
Solution Programming in Ansi C: Chapter-1: Problem Exercise No 1.1&1.2: Coding of The Programme
No ratings yet
Solution Programming in Ansi C: Chapter-1: Problem Exercise No 1.1&1.2: Coding of The Programme
7 pages
PDF Document 2
No ratings yet
PDF Document 2
24 pages
Exp 2a
No ratings yet
Exp 2a
3 pages
Beautiful Soup Documentation: Getting Help
100% (1)
Beautiful Soup Documentation: Getting Help
56 pages
Modification For Fitment: With Steel Spring With Air Spring
No ratings yet
Modification For Fitment: With Steel Spring With Air Spring
33 pages
Module 5-Web Scraping
No ratings yet
Module 5-Web Scraping
8 pages
Web Scraping
No ratings yet
Web Scraping
11 pages
Test 2
No ratings yet
Test 2
2 pages
Practical Introduction To Web Scraping in Python
100% (1)
Practical Introduction To Web Scraping in Python
14 pages
Subdomain Scanner
No ratings yet
Subdomain Scanner
2 pages
Bebo Arch System Installation Guide
No ratings yet
Bebo Arch System Installation Guide
59 pages
Implementing Web Scraping in Python With Beautifulsoup
No ratings yet
Implementing Web Scraping in Python With Beautifulsoup
6 pages
Fun With Python
100% (5)
Fun With Python
113 pages
Studer On-Air-1000 v403 Info
No ratings yet
Studer On-Air-1000 v403 Info
164 pages
HOWTO Fetch Internet Resources Using The Urllib Package: Guido Van Rossum and The Python Development Team
No ratings yet
HOWTO Fetch Internet Resources Using The Urllib Package: Guido Van Rossum and The Python Development Team
11 pages
Manipulating HTML Using Nokogiri
No ratings yet
Manipulating HTML Using Nokogiri
3 pages
Howto Urllib2
100% (2)
Howto Urllib2
11 pages
HOWTO Fetch Internet Resources Using The Urllib Package: Table Des Matières
No ratings yet
HOWTO Fetch Internet Resources Using The Urllib Package: Table Des Matières
11 pages
Chapter1 PDF
No ratings yet
Chapter1 PDF
22 pages
Howto Urllib2 PDF
No ratings yet
Howto Urllib2 PDF
11 pages
Scrapping The Web
100% (1)
Scrapping The Web
13 pages
Python Cheat Set
No ratings yet
Python Cheat Set
1 page
Yes 2 Prepay
No ratings yet
Yes 2 Prepay
163 pages
Python Web Crawler
No ratings yet
Python Web Crawler
15 pages
Beautiful Soup
No ratings yet
Beautiful Soup
61 pages
Beautiful Soup Documentation
No ratings yet
Beautiful Soup Documentation
53 pages
How Do I Download A File Over HTTP Using Python - Stack Overflow
No ratings yet
How Do I Download A File Over HTTP Using Python - Stack Overflow
8 pages
Source Diginotes - In: Save The Earth - Go Paperless
No ratings yet
Source Diginotes - In: Save The Earth - Go Paperless
27 pages
HOWTO Fetch Internet Resources Using Urllib2: Guido Van Rossum and The Python Development Team
No ratings yet
HOWTO Fetch Internet Resources Using Urllib2: Guido Van Rossum and The Python Development Team
10 pages
Importing Data in Python Ii: Importing Flat Files From The Web
No ratings yet
Importing Data in Python Ii: Importing Flat Files From The Web
22 pages
Python For Web Scraping - Week 3: 1 Installing A Module
No ratings yet
Python For Web Scraping - Week 3: 1 Installing A Module
4 pages
Beautiful Soup
No ratings yet
Beautiful Soup
40 pages
A Simple Python Web Crawler...
100% (1)
A Simple Python Web Crawler...
5 pages
Strip HTML Tags Using Python
No ratings yet
Strip HTML Tags Using Python
8 pages
Beautiful Soup Documentation - Beautiful Soup 4.4.0 Documentation
No ratings yet
Beautiful Soup Documentation - Beautiful Soup 4.4.0 Documentation
49 pages
Beautiful Soup Documentation
No ratings yet
Beautiful Soup Documentation
61 pages
670 Series 2.0 ANSI: DNP3 Communication Protocol Manual
No ratings yet
670 Series 2.0 ANSI: DNP3 Communication Protocol Manual
74 pages

How Can I Get Href Links From HTML Using Python?: 6 Answers

Uploaded by

How Can I Get Href Links From HTML Using Python?: 6 Answers

Uploaded by

How can I get href links from HTML using Python?

python html hyperlink beautifulsoup href

edited Jun 17 '17 at 11:05 asked Jun 19 '10 at 12:58

Try with Beautifulsoup:

from BeautifulSoup import BeautifulSoup

soup.findAll('a', attrs={'href': re.compile("^http://")})

edited Jun 19 '10 at 13:13 answered Jun 19 '10 at 13:04

You can use the HTMLParser module.

The code would probably look something like this:

from HTMLParser import HTMLParser

def handle_starttag(self, tag, attrs):

Look at using the beautiful soup html parsing library.

You will do something like this:

edited Mar 4 '14 at 15:27 answered Jun 19 '10 at 13:07

Join Stack Overflow

answered Feb 15 '13 at 5:05

Here's a lazy version of @stephen's answer

from urllib.request import urlopen

def handle_starttag(self, tag, attrs):

def gen_links(f, parser):

Use it like so:

>>> parser = LinkParser()

edited Jan 15 '17 at 17:58 answered Jan 15 '17 at 17:13

Using BS4 for this specific task seems overkill.

I found this nifty piece of code on http://www.pythonforbeginners.com/code/regular-expression-re-

and I got a sorted list of the files\folders under the URL

answered Sep 20 '17 at 11:09

You might also like