Python Web Crawler
Python Web Crawler
in Python
Frank McCown
Harding University
Spring 2010
Download a Web Page
• urllib2 library
http://docs.python.org/library/urllib2.html
import urllib2
response = urllib2.urlopen('http://python.org/')
html = response.read()
import urllib2
request = urllib2.Request('http://python.org/')
request.add_header("User-Agent", "My Python Crawler")
opener = urllib2.build_opener()
response = opener.open(request)
html = response.read()
Getting the HTTP headers
• Use response.info()
response = urllib2.urlopen('http://python.org/')
content_type = response.info().get('Content-Type')
>>> content_type
'text/html'
Saving the Response to Disk
• Output html content to myfile.html
f = open('myfile.html', 'w')
f.write(html)
f.close()
Download BeautifulSoup
• Use BeautifulSoup to easily extract links
• Download BeautifulSoup-3.2.0.tar.gz from
http://www.crummy.com/software/BeautifulSoup/download/3.x/
C:\>cd BeautifulSoup-3.2.0
C:\BeautifulSoup-3.2.0>setup.py install
running install
running build
running build_py
creating build
Etc…
Extract Links
• Use BeautifulSoup to extract links
from BeautifulSoup import BeautifulSoup
html = urllib2.urlopen('http://python.org/').read()
soup = BeautifulSoup(html)
links = soup('a')
>>> len(links)
94
>>> links[4]
<a href="/about/" title="About The Python Language">About</a>
>>> link = links[4]
>>> link.attrs
[(u'href', u'/about/'), (u'title', u'About The Python Language')]
Convert Relative URL to Absolute
• Links from BeautifulSoup may be relative
• Make absolute using urljoin()
Download
Visited URLs Repo
resource
Frontier Extract
URLs
Primary Data Structures
• Frontier
– Links that have not yet been visited
• Visited
– Links that have been visited
• Discovered
– Links that have been discovered
Simple Crawler Pseudocode
content_type = c.info().get('Content-Type')
if not content_type.startswith('text/html'):
continue
soup = BeautifulSoup(c.read())
discovered_urls = set()
links = soup('a') # Get all anchor tags
for link in links:
if ('href' in dict(link.attrs)):
url = urljoin(crawl_url, link['href'])
if (url[0:4] == 'http' and url not in visited_urls
and url not in discovered_urls and url not in frontier):
discovered_urls.add(url)
frontier += discovered_urls
time.sleep(2)
Assignment
• Add an optional parameter limit with a default of 10 to crawl()
function which is the maximum number of web pages to download
• Save files to pages dir using the MD5 hash of the URL
import hashlib
filename = 'pages/' + hashlib.md5(url).hexdigest() + '.html'