100% found this document useful (1 vote)
136 views13 pages

Scrapping The Web

This document discusses tools for web scraping using Python, including BeautifulSoup and Mechanize. BeautifulSoup is an HTML/XML parser that allows programmers to navigate, search, and modify the parse tree of markup. Mechanize provides stateful programmatic web browsing and allows easy navigation, form filling, and link parsing. Examples are given of using both tools to extract information from websites.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
136 views13 pages

Scrapping The Web

This document discusses tools for web scraping using Python, including BeautifulSoup and Mechanize. BeautifulSoup is an HTML/XML parser that allows programmers to navigate, search, and modify the parse tree of markup. Mechanize provides stateful programmatic web browsing and allows easy navigation, form filling, and link parsing. Examples are given of using both tools to extract information from websites.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Web Scrapping with Python

Miguel Miranda de Mattos


:@mmmattos - mmmattos.net
Porto Alegre, Brazil.
2012
Web Scrapping with Python
Tools:
BeautifulSoup
Mechanize
BeautifulSoup
An HTML/XML parser for Python that can turn even invalid
markup into a parse tree. It provides simple, idiomatic ways
of navigating, searching, and modifying the parse tree. It
commonly saves programmers hours or days of work.
In Summary:
Navigate the "soup" of HTML/XML tags,
programatically
Access tags properties and values
Search for tags and their attributes.
BeautifulSoup
Example:
from BeautifulSoup import BeautifulSoup
doc = "<html><h1>Heading</h1><p>Text"
soup = BeautifulSoup(doc)
print soup.prettify()
# <html>
# <h1>
# Heading
# </h1>
# <p>
# Text
# </p>
# </html>

BeautifulSoup
Searching / Looking for things
'find', 'findAll', 'findAllNext', 'findAllPrevious', 'findChild',
'findChildren', 'findNext', 'findNextSibling', 'findNextSiblings',
'findParent', 'findParents', 'findPrevious', 'findPreviousSibling',
'findPreviousSiblings'
findAll
findAll(self, name=None, attrs={}, recursive=True,
text=None, limit=None, **kwargs)
Extracts a list of Tag objects that match the given
criteria. You can specify the name of the Tag and any
attributes you want the Tag to have.

BeautifulSoup
Example:
>>> from BeautifulSoup import BeautifulSoup
>>> doc = "<table><tr><td>one</td><td>two</td></tr></table>"
>>> docSoup = BeautifulSoup(doc)

>>> print docSoup.findAll('tr')
[<tr><td>one</td><td>two</td></tr>]
>>> print docSoup.findAll('td')
[<td>one</td>, <td>two</td>]
BeautifulSoup
findAll (contd.):
>>> for t in docSoup.findAll('td'):
>>> print t
<td>one</td>
<td>two</td>
>>> for t in docSoup.findAll('td'):
>>> print t.getText()
one
two
BeautifulSoup
findAll using attributes to qualify:
>>> soup.findAll('div',attrs = {'class': 'Menus'})
[<div>musicMenu</div>,<div>videoMenu</div>]
For more options:
dir (BeautifulSoup)
help (yourSoup.<command>)
Use BeautifulSoup rather than regexp patterns:
patFinderTitle = re.compile(r'<a[^>]*\stitle="(.*?)"')
re.findAll(patFinderTitle, html)
by
soup = BeautifulSoup(html)
for tag in brand_row_soup.findAll('a'):
print tag['title']
Mechanize
Stateful programmatic web browsing in Python, after
Andy Lesters Perl module.
mechanize.Browser and mechanize.UserAgentBase, so:
any URL can be opened, not just http:
mechanize.UserAgentBase offers easy dynamic configuration of
user-agent features like protocol, cookie, redirection and robots.
txt handling, without having to make a new OpenerDirector each
time, e.g. by callingbuild_opener().
Easy HTML form filling.
Convenient link parsing and following.
Browser history (.back() and .reload() methods).
The Referer HTTP header is added properly (optional).
Automatic observance of robots.txt.
Automatic handling of HTTP-Equiv and Refresh.
Mechanize
Navigation commands:
open(url)
follow_link(link)
back()
submit()
reload()
Examples
br = mechanize.Browser()
br.open("python.org")
gothtml = br.response().read()
for link in br.links(url_regex="python.org"):
print link
br.follow_link(link) # takes EITHER Link instance OR keyword args
br.back()
Mechanize
Example:
import re
import mechanize
br = mechanize.Browser()
br.open("http://www.example.com/")
# follow second link with element text matching
# regular expression
response1 = br.follow_link(text_regex=r"cheese\s*shop")
assert br.viewing_html()
print br.title()
print response1.geturl()
print response1.info() # headers
print response1.read() # body
Mechanize
Example: Combining Mechanize and BeautifulSoup
import re
import mechanize
from BeautifulSoup import BeutifulSoup
url = "http://www.hp.com"
br = mechanize.Browser()
br..open(url)
assert br.viewing_html()
html = br.response().read()
result_soup = BeautifulSoup(html)
found_divs = soup.findAll('div')
print "Found " + str(len(found_divs))
for d in found_divs:
print d
Mechanize
Example: Combining Mechanize and BeautifulSoup
import re
import mechanize
url = "http://www.hp.com"
br = mechanize.Browser()
br..open(url)
assert br.viewing_html()
html = br.response().read()
result_soup = BeautifulSoup(html)
found_divs = soup.findAll('div')
print "Found " + str(len(found_divs))
for d in found_divs:
if d.has_key('class'):
print d['class']

You might also like