Scrapping The Web

This document discusses tools for web scraping using Python, including BeautifulSoup and Mechanize. BeautifulSoup is an HTML/XML parser that allows programmers to navigate, search, and modify the parse tree of markup. Mechanize provides stateful programmatic web browsing and allows easy navigation, form filling, and link parsing. Examples are given of using both tools to extract information from websites.

Uploaded by

Emmanuel Mecanosaurio

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

136 views13 pages

Scrapping The Web

Uploaded by

Emmanuel Mecanosaurio

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

Web Scrapping with Python

Miguel Miranda de Mattos

:@mmmattos - mmmattos.net
Porto Alegre, Brazil.
2012
Web Scrapping with Python
Tools:
BeautifulSoup
Mechanize
BeautifulSoup
An HTML/XML parser for Python that can turn even invalid
markup into a parse tree. It provides simple, idiomatic ways
of navigating, searching, and modifying the parse tree. It
commonly saves programmers hours or days of work.
In Summary:
Navigate the "soup" of HTML/XML tags,
programatically
Access tags properties and values
Search for tags and their attributes.
BeautifulSoup
Example:
from BeautifulSoup import BeautifulSoup
doc = "<html><h1>Heading</h1><p>Text"
soup = BeautifulSoup(doc)
print soup.prettify()
# <html>
# <h1>
# Heading
# </h1>
# <p>
# Text
# </p>
# </html>

BeautifulSoup
Searching / Looking for things
'find', 'findAll', 'findAllNext', 'findAllPrevious', 'findChild',
'findChildren', 'findNext', 'findNextSibling', 'findNextSiblings',
'findParent', 'findParents', 'findPrevious', 'findPreviousSibling',
'findPreviousSiblings'
findAll
findAll(self, name=None, attrs={}, recursive=True,
text=None, limit=None, **kwargs)
Extracts a list of Tag objects that match the given
criteria. You can specify the name of the Tag and any
attributes you want the Tag to have.

BeautifulSoup
Example:
>>> from BeautifulSoup import BeautifulSoup
>>> doc = "<table><tr><td>one</td><td>two</td></tr></table>"
>>> docSoup = BeautifulSoup(doc)

>>> print docSoup.findAll('tr')
[<tr><td>one</td><td>two</td></tr>]
>>> print docSoup.findAll('td')
[<td>one</td>, <td>two</td>]
BeautifulSoup
findAll (contd.):
>>> for t in docSoup.findAll('td'):
>>> print t
<td>one</td>
<td>two</td>
>>> for t in docSoup.findAll('td'):
>>> print t.getText()
one
two
BeautifulSoup
findAll using attributes to qualify:
>>> soup.findAll('div',attrs = {'class': 'Menus'})
[<div>musicMenu</div>,<div>videoMenu</div>]
For more options:
dir (BeautifulSoup)
help (yourSoup.<command>)
Use BeautifulSoup rather than regexp patterns:
patFinderTitle = re.compile(r'<a[^>]*\stitle="(.*?)"')
re.findAll(patFinderTitle, html)
by
soup = BeautifulSoup(html)
for tag in brand_row_soup.findAll('a'):
print tag['title']
Mechanize
Stateful programmatic web browsing in Python, after
Andy Lesters Perl module.
mechanize.Browser and mechanize.UserAgentBase, so:
any URL can be opened, not just http:
mechanize.UserAgentBase offers easy dynamic configuration of
user-agent features like protocol, cookie, redirection and robots.
txt handling, without having to make a new OpenerDirector each
time, e.g. by callingbuild_opener().
Easy HTML form filling.
Convenient link parsing and following.
Browser history (.back() and .reload() methods).
The Referer HTTP header is added properly (optional).
Automatic observance of robots.txt.
Automatic handling of HTTP-Equiv and Refresh.
Mechanize
Navigation commands:
open(url)
follow_link(link)
back()
submit()
reload()
Examples
br = mechanize.Browser()
br.open("python.org")
gothtml = br.response().read()
for link in br.links(url_regex="python.org"):
print link
br.follow_link(link) # takes EITHER Link instance OR keyword args
br.back()
Mechanize
Example:
import re
import mechanize
br = mechanize.Browser()
br.open("http://www.example.com/")
# follow second link with element text matching
# regular expression
response1 = br.follow_link(text_regex=r"cheese\s*shop")
assert br.viewing_html()
print br.title()
print response1.geturl()
print response1.info() # headers
print response1.read() # body
Mechanize
Example: Combining Mechanize and BeautifulSoup
import re
import mechanize
from BeautifulSoup import BeutifulSoup
url = "http://www.hp.com"
br = mechanize.Browser()
br..open(url)
assert br.viewing_html()
html = br.response().read()
result_soup = BeautifulSoup(html)
found_divs = soup.findAll('div')
print "Found " + str(len(found_divs))
for d in found_divs:
print d
Mechanize
Example: Combining Mechanize and BeautifulSoup
import re
import mechanize
url = "http://www.hp.com"
br = mechanize.Browser()
br..open(url)
assert br.viewing_html()
html = br.response().read()
result_soup = BeautifulSoup(html)
found_divs = soup.findAll('div')
print "Found " + str(len(found_divs))
for d in found_divs:
if d.has_key('class'):
print d['class']

Instant Access To Exploratory Data Analysis With Python Cookbook: Over 50 Recipes To Analyze, Visualize, and Extract Insights From Structured and Unstructured Data Oluleye Ebook Full Chapters
No ratings yet
Instant Access To Exploratory Data Analysis With Python Cookbook: Over 50 Recipes To Analyze, Visualize, and Extract Insights From Structured and Unstructured Data Oluleye Ebook Full Chapters
41 pages
7th Annual Hacker Powered Security Report
No ratings yet
7th Annual Hacker Powered Security Report
37 pages
Documented Disciplinary and Grievance Handling Procedure
No ratings yet
Documented Disciplinary and Grievance Handling Procedure
7 pages
Sample Plan by Satish Mistry: Scope of Personal Financial Plan / Financial Objective
No ratings yet
Sample Plan by Satish Mistry: Scope of Personal Financial Plan / Financial Objective
147 pages
Guide
No ratings yet
Guide
3 pages
Web Scraping Using Python: A Step by Step Guide: September 2019
0% (1)
Web Scraping Using Python: A Step by Step Guide: September 2019
7 pages
Beautiful Soup Documentation: Getting Help
100% (1)
Beautiful Soup Documentation: Getting Help
56 pages
Api-Demo: Platform-As-A-Service (Paas) Based Solution
No ratings yet
Api-Demo: Platform-As-A-Service (Paas) Based Solution
6 pages
How To Install VcPanel - VPS Control Panel
No ratings yet
How To Install VcPanel - VPS Control Panel
6 pages
Machine Learning and Data Mining: Introduction to (Học máy và Khai phá dữ liệu)
No ratings yet
Machine Learning and Data Mining: Introduction to (Học máy và Khai phá dữ liệu)
49 pages
Sending Emails With Python - Real Python
No ratings yet
Sending Emails With Python - Real Python
2 pages
Pishing
No ratings yet
Pishing
16 pages
No Name of Unauthorised Entities/individual Website Date Added To Alert List
No ratings yet
No Name of Unauthorised Entities/individual Website Date Added To Alert List
6 pages
How To Host Your Own Website (2024 Guide)
No ratings yet
How To Host Your Own Website (2024 Guide)
3 pages
Web Scraping LinkedIn With Selenium in Python: A Step-by-Step Approach by Alena Gorb
No ratings yet
Web Scraping LinkedIn With Selenium in Python: A Step-by-Step Approach by Alena Gorb
19 pages
Bypass Captcha Using Python and Tesseract OCR Engine
No ratings yet
Bypass Captcha Using Python and Tesseract OCR Engine
3 pages
CS202 - Fundamentals of Front End Development Handouts
No ratings yet
CS202 - Fundamentals of Front End Development Handouts
307 pages
Shopify - Sa Form
No ratings yet
Shopify - Sa Form
6 pages
Web Scraping: Applications and Tools
100% (2)
Web Scraping: Applications and Tools
31 pages
Project Management - LinkedIn
No ratings yet
Project Management - LinkedIn
251 pages
Ltr101 Breaking Into Infosec
100% (1)
Ltr101 Breaking Into Infosec
106 pages
SQL Alchemy
No ratings yet
SQL Alchemy
1,088 pages
Computer Programming Algorithms
0% (1)
Computer Programming Algorithms
29 pages
Banking Bot Report
No ratings yet
Banking Bot Report
24 pages
01 - CO1-Module 1
No ratings yet
01 - CO1-Module 1
158 pages
XML API Developer Guide
No ratings yet
XML API Developer Guide
159 pages
Django Vs Flask Vs Pyramid - Choosing A Python Web Framework
No ratings yet
Django Vs Flask Vs Pyramid - Choosing A Python Web Framework
18 pages
Trojan Horse
No ratings yet
Trojan Horse
2 pages
Install Maubot For Matrix
No ratings yet
Install Maubot For Matrix
14 pages
SQL Injection Is A Code Injection Technique That Exploits A Security Vulnerability Occurring in The Database Layer of An Application
No ratings yet
SQL Injection Is A Code Injection Technique That Exploits A Security Vulnerability Occurring in The Database Layer of An Application
6 pages
ThePirateBay Proxy List
No ratings yet
ThePirateBay Proxy List
1 page
Tryhackme Spoofingattack
No ratings yet
Tryhackme Spoofingattack
41 pages
Vba Web Scraping
0% (1)
Vba Web Scraping
718 pages
Automatic Xss Detection Using Google
No ratings yet
Automatic Xss Detection Using Google
10 pages
REST Tutorial
No ratings yet
REST Tutorial
22 pages
Meme Code
No ratings yet
Meme Code
3 pages
WWW - Lanarkshireitservices.couk: Virus / Malware Removal Guide By: Lanarkshire IT Services
No ratings yet
WWW - Lanarkshireitservices.couk: Virus / Malware Removal Guide By: Lanarkshire IT Services
5 pages
Premium Crypto-Assets Services: Pitch Deck
No ratings yet
Premium Crypto-Assets Services: Pitch Deck
37 pages
XSS (Cross Site Scripting) Cheat Sheet
No ratings yet
XSS (Cross Site Scripting) Cheat Sheet
19 pages
Pandas
No ratings yet
Pandas
4 pages
Analytics WWW - Merchantmaverick.com - Landing Pages 20191001-2019123
No ratings yet
Analytics WWW - Merchantmaverick.com - Landing Pages 20191001-2019123
151 pages
Employer Information: Department of Labor & Industry Office of Unemployment Compensation Benefits Policy
No ratings yet
Employer Information: Department of Labor & Industry Office of Unemployment Compensation Benefits Policy
2 pages
Quant Mod
No ratings yet
Quant Mod
105 pages
5 Easy Steps To Create Simple & Secure PHP Login Script
No ratings yet
5 Easy Steps To Create Simple & Secure PHP Login Script
22 pages
Glossary - Malwarebytes
No ratings yet
Glossary - Malwarebytes
63 pages
Python Specialization2
No ratings yet
Python Specialization2
3 pages
How To Set Up A Mail Server On A GNU Linux System
No ratings yet
How To Set Up A Mail Server On A GNU Linux System
35 pages
IRIS Catalog
100% (1)
IRIS Catalog
146 pages
WebRTC GitHub Repo Developer's Guide PDF
No ratings yet
WebRTC GitHub Repo Developer's Guide PDF
6 pages
Topic 1 Basics: Add Two Numbers
No ratings yet
Topic 1 Basics: Add Two Numbers
57 pages
Web Scraping by Using R
No ratings yet
Web Scraping by Using R
3 pages
Social Entity (mẫu)
No ratings yet
Social Entity (mẫu)
11 pages
Ansible Modules
No ratings yet
Ansible Modules
12 pages
Build An SEO Analyzer Using Python
No ratings yet
Build An SEO Analyzer Using Python
7 pages
Web Servers Succinctly
No ratings yet
Web Servers Succinctly
108 pages
Python Arsenal For RE 1.1
No ratings yet
Python Arsenal For RE 1.1
65 pages
Hacking University
No ratings yet
Hacking University
120 pages
A Simple Python Web Crawler...
100% (1)
A Simple Python Web Crawler...
5 pages
DAP 4 Module
No ratings yet
DAP 4 Module
45 pages
Web Scraping and Data Collection CheatSheet 1731972399
No ratings yet
Web Scraping and Data Collection CheatSheet 1731972399
10 pages
Web Scraping Using Python
No ratings yet
Web Scraping Using Python
18 pages
OfApp - Vsynth.07
No ratings yet
OfApp - Vsynth.07
3 pages
Topic Models From Twitter Hashtags: Method
No ratings yet
Topic Models From Twitter Hashtags: Method
2 pages
Topic Models From Twitter Hashtags: 1 Problem Definition
No ratings yet
Topic Models From Twitter Hashtags: 1 Problem Definition
2 pages
Introduction To Latent Things
No ratings yet
Introduction To Latent Things
2 pages
Q3 Module6 CSS9
No ratings yet
Q3 Module6 CSS9
7 pages
Stihl MS 661 C M
No ratings yet
Stihl MS 661 C M
52 pages
HBS Neighborhood Map
No ratings yet
HBS Neighborhood Map
1 page
Qualitative Thesis Example PDF
100% (2)
Qualitative Thesis Example PDF
4 pages
Avast 2050 License Faker by ZeNiX 2014-03-14 en
No ratings yet
Avast 2050 License Faker by ZeNiX 2014-03-14 en
1 page
Austria Info: Culture 2010
No ratings yet
Austria Info: Culture 2010
52 pages
Compilation - Property
No ratings yet
Compilation - Property
32 pages
Chapter 7: Data Link Control Protocols True or False: Data and Computer Communications, 10 Edition, by William Stallings
No ratings yet
Chapter 7: Data Link Control Protocols True or False: Data and Computer Communications, 10 Edition, by William Stallings
5 pages
10 Communication Skills For Your Life and Career Success
No ratings yet
10 Communication Skills For Your Life and Career Success
1 page
Strategic Analysis of WALMART - Group-4
No ratings yet
Strategic Analysis of WALMART - Group-4
12 pages
Digital Design
No ratings yet
Digital Design
173 pages
Slides Chapter 2 (PDF) (ENG) Theories of International Trade
No ratings yet
Slides Chapter 2 (PDF) (ENG) Theories of International Trade
33 pages
Bertha L. Turner - The Federation Cook Book (CA. 1910)
100% (3)
Bertha L. Turner - The Federation Cook Book (CA. 1910)
100 pages
Economics A Contemporary Introduction With InfoTrac 7th Edition William A. Mceachern Instant Download
No ratings yet
Economics A Contemporary Introduction With InfoTrac 7th Edition William A. Mceachern Instant Download
55 pages
Maths Revision 3
No ratings yet
Maths Revision 3
16 pages
Spongebob Squarepants (Theme Song) (Arr. Paul Lavender) - Snare Drum by Paul Lavender - Marching Band - Digital Sheet Music SH
No ratings yet
Spongebob Squarepants (Theme Song) (Arr. Paul Lavender) - Snare Drum by Paul Lavender - Marching Band - Digital Sheet Music SH
1 page
Crosscut Saw Manual
100% (2)
Crosscut Saw Manual
35 pages
31 24 0019710 00 - 60PS
No ratings yet
31 24 0019710 00 - 60PS
82 pages
A Level Media Studies Statement of Intent Form Ocr
No ratings yet
A Level Media Studies Statement of Intent Form Ocr
3 pages
Resume 2011
No ratings yet
Resume 2011
2 pages
50 Important Queries in SQL Server
No ratings yet
50 Important Queries in SQL Server
19 pages
vt0228 English
No ratings yet
vt0228 English
38 pages
Claes 20 Gauge Vitrectomy System
No ratings yet
Claes 20 Gauge Vitrectomy System
8 pages
Are You Ready For Bo Sanchez's Platinum Wealth Circle?
No ratings yet
Are You Ready For Bo Sanchez's Platinum Wealth Circle?
54 pages
SVC Application Guide PDP TV-H2 Series, Fitted With 42V7 Module
100% (1)
SVC Application Guide PDP TV-H2 Series, Fitted With 42V7 Module
9 pages
45 Colonialism As A Profession Mudassar Khan
No ratings yet
45 Colonialism As A Profession Mudassar Khan
11 pages
(G) Verbiage - SBLC MT760
No ratings yet
(G) Verbiage - SBLC MT760
2 pages
Sunburst Primary 1 BigBook 2
No ratings yet
Sunburst Primary 1 BigBook 2
36 pages
Topic:-Optical Fiber Connectors: Prepared by
No ratings yet
Topic:-Optical Fiber Connectors: Prepared by
23 pages

Scrapping The Web

Uploaded by

Scrapping The Web

Uploaded by

Web Scrapping with Python

Miguel Miranda de Mattos

You might also like