Pycon 2014 Presentation
Pycon 2014 Presentation
Extract data
How to download
wget / curl
urllib
requests
scrapy / ...
HTML
<html>
<head></head>
<body>
<div>TEXT-1</div>
<div>
TEXT-2 <b>TEXT-3</b>
</div>
<b>TEXT-4</b>
</body>
</html>
HTML
XPath
//b
XPath
//div/b
XPath
//div[2]/text()
XPath
//div[2]//text()
HTML information
extraction
re (regular expressions)
XPath selectors
CSS3 selectors
jquery selectors
parsley selectors
...
Without selectors
Scrapely (https://github.com/scrapy/scrapely)
Portia (https://github.com/scrapinghub/portia)
Portia: demo
Hard cases:
many websites, all of
them are different;
Много сайтов, все разные
site is unknown.
Tasks
Обычно - правила
Follow /contact, /about, etc. links (depending
on a task);
follow links only to the original domain;
depth limits;
total pages limit;
...
pagination?
Scraping
Rules / regexes works ~ok for phones, faxes,
etc.
https://github.com/xtannier/WebAnnotator
https://gate.ac.uk/
http://brat.nlplab.org/
WebAnnotator
(Firefox extension):
demo
Named Entity
Recognition
For English it is often solved using machine
learning
ORG
© Old Tea Cafe
Rights Reserv
This data format is not convenient for ML
algorithms
IOB encoding
it is possible to improve;