0% found this document useful (0 votes)
3 views33 pages

Pycon 2014 Presentation

The document outlines methods for extracting data from web pages, including downloading pages and using various techniques like XPath, regular expressions, and machine learning for Named Entity Recognition (NER). It discusses the process of annotating web pages, training models, and the tools available for manual annotation. The document also highlights the advantages and disadvantages of these methods, emphasizing the importance of understanding the underlying processes and the potential for improvement and adaptation.

Uploaded by

kmike84
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as KEY, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views33 pages

Pycon 2014 Presentation

The document outlines methods for extracting data from web pages, including downloading pages and using various techniques like XPath, regular expressions, and machine learning for Named Entity Recognition (NER). It discusses the process of annotating web pages, training models, and the tools available for manual annotation. The document also highlights the advantages and disadvantages of these methods, emphasizing the importance of understanding the underlying processes and the potential for improvement and adaptation.

Uploaded by

kmike84
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as KEY, PDF, TXT or read online on Scribd
You are on page 1/ 33

How to extract data

from web pages


Mikhail Korobov, ScrapingHub
PyCon RU 2014
Plan

Download a web page

Extract data
How to download
wget / curl

urllib

requests

twisted / tornado / gevent / ...

scrapy / ...
HTML
<html>
<head></head>
<body>
<div>TEXT-1</div>
<div>
TEXT-2 <b>TEXT-3</b>
</div>
<b>TEXT-4</b>
</body>
</html>
HTML
XPath

//b
XPath

//div/b
XPath

//div[2]/text()
XPath

//div[2]//text()
HTML information
extraction
re (regular expressions)

XPath selectors

CSS3 selectors

jquery selectors

parsley selectors

...
Without selectors

Scrapely (https://github.com/scrapy/scrapely)

Portia (https://github.com/scrapinghub/portia)
Portia: demo
Hard cases:
many websites, all of
them are different;
Много сайтов, все разные

the structure of a web


Структура сайта неизвестна заранее

site is unknown.
Tasks

Crawling (traverse a web site)

Scraping (extract structured information)


Crawling:
usually - rules

Обычно - правила
Follow /contact, /about, etc. links (depending
on a task);
follow links only to the original domain;
depth limits;
total pages limit;
...
pagination?
Scraping
Rules / regexes works ~ok for phones, faxes,
etc.

Rules work worse for more complex tasks:


people names, organization names, etc.

In scientific terms the problem can be


formulated as a Named Entity
Recognition (NER) problem.
Named Entity
Recognition
For English it is often solved using machine
learning

1. Define what do we want to find.

2. Annotate web pages manually.

3. Train a ML model using annotated pages.

4. Extract information from new, unseen pages.


Named Entity
Recognition
For English it is often solved using machine
learning

1. Define what do we want to find.

2. Annotate web pages manually.

3. Train a ML model using annotated pages.

4. Extract information from new, unseen pages.


1. Named Entities:
examples
organization name
person name
person function/position/job title
street address
city
state
country
phone
fax
open hours
1. Named Entities:
examples
organization name - ORG
person name - PER
person function/position/job title - FUNC
street address - STREET
city - CITY
state - STATE
country - COUNTRY
phone - TEL
fax - FAX
open hours - HOURS
Named Entity
Recognition
For English it is often solved using machine
learning

1. Define what do we want to find.

2. Annotate web pages manually.

3. Train a ML model using annotated pages.

4. Extract information from new, unseen pages.


2. Tools for manual
annotation

https://github.com/xtannier/WebAnnotator

https://gate.ac.uk/

http://brat.nlplab.org/
WebAnnotator
(Firefox extension):
demo
Named Entity
Recognition
For English it is often solved using machine
learning

1. Define what do we want to find.

2. Annotate web pages manually.

3. Train a ML model using annotated


pages.

4. Extract information from new, unseen pages.


3. Reduce the problem to a form
suitable for machine learning

Web page => an array or tokens;

for each token keep information about its


position in HTML tree;

each token is assigned a label (one of the


named entity labels).
Tool:
https://github.com/scrapinghub/webstruct
Name entity -> one
or more tokens

ORG
© Old Tea Cafe
Rights Reserv
This data format is not convenient for ML
algorithms
IOB encoding

B-ORG I-ORG I-ORG


O O O

© Old Tea Cafe


Rights Reserv
Tokens 'outside' named entities - tag O
The first token in entity - tag B-ENTITY
Other tokens of an entity - tag I-ENTITY
The problem is reduced to a
"standard" ML classification task

Input data - information about tokens


(==features)

Output - named entity label, encoded as IOB

... + an important detail - in order to get a


better prediction quality use a classifier
which takes a sequence of predicted labels
in account (Conditional Random Fields is a
common choice)
Examples of features
token == "Cafe"?

is the first letter uppercase?

is token a name of a month?

Are two previous tokens "© 2014"?

Is the token inside - <title> HTML element?

Is the token last in its HTML element?


Putting it all together
(one of the approaches)
Use WebAnnotator to annotate pages manually

Use WebStruct to load training data (annotated


pages) and encode named entitiy labels to IOB

Write Python functions to extract features (and/or


use some of the WebStruct feature extraction
functions)

Train a CRF model using python-crfsuite

Use WebStruct to combine all the pieces


Disadvantages
Many training pages are necessary to get
good quality (it is good to have at least
several hundred manually annotated pages)

100% accuracy is impossible

When features are extracted using many


Python functions it can become slow (5-20
pages/sec)
Advantages
It works;

it is possible to improve;

it is possible to adapt to a new problem


domain;

parts of a job (manual data annotation) can


be done by non-developers.
Hints

Understand what's inside, how it works;

don't make changes blindly, don't look at the


libraries as on a black box

Coursera / ... courses are helpful

to dive deeper read books.

You might also like