0% found this document useful (0 votes)

3 views33 pages

Pycon 2014 Presentation

The document outlines methods for extracting data from web pages, including downloading pages and using various techniques like XPath, regular expressions, and machine learning for Named Entity Recognition (NER). It discusses the process of annotating web pages, training models, and the tools available for manual annotation. The document also highlights the advantages and disadvantages of these methods, emphasizing the importance of understanding the underlying processes and the potential for improvement and adaptation.

Uploaded by

kmike84

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as KEY, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views33 pages

Pycon 2014 Presentation

Uploaded by

kmike84

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as KEY, PDF, TXT or read online on Scribd

You are on page 1/ 33

How to extract data

from web pages

Mikhail Korobov, ScrapingHub
PyCon RU 2014
Plan

Download a web page

Extract data
How to download
wget / curl

urllib

requests

twisted / tornado / gevent / ...

scrapy / ...
HTML
<html>
<head></head>
<body>
<div>TEXT-1</div>
<div>
TEXT-2 <b>TEXT-3</b>
</div>
<b>TEXT-4</b>
</body>
</html>
HTML
XPath

//b
XPath

//div/b
XPath

//div[2]/text()
XPath

//div[2]//text()
HTML information
extraction
re (regular expressions)

XPath selectors

CSS3 selectors

jquery selectors

parsley selectors

...
Without selectors

Scrapely (https://github.com/scrapy/scrapely)

Portia (https://github.com/scrapinghub/portia)
Portia: demo
Hard cases:
many websites, all of
them are different;
Много сайтов, все разные

the structure of a web

Структура сайта неизвестна заранее

site is unknown.
Tasks

Crawling (traverse a web site)

Scraping (extract structured information)

Crawling:
usually - rules

Обычно - правила
Follow /contact, /about, etc. links (depending
on a task);
follow links only to the original domain;
depth limits;
total pages limit;
...
pagination?
Scraping
Rules / regexes works ~ok for phones, faxes,
etc.

Rules work worse for more complex tasks:

people names, organization names, etc.

In scientific terms the problem can be

formulated as a Named Entity
Recognition (NER) problem.
Named Entity
Recognition
For English it is often solved using machine
learning

1. Define what do we want to find.

2. Annotate web pages manually.

3. Train a ML model using annotated pages.

4. Extract information from new, unseen pages.

Named Entity
Recognition
For English it is often solved using machine
learning

1. Define what do we want to find.

2. Annotate web pages manually.

3. Train a ML model using annotated pages.

4. Extract information from new, unseen pages.

1. Named Entities:
examples
organization name
person name
person function/position/job title
street address
city
state
country
phone
fax
open hours
1. Named Entities:
examples
organization name - ORG
person name - PER
person function/position/job title - FUNC
street address - STREET
city - CITY
state - STATE
country - COUNTRY
phone - TEL
fax - FAX
open hours - HOURS
Named Entity
Recognition
For English it is often solved using machine
learning

1. Define what do we want to find.

2. Annotate web pages manually.

3. Train a ML model using annotated pages.

4. Extract information from new, unseen pages.

2. Tools for manual
annotation

https://github.com/xtannier/WebAnnotator

https://gate.ac.uk/

http://brat.nlplab.org/
WebAnnotator
(Firefox extension):
demo
Named Entity
Recognition
For English it is often solved using machine
learning

1. Define what do we want to find.

2. Annotate web pages manually.

3. Train a ML model using annotated

pages.

4. Extract information from new, unseen pages.

3. Reduce the problem to a form
suitable for machine learning

Web page => an array or tokens;

for each token keep information about its

position in HTML tree;

each token is assigned a label (one of the

named entity labels).
Tool:
https://github.com/scrapinghub/webstruct
Name entity -> one
or more tokens

ORG
© Old Tea Cafe
Rights Reserv
This data format is not convenient for ML
algorithms
IOB encoding

B-ORG I-ORG I-ORG

O O O

© Old Tea Cafe

Rights Reserv
Tokens 'outside' named entities - tag O
The first token in entity - tag B-ENTITY
Other tokens of an entity - tag I-ENTITY
The problem is reduced to a
"standard" ML classification task

Input data - information about tokens

(==features)

Output - named entity label, encoded as IOB

... + an important detail - in order to get a

better prediction quality use a classifier
which takes a sequence of predicted labels
in account (Conditional Random Fields is a
common choice)
Examples of features
token == "Cafe"?

is the first letter uppercase?

is token a name of a month?

Is the token inside - <title> HTML element?

Is the token last in its HTML element?

Putting it all together
(one of the approaches)
Use WebAnnotator to annotate pages manually

Use WebStruct to load training data (annotated

pages) and encode named entitiy labels to IOB

Write Python functions to extract features (and/or

use some of the WebStruct feature extraction
functions)

Train a CRF model using python-crfsuite

Use WebStruct to combine all the pieces

Disadvantages
Many training pages are necessary to get
good quality (it is good to have at least
several hundred manually annotated pages)

100% accuracy is impossible

When features are extracted using many

Python functions it can become slow (5-20
pages/sec)
Advantages
It works;

it is possible to improve;

it is possible to adapt to a new problem

domain;

parts of a job (manual data annotation) can

be done by non-developers.
Hints

Understand what's inside, how it works;

don't make changes blindly, don't look at the

libraries as on a black box

Coursera / ... courses are helpful

to dive deeper read books.

Python Module-4
No ratings yet
Python Module-4
109 pages
Verified PDF Download Discrete Time Signal Processing 3rd Edition by Alan V Oppenheim Ebook and TestBank Bundle Fast Instant Download
No ratings yet
Verified PDF Download Discrete Time Signal Processing 3rd Edition by Alan V Oppenheim Ebook and TestBank Bundle Fast Instant Download
408 pages
Web Scraping Handbook
No ratings yet
Web Scraping Handbook
115 pages
Dokumen - Pub - Natural Language Processing Practical Using Transformers With Python
No ratings yet
Dokumen - Pub - Natural Language Processing Practical Using Transformers With Python
275 pages
DAP Module4
No ratings yet
DAP Module4
109 pages
Web Scraping
No ratings yet
Web Scraping
5 pages
Sap ABAP Dumps
No ratings yet
Sap ABAP Dumps
114 pages
Web Scraping Report
No ratings yet
Web Scraping Report
14 pages
Implementing Web Scraping in Python With Beautifulsoup
No ratings yet
Implementing Web Scraping in Python With Beautifulsoup
6 pages
SANDEL-3308 Install Manual Rev G 9-25-03
No ratings yet
SANDEL-3308 Install Manual Rev G 9-25-03
130 pages
EATON SMP 4DP Manual
No ratings yet
EATON SMP 4DP Manual
2 pages
AI-Powered Web Scraping in 2024: Best Practices & Use Cases
No ratings yet
AI-Powered Web Scraping in 2024: Best Practices & Use Cases
5 pages
DAP - Module 4
No ratings yet
DAP - Module 4
57 pages
Espan140 Solution 54860159 8697
No ratings yet
Espan140 Solution 54860159 8697
39 pages
Data - Collection Python
No ratings yet
Data - Collection Python
40 pages
Master Thesis
No ratings yet
Master Thesis
70 pages
Practical Web Scraping For Economists 1744341390
No ratings yet
Practical Web Scraping For Economists 1744341390
33 pages
05 Custom Named Entity Recognition
No ratings yet
05 Custom Named Entity Recognition
27 pages
Detailed Explanation: IR Vs Web Search Vs Web
No ratings yet
Detailed Explanation: IR Vs Web Search Vs Web
15 pages
Hephaestus 7100 - Quick Reference Guide
No ratings yet
Hephaestus 7100 - Quick Reference Guide
4 pages
Documentation ML
No ratings yet
Documentation ML
10 pages
Web Crawling - Python
No ratings yet
Web Crawling - Python
34 pages
Acpk Brochure 1
No ratings yet
Acpk Brochure 1
20 pages
Ai Scraping Techniques
No ratings yet
Ai Scraping Techniques
9 pages
Experiment2 Web Scraping and Data Analysis
No ratings yet
Experiment2 Web Scraping and Data Analysis
5 pages
Web Scraping Using Python
No ratings yet
Web Scraping Using Python
18 pages
Integrating PCA With Deep Learning Models For Stock Market Forecasting
No ratings yet
Integrating PCA With Deep Learning Models For Stock Market Forecasting
13 pages
Kareem Shagar Formation An Oil Field Located in Ras Gharib Development
No ratings yet
Kareem Shagar Formation An Oil Field Located in Ras Gharib Development
53 pages
PDF Succinctly
100% (1)
PDF Succinctly
60 pages
Com01 PPT Operatorsandconditionalstatement
No ratings yet
Com01 PPT Operatorsandconditionalstatement
19 pages
Web Scraping Cheat Sheet 2.0
No ratings yet
Web Scraping Cheat Sheet 2.0
3 pages
Huang GameFormer Game-Theoretic Modeling and Learning of Transformer-Based Interactive Prediction and ICCV 2023 Paper
No ratings yet
Huang GameFormer Game-Theoretic Modeling and Learning of Transformer-Based Interactive Prediction and ICCV 2023 Paper
11 pages
Regulation of Streams in The Skopje Region With Measures For Regulation and Rehabilitation of The River Beds
No ratings yet
Regulation of Streams in The Skopje Region With Measures For Regulation and Rehabilitation of The River Beds
29 pages
Bricks
No ratings yet
Bricks
34 pages
Introduction To Web Scraping in RPA With Python
No ratings yet
Introduction To Web Scraping in RPA With Python
10 pages
2021 Findings-Emnlp 7
No ratings yet
2021 Findings-Emnlp 7
5 pages
Search Gps
No ratings yet
Search Gps
27 pages
8051 UNIT 1-Material
No ratings yet
8051 UNIT 1-Material
38 pages
Lowongan Pekerjaan - Employee Referral Program (10022021)
No ratings yet
Lowongan Pekerjaan - Employee Referral Program (10022021)
5 pages
WebScraping Lessons 1
100% (1)
WebScraping Lessons 1
3 pages
Term Project GEN 351: Derry Ardiansyah Civil Engineering 61070503201
No ratings yet
Term Project GEN 351: Derry Ardiansyah Civil Engineering 61070503201
11 pages
CG Report Final-Full
No ratings yet
CG Report Final-Full
24 pages
Web Content Extraction Through Machine Learning: Ziyan Zhou Ziyanjoe@stanford - Edu Muntasir Mashuq Muntasir@stanford - Edu
No ratings yet
Web Content Extraction Through Machine Learning: Ziyan Zhou Ziyanjoe@stanford - Edu Muntasir Mashuq Muntasir@stanford - Edu
5 pages
CS6113 Semantic Computing: Tagging Data With XML
No ratings yet
CS6113 Semantic Computing: Tagging Data With XML
26 pages
Toyota Sienna 6
No ratings yet
Toyota Sienna 6
2 pages
TSP Formulations Oncan PDF
No ratings yet
TSP Formulations Oncan PDF
18 pages
A Web Scraper For Extracting Alumni Information From Social
No ratings yet
A Web Scraper For Extracting Alumni Information From Social
4 pages
Learning Web Content Extraction With DOM Features: September 2018
No ratings yet
Learning Web Content Extraction With DOM Features: September 2018
8 pages
A Guide To Web Scraping in Python Using Beautiful Soup
No ratings yet
A Guide To Web Scraping in Python Using Beautiful Soup
6 pages
Semantic Crawling: An Approach Based On Named Entity Recognition
No ratings yet
Semantic Crawling: An Approach Based On Named Entity Recognition
5 pages
Web Scarpping
No ratings yet
Web Scarpping
4 pages
OUM MARKETING MANAGEMENT BBPM2103 Topic 2
No ratings yet
OUM MARKETING MANAGEMENT BBPM2103 Topic 2
45 pages
6 - Text Vectorization-CSC688-SP22
No ratings yet
6 - Text Vectorization-CSC688-SP22
5 pages
Module 1 - Introduction To Computer Networks
No ratings yet
Module 1 - Introduction To Computer Networks
9 pages
Final Publish Paper
No ratings yet
Final Publish Paper
4 pages
Machine Structure
No ratings yet
Machine Structure
27 pages
LL014N InternationalRectifier
No ratings yet
LL014N InternationalRectifier
9 pages
Download
No ratings yet
Download
4 pages
Simple Packer-In C Gunther
No ratings yet
Simple Packer-In C Gunther
10 pages
BMC Bos
No ratings yet
BMC Bos
1 page
Web Scrapping: From NP-10
No ratings yet
Web Scrapping: From NP-10
11 pages
Web+Scraping+Cheat+Sheet+2 0
No ratings yet
Web+Scraping+Cheat+Sheet+2 0
3 pages
A Machine Learning Approach To Webpage Content Exraction: Jiawei Yao Xinhui Zuo
No ratings yet
A Machine Learning Approach To Webpage Content Exraction: Jiawei Yao Xinhui Zuo
4 pages
Tata 1412g LPT BS6 Trucks Overview - Specs, Features & Images
100% (1)
Tata 1412g LPT BS6 Trucks Overview - Specs, Features & Images
2 pages
CD Expt 3 Implementation of A Lexical Analyzer Using Lex Tool
No ratings yet
CD Expt 3 Implementation of A Lexical Analyzer Using Lex Tool
6 pages
Heterogeneouswebdataextractionusingontology: Hicham Snoussi Laurent Magnin Jian-Yun Nie
No ratings yet
Heterogeneouswebdataextractionusingontology: Hicham Snoussi Laurent Magnin Jian-Yun Nie
13 pages
Webannotator, An Annotation Tool For Web Pages: Xavier Tannier
No ratings yet
Webannotator, An Annotation Tool For Web Pages: Xavier Tannier
4 pages
Drill Stem Test
No ratings yet
Drill Stem Test
4 pages
Object-Oriented JavaScript
From Everand
Object-Oriented JavaScript
Stoyan Stefanov
3.5/5 (12)
PHP, MySQL, & JavaScript All-In-One For Dummies
From Everand
PHP, MySQL, & JavaScript All-In-One For Dummies
Richard Blum
1/5 (1)
Bare Metal C: Embedded Programming for the Real World
From Everand
Bare Metal C: Embedded Programming for the Real World
Stephen Oualline
No ratings yet
Web Coding & Development All-in-One For Dummies
From Everand
Web Coding & Development All-in-One For Dummies
Paul McFedries
1/5 (1)
Mastering Node.js Web Development: Go on a comprehensive journey from the fundamentals to advanced web development with Node.js
From Everand
Mastering Node.js Web Development: Go on a comprehensive journey from the fundamentals to advanced web development with Node.js
Adam Freeman
No ratings yet
The Complete Guide to Technology & Programming
From Everand
The Complete Guide to Technology & Programming
MATHY WISDOM
No ratings yet
Learning Programming and Computer Science: 1, #1
From Everand
Learning Programming and Computer Science: 1, #1
MATHY WISDOM
No ratings yet
Object-Oriented JavaScript: Create scalable, reusable high-quality JavaScript applications, and libraries
From Everand
Object-Oriented JavaScript: Create scalable, reusable high-quality JavaScript applications, and libraries
Stoyan Stefanov
3.5/5 (3)
Oracle ADF 11gR2 Development Beginner's Guide
From Everand
Oracle ADF 11gR2 Development Beginner's Guide
Vinod Krishnan
No ratings yet
Learn Professional Programming in .Net Using C#, Visual Basic, and Asp.Net
From Everand
Learn Professional Programming in .Net Using C#, Visual Basic, and Asp.Net
Adalat Khan
No ratings yet
CryENGINE Game Programming with C++, C#, and Lua
From Everand
CryENGINE Game Programming with C++, C#, and Lua
Filip Lundgren
No ratings yet
Oracle Application Express 3.2: The Essentials and More
From Everand
Oracle Application Express 3.2: The Essentials and More
Arie Geller
No ratings yet
Entity Framework Tutorial - Second Edition
From Everand
Entity Framework Tutorial - Second Edition
Joydip Kanjilal
No ratings yet
IBM WebSphere eXtreme Scale 6
From Everand
IBM WebSphere eXtreme Scale 6
Anthony Chaves
No ratings yet
Elements of Android Room
From Everand
Elements of Android Room
Mark Murphy
No ratings yet
The Little Book of Sitecore® Tips: Volume 2
From Everand
The Little Book of Sitecore® Tips: Volume 2
Neil P Shack
No ratings yet
Learning DHTMLX Suite UI
From Everand
Learning DHTMLX Suite UI
Eli Geske
No ratings yet
Practice Questions for UiPath Certified RPA Associate Case Based
From Everand
Practice Questions for UiPath Certified RPA Associate Case Based
Exam OG
No ratings yet
Lotus Notes Interview Questions, Answers and Explanations
From Everand
Lotus Notes Interview Questions, Answers and Explanations
equitypress
No ratings yet
Oracle E-Business Suite R12 Integration and OA Framework Development and Extension Cookbook
From Everand
Oracle E-Business Suite R12 Integration and OA Framework Development and Extension Cookbook
Andy Penver
No ratings yet
.Net Framework and Programming in ASP.NET
From Everand
.Net Framework and Programming in ASP.NET
Priyanka Agarwal
No ratings yet
Unofficial TIBCO® Business Works™ Interview Questions, Answers, and Explanations: TIBCO Certification Review Questions
From Everand
Unofficial TIBCO® Business Works™ Interview Questions, Answers, and Explanations: TIBCO Certification Review Questions
equitypress
3.5/5 (2)

Pycon 2014 Presentation

Uploaded by

Pycon 2014 Presentation

Uploaded by

How to extract data

from web pages

Download a web page

twisted / tornado / gevent / ...

the structure of a web

Crawling (traverse a web site)

Scraping (extract structured information)

Rules work worse for more complex tasks:

In scientific terms the problem can be

1. Define what do we want to find.

2. Annotate web pages manually.

3. Train a ML model using annotated pages.

4. Extract information from new, unseen pages.

1. Define what do we want to find.

2. Annotate web pages manually.

3. Train a ML model using annotated pages.

4. Extract information from new, unseen pages.

1. Define what do we want to find.

2. Annotate web pages manually.

3. Train a ML model using annotated pages.

4. Extract information from new, unseen pages.

1. Define what do we want to find.

2. Annotate web pages manually.

3. Train a ML model using annotated

4. Extract information from new, unseen pages.

Web page => an array or tokens;

for each token keep information about its

each token is assigned a label (one of the

B-ORG I-ORG I-ORG

© Old Tea Cafe

Input data - information about tokens

Output - named entity label, encoded as IOB

... + an important detail - in order to get a

is the first letter uppercase?

is token a name of a month?

Are two previous tokens "© 2014"?

Is the token inside - <title> HTML element?

Is the token last in its HTML element?

Use WebStruct to load training data (annotated

Write Python functions to extract features (and/or

Train a CRF model using python-crfsuite

Use WebStruct to combine all the pieces

100% accuracy is impossible

When features are extracted using many

it is possible to adapt to a new problem

parts of a job (manual data annotation) can

Understand what's inside, how it works;

don't make changes blindly, don't look at the

Coursera / ... courses are helpful

to dive deeper read books.

You might also like