Web Scraping: Applications and Tools
Web Scraping: Applications and Tools
Table of Contents
Table of Contents ...................................................................................................................... 2
Keywords ................................................................................................................................... 3
Abstract/ Executive Summary ................................................................................................... 3
1 Introduction ............................................................................................................................ 4
2 What is web scraping? ............................................................................................................ 6
The necessity to scrape web sites and PDF documents ........................................................ 6
The API mana ........................................................................................................................ 7
3 Web scraping tools ................................................................................................................. 8
Partial tools for PDF extraction ............................................................................................. 8
Partial tools to extract from HTML ........................................................................................ 9
Complete tools .................................................................................................................... 12
Confronting the three tools ................................................................................................ 25
Other tools .......................................................................................................................... 26
4 Decision map ........................................................................................................................ 27
5 Scraping and legislation ........................................................................................................ 28
6 Conclusions and recommendations ...................................................................................... 29
About the Author .................................................................................................................... 30
Copyright information ............................................................................................................. 31
Keywords
web scraping, data extracting, web content extracting, data mining, data harvester, crawler,
open government data
1 Introduction
Every single day, several petabytes of information are published via the Internet in various
formats, such as HTML, PDF, CSV or XML. Curiously, HTML is not necessarily the most common
format to publish content. For instance, 70% of the content indexed by Google is extracted
from PDF documents. This is an additional obstacle for the different roles involved in data
extraction from multiple sources. Journalists, researchers, data analysts, sales agents or
software developers are some examples of professionals typically using the copy-and-paste
technique to get information in a specific format and export it to a spreadsheet, an executive
report or to some data exchange format such as XML, JSON or any of the several vocabularies
available based on them.
Focusing on data acquisition from HTML (i.e, a web document), this article explains the
mechanisms and tools that may help us to minimize the tedious duties of data extraction from
the Internet.
With regards to the commitment of transparency and data openness that public
administrations have assumed over the last decade, scraping and crawling are techniques
that may be useful. Whereas current information systems in use by public administrations
consider these techniques, a relevant amount of web sites, content management systems and
ECMs (Enterprise Content Management) do not. Exchanging these systems for new ones imply
considerable economic efforts. Scraping tools provide an alternative to exchanging which
minimizes such efforts. Open Government Data and Transparency Policies should take
advantage of this opportunity.
extraction
transformation
reuse
Scraping processes may be written in different programming languages. The most popular are
Java, Python, Ruby or Node. As it is obvious, expert programmers are required to develop and
evolve them, and even to use them. Nonetheless, some software companies have designed
different tools that enable other people to use scraping techniques by means of attractive and
powerful user interfaces.
Web scraping tools are also referred as Web Data Extractors, Data Harvesters, Crawling Tools
or Web Content Mining Tools.
PDFtoExcelOnLine
Zamzar
www.pdftoexcelonline.com
Works
online.
Allows
CometDocs
PDFTables
Tabula
www.zamzar.com
www.cometdocs.com
pdftables.com
tabula.technology
Works
Works
Desktop
tool.
converts
Focused
on
online.
online.
Large amount of
Converts
Word,
formats.
PowerPoint.
based
storage. Application
use.
available.
Smartphone
accounts
with
Oriented
licensing
software
The
PDF
Excel
to
existing
and
tables in a PDF
extraction
Cloud-
document to Excel.
tables
document
documents.
account
Source
and
various
terms.
(very
of
in
PDF
code
to
developers.
On-line
Off-line
Converts existing
Only extracts
Converts PDF to
Document
formats
many
API
Oriented to
developers
Source code
available
documents to Excel
into PDF
format
documents
Testing the IMPORTHTML formula is fairly simple, as long as value N is known. This value
represents the order of the table in the list of tables available in the HTML code available in
ePSIplatform Topic Report No. 2015 / 10 , December 2015
Table Capture, a Google Chrome extension
Table Capture is an extension to the popular Google Chrome web browser. It enables users to
copy the content of tables included in a web page in an easy manner. The extension is available
from the aforementioned browser, typing the following URL in its address bar:
https://chrome.google.com/webstore/detail/table-capture/iebpjdmgckacbodjpijphcplhebcmeop
Once installed, if Chrome detects tables in the web page currently rendered, a red icon is
shown to the right of the address bar. Clicking on it shows a listing of the tabled detected and
two controls. The first one copies content to the clipboard, and the second one open a Google
Docs document to later paste the content of the table. An example is shown in the following
figure.
Example of use of Table Capture
ePSIplatform Topic Report No. 2015 / 10 , December 2015
10
and
installed
from
https://addons.mozilla.org/es/firefox/addon/dafizilla-
table2clipboard/?src=userprofile
In this case, a context menu showing upon a right-click on a table allows copying it up or just
the clicked row or column. This is a mechanism quite useful and less intrusive that offers an
interesting functionality in many cases.
Example of use of Table to Clipboard
11
Complete tools
Over the last years, a set of companies, may of them start-ups, have realized that the market
was demanding tools to extract, store, process and render data via APIs. The software industry
moves fast and in many directions and a good web scraper can help in application
development, Public Administration transparency, Big Data processes, data analysis, online
marketing, digital reputation, information reuse or content comparers and aggregators,
among other typical scenarios.
But, what should a tool of this kind include at least to be considered as a serious alternative? In
the opinion of the author:
12
Import.io
Company located in London, specialized in data mining and Internet data transformation to users to
exploit information.
Web site import.io and enterprise.import.io
Motto Instantly turn web pages into data
Indubitably, this is one of the reference tools in the market. It may be used in four different
ways. The first one (named Magic, that can be classified as basic) is the access to import.io in
order to type the address of the web site on which we want to perform scraping. The result is
shown in an attractive visual tabular format. The main con is that the process is not
configurable at all.
The second way of use is named Extractor. It is the most common usage of import.io
technology: download, install and execute in your own computer. This tool is a customized web
browser available for Windows, OS X and Linux. This way requires some previous skills using
software tools and some time to learn how to use the tool. However, picking is offered in a
quite reasonable manner although open to improvement, at the same time. Picking is
performed by clicking on the parts of the scraped web site that we want to extract, in a simple
and visual way. This is a feature that any web scraping tools must include these days.
Once that queries have been created, output formats are only two: JSON and CSV. Queries may
also be combined in order to page results (Bulk Extract) or aggregate them (URLs from
another API). It is also relevant to note that users will have a RESTful API EndPoint to access
the data source which is a mandatory feature in any relevant complete scraping tool
nowadays.
The import.io application requires a simple user registration process. With a username and
password, the application can be used for free, with a set of basic features that may be
sufficient for small developer teams without complex scraping requirements. For companies or
professionals demanding more flexibility and backend resources, contact with import.io sales
team is required.
The third and fourth ways of use provide more value to the tool. They are the Crawler and the
Connector, respectively. The Crawler tries to follow all the links in the document indicated via
13
its URL and it allows information extraction based on the picking process carried out in the
initial document. In the tests carried out to write this article, we have not managed to finish all
this process, as it seems to keep working all the time without producing any results. The
Connector permits to record a browsing script to reach the web document from which to
extract information. This approach is very interesting, for instance, if the data to be scraped are
the result of a search.
In summary, import.io is a tool with interesting functionality free to use, with a high level of
maturity, an attractive and modern graphical user interface, supporting cloud storage of the
queries, which demands local installation to take full advantage of its features.
Strengths
Weaknesses
Visual Interface
Desktop installation
Learning curve
recording
Strenghts vs Weaknesses for import.io
View of the basic use of import.io (available at https://import.io)
14
import.io desktop application
Screen to select data extraction mode in import.io
15
Kimonolabs
Kimono Labs is a company located in Mountain View, California. They offer a SaaS tool after the use of
a Chrome extension. It allows extracting data from webdocuments in an intuitive and easy way. They
provide results in JSON, CSV and RSS formats.
Web site www.kimonolobas.com
Motto Turn websites into structured APIs from your browser in seconds
Kimono Labs is another key player in the field of web scraping. It uses a strategy similar to
import.io, using a web browser. In their case, they offer a Chrome extension, rather than
embedding a web browser in a native desktop application. Therefore, the first step to use this
tool is to install Google Chrome and then this extension. Registration is optional. It is a quick
and simple process that is recommended.
After installation and registration, we can use Chrome to reach a web document with
interesting information. In this moment, we click the icon installed by the Kimono extension
and the picking process starts. This process provides help to the user at first execution with a
visual and attractive format. Its graphical user interface is really polished and the tool results
very friendly.
Initial help view offered by Kimono
16
To start working with Kimono, the user must create a ball in the upper end of the screen. By
default, a ball is already created. The various balls created are shown with a different colour.
Afterwards, sections of the document may be selected to extract data an, subsequently, they
are highlighted with the colour of the associated ball. At the same time, the ball shows the
number of elements that match the selection. Balls may be used to select different zones of
the same document, although their purpose is to refer zones with certain semantic
consistency.
For instance, in the web site of a newspaper, we might create a ball named Title and then
select a headline in a web page. Then, the tool highlights one or two additional headlines as
interesting. After selecting a new headline, the tool starts highlighting another 20 interesting
elements in the web page. We may notice that there are some headlines which are not
highlighted by the tool. We select one of them and now over 60 headlines are highlighted. This
process may be repeated until the tool highlights all the headlines after selecting a small
amount of them. This process is known as selector overload and is available in several
scraping tools.
Once that all the headlines have been selected, we can try doing the same with the opening
paragraph of each news item: create a ball, click on the text area of an opening paragraph,
then another one, and go on with the process until having all the desired information ready for
extraction.
Although the idea is really good, our tests have found that the process is somehow
bothersome. Sometimes, box highlighting in web pages does not work well with, for instance,
problems in texts which are links at the same time.
Once that the picking process is finished by clicking on the Done button, we can name our
new API and parameterize its temporal programming. With this, the system may execute the
API every 15 minutes, hour, day, etc. and store the results in the cloud. Whenever the user
calls the API by accessing the associated URL, Kimono does not access the target site but
returns the most recent data stored in its cache. This caching mechanism is highly useful but
not exclusive of Kimono.
17
Example of highlighting in the picking process of kimono
The query management console, available at www.kimonolabs.com, provides access to the
newly created API and various controls and panels to read data and configure how they are
obtained. This includes an interesting feature, which are email alerts to be received when data
change in the target site.
There is an additional interesting option named Mobile App that integrates the content of
the created API in a view resembling a mobile application, allowing some styling configuration.
However, the view generated by this option is a web document accessible by the URL
announced, aimed to be rendered in a mobile browser. Unluckily, the name of the option
misleads users and does not generate a mobile application to be published in any mobile
application store. Still, it may be a useful option for rapid prototyping.
The console menu also offers the Combine APIs option. Initially, it may look like an
aggregator, assembling the data obtained from several heterogeneous APIs in a single API.
Nevertheless, help information in this option indicates that the aggregated APIs must have the
same exact name of data collections. The conclusion is that this option is useful to paginate
information, but not to aggregate.
18
In summary, kimono is a free tool, with a high level of maturity, a very good graphical user
interface, providing cloud storage for queries, requiring Chrome browser and their extension
both of them installed locally.
Strengths
Weaknesses
Visual interface
Documentation
Picker
19
myTrama
myTrama is a web crawling tool developed by Vitesia Mobile Solutions, a company located in Gijn,
Spain. myTrama allows any user to extract data located in different Internet sites and obtain them in
an ordered and structured way in various formats (JSON, XML, CSV and PDF).
Web sites www.mytrama.com and www.vitesia.com
Motto Data is the new oil
myTrama is a new web crawling tool positioned as a clear competitor to those previously
commented. It is a purely SaaS service, thus avoiding the need for users to install any software
nor to depend on a specific web browser. myTrama works on Chrome, Firefox, Internet Explorer
and Safari. It is available at https://www.mytrama.com.
A general analysis of this tool suggests that myTrama takes the best ideas of import.io and
kimono. It presents information in a graphical user interface, perhaps not so good but more
compact and with the look and feel of a project management tool. Some of the features which
seem more interesting in this tool are commented below:
Main view is organized in a way similar to an email client, with 3 vertical zones: 1)
folders, 2) queries, and 3) query detail. It is efficient and friendly.
Besides JSON, XML and CSV, the classical structured formats for B2B integrations, it
adds PDF for quick reporting and sends results in an easily viewable and printable
format.
It includes a query language named Trama-WQL (quite similar to SQL), which is simple
to use while powerful. It is useful when visual picking is not sufficient, providing a
programmatic manner to define the picking process. Documentation of this language is
available in the tool as a menu option.
The Audit menu option gives access to a compact control panel with information
about the requests currently being made to each of the APIs (EndPoints).
The picker is completely integrated. It is not necessary any type of additional software.
It is similar to the approach used in kimono, although it uses boxes instead of balls.
A subtle differentiation is that a magic wand replaces the default mouse pointer when
picking is available. In addition, the picking process may be stopped by right-clicking on
the area being picked.
myTrama permits grouping boxes within boxes, although only one level of grouping
20
and only a group with query are allowed. This is a very useful feature in order to have
results properly grouped. Hopefully, the development team will improve this feature
soon to provide users with more flexibility.
APIs may be programmed using parameters sent via GET and POST requests.
Unfortunately, the dev team has not published sufficient documentation related to this
feature. For example, it is possible to use the URL of an API and overwrite the FROM
parameter (the URL referencing the target document) in real time. It is also possible to
pass parameters via GET and POST in the same API. Additionally, there is a service that
allows the execution of a Trama-WQL sentence without any query created in the tool.
As these are not very well documented features, the best choice is to contact the
people at Vitesia.
For those preferring the browser extension way of scraping, a Chrome extension is also
available. This mechanism allows users to browse sites and start the scraping process
by clicking on the button installed by the extension. This plugin is not yet published but
can be requested to Vitesia.
PDF is not only a format available as output, but also as input. Therefore, a URL may
reference only HTML documents but also PDF. For instance, users will be able to
extract information from PDFs and generate JSON documents that feed a database for
later information analysis. The business hypothesis to support this is based on the
evidence initially commented at the introduction of this article that stated that 70% of
the content published in the Internet is contained in PDF documents. Vitesia consider
that this may be a differentiating feature between myTrama and their competitors.
APIs preserve session. This allows chaining calls to queries in myTrama and fulfil
business processes, such as searches, step-based forms (wizards) or access information
available behind a login mechanism.
Access to this platform is based on invitation. Users remain active for 30 days. Later
contact with the dev team is required in order to move to a stable user.
21
Among all the tools analysed, myTrama seems to be the most complete and compact, although
its user interface is one step behind kimono and import.io. For users with software
development skills, myTrama seems to be the best choice although requiring direct contact
with Vitesia.
Initial screen of myTrama
22
Picking process in myTrama
Dashboard screen in myTrama
In summary, myTrama is a tool solely offered as a SaaS service, very complete to carry our
scraping processes, with cloud storage and that may be operated with any web browser. Its
major weakness is the lack of documentation of many differentiating issues relevant to
developers interested in taking advantage of scraping processes.
Strengths
Weaknesses
Limited documentation
Dashboards
Picker
ePSIplatform Topic Report No. 2015 / 10 , December 2015
23
24
SaaS model
(requires
installation of
others
features
operation
distribution
chrome extension)
X
Desktop installer
(Windows, OS X
(required)
(optional)
and Linux)
Chrome extension
Free license
Cross-browser compatibility
Own browser
Only Chrome
Good
Very good
Good
Visual picking
Query organization
Trama WQL
PDF extraction
Output formats
JSON, CSV
Automatic crawling
Maturity level
High
High
Medium/High
Complex of use
Medium/High
Medium
Medium
Cloud storage
Query pagination
Query aggregation
Medium/Low
Medium/Low
Medium
Level of documentation
High
High
Low
25
Other tools
This article analyses three tools in the category of complete tools but many other exist. For the
sake of brevity, and for those readers interested in this kind of tools, some other interesting
tools, utilities or frameworks permitting web scraping are listed below.
Mozenda
QuBole
ScraperWiki
Scrapy
Scrapinghub
ParseHub
Ubot Studio 5
Scraper
Apache Nutch
(Chrome Outwit Hub
Plugin)
Fminer.com
80legs
Content Grabber
CloudScrape
Webhose.io
UIPath
Winautomation
AddTolt
Agent Community
Automation
Clarabridge
Darcy Ripper
Data Integration
Anywhere
Enterprise
Data Crops
Dataddo
Diffbot
Espion
Feedity
Ficstar
Web
Grabber
PDF Collector
Plain
Kapow Katalyst
Platform
Text
RedCritter
Scrape.it
Solid Converter
TextfromPDF
Trapeze
Unit Miner
Web
Extractor
Spinn3r
Web
Extractor
SyncFirst Standard
Content Web
Data
Extraction
Robots WebHarvy
Scraping
If you are not convinced to use any of the three recommended tools, Mozenda or ParseHub
may be interesting alternatives.
26
4 Decision map
The following diagram may be helpful in the process of deciding which tool meets which
scraping requirements. Obviously, the diagram could be more complex, as more questions may
be asked in a decision process. However, the rise of complexity in the figure suggests keeping it
simple, but sufficiently illustrative.
27
28
29
30
Copyright information
2015 European PSI Platform This document and all material therein has been compiled
with great care. However, the author, editor and/or publisher and/or any party within the
European PSI Platform or its predecessor projects the ePSIplus Network project or ePSINet
consortium cannot be held liable in any way for the consequences of using the content of this
document and/or any material referenced therein. This report has been published under the
auspices of the European Public Sector information Platform.
The report may be reproduced providing acknowledgement is made to the European Public
Sector Information (PSI) Platform.
31