Data Scraping
Data Scraping
Data scraping is a technique where a computer program extracts data from human-readable output coming
from another program.
Description
Normally, data transfer between programs is accomplished using data structures suited for automated
processing by computers, not people. Such interchange formats and protocols are typically rigidly
structured, well-documented, easily parsed, and minimize ambiguity. Very often, these transmissions are not
human-readable at all.
Thus, the key element that distinguishes data scraping from regular parsing is that the output being scraped
is intended for display to an end-user, rather than as an input to another program. It is therefore usually
neither documented nor structured for convenient parsing. Data scraping often involves ignoring binary
data (usually images or multimedia data), display formatting, redundant labels, superfluous commentary,
and other information which is either irrelevant or hinders automated processing.
Data scraping is most often done either to interface to a legacy system, which has no other mechanism
which is compatible with current hardware, or to interface to a third-party system which does not provide a
more convenient API. In the second case, the operator of the third-party system will often see screen
scraping as unwanted, due to reasons such as increased system load, the loss of advertisement revenue, or
the loss of control of the information content.
Data scraping is generally considered an ad hoc, inelegant technique, often used only as a "last resort"
when no other mechanism for data interchange is available. Aside from the higher programming and
processing overhead, output displays intended for human consumption often change structure frequently.
Humans can cope with this easily, but a computer program will fail. Depending on the quality and the
extent of error handling logic present in the computer, this failure can result in error messages, corrupted
output or even program crashes.
Technical variants
Screen scraping
As a concrete example of a classic screen scraper, consider a hypothetical legacy system dating from the
1960s—the dawn of computerized data processing. Computer to user interfaces from that era were often
simply text-based dumb terminals which were not much more than virtual teleprinters (such systems are still
in use today, for various reasons). The desire to interface such a system to more modern systems is
common. A robust solution will often require things no longer available, such as source code, system
documentation, APIs, or programmers with experience in a 50-year-old computer system. In such cases, the
only feasible solution may be to write a screen scraper that "pretends" to be a user at a terminal. The screen
scraper might connect to the legacy system via Telnet, emulate the keystrokes needed to navigate the old
user interface, process the resulting display output, extract the desired data, and pass it on to the modern
system. A sophisticated and resilient implementation of this kind, built on a platform providing the
governance and control required by a major enterprise—e.g. change control, security, user management,
data protection, operational audit, load balancing, and queue management, etc.—could be said to be an
example of robotic process automation software, called RPA or RPAAI for self-guided RPA 2.0 based on
artificial intelligence.
In the 1980s, financial data providers such as Reuters, Telerate, and Quotron displayed data in 24×80
format intended for a human reader. Users of this data, particularly investment banks, wrote applications to
capture and convert this character data as numeric data for inclusion into calculations for trading decisions
without re-keying the data. The common term for this practice, especially in the United Kingdom, was
page shredding, since the results could be imagined to have passed through a paper shredder. Internally
Reuters used the term 'logicized' for this conversion process, running a sophisticated computer system on
VAX/VMS called the Logicizer.[2]
More modern screen scraping techniques include capturing the bitmap data from the screen and running it
through an OCR engine, or for some specialised automated testing systems, matching the screen's bitmap
data against expected results.[3] This can be combined in the case of GUI applications, with querying the
graphical controls by programmatically obtaining references to their underlying programming objects. A
sequence of screens is automatically captured and converted into a database.
Another modern adaptation to these techniques is to use, instead of a sequence of screens as input, a set of
images or PDF files, so there are some overlaps with generic "document scraping" and report mining
techniques.
There are many tools that can be used for screen scraping.[4]
Web scraping
Web pages are built using text-based mark-up languages (HTML and XHTML), and frequently contain a
wealth of useful data in text form. However, most web pages are designed for human end-users and not for
ease of automated use. Because of this, tool kits that scrape web content were created. A web scraper is an
API or tool to extract data from a website.[5] Companies like Amazon AWS and Google provide web
scraping tools, services, and public data available free of cost to end-users. Newer forms of web scraping
involve listening to data feeds from web servers. For example, JSON is commonly used as a transport
storage mechanism between the client and the webserver.
Recently, companies have developed web scraping systems that rely on using techniques in DOM parsing,
computer vision and natural language processing to simulate the human processing that occurs when
viewing a webpage to automatically extract useful information.[6][7]
Large websites usually use defensive algorithms to protect their data from web scrapers and to limit the
number of requests an IP or IP network may send. This has caused an ongoing battle between website
developers and scraping developers.[8]
Report mining
Report mining is the extraction of data from human-readable computer reports. Conventional data
extraction requires a connection to a working source system, suitable connectivity standards or an API, and
usually complex querying. By using the source system's standard reporting options, and directing the output
to a spool file instead of to a printer, static reports can be generated suitable for offline analysis via report
mining.[9] This approach can avoid intensive CPU usage during business hours, can minimise end-user
licence costs for ERP customers, and can offer very rapid prototyping and development of custom reports.
Whereas data scraping and web scraping involve interacting with dynamic output, report mining involves
extracting data from files in a human-readable format, such as HTML, PDF, or text. These can be easily
generated from almost any system by intercepting the data feed to a printer. This approach can provide a
quick and simple route to obtaining data without the need to program an API to the source system.
See also
Comparison of feed aggregators
Data cleansing
Data munging
Importer (computing)
Information extraction
Open data
Mashup (web application hybrid)
Metadata
Web scraping
Search engine scraping
References
1. "Back in the 1990s.. 2002 ... 2016 ... still, according to Chase Bank, a major issue. Ron
Lieber (May 7, 2016). "Jamie Dimon Wants to Protect You From Innovative Start-Ups" (http
s://www.nytimes.com/2016/05/07/your-money/jamie-dimon-wants-to-protect-you-from-innova
tive-start-ups.html). The New York Times.
2. Contributors Fret About Reuters' Plan To Switch From Monitor Network To IDN (http://www.fx
week.com/fx-week/news/1539599/contributors-fret-about-reuters-plan-to-switch-from-monitor
-network-to-idn), FX Week, 02 Nov 1990
3. Yeh, Tom (2009). "Sikuli: Using GUI Screenshots for Search and Automation" (https://web.ar
chive.org/web/20100214184939/http://groups.csail.mit.edu/uid/projects/sikuli/sikuli-uist2009.
pdf) (PDF). UIST. Archived from the original (http://groups.csail.mit.edu/uid/projects/sikuli/sik
uli-uist2009.pdf) (PDF) on 2010-02-14. Retrieved 2015-02-16.
4. "What is Screen Scraping" (http://www.prowebscraper.com/blog/screen-scraping/). June 17,
2019.
5. Thapelo, Tsaone Swaabow; Namoshe, Molaletsa; Matsebe, Oduetse; Motshegwa, Tshiamo;
Bopape, Mary-Jane Morongwa (2021-07-28). "SASSCAL WebSAPI: A Web Scraping
Application Programming Interface to Support Access to SASSCAL's Weather Data" (http://d
atascience.codata.org/articles/10.5334/dsj-2021-024/). Data Science Journal. 20: 24.
doi:10.5334/dsj-2021-024 (https://doi.org/10.5334%2Fdsj-2021-024). ISSN 1683-1470 (http
s://www.worldcat.org/issn/1683-1470). S2CID 237719804 (https://api.semanticscholar.org/C
orpusID:237719804).
6. "Diffbot aims to make it easier for apps to read Web pages the way humans do" (http://www.t
echnologyreview.com/news/428056/a-startup-hopes-to-help-computers-understand-web-pa
ges/). MIT Technology Review. Retrieved 1 December 2014.
7. "This Simple Data-Scraping Tool Could Change How Apps Are Made" (https://web.archive.o
rg/web/20150511050542/http://www.wired.com/2014/03/kimono). WIRED. Archived from the
original (https://www.wired.com/2014/03/kimono/) on 11 May 2015. Retrieved 8 May 2015.
8. " "Unusual traffic from your computer network" - Search Help" (https://support.google.com/we
bsearch/answer/86640?hl=en). support.google.com. Retrieved 2017-04-04.
9. Scott Steinacher, "Data Pump transforms host data" (https://web.archive.org/web/201603042
05109/http://connection.ebscohost.com/c/product-reviews/2235513/data-pump-transforms-h
ost-data), InfoWorld, 30 August 1999, p55
Further reading
Hemenway, Kevin and Calishain, Tara. Spidering Hacks. Cambridge, Massachusetts:
O'Reilly, 2003. ISBN 0-596-00577-6.