Practical Perl: Web Automation

This document discusses using Perl to automate interactions with websites by scraping HTML content. It introduces the LWP library for fetching web pages and resources in Perl. Screen scraping HTML is described as error-prone due to invalid formatting, but necessary for extracting data from websites. The document recommends using modules like HTML::LinkExtor and ignoring common interface elements to focus on page-specific content when screen scraping.

Uploaded by

Rajesh Patel

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

52 views5 pages

Practical Perl: Web Automation

Uploaded by

Rajesh Patel

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

practical perl

offer programmatic access to a service like, say, eBay, Google, or

Web Automation Amazon.com. But if you want to publish information or offer a
service to the widest possible audience, you still need to build a
Web site.
by Adam Turoff
Adam is a consultant The HTML Problem
who specializes in
using Perl to manage Before Web services, automatic processing of data from the Web
big data. He is a usually involved fetching HTML documents and scanning them
long-time Perl Mon-
ger, a technical editor to find new or interesting bits of data. Web services offer a more
for The Perl Review, robust alternative, but do not eliminate the need to sift through
and a frequent pre-
senter at Perl confer- HTML documents and “screen scrape” data off a Web page.
ences.
Processing HTML is the worst possible solution, but it is often
[email protected]
the only solution available. HTML is a difficult format to parse.
Many documents contain invalid or otherwise broken format-
ting. Using regular expressions to extract information from
Introduction
HTML documents is a common coping strategy, but it is quite
Web service protocols like XML-RPC and SOAP are error-prone and notoriously brittle.
great for automating common tasks on the Web. But
Nevertheless, HTML is the universal format for data on the
these protocols aren’t always available. Sometimes
Web. Programmers who are building systems may consider
interacting with HTML-based interfaces is still neces- alternatives like XML-RPC or SOAP Web services. But publish-
sary. Thankfully, Perl has the tools to help you get your ers and service providers are still focused on HTML, because it
job done. is the one format that everyone with a Web browser can always
use.
In my last column, I introduced Web services using XML-RPC.
Web services are commonly used as a high-level RPC (remote
Automating the Web
procedure call) mechanism to allow two programs to share
Since the early days of the Web, people have used programs that
data. They enable programs to exchange information with each
automatically scan, monitor, mirror, and fetch information
other by sending XML documents over HTTP.
from the Web. These programs are generally called robots or
There are many advantages to using Web service tools like spiders. Today, other kinds of programs traverse the Web, too.
XML-RPC and its cousin, SOAP. First, all of the low-level details Spammers use email harvesters to scour Web pages for email
of writing a client and server programs are handled by reusable addresses they can spam. In the semantics of the Web commu-
libraries. No longer is it necessary to master the arcana of socket nity, “scutters” follow links to metadata files to build up data-
programming and protocol design to implement or use a new bases of information about who’s who and what’s what on the
service or daemon. Because information is exchanged as text, Web.
Web services are programming-language agnostic. You could
There are many other mundane uses for Web automation pro-
write a service in Perl to deliver weather information, and access
grams. Link checkers rigorously fetch all the resources on a Web
it with clients written in Python, Tcl, Java, or C#. Or vice versa.
site to find and report broken links. With software development
Yet for all of the benefits Web services bring, they are hardly a moving to the Web, testers use scripts to simulate a user session
panacea. Protocols like SOAP and XML-RPC focus on how pro- to make sure Web applications behave properly.
grams interact with each other, not on how people interact with
Fortunately, there are a great many Perl modules on CPAN to
programs. For example, I cannot scribble down the address of
help with all of these tasks.
an XML-RPC service on a napkin and expect someone to use
that service easily. Nor can I send a link to an XML-RPC service Most Web automation programs in Perl start with libwww-
in the body of an email message. perl, more commonly known as LWP. This library of modules is
Gisle Aas’s Swiss Army knife for interacting with the Web. The
Generally speaking, in order to use a Web service, I need to
easiest way to get started with LWP is with the LWP::Simple
write some code, and have an understanding of how to use that
module, which provides a simple interface to fetch Web
particular service. This is why after about five years, Web ser-
resources:
vices are still a niche technology. They work great if you want to

50 Vol. 29, No. 3 ;login:

#!/usr/bin/perl -w

PROGRAMMING
use strict;
use LWP::Simple;
## Grab a Web page, and throw the content in a Perl variable.
my $content = get("http://www.usenix.org/publications/login/");

l
## Grab a Web page, and write the content to disk.
getstore("http://www.usenix.org/publications/login/", "login.html");
## Grab a Web page, and write the content to disk if it has changed.
mirror("http://www.usenix.org/publications/login/”, "login.html");
LWP has other interfaces that enable you to customize exactly how your program will interact with the Web sites it visits. For more
details about LWP’s capabilities, check out the documentation that comes with the module, including the lwpcook and lwptut man
pages. Sean Burke’s book Perl & LWP also provides an introduction to and overview of LWP.

Screen Scraping
Retrieving Web resources is the easy part of automating Web access. Once HTML files have been fetched, they need to be examined.
Simple Web tools like link checkers only care about the URLs for the clickable links, images, and other files embedded in a Web page.
One easy way to find these pieces of data is to use the HTML::LinkExtor module to parse an HTML document and extract only these
links. HTML::LinkExtor is another of one of Gisle’s modules that can be found in his HTML::Parser distribution.
#!/usr/bin/perl -w
use strict;
use LWP::Simple;
use HTML::LinkExtor;
my $content = get("http://www.usenix.org/publications/login/");
my $extractor = new HTML::LinkExtor;
$extractor->parse($content);
my @links = $extractor->links();
foreach my $link (@links) {
## $link is a 3-element array reference containing
## element name, attribute name, and URL:
##
## 0 1 2
## <a href="http://....">
## <img src="http://....">
print "$link->[2]\n";
}

Most modern Web sites have common user interface elements that appear on every page. These are elements like page headers, page
footers, and navigation columns. The actual content of a page is embedded inside these repeating interface elements that appear on
every page of a Web site. Sometimes, a screen scraper will want to ignore all of the repeatable elements and focus instead on the
page-specific content for each HTML page it examines.
For example, the O’Reilly book catalog (http://www.oreilly.com/catalog/) has each of these three common interface elements. The
header, footer, and navigation column on this page all contain links to ads and to other parts of the O’Reilly Web site. A program
that monitors the book links on this page is only concerned with a small portion of this Web page, the actual list of book titles.
One way to focus on the meaningful content is to examine the structure of the URLs on this page, and create a regular expression
that matches only the URLs on the list of titles. But when the URLs change, your program breaks. Another way to solve this problem
is to write a regular expression that matches the HTML content of the entire book list, and throw out the extraneous parts of this

June 2004 ;login: PRACTICAL PERL l 51

Web page. Both of these approaches can work, but they are error-prone. Both will fail if the page design changes in a subtle or a sig-
nificant manner.
Of course, this is Perl, so there’s more than one way to do it. Many Web page designs are built using a series of HTML tables. A better
way to find the relevant content on this Web page is to parse the HTML and focus on the portion of the page that contains what we
want to examine. This approach isn’t foolproof, but it is more robust than using a regular expression to match portions of a Web
page and fixing your program each time the Web page you are analyzing changes.
There are a few modules on CPAN that handle parsing HTML content. While HTML::Parser can provide a good general-purpose
solution, I prefer Simon Drabble’s HTML::TableContentParser, which focuses on extracting the HTML tables found in a Web page.
This technique will break if the HTML layout changes drastically, but at least it is less likely to break when insignificant changes to
the HTML structure appear.
#!/usr/bin/perl -w
use strict;
use LWP::Simple;
use HTML::TableContentParser;
my $content = get("http://www.oreilly.com/catalog/");
my $parser = new HTML::TableContentParser;
my $tables = $parser->parse($content);
## $tables is an array reference. Select to the specific table
## content and process it directly.

Interacting with the Web

Most Web automation techniques, like the ones described above, focus on fetching a page and processing the result. This kind of
shallow interaction is sufficient for simple automation tasks, like link checking or mirroring. For more complicated automation,
scripts need to be able to do all the things a person could do with a Web browser. This means entering data into forms, clicking on
specific links in a specific order, and using the back and reload buttons.
This is where Andy Lester’s WWW::Mechanize comes in. Mechanize provides a simple programmatic interface to script a virtual
user navigating through a Web site or using a Web application.
Consider a shopping cart application. A user starts by browsing or searching for products, and periodically clicks on “Add to Shop-
ping Cart.” On the shopping cart page, the user can click on the “Continue shopping” button, click on the back button, browse else-
where on the Web site, or search for products.
If you were developing this application, how would you test it? Would you write down detailed instructions for the people on your
test team to repeat by rote? Or would you write a program to simulate a user, checking each and every intermediate result along the
way? Mechanize is the tool you need to write your simulated user scripts. That user script might look something like this:
#!/usr/bin/perl -w
use strict;
use WWW::Mechanize;
my $mech = new WWW::Mechanize;
## Start with the homepage.
$mech->get("http://localhost/myshop.cgi");
## Browse for a book.
$mech->follow_link( text => "Books" );
$mech->follow_link( text_regex => qr/Computers/ );
$mech->follow_link( text_regex => qr/Perl/ );
## Put "Programming Perl" in the shopping cart.

52 Vol. 29, No.3 ;login:

$mech->follow_link( text_regex => qr/Programming Perl/);

PROGRAMMING
## Add this to the shopping cart.
$mech->click_button( name => "AddToCart");
## Click the "back button."
$mech->back();

l
## Check out.
$mech->click_button( name => "Checkout");
## Fill in the shipping and billing information.
....

Mechanize is also an excellent module for scripting common actions. Every other week, I need to use a Web-based time-tracking
application to tally up how much time I’ve worked in the current pay period. I could fire up a browser and type in the same thing I
typed in two weeks ago. Or I could use Mechanize :
#!/usr/bin/perl -w
use strict;
use WWW::Mechanize;
my $mech = new WWW::Mechanize;
$mech->get('...');
## Log in.
$mech->set_fields(
user => "my_username",
pass => "my_password",
);
$mech->submit();
## Put in a standard work week.
## Log in manually later if this needs to be adjusted.
## (Timesheet is the 2nd form. Skip the calendar.)
$mech->submit_form (
form_number => 1,
fields => {
0 => 7.5,
1 => 7.5,
...
9 => 7.5,
},
button => "Save",
);
## That's it. Run this again in two weeks.

Mechanize is also a great module for writing simple Web automation. Scripts that rely on HTML layout or specific textual artifacts
in HTML documents are prone to breaking whenever a page layout changes. For example, whenever I am reading a multi-page arti-
cle on the Web, I invariably click on the “Print” link to read the article all at once.
I could use regular expressions, or modules like HTML::LinkExtor or HTML::TableContentParser, to examine the content of a Web
page to find the printable version of an article. But these techniques are both site-specific and prone to breakage. With Mechanize, I
can analyze the text of a link — the stuff that appears underlined in blue in my Web browser. Using Mechanize, I can look for the
“Print” link and just follow it:

June 2004 ;login: PRACTICAL PERL l 53

#!/usr/bin/perl -w
use strict;
use WWW::Mechanize;
my $mech = new WWW::Mechanize;
my $url = shift(@ARGV)
$mech->get();
if ($mech->find_link(text_regex => qr/^2|Next$/i)) {
## This is a multipage document.
## Open the "print" version instead.
$url = $mech->find_link(text_regex => qr/Print/);
}
## Open the file in a browser (on MacOS X).
system(“open '$url'");

Conclusion
Perl is well known for automating the drudgery out of system administration. But Perl is also very capable of automating Web-based
interactions. Whether you are using Web service interfaces like XML-RPC and SOAP or interacting with standard HTML-based
interfaces, Perl has the tools to help you automate frequent, repetitive tasks.
Perl programmers have a host of tools available to help them automate the Web. Simple automation can be accomplished quickly
and easily with LWP::Simple and a couple of regular expressions. More intensive HTML analysis can be done using modules like
HTML::LinkExtor, HTML::Parser, HTML::TableContentParser, or WWW::Mechanize, to name a few. Whatever you need to automate
on the Web, there’s probably a Perl module ready to help you quickly write a robust tool to solve your problem.