Practical Perl: Web Automation
Practical Perl: Web Automation
PROGRAMMING
use strict;
use LWP::Simple;
## Grab a Web page, and throw the content in a Perl variable.
my $content = get("http://www.usenix.org/publications/login/");
l
## Grab a Web page, and write the content to disk.
getstore("http://www.usenix.org/publications/login/", "login.html");
## Grab a Web page, and write the content to disk if it has changed.
mirror("http://www.usenix.org/publications/login/”, "login.html");
LWP has other interfaces that enable you to customize exactly how your program will interact with the Web sites it visits. For more
details about LWP’s capabilities, check out the documentation that comes with the module, including the lwpcook and lwptut man
pages. Sean Burke’s book Perl & LWP also provides an introduction to and overview of LWP.
Screen Scraping
Retrieving Web resources is the easy part of automating Web access. Once HTML files have been fetched, they need to be examined.
Simple Web tools like link checkers only care about the URLs for the clickable links, images, and other files embedded in a Web page.
One easy way to find these pieces of data is to use the HTML::LinkExtor module to parse an HTML document and extract only these
links. HTML::LinkExtor is another of one of Gisle’s modules that can be found in his HTML::Parser distribution.
#!/usr/bin/perl -w
use strict;
use LWP::Simple;
use HTML::LinkExtor;
my $content = get("http://www.usenix.org/publications/login/");
my $extractor = new HTML::LinkExtor;
$extractor->parse($content);
my @links = $extractor->links();
foreach my $link (@links) {
## $link is a 3-element array reference containing
## element name, attribute name, and URL:
##
## 0 1 2
## <a href="http://....">
## <img src="http://....">
print "$link->[2]\n";
}
Most modern Web sites have common user interface elements that appear on every page. These are elements like page headers, page
footers, and navigation columns. The actual content of a page is embedded inside these repeating interface elements that appear on
every page of a Web site. Sometimes, a screen scraper will want to ignore all of the repeatable elements and focus instead on the
page-specific content for each HTML page it examines.
For example, the O’Reilly book catalog (http://www.oreilly.com/catalog/) has each of these three common interface elements. The
header, footer, and navigation column on this page all contain links to ads and to other parts of the O’Reilly Web site. A program
that monitors the book links on this page is only concerned with a small portion of this Web page, the actual list of book titles.
One way to focus on the meaningful content is to examine the structure of the URLs on this page, and create a regular expression
that matches only the URLs on the list of titles. But when the URLs change, your program breaks. Another way to solve this problem
is to write a regular expression that matches the HTML content of the entire book list, and throw out the extraneous parts of this
PROGRAMMING
## Add this to the shopping cart.
$mech->click_button( name => "AddToCart");
## Click the "back button."
$mech->back();
l
## Check out.
$mech->click_button( name => "Checkout");
## Fill in the shipping and billing information.
....
Mechanize is also an excellent module for scripting common actions. Every other week, I need to use a Web-based time-tracking
application to tally up how much time I’ve worked in the current pay period. I could fire up a browser and type in the same thing I
typed in two weeks ago. Or I could use Mechanize :
#!/usr/bin/perl -w
use strict;
use WWW::Mechanize;
my $mech = new WWW::Mechanize;
$mech->get('...');
## Log in.
$mech->set_fields(
user => "my_username",
pass => "my_password",
);
$mech->submit();
## Put in a standard work week.
## Log in manually later if this needs to be adjusted.
## (Timesheet is the 2nd form. Skip the calendar.)
$mech->submit_form (
form_number => 1,
fields => {
0 => 7.5,
1 => 7.5,
...
9 => 7.5,
},
button => "Save",
);
## That's it. Run this again in two weeks.
Mechanize is also a great module for writing simple Web automation. Scripts that rely on HTML layout or specific textual artifacts
in HTML documents are prone to breaking whenever a page layout changes. For example, whenever I am reading a multi-page arti-
cle on the Web, I invariably click on the “Print” link to read the article all at once.
I could use regular expressions, or modules like HTML::LinkExtor or HTML::TableContentParser, to examine the content of a Web
page to find the printable version of an article. But these techniques are both site-specific and prone to breakage. With Mechanize, I
can analyze the text of a link — the stuff that appears underlined in blue in my Web browser. Using Mechanize, I can look for the
“Print” link and just follow it:
Conclusion
Perl is well known for automating the drudgery out of system administration. But Perl is also very capable of automating Web-based
interactions. Whether you are using Web service interfaces like XML-RPC and SOAP or interacting with standard HTML-based
interfaces, Perl has the tools to help you automate frequent, repetitive tasks.
Perl programmers have a host of tools available to help them automate the Web. Simple automation can be accomplished quickly
and easily with LWP::Simple and a couple of regular expressions. More intensive HTML analysis can be done using modules like
HTML::LinkExtor, HTML::Parser, HTML::TableContentParser, or WWW::Mechanize, to name a few. Whatever you need to automate
on the Web, there’s probably a Perl module ready to help you quickly write a robust tool to solve your problem.